The postings on this site are my own and do not represent my Employer's positions, advice or strategies.

LifeAsBob - Blog

 

Home

No Ads ever, except search!
Monday, March 18, 2024 Login
Public

Blog posts for the month of month,2009.
Some Days you need one of these...5/20/2009 3:26:52 PM

Affinity mask keeping SQL Server Agent from starting.

Recently had a unique issue in one of our HP Polyserve clustered environments.  One of the unique aspects of HP Polyserve is that you can cluster different types of hardware.  This can reduce costs and risks (as a bug in firmware or hardware on one machine is also most likely going to exist in the other node, especially if they are the same make, model, and usually purchased together).


We had an affinity mask set on a 16 processor machine, the instance was intentionally re-hosted to another server in the cluster, which was an 8 way machine.  The instance did start, but SQL Agent would not.


The following error messages were found in the sql errorlog:

initconfig: Warning: affinity mask specified is not valid. Defaulting to no affinity. Use sp_configure 'affinity mask' or 'affinity64 mask' to configure the system to be compatible with the CPU mask on the system. You can also configure the system based on the number of licensed CPUs.


SQL Server blocked access to procedure 'dbo.sp_sqlagent_has_server_access' of component 'Agent XPs' because this component is turned off as part of the security configuration for this server. A system administrator can enable the use of 'Agent XPs' by using sp_configure. For more information about enabling 'Agent XPs', see "Surface Area Configuration" in SQL Server Books Online.

A quick bit of research showed a similiar issue in a Microsoft Cluster, where the one node did not have the correct permissions for the "lock pages in memory" setting, causing the same result of SQL Agent not starting, you can see a great write up of this in Suhas' blog.


Seems if the "start up" process for a SQL Server instance is interupted that the settings changed from the defaults in sp_configure are not set and the process does not continue.  From what I can determine this "start up" process is the settings contained in sp_configure, if any of these fail along the way the process just stops or rolls them back and the other settings like "Agent XPs" (required to start sql agent), will not occur.


I think this is a behavior that should not exist in SQL Server, why does an incorrect affinity mask keep SQL Server from finishing running the other settings set by sp_configure ?  These options should be mutually exclusive and one failure should not preclude others from running.

I've had similar issues with how SQL Server starts up affecting SQL Agent, and it does not appear to be a priority to fix, see my post on SQL Agent will not start when a user database is in recovery


This was easy to fix, but required setting a correct affinity mask and stopping and starting the service.  Generally we don't use an affinity mask on many instances.  HP Polyserve provides pre and post start up scripts which we'll use to address the issue in the future, though in extreme cases of failure no pre shutdown script will be effective.  Plan your failover capacity carefully.

SAN Patching, you'd better !5/15/2009 1:14:28 PM

Recently I had a nice experience of working an outage of a SQL Server caused by a SAN Issue.  Here is where clustering breaks down.  Fortunately I work in a big shop which uses Microsoft, Veritas, Polyserve and VM Ware clustering technologies; but all of them have a single point of failure, the SAN.

The official response to the problem was:

We are experiencing intermittent {vendor here} issues causing some SAN storage to become read only. Server team is closely monitoring for this condition and putting the setting back to read/write. A fix is available and being planned for Saturday night, unless the issue becomes more prevalent that it is now.

 

Lovely.  What is missing from the statement above is depending on which clustering technology you are using, it may require a reboot to bring the storage back for windows (sometimes all nodes !).  Veritas, Polyserve and VMWare seem to handle san / fiber hickups the best.

 

It may be time to research a stretch cluster with different sans and some type of replication or mirroring.  The uptime of 9's (pick your number) is a difficult task to reach and in my opinion not truly possible with one SAN.  I've seen too many SAN Failures.  SANS are supposed to be built in redundant everything, but somehow almost all my outages on High Availability SQL Implementations are the SAN.

 

Of course it has to be something, i'm not inferring that a SAN is no good or poorly designed, just that as every point of failure is addressed, another one appears. 

 

How the vendor could know about this issue and not let us know, is confusing in itself.  The vendor is responsible for maintenance and patching of the SAN, seems they wanted to keep this bug "close to the vest" and maybe just "roll" it in with some other firmware patching.....i'm not impressed.

 

Keep your vendors accountable and ask them how often they patch the san, and what patches are missing from your environment.  Work with the vendor so they know that you are willing to accept patches and get them applied, don't wait for the bug to affect you before applying it.

 

This may apply to SQL Server as well, how often do we patch to a specific level and try and stay steady there, not wanting to apply all the cumulative updates, unless it affects something.  It may be an affect you don't like.

 

Be more pro-active.

 


Blog Home