Horkay Blog
The postings on this site are my own and do not represent my Employer's positions, advice or strategies.
Tuesday, 01 December 2009

Migrating from one cluster technology to another or even with-in the same technology is fairly easy.  Recently I'm run into an issue where we need to migrate a SQL Server Instance from a HP Polyserve Cluster to a Microsoft Cluster.

There were two issues I found in setting this up:

  • Installing SQL Server on a Microsoft Cluster requires a virtual name
  • Keeping the exact same Port Number

The reason for the above two issues was to keep the down time to an absolute minimum and ensure there were no changes necessary to the application or infrastructure (firewalls).

Fortunately both HP Polyserve and Microsoft Clustering use virtual names, this is what makes this possible.

I found the following two links helpful:

How to: Rename a SQL Server 2005 Virtual Server
How to change the network IP addresses of SQL Server failover cluster instances

The key to making this happen is to install SQL Server using a temporary Virtual name and IP Address and ensure to use the EXACT Same instance name.  Instance names can not be changed with SQL Server 2005 (or at least it's not supported to change them).  Changing the port number is pretty standard stuff.

Now you can pre-test your migration of databases and user logins, and load test the new hardware. 

At the designated change time we performed the following.

  • Take the Microsoft Cluster Off line
  • Take the Instance on HP Polyserve and delete the Instance and virtual name (binaries and data files will be kept as a backout plan)
  • Using the SQL Server Configuration editor, change the IP address on all nodes in the Microsoft Cluster:
  • Using the Cluster Administrator change the SQL Server IP Address
  • Using the Cluster Administrator change the SQL Server network name
  • Bring the Cluster on-line
  • Test

The one issue we ran into was with logical networks and VLANS, I don't have a complete understanding of network topology, but only certain logical networks with-in our environment can host different ranges of IP Addresses.  Initiall we built the new cluster on a logical network that was unable to host the existing virtual name and the switch failed, ensure to talk to your network, windows and DNS engineers about exactly what your wanting to do so they can build things properly the first time, as they don't like switching and changing things twice any more than DBA's do !

Tuesday, 01 December 2009 13:54:25 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve | SQL Server#
Thursday, 10 September 2009

Recently we upgraded to a new version of HP SIM (systems insight manager), of course not testing it or letting the DBA's know.  Suddenly some things crash.  The new version of HP Sim provides a "richer discovery model"; oh it's rich!

Seems the new version performs some type of scan on the scsi bus, which causes our multi-path software (EMC Powerpath) to loose connectivity to the SAN, this causes the file system to "Panic", and all filesystems unmount.  Nice.

SIM is a hardware monitoring solution from HP for HP servers.  The server administration team loves it, the SQL DBA's don't mind it.  It of course uses SQL Server for a back-end database, so it helps keep us employed as well !  Basically SIM provides hardware inventory and monitoring of the servers.

Specifically: 

Version:  Systems Insight Manager 5.3 with SP1 - Windows
Build version:  C.05.03.01.00 

Using the Manual Discovery Task that ships with the product.

The issue was most notable with our clustered servers, especially Polyserve.  Below were the error messages:

I/O error in nodelist_get for filesystem on psv30: nlblocknr=10, blocknr=10, nlsize=8192, size=8192, count=16.
umount: unmounting filesystem from psv30.
Filesystem on psv30 has finished disabling itself, and has no more writes to drain.
A psv-bound subdevice (psv7 - 0x8001) has been removed from the system.
Filesystem on psv39 has suffered a critical I/O error, and will be disabled to protect filesystem integrity.
The device, \Device\Harddisk140\DR645, is not ready for access yet.
\Device\MPIODisk398 is currently in a degraded state. One or more paths have failed, though the process is now complete.

Work closely with your administrators and be careful of how these monitoring solutions will affect your production servers.

Thursday, 10 September 2009 11:05:55 (Central Standard Time, UTC-06:00) | Comments [0] | General Technology | Polyserve | SQL Server#
Tuesday, 25 August 2009


SQL Server SAN Migration

I think this is my 3rd or 4th SAN Migration casued by:
 - completely moving data centers
 - changing the storage backends to different vendors
 - consolidating SANS
 - growing to a bigger san to consolidate more

There are several different scenarios to consider
 1.  Stand alone SQL Servers Instances on Internal Disk to SAN
 2.  Stand alone SQL Servers on SAN Disk (changing sans).
 3.  VM Ware SQL Servers (required to be on SAN)
 4.  Microsoft Clustered SQL Servers
 5.  Polyserve Clustered SQL Servers
 6.  BCV's / SAN Mirroring / replication technologies
---------------------------
The most important thing to remember is to backup.  The next most important thing, no matter what the SAN Engineers, Windows Engineers or Vendors tell you about SAN Migration, YOU AS THE DBA ARE RESPONSIBLE.  Understand the migration plan for each scenario, regardless of what anyone says, the DBA is always left holding the bag.  If you don't understand the migration plan / scenario, make them explain it, learn it, try and practice it.  I'll explain with an example. 

One of the important items in most scenarios is updating to latest drivers and versions of software, even this step can be dangerous.  In a recent effort to patch servers to the latest version of drivers the SAN disk just "disappeared", when it came back, there was NO DATA.  No amount of research could yield what happenned to the data.  Refer back to the most important thing!
----------------------------

All of the different scenarios are simple, with careful preperation and a good windows and SAN team. 

1.  Stand alone SQL Servers instance on Internal Disk to SAN.

Usually the most difficult thing here is that you are taking an existing stable server and adding a lot of new complexity to it.  Drivers and hardware for the SAN, sometimes this does not always go well, if possible I try and get new hardware and completely swap the machines, configuring the new machine in advance for the SAN and installing sql; taking several dry runs to ensure it's stable.  The next issue is the down time in copying the data from the internal disk to the san.  Than switch the drive letters and start SQL.  Don't forget the most important thing.

2.  Stand alone SQL Servers on SAN Disk (changing sans).

Here is where having a good windows and san team can help you.  Most of the migrations I've been involved with the Windows and SAN team will setup a mirror between the SANS.  Than on migration day we stop SQL Server, the windows and san team ensure the mirror is up to date, than split the mirror and hook the server up to the new storage and ensure the drive letters and mount points come up.  SQL DBA restarts sql, boom, your done.

Sometimes if your switching san vendors, you can't setup a mirror.  Now things get dicey.  Can you get "hooked" up to both sans simultaneously, of so than your ok, stop sql, copy the data to the new san, reset drive letters and mount points, restart SQL. 

If you can not get "hooked" up to both sans simultaneously than you need to default to some type of backup and restore mechanism, tapes or copying sql files to local disk (if you have room).  I'm usually not a fan of this, as I find that different vendors use different drivers, switching vendors means now you have both vendors drivers on the machine, the machine decays and becomes unstable, NOT FUN.  Don't forget the most important thing.

3.  VM Ware SQL Servers (required to be on SAN).

These have been my easiest san migrations.  That's because we have awesome dedicated VM Ware administrators.  They do it all, just schedule the outage.  But trust me, Don't forget the most important thing, check your backups went to tape, double check your Disaster Recovery plan.

4.  Microsoft Clustered SQL Servers

I've only done Microsoft Cluster migrations with SAN Mirrors, and it has been uneventful.  This is because the "mirrors" end result is the preservation of clustered resources (quorum), drive letters etc.  But don't forget the most important thing.  Again a great windows and san team makes this easier.  I'd fret to switch vendors on the Clusters, but if I ever do I'll update this post.

5.  Polyserve Clustered SQL Servers

This is the SAN Migration worst case scenario, "The clustered file system".  Below are the steps we followed for migrating sans with Polyserve.  Don't forget the most important thing.

-Dump vsql and vsqlinstance information from cluster
 -mx vsql dump >> vsql.txt
 -mx vsqlinstance dump >> vsqlinstance.txt
-Get a listing of all storage by copying the grid on storage summary to excel
-stop and disable all SQL instances and VSQLs
-copy the virtual root for each sql server instance to another server (outside the cluster)
-deport ALL dynamic volumes (paths are automatically unassigned)
-stop cluster services on all cluster nodes
-copy the entire c:\polyserve directory to another server (outside the cluster) for each machine (CYA)
-manage the storage to unpresent all LUNs from the old array
-break the mirror relationships and then present all of the mirrored LUNs
-create three new 1GB LUNs on the new array and present them for new mem parts
-put partitions on the three new LUNs
-go into the config utility on node 1 and delete old membership partitions and add the three new membership partitions
-start cluster services on this node
-export the config to other nodes and start the service on the rest of the nodes
-import all importable dynamic volumes
-assign paths
-enable instances and vsqls
-done
Polyserve SAN Switch.doc (29.5 KB)

 

6.  BCV's / SAN Mirroring / replication technologies

Administring the advanced SAN technologies is different for each vendor and quite proprietary.  You definetly want to test and work with each one individually and ensure it all works.  The details of this are far outside the scope of a simple blog post, but having great SAN engineers will make this easy, as they generally setup the mirrors, clones and movement of them to different machines or remote locations.

Don't forget the most important thing.

Tuesday, 25 August 2009 10:16:01 (Central Standard Time, UTC-06:00) | Comments [0] | General Technology | Polyserve | SQL Server#
Friday, 17 July 2009

Installing vendor databases on SQL Server is usually pretty straight forward.  Not many DBA's like the process, but the proliferation of 3rd party products that use SQL Server is continuing to grow.

Recently this particular vendor, of course insisting their product was 64 bit and SQL Server 2005 compatible, ran into problems for me.  It's only compatible after you manually add registry entries that make the installshield program think there is a 32 bit instance of MSDE !  Yah, that's compatible (oh by the way, those registry entries also break the sql browser service so you whole instance becomes unavailable unless it is accessed directly through a port number).  NICE !

Getting through the above process was painful enough, than the installshield installer errors out with a SQL Server error on creating a view, indicating that a dependent object is missing.

Technical support, while helpful is clueless.  "Never seen that before!"  After 3 days of issues, i'm basically troubleshooting for them, taking traces from our lab environment and comparing them.  Finally we stumble on to the fact that installshield is creating a _setup directory where it is putting the SQL Server scripts.  This folder (_setup), happens to be under the data directory where the MDF's are being created (NICE).  This data directory happens to be a mount point!

Ahhh, for whatever reason, Installshield runs the scripts differently from a mount point vs. a drive letter.   We think it's either a bug in Installshield or an issue with the permissions on the mount point.  We were not able to find any differences in permissions, but sure enough when we changed the location of the data files (which changes the location of the _setup directory), objects in the database were created in the correct order.   3 days of my life gone.

This vendor could have done many things differently.  Providing some type of manual work around, scripts to create the database (this was not an option so I was screwed), backups to manually restore, something, someway to escape the installshield install (which was like version 5 !).

And this was not a small dippy vendor, this is compliance software used by many of the big banking and financial institutions.  No wonder so many found them selves in danger of collapse !

Watch those mount points. 

Friday, 17 July 2009 08:57:04 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve | SQL Server#
Wednesday, 20 May 2009

Affinity mask keeping SQL Server Agent from starting.

Recently had a unique issue in one of our HP Polyserve clustered environments.  One of the unique aspects of HP Polyserve is that you can cluster different types of hardware.  This can reduce costs and risks (as a bug in firmware or hardware on one machine is also most likely going to exist in the other node, especially if they are the same make, model, and usually purchased together).


We had an affinity mask set on a 16 processor machine, the instance was intentionally re-hosted to another server in the cluster, which was an 8 way machine.  The instance did start, but SQL Agent would not.


The following error messages were found in the sql errorlog:

initconfig: Warning: affinity mask specified is not valid. Defaulting to no affinity. Use sp_configure 'affinity mask' or 'affinity64 mask' to configure the system to be compatible with the CPU mask on the system. You can also configure the system based on the number of licensed CPUs.


SQL Server blocked access to procedure 'dbo.sp_sqlagent_has_server_access' of component 'Agent XPs' because this component is turned off as part of the security configuration for this server. A system administrator can enable the use of 'Agent XPs' by using sp_configure. For more information about enabling 'Agent XPs', see "Surface Area Configuration" in SQL Server Books Online.

A quick bit of research showed a similiar issue in a Microsoft Cluster, where the one node did not have the correct permissions for the "lock pages in memory" setting, causing the same result of SQL Agent not starting, you can see a great write up of this in Suhas' blog.


Seems if the "start up" process for a SQL Server instance is interupted that the settings changed from the defaults in sp_configure are not set and the process does not continue.  From what I can determine this "start up" process is the settings contained in sp_configure, if any of these fail along the way the process just stops or rolls them back and the other settings like "Agent XPs" (required to start sql agent), will not occur.


I think this is a behavior that should not exist in SQL Server, why does an incorrect affinity mask keep SQL Server from finishing running the other settings set by sp_configure ?  These options should be mutually exclusive and one failure should not preclude others from running.

I've had similar issues with how SQL Server starts up affecting SQL Agent, and it does not appear to be a priority to fix, see my post on SQL Agent will not start when a user database is in recovery


This was easy to fix, but required setting a correct affinity mask and stopping and starting the service.  Generally we don't use an affinity mask on many instances.  HP Polyserve provides pre and post start up scripts which we'll use to address the issue in the future, though in extreme cases of failure no pre shutdown script will be effective.  Plan your failover capacity carefully.

Wednesday, 20 May 2009 14:26:52 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve | SQL Server |  SQL Agent#
Friday, 15 May 2009

Recently I had a nice experience of working an outage of a SQL Server caused by a SAN Issue.  Here is where clustering breaks down.  Fortunately I work in a big shop which uses Microsoft, Veritas, Polyserve and VM Ware clustering technologies; but all of them have a single point of failure, the SAN.

The official response to the problem was:

We are experiencing intermittent {vendor here} issues causing some SAN storage to become read only. Server team is closely monitoring for this condition and putting the setting back to read/write. A fix is available and being planned for Saturday night, unless the issue becomes more prevalent that it is now.

 

Lovely.  What is missing from the statement above is depending on which clustering technology you are using, it may require a reboot to bring the storage back for windows (sometimes all nodes !).  Veritas, Polyserve and VMWare seem to handle san / fiber hickups the best.

 

It may be time to research a stretch cluster with different sans and some type of replication or mirroring.  The uptime of 9's (pick your number) is a difficult task to reach and in my opinion not truly possible with one SAN.  I've seen too many SAN Failures.  SANS are supposed to be built in redundant everything, but somehow almost all my outages on High Availability SQL Implementations are the SAN.

 

Of course it has to be something, i'm not inferring that a SAN is no good or poorly designed, just that as every point of failure is addressed, another one appears. 

 

How the vendor could know about this issue and not let us know, is confusing in itself.  The vendor is responsible for maintenance and patching of the SAN, seems they wanted to keep this bug "close to the vest" and maybe just "roll" it in with some other firmware patching.....i'm not impressed.

 

Keep your vendors accountable and ask them how often they patch the san, and what patches are missing from your environment.  Work with the vendor so they know that you are willing to accept patches and get them applied, don't wait for the bug to affect you before applying it.

 

This may apply to SQL Server as well, how often do we patch to a specific level and try and stay steady there, not wanting to apply all the cumulative updates, unless it affects something.  It may be an affect you don't like.

 

Be more pro-active.

 

Friday, 15 May 2009 12:14:28 (Central Standard Time, UTC-06:00) | Comments [0] | General Technology | Polyserve | SQL Server#
Thursday, 29 January 2009

Mountpoints are fun and easily can solve issues with more than 26 drive letters (don't ask), but recently we ran into several issues with monitoring the mountpoints.  In particular we have a report that is based on past growth history of a database, disk size and disk free space, and it estimates when a drive will be at 80% capacity and when it will fill up.   With 100's of database servers this report can prioritize and pro-actively identify which server will encounter a problem next and when.  Of course the limitation is in bold, drive. 

Report Example:

When we began using Mountpoints the report was not as accurate and it needed to be adjusted.  We have some internal services that collect the drive size and free space to a central DBA database.  Review of this monitoring reveals it is using a WMI Query, a quick review of the WMI SDK shows another call that will pick up mountpoints, Select * from WIN32_Volume.  Life is good.

Not so quick after hurdling from drives to drives and mountpoints, a problem was revealed where the WMI call failed on two servers.  Englightenment.  These two sql servers have also been giving us odd issues with SQL Management Studio (SSMS), which is highly dependent on WMI, SMO, .NET and probably some other stuff.  Fixing WMI on these two servers fixed the issue and corrected the SSMS issues.

Steps to fix WMI (Thanks to our Windows Team for the steps below):

1. net stop winmgmt
2. del %SystemRoot%\System32\WBEM\Repository /s /q
 
If that does not work, then I
 
1. remove all rights from %system32%\wbem\Repository\FS
2. disable the "Windows Management Instrumentation" service
3. reboot
4. add rights back to %system32%\wbem\Repository\FS
5. delete the contents of %system32%\wbem\Repository\FS
6. set the "Windows Management Instrumentation" service back to Automatic
7. start the "Windows Management Instrumentation" service

Here is a short quick Visual Basic Script (VBS) you can save to a text file with a .VBS Extension to see the call to WMI to check disk space for mount points or drives.  It filters out certain mount points for Polyserve as we don't want to monitor those, also for some reason z:\ is mapped in our environment and this wmi script pulls that with nulls, so you need to test for those.  You can also use Powershell, but it is using a wmi call underneath the hood as well, and we have yet to install powershell on all our servers.

Set DiskSet = GetObject("winmgmts:{impersonationLevel=impersonate}!//BCCMAPP02")_
    .ExecQuery("Select * from Win32_Volume")
For Each objItem In DiskSet
    Ignore = False
    if len(objitem.name) >= 51 then
        If UCase(Left(objitem.name,51)) = _
            UCase("C:\Program Files\PolyServe\MatrixServer\conf\mounts") Then
            Ignore = True
        End If
    End If

    If Ignore = False Then
        msgbox(objItem.Name & vbCrLf & "Percent Free = " & _
            round((objItem.freespace/objItem.Capacity)*100,2) & _
            " = " & objItem.freeSpace & " = " & objItem.Capacity)
    End If

Next
Thursday, 29 January 2009 12:33:07 (Central Standard Time, UTC-06:00) | Comments [0] | General Technology | Polyserve | SQL Server#
Thursday, 25 September 2008

Discovered an issue with netbackup failing when there are more than 64 mount points on a system.

A Patch is to be delivered.

Workaround implemented:


ISSUE:
GENERAL ERROR: bpbkar32 can experience a Application popup runtime error when attempting to backup a Polyserve system with more than 64 mount points.

ERROR CODE/MESSAGE:
Status 41 - network connection timed out

ENVIRONMENT/CONDITIONS:
After introducing over 64 mount points onto a NetBackup client, you will experience a limitation.
EVIDENCE:
Event Viewer:
20070228 13:03:52 Application Popup I26 NA Application popup: Microsoft Visual C++ Runtime Library : Runtime Error!

Program: C:\Program Files\VERITAS\NetBackup\bin\bpbkar32.exe This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.

SOLUTION/WORKAROUND:

ETA of Fix:
Symantec Corporation has acknowledged that the above-mentioned issue is present in the current version(s) of the product(s) mentioned at the end of this article. Symantec Corporation is committed to product quality and satisfied customers.

This issue has already been correct in 6.5GA and 6.0 MP6. However it is currently being considered by Symantec Corporation to be addressed in a forthcoming Maintenance Pack of 5.1. Please note that Symantec Corporation reserves the right to remove any fix from the targeted release if it does not pass quality assurance tests or introduces new risks to overall code stability. Symantec's plans are subject to change and any action taken by you based on the above information or your reliance upon the above information is made at your own risk. Please refer to the maintenance pack readme or contact NetBackup Enterprise Support to confirm this issue (ET1039189) was included in the maintenance pack.

As future maintenance packs are released, please visit the following link for download and readme information:
http://www.symantec.com/enterprise/support/downloads.jsp?pid=15143

WORKAROUND:
Force NetBackup to use BackupRead() API by performing the following changes: (This will not work if you are using Flashbackup to backup the data)

1. Click Start | Run, type regedit, and click OK


Warning: Incorrect use of the Windows registry editor may prevent the operating system from functioning properly. Great care should be taken when making changes to a Windows registry. Registry modifications should only be carried-out by persons experienced in the use of the registry editor application. It is recommended that a complete backup of the registry and workstation be made prior to making any registry changes.


2. Browse to HKEY_LOCAL_MACHINE\Software\VERITAS\NetBackup\CurrentVersion\Config

3. In the Config registry key, create a new key called NTIO

4. In the NTIO registry key, create a REG_DWORD value, give it the name UseNTIO, and the value 0 (zero)

Thursday, 25 September 2008 07:55:45 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve#
Friday, 19 September 2008

This post could be titled as:

  1. Moron
  2. Don't Jump to conclusions
  3. Research things appropriately
  4. Think

Recently I experienced a machine that was running under some cpu pressure.  Seeing that it was QA and during a load test, we were concerned with what was taking place on the machine.  The machine was running 20-30%, not bad, but sqlservr.exe was only using 10%.  Tracking down what else was taking place on the machine was not easy.

I focused in on DLM.EXE, wow, whats this...well being a moron, not thinking, jumping to conclusions and doing improper research....caused this....

A quick search in google, showed that DLM.exe shouldn't really be running and is often used in viruses etc, so I killed it, than of course, the machine immediately crashed, crap.  Like I said, this was QA and was meant to be the playground, but regardless it's embarrasing to have a machine crash during a load test.

Now the light bulb goes off...., duh this is "Polyserve", DLM.EXE has absoultely zero to do with anything found on that google search.  DLM = Distributed Lock Manager, and is the mechanism polyserve uses to control access to the clustered file system.  DLM.exe was running high because the cluster was experiencing large amounts of i/o across many servers due to the fact that we were running the load test during the maintenance windows of the servers (dbcc's, backups, reindexes etc).

The machine crashed because DLM.exe is part of the polyserve service, killing it caused the file system to become unstable, so the server was "fenced" from the clustered environment, exactly as Polyserve is supposed to do.  Nice to see that Polyserve worked well, even when the operator is not.

Friday, 19 September 2008 07:56:05 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve#
Monday, 15 September 2008


Better known as, "Polyserve is puking and you get to clean it up !".   Technincally it's not fair to blame Polyserve, as the root cause of this issue looks to be a corrupt registry from version 3.4 that was never properly corrected.  We've never experienced an issue like this with SQL Instances installed from version 3.6 originally.

When you have a SQL Server instance in Polyserve that will not fail back to it's primary, you know you have a problem.  Best bet, call technincal support [that's what maintenance is for!], but here is what happenned to us recently and what we did to correct this issue [we have seen this before and called technical support and concurred this is the permanent fix].

Automated patching of the development environment caused some sql instances to fail over, this is expected.  We do not run the instances in "auto fail back" mode, preferring to complete this step manually to minimize "ping-ponging" instances.  After patching we reviewed the environment, and it looks good, with the excpetion of one instance, it is on it's secondary, it is running, it is available, but notice the status of "warning".

There is no nice error message in the console.  Right clicking the instance to "show alerts", displays nothing.  What gives ?

Who cares right, just move the instance back to the primary and let's get working....no dice.  The instance won't move, and NO Error or Message is given, crap, you know your in trouble now....for the newly Polyserve initiated, this is when you STOP and call technical support.  The more you play with things, the worse it will get and it will cause technincal support consternation in correcting the issue.  Since this was development I get to play...

The instance will not fail back, if we rehost, that result is no change.  Disabling the instance is the same result.  I'm pretty sure you could reboot the server and that would cause it to fail back to the primary, matter of fact, i'm positive....but what if you have other instances on that same physical server and you can't afford an outage ?  Also that may temporarily correct the issue, but it doesn't address the root cause and future scenarios of patching or fail over, may result in the same condition.

What the hell is happenning ?

Finally digging through all the logs (including Polyserve logs), i find that Polyserve still thinks the instance is "starting", and since it is in "starting mode", it does not fail it over when requested.  There needs to be someway of over-riding this stupid behavior but it is "by design".  See pic:

What and why ?  Based on past experience and knowing this instance existed from version 3.4 to 3.6, I know this deals with the registry issues.  A quick peak of the registry on the server it is currently hosted on shows that some of the registry entries point to the virtual root (3.6) and some of the registry entries point to mxshells (3.4), notice the registry entries below, they should all point to the virtual root:

 


Now the correct way to fix this is to delete the sql instance from Polyserve (not the machines).  Verify the registry entries and sql instance on each machine.  Delete any polyserve sql.original, sql.preg etc files (make a copy first).  Re-virtualize it and re-verify everything. 

Obviously if this is production instance, you might have to wait in doing this, as it is time consuming.  In which case you can manually stop the services and see if you can get things to fail back, though you may have to reboot the server.  At some point, the only way to correct the root cause is find a maintenance window to allow you to delete the instance from polyserve, correct each individual sql instance and re-virtualize.  Fun !

Monday, 15 September 2008 10:06:17 (Central Standard Time, UTC-06:00) | Comments [1] | Polyserve | SQL Server#
Thursday, 11 September 2008

Evaluation of the Polyserve 3.6.1 is underway.

Recently completed the upgrade of our Development cluster.  All went smooth. 

This is another one of those difficult upgrades that requires multiple outages (though small).  First you have to un-install the software and then install the new software.  So this causes an outage.  The File systems also have to be upgraded, so you have to stop using each file system to upgrade it, this causes another outage. 

So far things have been quick and no issues.

Definetly had we not completed the 3.4 to 3.6 upgrade, this would have been very difficult, as the 3.4 to 3.6 upgrade was more complex.

No word yet on Polyserve support for sql server 2008.  Though based on the sparse and ADS (Alternative Data Streams) file system options in 3.6.1 they are very close, and I'd bet you will have to be on 3.6.1 as both the sparse and ADS seem to be necessary for sql server 2008.

Thursday, 11 September 2008 13:06:44 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve | SQL Server#
Sunday, 17 August 2008

I love patching clustered servers (not!), and patching polyserve is usually quite simple and not an issue.   I'm not referring to Polyserve patches, but Operating System (microsoft, hba, other drivers etc) patches.

We normally have designed a special "rolling" sequential patch of 15 minute intervals for the clustered servers, trying to give each server enough time to recover before the next one rolls.

Something happenned with that sequence, and things rolled like a ping-pong !  Polyserve stabilized, but there was an issue with two instances.  Review of the issue determined that the mount points that these instances rely on, did not happen, meaning the mounts failed.

I attempted to remount the volumes and received the following error, "10.10.49.114 assign_drive_letter failed: "D:\Mounts\SysData\" is already a reparse point.". 



I'm not really sure why this happenned, meaning windows, polyserve, san related root cause.  I attempted to reboot the nodes that these failed on and recieved the same error, hey a reboot fixes everything under windoze, right ?

I found the only way to correct this was to find the folder where these "reparse points" (junction point or mount point) stem from and to completely delete that folder.  So in the above example I had to delete the folder eamload_data1 under eam (remember to do this on each node that has a problem).  I then assigned the mount point as normal, and all the instances worked fine.

I've seen sporadic issues with mount points in the past, polyserve and non-polyserve.  See a previous point here on how to remove a ghosted mount point...

http://www.lifeasbob.com/2008/03/25/ManuallyRemoveAMountPoint.aspx 

It'll be interesting to see if this happens again or was a "one time" deal.

Sunday, 17 August 2008 14:42:27 (Central Standard Time, UTC-06:00) | Comments [0] | General Technology | Polyserve#
Tuesday, 05 August 2008

Assigning SQL Server a static port number is necessary for many reasons, cluster, firewalls, security through obsecurity etc.  We use Polyserve and often have to assign port numbers, we've never really had a good guide to follow on this, even the Polyserve documentation doesn't really have a white paper or a short paragraph on a port numbering strategy.  After several years now of running static ports, one is definetly needed.  Before you can create a strategy for your Polyserve environment or SQL Server, the below except is a great generic explanation about port numbering and uses.


This is from December 1999, but is just as relevant today.  Source = http://www.microsoft.com/technet/archive/community/columns/inside/techan23.mspx?mfr=true 


Most everything you ever wanted to know about TCP/IP Port Numbers

Port numbers are divided into three ranges: the Well-Known Ports, the Registered Ports, and the Dynamic and/or Private Ports. The Well-Known Ports are those from 0 through 1023.The Registered Ports are those from 1024 through 49151. The Dynamic and/or Private Ports are those from 49152 through 65535.

Well-Known Ports are assigned by Internet Assigned Numbers Authority (IANA) and should only be used by System Processes or by programs executed by privileged users. An example of this type of port is 80/TCP and 80/UDP. These ports are privileged and reserved for use by the HTTP protocol.

Registered Ports are listed by the IANA and on most systems can be used by ordinary user processes or programs executed by ordinary users. An example of this type of port is 1723/TCP and 1723/UDP. Although other processes can use these ports, they are generally accepted as the connection control port for Point To Point Tunneling Protocol (PPTP).

Dynamic or Private Ports can be used by any process or user, and are unrestricted.

IANA maintains a list of ports on their Web site (http://www.isi.edu/in-notes/iana/assignments/port-numbers)

 

Tuesday, 05 August 2008 07:45:40 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve | SQL Server#
Wednesday, 30 July 2008

When working with the Polyserve SQL Installer or Multi-Node Upgrade Wizard for SQL Server, we noticed an issue where even after applying the SQL Hotfix, the SQL Version was showing as not upgraded on some machines, this was concerning as the utility did allow us to re-virtualize with these instances in different versions, this is a big concern.  You do not want a SQL Instance on the cluster to be different versions, Different instances can be different versions, but a single instance should be the same across each machine, see picture:



Surprisingly when you went to the individual machines and checked, the SQL version was the same (so on the picture above, physically checking DEVPLYSQL01, SQLTest1 instance was at version, 9.2.3228).  So why was the utility showing 9.2.3152, even after we applied the hotfix ?  No error was reported, but something wasn't right.

Not sure if the instance was not put in maintenance mode properly, or if something else occurred, but using RegMon (registry montior) while the multi-node installer utility ran, i was able to determine that the Polyserve utility used a registry entry to populate this screen.  We were checking the physical binary sqlservr.exe and @@Version (select @@version).  The registry entry is located at:  HKLM\Software\Microsoft\Microsoft SQL Server\MSSQL.n\Setup , the key is PatchLevel.  Note that you may have to determine what Instance maps to what MSSQL.n, this can be done by checking another key, HKLM\Software\Microsoft\Microsoft SQL Server\Instance Names, There you will see the mapping of an instance name to it's mssql.n location, see below:




Obviously something went wrong, even though no error was reported back through the service pack / hotfix installer.  The solution is to ensure to put the Instance in maintenance mode and then being the "non-trusting" type, I manually applied the hotfix / service pack to those nodes that have the issue.  Interestingly the hotfix installer indicated the instances had been upgraded, so obviously the hotfix installer doesn't check the same registry entry as the Polyserve installer.  I manually checked the instance in question and ran it again.   It reported success.  I again opened the registry and checked the patch level entry, this time it reported the correct version number.  I opened the Polyserve utility, multi-node installer and it also reported all the version numbers correctly and homogenously (is that a word ?).  We then took the instances out of maintenance mode and all was good.

We never experienced any errors or issues, so the moral of the story is, "to run your upgrade / service pack / hotfix and re-verify version numbers across each node, regardless of the messagebox reporting success", also "re-open the multi-node installer (it caches information so completely leave the utility), and verify that it "agrees" that the version numbers are the same".  Do this before going out of maintenance mode !

 

Wednesday, 30 July 2008 07:53:53 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve | SQL Server#
Wednesday, 02 July 2008

Our Polyserve cluster took a deep dive and crashed, all nodes.  Root cause is still under research, but basically we zoned some new storage to the cluster and after a reboot of the nodes the Polyserve software was unable to read or write to the membership partitions.  Of course the error didn't state that, as that would have made troubleshooting the problem easier, instead we received this error:


Event Type: Error
Event Source: sanpulse
Event Category: SAN Storage
Event ID: 17005
Date:  7/2/2008
Time:  9:35:40 AM
User:  N/A
Computer: BCPLYSQL03
Description:
This matrix is unable to take control of SAN because the servers are unable to perform fencing operations, possibly due to a networking or fencing hardware failure or misconfiguration. As a result, some or all filesystem operations may be paused throughout the matrix. In addition, filesystem mounts and unmounts and disk imports and deports cannot be performed.


We have zoned storage to Polyserve many many times, and never had a stability issue, we've had isolated issues with LUNS not showing up, mini / storport issues, emulex issues, but nothing that caused the cluster to become unstable.

So we eventually de-zoned the new storage, rebooted the entire cluster and everything worked fine.  We're not sure if we zoned the storage incorrectly (we have a new SAN Administrator, so maybe it wasn't done correctly), though I don't suspect this.  Our SAN Administrator while new has succesfully zoned storage to our clusters in the past with no issues, and understands how / what Polyserve is. 

More so, I suspect some internal issue to windows / emulex / Powerpath or something that upon the zoning of the new storage, caused the LUN Id's change to map incorrectly.

 

Wednesday, 02 July 2008 09:47:50 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve#
Tuesday, 24 June 2008

When you extend an instance of SQL Server on Polyserve, to a new server the startup procedure in the master database is not installed, which if the new instance, on the new server is installed to a different MSSQL.x path, the sql agent service may have issues running jobs, as the sub systems will have different paths.

 

For my scenario I had 7 machines in a cluster, with an instance SQLTest1 installed on servers 1,2 and 3, with a directory of mssql.3.  This all previously existed and worked flawlessly for many months, but then I needed to setup SQLTest1 on server number 7, this required installing SQL on server 7, and then adjusting the properties in Polyserve to include 7, than failing over to 7.  All this worked great, but further inspection showed that the there were some SQLAgent jobs failing and/or entering a "suspended" state.

 

A quick review showed that server 7 installed to mssql.1.  Polyserve is supposed to handle this, it does this through a procedure in the master database that is set to startup automatically.  I've seen other instances installed to mssql.1,mssql.2 and mssql.1 and there is no issue, as that stored procedure in the master database handles adjusting sql agent sub systems.  I reviewed the SQLTest1 instance and the procedure was definetly missing.  I manually added the procedure and ran it, now the fail over between any of the servers work.

 

I can only surmise that the initial virtualization did not add the procedure, because it was not needed, all the sub systems were the same mssql.3. 

 

I think this may be a bug, it is very simple to fix, the difficult part is recognizing the problem and knowing what the fix is!

I contacted Polyserve on this, and the thinking is that this was caused by the 3.4 to 3.6 upgrade and would not happen on an instance that was "fresh" on 3.6 from day zero, makes sense.

Tuesday, 24 June 2008 13:58:08 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve | SQL Server#

Received the following error on installing cumulative update 8


MSI (s) (40:4C) [13:11:16:515]: Product: Microsoft SQL Server 2005 (64-bit) - Update 'Hotfix 3257 for SQL Server Database Services 2005 (64-bit) ENU (KB951217)' could not be installed. Error code 1603. Additional information is available in the log file C:\Program Files\Microsoft SQL Server\90\Setup Bootstrap\LOG\Hotfix\SQL9_Hotfix_KB951217_sqlrun_sql.msp.log.

This was on a sql server installation under polyserve, and sometimes they (the instances) get screwed up.

I tried to start the instance manually and it would not start (net start mssql$Instance).

I checked the error log and found 4 entries:


TDSSNIClient initialization failed with error 0x34, status code 0x1e.

Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.

SQL Server could not spawn FRunCM thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.


These errors usually indicate and issue with the virtual IP address under the registry settings for the instance, usually one for an IP address that is virtualized on another service, this is a legacy problem from polyserve that was corrected with version 3.6, but if your one of the unlucky few that have this problem, you'd better know your sql registry, or a simple call to HP support, as they know how to figure this out pretty quickly.

After fixing the registry, the cumulative update applied succesfully. 

Always remember on polyserve that once you go into maintenance mode, you should manually start each instance on each node, as if the instance will not start, than more than likely your patch (service pack or cumulative update) will not install correctly.  This doesn't mean there was a problem with the patch, as obviously if an instance is not starting correctly, it is quite hard to patch it.

Tuesday, 24 June 2008 12:31:50 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve | SQL Server#
Tuesday, 13 May 2008

Team working an issue with an SSIS package failing on a new installation of SQL Server.

Not sure why, but  basically we had an existing instance on a server.  We had capacity to install a second instance on the server.  We installed and prepared the new instance, all works great.

We have an administrative DTS Package that pumps data to an excel spreadsheet and emails the data.

Could not get the package to run, received the below error.  Note the error in green, this option is set to false, we tried all kinds of items to change it with no luck.


Message
Executed as user: VSQLCRM\SYSTEM. ...ute Package Utility  Version 9.00.3042.00 for 32-bit  Copyright (C) Microsoft Corp 1984-2005. All rights reserved.    Started:  10:27:21 AM  Error: 2008-05-13 10:28:52.79     Code: 0xC0014019     Source: Tracing_SSIS      Description: The connection manager "DestinationConnectionExcel" will not acquire a connection because the package OfflineMode property is TRUE. When the OfflineMode is TRUE, connections cannot be acquired.  End Error  Error: 2008-05-13 10:28:52.79     Code: 0xC00291EC     Source: Drop Baseline Tab Execute SQL Task     Description: Failed to acquire connection "DestinationConnectionExcel". Connection may not be configured correctly or you may not have the right permissions on this connection.  End Error  Warning: 2008-05-13 10:28:52.79     Code: 0x80019002     Source: Populate Baseline Tab      Description: SSIS Warning Code DTS_W_MAXIMUMERRORCOUNTREACHED.  The Execution method succeeded, but the number of e...  Process Exit Code 0.  The step succeeded.

Only solution was to uninstall Client components and re-install them.

Not sure why this works, but somehow the installation of the 2nd instance, and applying the associated service pack 2 and CU 6 caused this.  It's also possible that the automated push install from Polyserve somehow causes this, though all the Polyserve push is an un-intattended install setup script, so the bug would be with that utility that is run through Polyserve.

Lost 4+ hours worth of work chasing this bug down.

Tuesday, 13 May 2008 14:42:03 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve |  SSIS#
Thursday, 24 April 2008

Upgraded the Cluster to version 3.6 from 3.4, and we lost all network shares on a volume.

These were not shares through the "Polyserve file sharing" components, but standard NT / Windows shares on a Polyserve volume.  [ We do not have the file sharing components, only the database ].

Since there were only 3 shares it was not worth creating a ticket, as investigating root cause on this one-time event (the upgrade) is too much effort.  We just recreated the UNC shares.

I did send a note to the Polyserve HP Support, so they could forward to the engineering team.

Note:  These shares were lost before the upgrade of the dynamic volumes from 3.4 to 3.6, as obviously when you destroy a 3.4 volume and rebuild it as 3.6 you will loose UNC (non polyserve shares).

Thursday, 24 April 2008 09:12:34 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve#
Wednesday, 23 April 2008

After upgrading to Polyserve version 3.6, and adding a new server to the cluster that does not have any sql server instances hosted on it, an error is received when rehosting or accessing properties.

The error is:  "Failed to get the matrix topology."  Picture below.

The response from HP Polyserve support is that this is a bug in 3.6, mxDB.  Work around is two uninstall the mxDB module from the server, or setup a sql instance on the server. 

This is not a large issue as the goal of the new server is to host a database on it, so once we setup a sql instance on the machine the error will go away.  A hotfix is in the works for the issue.

The error was concerning because we were not sure if we configured the new server properly, as before we added the new server to the cluster, things worked properly.

Reply from HP Support:

From: Mokhtari, Mostafa [mailto:Mostafa.Mokhtari@hp.com]
Sent: Wednesday, April 23, 2008 11:30 AM
To: Horkay, Robert
Subject: #3601723537

Hi Robert,

What you are experiencing with re-hosting is a bug with MxDB and the fix is going to be in the next version.
Problem is not with the NIC.
The issue occurs when mxdb is installed on a node that does not have any instances installed. So once the instances are on the new node it should go away
 
Two workaround
Uninstall MxDB from the node or install the instance on it.
 
Let me know if you have any question,
 
Regards,
Mostafa Mokhtari
HEWLETT-PACKARD COMPANY
High Availability Team
Monday-Friday 8-4pm PST
Wednesday, 23 April 2008 09:06:17 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve#
Tuesday, 22 April 2008

I've seen two isolated incidents on SQL Server 2005 where restoring from a SQL Lite Speed backup where the MDF, NDF and LDF file permissions were changed to the user who performed the restore. 

Everthing works fine, but then in the future we decided to move these files, but to our surprise were were unable to move them, giving us an error that the files were in use, read-only or did not have permissions.  We spent considerable time looking for what process had the files in use (Virus Scan, netbackup, SQL ? (process and file explorer from sysinternals), and finally read the error again and decided, maybe the files are read-only !  In the process of checking this, we clicked on the security tab, and the individual who performed the restore was the only account with permissions to the files !

We changed the permissions and copied them fine.

I don't know if this was caused by Polyserve, Lite-Speed or SQL Server; but definetly caused us some frustration, very strange indeed !

Tuesday, 22 April 2008 13:47:24 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve | SQL Server#
Monday, 21 April 2008

While Upgrading Polyserve 3.4 to 3.6, performing a rolling upgrade, on the first server, after starting the services, while performing the license file upgrade, the following error is thrown across all Other servers in the cluster, and they shutdown ! 

So far we are turning off Data Execution Prevention (DEP), for ClusterPulse.exe. 

UPDATE:  Turns out the root cause we started the 3.6 upgrade with the wrong server, you must start with the highest IP Address, we started with the lowest.

To disable DEP for a program or server wide perform the following:

  • Right click on My Computer
  • Click on Properties
  • Click on the Advanced tab
  • Click on the Settings button in the Performance section
  • Click on the Data Execution Prevention tab

Screen Captures of the Error Message:

Screen Captures of the solution:

Screen Captures of Turning it off:

 

Monday, 21 April 2008 16:00:49 (Central Standard Time, UTC-06:00) | Comments [0] | General Technology | Polyserve#
Monday, 14 April 2008

We've found issues with servers having the IRPStackSize too small, causing numerous entries in the eventlog and shares that do not work.  We also have noticed a correlation between crashes and the entries for IRPStackSize.

Increase this value in the registry by performing the following:

  1. Start the registry editor
  2. Navigate to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services
\lanmanserver\parameters.
  3. Double-click IRPStackSize (or if this registry setting doesn't exist,
create it of type DWORD and ensure the case is correct).
  4. Change the base to decimal, set the value to 15 and click OK.  If a
value is already present, add 3 to that number, and set it to the new
number.
  5. Reboot the server

 

 

Monday, 14 April 2008 12:06:17 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve#
Friday, 04 April 2008

Trying to delete a SQL Instance from polyserve, 3.6 fails with an internal error.

Detail description is:

java.lang.IllegalArgumentException: No value found for sqldataroot
 at com.polyserve.mssql.common.domain.SpackDAO.getParamaterValue(SpackDAO.java:142)
 at com.polyserve.mssql.common.tasks.TaskFactory.removeService(TaskFactory.java:483)
 at com.polyserve.mssql.common.gui.SpackServicePM.doDelete(SpackServicePM.java:320)
 at com.polyserve.mssql.common.gui.SpackServiceEditor.showDeleteProgress(SpackServiceEditor.java:98)
 at com.polyserve.gui.controller.MonitorController$DeleteMonitorAction.actionPerformed(MonitorController.java:110)
 at com.polyserve.mssql.common.gui.SpackServiceController$DeleteAction.actionPerformed(SpackServiceController.java:139)
 at com.polyserve.gui.controller.AbstractController$ProxyVisualAction.actionPerformed(AbstractController.java:198)
 at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1995)
 at javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2318)
 at javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:387)
 at javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242)
 at javax.swing.AbstractButton.doClick(AbstractButton.java:357)
 at javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:1216)
 at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:1257)
 at java.awt.Component.processMouseEvent(Component.java:6038)
 at javax.swing.JComponent.processMouseEvent(JComponent.java:3265)
 at java.awt.Component.processEvent(Component.java:5803)
 at java.awt.Container.processEvent(Container.java:2058)
 at java.awt.Component.dispatchEventImpl(Component.java:4410)
 at java.awt.Container.dispatchEventImpl(Container.java:2116)
 at java.awt.Component.dispatchEvent(Component.java:4240)
 at java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4322)
 at java.awt.LightweightDispatcher.processMouseEvent(Container.java:3986)
 at java.awt.LightweightDispatcher.dispatchEvent(Container.java:3916)
 at java.awt.Container.dispatchEventImpl(Container.java:2102)
 at java.awt.Window.dispatchEventImpl(Window.java:2429)
 at java.awt.Component.dispatchEvent(Component.java:4240)
 at java.awt.EventQueue.dispatchEvent(EventQueue.java:599)
 at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:273)
 at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:183)
 at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:173)
 at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:168)
 at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:160)
 at java.awt.EventDispatchThread.run(EventDispatchThread.java:121)

Solution provided by technincal support below (it worked)...


No value found for sqldataroot (3.4 upgrade to 3.6)
This error indicates that monitor_agent is still using 3.4 parameters, and therefore needs to be updated using one or both of these methods:
 
  • Update the probe parameter by right-clicking on the instance, selecting properties, modify the probe timeout by 1 and hit OK.
  • Kill monitor_agent by opening task manager, selecting the processes tab, find monitor_agent.exe and kill the process.  It will automatically restart.

--------------------------------------------------------------------------


From: Mokhtari, Mostafa [mailto:Mostafa.Mokhtari@hp.com]
Sent: Friday, April 04, 2008 11:29 AM
To: Horkay, Robert
Subject: RE: #3601519915

You would want to do this on any instance that's having the problem.  If changing the instance properties doesn't update the monitor and you decide to kill monitor agent, then killing monitor agent would only need to be performed once per node.

 

Regards,

Mostafa Mokhtari

HEWLETT-PACKARD COMPANY

High Availability Team

(719) 592-6700 ext. 65209

Monday-Friday 8-4pm PST

 


From: Horkay, Robert [mailto:RHorkay@....com]
Sent: Friday, April 04, 2008 9:19 AM
To: Horkay, Robert; Mokhtari, Mostafa
Subject: RE: #3601519915

 

Ok,

 

That worked !

 

Do we need to do this for every instance ?  or would it happen automatically as a box was restarted (as that would cause the monitor_agent to restart)....as we have yet to reboot every box after the 3.6 upgrade...

 

bob

 


From: Horkay, Robert
Sent: Friday, April 04, 2008 11:13 AM
To: 'Mokhtari, Mostafa'
Subject: RE: #3601519915

Yes that is correct.

 

I will try that.

 

On a side note, no where in the documentaiton on 3.4 to 3.6 upgrade did it mention doing this...?

 

let me see if it works...

 

bob

 


From: Mokhtari, Mostafa [mailto:Mostafa.Mokhtari@hp.com]
Sent: Friday, April 04, 2008 11:11 AM
To: Horkay, Robert
Subject: RE: #3601519915

 

Was this an upgrade from 3.4?  If so,

 

No value found for sqldataroot (3.4 upgrade to 3.6)

This error indicates that monitor_agent is still using 3.4 parameters, and therefore needs to be updated using one or both of these methods:

 

  • Update the probe parameter by right-clicking on the instance, selecting properties, modify the probe timeout by 1 and hit OK.
  • Kill monitor_agent by opening task manager, selecting the processes tab, find monitor_agent.exe and kill the process.  It will automatically restart.

 

Regards,

Mostafa Mokhtari

HEWLETT-PACKARD COMPANY

High Availability Team

(719) 592-6700 ext. 65209

Monday-Friday 8-4pm PST

.

 

 

_____________________________________________
From: Mokhtari, Mostafa
Sent: Friday, April 04, 2008 8:57 AM
To: 'Horkay, Robert'
Subject: #3601519915

 

Hi Robert,

I just picked up your case. Is this x64 OS or x86? What is the version of your SQL?

 

Thanks,

Mostafa Mokhtari

HEWLETT-PACKARD COMPANY

High Availability Team

(719) 592-6700 ext. 65209

Monday-Friday 8-4pm PST

 

 

 

Friday, 04 April 2008 09:50:48 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve#
Wednesday, 02 April 2008

Polyserve upgrade to 3.6 for our development environment is completed.

The directions are pretty clear, but could use some help.

We performed a rolling upgrade.  Nothing major found.

The most difficult part is the completion steps of growing the membership partitions and rebuilding volumes to be a 3.6 file system....this requires some "swing" luns from the SAN so data can be moved around.

 

 

Wednesday, 02 April 2008 09:45:10 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve#
Tuesday, 25 March 2008

Don't ask me why, but the Mount Point failed unmounting via gui.

No error, no message, but the mount point still shows in explorer.  Reboot, still shows in explorer.

Drop to the command line and run

MountVol

MountVol d:\crm /D

 

 

 

Tuesday, 25 March 2008 13:30:38 (Central Standard Time, UTC-06:00) | Comments [0] | General Technology | Polyserve#
Friday, 21 March 2008

We have been experiencing a Polyserve Pan Pulse error that was difficult to troubleshoot and explain.  Most perplexing was the lack of failover, as the Matrix eventlog entry for the sql instance, indicates it stopped communicating than started communicating, almost like a "stutter".

Had we not been carefully monitoring the sql instances, we would not even notice this happens, as the sql instance does not fail over, it is stopped and started by the cluster software.  We run a sql agent starting job that sends an alert whenever the agent stops and starts, which alerts us to this condition.

For some reason this is only happenning on 3 of our newer machines, all DL585's 4-way dual core and a DL580 4-way quad-core machines. 

The event log entries are as follow, notice they are 2 seconds apart:

--------------------------------------------

Event Type: Information
Event Source: PANPulse
Event Category: Interface
Event ID: 100
Date:  3/13/2008
Time:  1:21:26 PM
User:  N/A
Computer: BCPLYSQL07
Description:
10.10.50.48     2008-03-13 13:21:26 Interface 10.10.50.48 address 10.10.50.48 has gone down
-------------------------------------------------------

Event Type: Information
Event Source: PANPulse
Event Category: Interface
Event ID: 100
Date:  3/13/2008
Time:  1:21:28 PM
User:  N/A
Computer: BCPLYSQL07
Description:
10.10.50.48     2008-03-13 13:21:28 Interface 10.10.50.48 address 10.10.50.48 has come up because interface statistics indicate there is incoming traffic
-----------------------------------------------------------

What we stumbled across when reviewing this was Flow Control.  The flow control is a nic card setting.  These 3 machines were all set to Auto.  There is an option in the HP Network Configuration Utility where you can select the Information; and it shows the currently selected Flow Control, of which all 3 of these were somehow auto-selecting Rx Pause.  We reconfigured this property to disabled.

We're hoping this resolves the issue, as we couldn't understand why a sql instance would go up and down with a pan pulse error. 

We have 9 servers clustered together and run SQL Instances on all of them.  We only experienced the pan pulse error on these 3 machines, and all of them had the wrong flow control.

Friday, 21 March 2008 11:46:15 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve#

When installing SQL Server 2005 with Polyserve, the install fails with error 28001. 

This is because when Polyserve first began supporting SQL Server 2005, the software had an issue / bug with the Password validation.  SQL Authentication password validation did not work under Polyserve.  The bug has since been fixed.  But our installation uses the template.ini, which contains an sa password that does not meet the validation.  So now that the bug is fixed, and password validation is working; the install fails.  Solution is very simple, to edit the template.ini file to contain a password that meets the validation requirements.

To dig down into the 28001 error you have to to the individual server(s) where the install failed and find the install log file, c:\Program Files\Microsoft SQL Server\90\Setup Bootstrap\LOG\Summary.txt 

Friday, 21 March 2008 09:33:11 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve#
Thursday, 20 March 2008

How do you determine which IP Address is bound to which NIC card in a mult-honed machine ?

Recently I ran into the task of ensuring the nic cards on our Clustered machines were all named with the standard "Private" and "Public" as opposed to various things like "network adapter 1" etc.

All our Private nics in the cluster start with 192.x.x.x.  So to ensure I was naming them properly I needed to find what IP address was bound to "network adpater 1" etc.  This seems easy but for some reason took me a while to figure this out.

( and if you look carefully in this example you will see the current nic card labeled "admin" has a public IP Address, someone goofed ! and reversed them, so this is important stuff to know how to check !)

We found two ways to do this, through windows control panel, and through our vendor's nic card configuration utility (HP).

Select Start -> Control Panel -> Network -> {Adapter Name} than right click and select status.

From the Status properties window you can select the "advanced" tab and determine which IP Address is bound to this adapter.

Our particular Vendor is HP, and from the HP Network Utility you can select a NIC Card and then choose, Information, and in the Details section you can find the IP Address.

Thursday, 20 March 2008 09:35:09 (Central Standard Time, UTC-06:00) | Comments [0] | General Technology | Polyserve#
Friday, 14 March 2008

We use an EMC Symmetrix SAN.  We zone lots of storage.  Recently had some new storage zoned and we could not "see" the storage.  Turns out that 64 bit windows can only see the first 256 luns zoned to an FA Channel.  Now 32 bit windows does not have this issue, so things actually "took a step backwards" with 64 bit.

 

This was a frustrating development.  Currently we have no solution from Microsoft (thankyou).  The work around is to have our SAN Administrator zone luns over 256 down another FA Channel.

 

Plan you SAN Carefully.

 

Found this:

310072 Adding support for more than eight LUNs in Windows Server 2003 and in Windows 2000

http://support.microsoft.com/default.aspx?scid=kb;EN-US;310072

Friday, 14 March 2008 08:34:09 (Central Standard Time, UTC-06:00) | Comments [0] | General Technology | Polyserve | SQL Server#
Thursday, 13 March 2008

This error is annoying.  No idea why it happens, not sure if netbackup or anti-virus locks up and or corrupts the polyserve password file.  So far we opened a support case, but if I remember correctly you have to reboot the server to correct this error.  Fortunately you can use the matrix console from other machines, so it can wait, but very annoying.

The solution of rebooting the server worked, so this indicates to me that the file was not corrupt but some type of "permission denied" happenned.   Almost as though this server did not get a "clean reboot" as both times I remember getting this error was right after a reboot or power cycle of the machine.

Related entry in the Matrix event log:

Event Type: Failure Audit
Event Source: ClusterPulse
Event Category: MxConsole
Event ID: 100
Date:  3/13/2008
Time:  2:54:20 PM
User:  N/A
Computer: DEVPLYSQL02
Description:
10.10.49.162    2008-03-13 14:54:20 FailureEventAudit -  user 'admin' at <10.10.49.162:3201>:  failed authentication, Sys_error(C:\PROGRA~1\POLYSE~1\MATRIX~1\conf\mx_passwd: Permission denied)

Screen shot of the error:

Thursday, 13 March 2008 13:59:09 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve#
Friday, 07 March 2008

While trying to remove Polyserve Hotfix 4, i've consistently run into a problem with the add remove programs not displaying any programs.

Polyserve Hotfix 4 must be removed before applying hotfix 5.  Open up Control Panel, Add / Remove Programs, and nothing.  The list just never populates.  Reboot, Reboot, nothing.  Out of frustration I know the uninstall is just a program, so I search the registry for the uninstall string, cut and paste that out, and run it from the command line.  See Picture below for registry, not sure if the guid is the same for all installations, so always check.  The string I ran was: 

 

 MsiExec.exe /I{6D67BBA1-ECB3-4FBF-80A3-A3A34F57CE89}

 

Worked well, I still have no idea why the list won't populate, but I've had this issue on several windows 2003, 64 bit, sp2 machines running polyserve.  It makes applying a hotfix frustrating.

 

The Add or Remove Programs screen never populates.

Registry entry which shows string to run to uninstall, UninstallString.

Friday, 07 March 2008 10:01:09 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve#
Tuesday, 04 March 2008

Want Fries with that ?

Experienced a Polyserve failover today, root cause has been very difficult to flush out. 

One lucky hit was our solar winds monitor was reporting 80% packet loss on the server !

Since we are using server based fencing we wouldn't expect a crash dump.  The problem is that the node is fenced (power_cycled) before the crash dump occurs.  Unfortunately, there is really no way to determine the cause of the crash without a Memory.dmp.  But, one of the side effects of TOE being enabled on the servers is a blue screen. 

TOE, what's that ?

This stands for TCP/IP offload engine. Modern NICs have the ability to offload the processing of network transmission. This allows the CPU to focus on its other responsibilities. This feature is dependent on the  OS supporting the feature. SP2 supports TOE. This is the first iteration of Windows that does so. It definitely has advantages, but we have seen some issues in other areas so turning it off may help solve the problem.

Of course like any good new technology, it doesn't always work right and the side effect of TOE is a blue screen, nice !

Review the attached KB Article on TOE from HP:  TOE_KB.htm (5.76 KB) 

At the same time, we're still working on issues with the CPU synchronization errors in the SQL Server error log:

The time stamp counter of CPU on scheduler id 14 is not synchronized with other CPUs.
The time stamp counter of CPU on scheduler id 2 is not synchronized with other CPUs.

See Microsoft KB Article:  http://support.microsoft.com/default.aspx?scid=kb;EN-US;931279

Now this error I though only affected the AMD Chipsets, but this particular machine is an Intel 4 way quad core, so 16 processors.  On the AMD Chipsets we modified a boot.ini file with /usepmtimer, but on this intel box we had to go to the bios settings.

Hopefully after changing the BIOS Power settings to maximum always, we'll see no more fail overs and no more synchronization issues.

Tuesday, 04 March 2008 12:45:01 (Central Standard Time, UTC-06:00) | Comments [1] | Polyserve | SQL Server#
Wednesday, 06 February 2008

Polyserve clusters SQL Server instances, and can then rehost them or fail them over to other servers.

One of the issues we've run into with Polyserve is when we move a sql server instance from one machine to another where the binaries change location from mssql.a to mssql.b.

SQL Agent (SQLAgent.exe, SQLAgent), utilizes a system table in msdb for starting and executing jobs, called syssubsystems;

Execute: select * from msdb.dbo.syssubsystems and you will see each sub-system of sql agent and the path to the binaries, if a sql server instance is moved through some technology like Polyserve or through manually moving a msdb database from one server to another and the path to the binaries change, the system tables will need to be updated to the new path.

We've experienced this on Polyserve and when we moved a sql server instance from one server to another and copied msdb.

very easy to fix, but frustrating, below is  a snippet which can be run to correct the issue:

--Get SQLBinRoot

declare @ret sysname

exec master..xp_instance_regread 'HKEY_LOCAL_MACHINE','Software\Microsoft\MSSQLServer\Setup','SQLBinRoot',@ret OUTPUT

--Update subsystem_dll to current SQLBinRoot

update sub

set subsystem_dll=@ret+substring(subsystem_dll,charindex('\binn\',subsystem_dll)+5,30)

from msdb..syssubsystems sub

where charindex('\binn\',subsystem_dll)>0

--Stop Sqlagent service if running

--MxDB should restart it automatically

declare @service sysname

select @service = case when charindex('\',@@servername)>0

then N'SQLAgent$'+@@servicename

else N'SQLSERVERAGENT' end

create table #stat(status sysname)

insert #stat

exec master..xp_servicecontrol N'QUERYSTATE', @service

if exists(select * from #stat where status='Running.')

begin

exec master..xp_servicecontrol N'STOP', @service

end

--Clean up tmp table

if object_id('tempdb..#stat') is not null

drop table #stat

 

Wednesday, 06 February 2008 09:49:41 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve |  SQL Agent#
Monday, 28 January 2008

Oh why, why have the hotfixes become such a head ache, this one fought me quite a bit, but finally she gave in...

Also I usually extract the hotfix manually so I can review it, do this from a command prompt (remember to make the target directory).

c:\sqlserver2005-kb943656-x64-enu.exe /x:c:\sqlhotfix\

The errors seen with 3215 are (some of these we have see with other hotfixes as well).

  1. MSP Error: 29511  Failure creating local group IUSR_DEVPLYSQL01
  2. MSP Error: 29528  The setup has encountered an unexpected error while Setting Internal Properties. The error is: Fatal error during installation.
  3.  MSP Error: 29534  Service 'MSSQL$QAEAM' could not be started. Verify that you have sufficient privileges to start system services. The error code is (52) You were not connected because a duplicate name exists on the network. Go to System in Control Panel to change the computer name and try again.
  4. MSP Error: 29538  SQL Server Setup did not have the administrator permissions required to rename a file: f:\MSSQL\DATA\distmdl1.ldf. To continue, verify that the file exists, and either grant administrator permissions to the account currently running Setup or log in with an administrator account. Then run SQL Server Setup again.
  5. MSP Error: 29537  SQL Server Setup has encountered the following problem: [Microsoft][SQL Native Client][SQL Server]Cannot find the object 'dm_exec_query_resource_semaphores', because it does not exist or you do not have permission.. To continue, correct the problem, and then run SQL Server Setup again.

1st failure was a really strange one:

MSP Error: 29511  Failure creating local group IUSR_DEVPLYSQL01

As I didn't need IIS on this machine I deleted the account as it already existed, mistake there, this lead to the 2nd failure.

2nd failure was on:

MSP Error: 29528  The setup has encountered an unexpected error while Setting Internal Properties. The error is: Fatal error during installation.

This turned out to be a registry issue with deleting items, while not specifically addressing CU 5, it did fix the issue: (http://support.microsoft.com/kb/925976).

For a stand-alone installation of SQL Server 2005

1. Remove the following registry subkeys that store SID settings:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.X\Setup\SQLGroup
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.X\Setup\AGTGroup
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.X\Setup\FTSGroup
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.X\Setup\ASGroup
Note In these registry subkeys, MSSQL.X is a placeholder for the corresponding value on a specific system. You can determine MSSQL.X on a specific system by examining the value of the MSSQLSERVER registry entry under the following registry subkey:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\Instance Names\SQL\
2. Reinstall the SQL Server 2005 service pack or the SQL Server 2005 hotfix package.

Finally after all that it installed, as easy as could be !

---------------------------------------

Experienced this error too:

 MSP Error: 29534  Service 'MSSQL$QAEAM' could not be started. Verify that you have sufficient privileges to start system services. The error code is (52) You were not connected because a duplicate name exists on the network. Go to System in Control Panel to change the computer name and try again.

-------------

The above error is because of the tcp/ip stack being modified by polyserve, i correct this with a change in the registry networking settings.  I've had some questions on what I corrected, basically under Polyserve the virtual IP "floats" to whatever node is active, if for some reason, it is moved from one node to another, and not removed from the previous node...you run into an issue where that other node will not start, you'll receive some very nice errors like these in the sql server errorlog:

2008-05-10 20:52:42.83 Server      A self-generated certificate was successfully loaded for encryption.
2008-05-10 20:52:42.85 Server      Error: 26024, Severity: 16, State: 1.
2008-05-10 20:52:42.85 Server      Server failed to listen on 10.10.48.40 <ipv4> 40020. Error: 0x2741. To proceed, notify your system administrator.
2008-05-10 20:52:42.85 Server      Error: 17182, Severity: 16, State: 1.
2008-05-10 20:52:42.85 Server      TDSSNIClient initialization failed with error 0x2741, status code 0xa.
2008-05-10 20:52:42.85 Server      Error: 17182, Severity: 16, State: 1.
2008-05-10 20:52:42.85 Server      TDSSNIClient initialization failed with error 0x2741, status code 0x1.
2008-05-10 20:52:42.85 Server      Error: 17826, Severity: 18, State: 3.
2008-05-10 20:52:42.85 Server      Could not start the network library because of an internal error in the network library. To determine the cause, review the errors immediately preceding this one in the error log.
2008-05-10 20:52:42.85 Server      Error: 17120, Severity: 16, State: 1.
2008-05-10 20:52:42.85 Server      SQL Server could not spawn FRunCM thread. Check the SQL Server error log and the Windows event logs for information about possible related problems.

This error happens because you can't have two sql server instances use the same IP Address, makes perfect sense when you think through it.

The real problem now that you have root cause, is how to fix it.  Polyserve performs some complex tasks with "swapping" registry entries and moving instances, this is not an easy issue to correct and you must be careful.  These errors rarely happen on version 3.6, but were pretty common on 3.4. 

I'd adise to engage HP / Polyserve support to correct the issue, and then as you become familiar with the SQL Registry and how polyserve works, this can be corrected quite easily yourself.

Depending on the scope of the problem you may only need to focus on the SuperSocketNetLib IP address entries.  The server that won't start will most likely contain the virtual ip, and there will be another node in your cluster already running the IP.  Sometimes you can edit the IP1, IP2 and IP3 to just reflect your Public, Private and loop Back IP Address.  But before starting SQL, Double check where your SQL Instances are "pointing" if they are pointing to the virtul root, than you have a bigger registry issue, and you must correct that too.

Very much fun !

-----------

Monday, 28 January 2008 22:36:07 (Central Standard Time, UTC-06:00) | Comments [1] | Polyserve | SQL Server#
Sunday, 27 January 2008

4 - Way Quad Core, 16 Processors, DONE.

Tonight is the fail over from 4 way dual core to 4 way quad core.

Interesting suttle changes as well, the dual core is AMD NUMA Architecture, the Quad core is Intel - NON NUMA.

Interesting to see, how the machine handles the load.  The current SQL Server is maxed out, all cpu's hitting 100% for sustained times, 4-6 hours.  Very little locking and blocking or disk i/o, all waits are on the SOS_SCHEDULER_YIELD.

Very difficult to recreate this issue or test in the lab, so the results are due tomorrow !

exhaustion.

 

Sunday, 27 January 2008 17:18:43 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve | SQL Server#

Been fighting this issue where a SQL Server 2005 instance hosted on polyserve would not allow the password policy enforcement.

Something of a fix:  http://support.microsoft.com/kb/926642/en-us

Sunday, 27 January 2008 17:14:40 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve | SQL Server#
Thursday, 03 January 2008

Chasing 4 problems.

  1. Polyserve failing / restarting SQL Instances due to no network traffic.
  2. Incorrect durations reporting in SQL Server traces (very large numbers).
  3. Information message logged in SQL Server 2005 log file, The time stamp counter of CPU on scheduler id 2 is not synchronized with other CPUs.
  4. Alert from Monitoring on, Windows cannot obtain the domain controller name for your computer network. (An unexpected network error occurred.). Group Policy processing aborted.
  5. I've been chasing these errors, without much support from network or other engineers, basically the same as the mechanic who tells you the engine light is on, but all is ok !

Well finally found a customer advisory from HP and all 4 problems roll up to the same issue:

----------------------------

SUPPORT COMMUNICATION - CUSTOMER ADVISORY

Document ID: c01075682

Version: 2
Advisory: (Revision) HP ProLiant Servers Using Dual-Core or More Than One Single-Core AMD Opteron Processor May Experience Incorrect Operating System Time When Running Systems That Use the System Time Stamp Counter
NOTICE: The information in this document, including products and software versions, is current as of the Release Date. This document is subject to change without notice.

Release Date: 2007-07-16

Last Updated: 2007-07-16


DESCRIPTIONDocument Version Release Date Details
2 07/16/2007 Added Sun Solaris information.
1 06/08/2007 Original Document Release.

HP ProLiant servers configured with Dual-Core or with more than one single-core AMD Opteron processor may encounter Time Stamp Counter (TSC) drift in certain conditions. The TSC is used by some operating systems as a timekeeping source. Each processor core, whether it is a single-core processor or a dual-core processor, includes a TSC. The condition where the TSC for different processor cores becomes unsynchronized is known as TSC drift.

Note : The potential for TSC drift if the proper recommendations are not applied when using AMD Opteron 200-series, Opteron 800-series, Opteron 1200-series, Opteron 2200-series and Opteron 8200-series processors is not specific to HP ProLiant servers.

Whether or not the system is affected by TSC drift depends on the specific ProLiant server generation, the number and type of AMD Opteron processors installed, the operating system, and whether the AMD PowerNow! feature is being utilized. TSC drift can result in different symptoms and behaviors based on the operating system environment, as detailed below:

Microsoft Windows Server 2003
This condition affects operations such as network communications and performance monitoring tasks that are sensitive to system time. For example, Microsoft Active Directory domain controllers can report an Unexpected Network Error (Event ID 1054) with the following description:

Event Description:
Windows cannot obtain the domain controller name for your computer network. (An unexpected network error occurred.). Group Policy processing aborted.

In addition, a negative PING time or larger than actual PING time may be returned after issuing the PING command. The negative PING time occurs because of a Time Stamp Counter drift occurring on AMD Opteron platforms which include more than one processor core.

Red Hat Enterprise Linux, SUSE Linux Enterprise Server and Sun Solaris
Earlier releases of Red Hat Enterprise Linux 4, SUSE Linux Enterprise Server 9 and Sun Solaris 10 will default to using the Time Stamp Counter as the default time source for gettimeofday() calls. When the time stamp counter is used, the server may exhibit some inconsistent timekeeping and the following symptoms may be observed:

When a command such as "date" is typed, an incorrect system time may be displayed.
The kernel may report an error similar to the following:
kernel: Your time source seems to be instable or some driver is hogging interrupts

Newer operating systems typically do not use the TSC by default if other timers are available in the system which can be used as a timekeeping source. Other available timers include the PM_Timer and the High Precision Event Timer (HPET). All HP ProLiant servers include the PM_Timer, and the latest generation of HP ProLiant servers supporting AMD Opteron 2200-series and 8200-series processors support HPET. These timers are not affected by this condition. New operating systems such as Red Hat Enterprise Linux (RHEL) 5, SUSE Linux Enterprise Server (SLES) 10, and Microsoft Windows Server 2008 (codename Longhorn) are not affected by this issue.

Note: Some applications (e.g., Microsoft SQL Server 2005) use the Time Stamp Counter even though the operating system is configured to use a different timer as the timekeeping source. To determine if a specific application uses the TSC as the timekeeping source, contact the software vendor.

SCOPE

Any HP ProLiant server configured with more than one single-core AMD Opteron processor or configured with one (or more) dual-core AMD Opteron processors running the following operating systems:

Microsoft Windows Server 2003 (any edition)
Microsoft Windows Server 2003 x64 Edition (any edition)
Red Hat Enterprise Linux 4(x86) or earlier
Red Hat Enterprise Linux 4 (AMD64/EM64T) or earlier
SUSE Linux Enterprise Server 9 32-bit (x86) or earlier
SUSE Linux Enterprise Server 9 64-bit (AMD64/EM64T) or earlier
Sun Solaris 9
Sun Solaris 10 3/05 (32/64 bit)
VMware ESX Server 2.5.4 (or earlier)

Note: VMware ESX Server 2.5.4 with the January 2007 (or later) patch is not affected. VMware ESX Server 3.0.0 (or later) uses an alternate mechanism for timekeeping and is not affected by the potential TSC drift.

Note : The issue does not affect systems with only one single-core processor installed.

The following servers are affected when running an affected operating system:

HP ProLiant BL465c Blade Server
HP ProLiant BL685c Blade Server
HP ProLiant BL25p G2 server
HP ProLiant BL45p G2 server
HP ProLiant DL145 G3 server
HP ProLiant DL385 G2 server
HP ProLiant DL585 G2 server
HP ProLiant DL365 server
HP ProLiant ML115 server

The following servers are affected ONLY when using the AMD PowerNow! feature and running an affected operating system:

ProLiant BL25p Blade Server
HP ProLiant BL45p Blade Server
HP ProLiant DL145 G2 server
HP ProLiant DL385 server
HP ProLiant DL585 server

The following operating systems are not affected by TSC drift because these operating systems do not use the TSC as a timekeeping source:

Microsoft Windows Server 2008 (codename Longhorn)
Red Hat Enterprise Linux 5 (x86)
Red Hat Enterprise Linux 5 (AMD64/EM64T)
SUSE Linux Enterprise Server 10 (x86)
SUSE Linux Enterprise Server 10 (AMD64/EM64T)
VMware ESX Server 3.0.0 (or later)

RESOLUTION

To ensure proper operation of tasks sensitive to system time, perform either of the following actions, based on the operating system environment:

Microsoft Windows Server 2003 (any edition)
Edit the BOOT.ini file and add the parameter "/usepmtimer," then reboot the server. Adding the "/usepmtimer" parameter to the BOOT.INI file configures the Windows operating system to use the PM_TIMER, rather than the Time Stamp Counter.

Note: When installing the AMD Opteron Processor with AMD PowerNow! Technology driver Version 1.3.2.16 (or later) from AMD, the BOOT.INI file will automatically be updated with the "/usepmtimer" parameter. While the driver itself does not resolve this issue, the installation process will make the necessary changes to the BOOT.INI file to prevent the issue from occurring.

Red Hat Enterprise Linux 4 or SUSE Linux
Add the boot parameter "clock=pmtmr" to the /boot/grub/menu.lst file. Adding the "clock=pmtmr" to the /boot/grub/menu.lst file configures the operating system to use the PM_TIMER, rather than the Time Stamp Counter.

Sun Solaris
If using Sun Solaris 10 3/05 apply the 1/06 (Update 1) Patch (or later). To locate the latest version of the Solaris 10 patch, click on the following Sun Microsystems URL, and click on the desired patch:

http://www.sun.com/downloads

VMware
If using VMware ESX Server 2.5.4, update to the January 2007 Patch (or later). To locate the latest version of the ESX Server 2.5.4 patch, click on the following VMware URL, and click on the desired patch.

http://www.vmware.com/download/esx/esx2_patches.html#c4317 RECEIVE PROACTIVE UPDATES : Receive support alerts (such as Customer Advisories), as well as updates on drivers, software, firmware, and customer replaceable components, proactively via e-mail through HP Subscriber's Choice. Sign up for Subscriber's Choice at the following URL:
http://www.hp.com/go/myadvisory

SEARCH TIP : For hints on locating similar documents on HP.com, refer to the Search Tips document: http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c00638154 .
To search for additional advisories related to System Time, use the following search string:
+ProLiant +Advisory +System Time
KEYWORDS: time sync, clock, track time

-------------------------------------

http://support.microsoft.com/kb/931279/en-us 


 

 

Thursday, 03 January 2008 14:55:22 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve | SQL Server#

Using Polyserve clustering technology from HP, you need to ensure to have Automount off.

This is done using diskpart from a command prompt.

DiskPart

Automount disable

If this is not disabled you end up with "phantom" drives, and it can cause problems with everything from virus scanning, back up software and windows itself !  Basically what happens is the SAN Administrator zones new storage, somewhere the server reboots (got to love that!), than you have a new drive letter, but basically it is a "phantom" and netbackup, virus scanning or some other program attempts to access the new volume through it's drive letter, and the software abends....nice !

Removing the drive letter is usually a straight forward process of using Disk Management or Veritas Enterprise Disk Manager and removing the drive letter, but I've had issues with this and had to drop do a command prompt and use DiskMount to remove it.

Thursday, 03 January 2008 12:35:20 (Central Standard Time, UTC-06:00) | Comments [0] | Polyserve#
Search
Popular Posts
Unpatched Vulnerabiltiy discovered ...
Spring Fornicator brewed...
DTA - Failed to initialize MSDB dat...
SQL Server Security, not where it n...
Check the Uptime of a Windows Serve...
Recent Posts
Archive
May, 2017 (2)
April, 2017 (1)
March, 2017 (1)
February, 2017 (1)
December, 2016 (2)
October, 2016 (2)
September, 2016 (1)
August, 2016 (1)
July, 2016 (1)
March, 2016 (2)
February, 2016 (3)
December, 2015 (4)
November, 2015 (6)
September, 2015 (1)
August, 2015 (2)
July, 2015 (1)
March, 2015 (2)
January, 2015 (1)
December, 2014 (3)
November, 2014 (1)
July, 2014 (2)
June, 2014 (2)
May, 2014 (3)
April, 2014 (3)
March, 2014 (1)
December, 2013 (1)
October, 2013 (1)
August, 2013 (1)
July, 2013 (1)
June, 2013 (2)
May, 2013 (1)
March, 2013 (3)
February, 2013 (3)
January, 2013 (1)
December, 2012 (3)
November, 2012 (1)
October, 2012 (1)
September, 2012 (1)
August, 2012 (1)
July, 2012 (4)
June, 2012 (3)
April, 2012 (1)
March, 2012 (3)
February, 2012 (3)
January, 2012 (4)
December, 2011 (3)
October, 2011 (2)
September, 2011 (2)
August, 2011 (8)
July, 2011 (4)
June, 2011 (3)
May, 2011 (3)
April, 2011 (1)
March, 2011 (2)
February, 2011 (3)
January, 2011 (1)
September, 2010 (1)
August, 2010 (2)
May, 2010 (2)
April, 2010 (3)
March, 2010 (1)
February, 2010 (4)
January, 2010 (1)
December, 2009 (3)
November, 2009 (2)
October, 2009 (2)
September, 2009 (5)
August, 2009 (4)
July, 2009 (8)
June, 2009 (2)
May, 2009 (3)
April, 2009 (9)
March, 2009 (6)
February, 2009 (3)
January, 2009 (8)
December, 2008 (8)
November, 2008 (4)
October, 2008 (14)
September, 2008 (10)
August, 2008 (7)
July, 2008 (7)
June, 2008 (11)
May, 2008 (14)
April, 2008 (12)
March, 2008 (17)
February, 2008 (10)
January, 2008 (13)
December, 2007 (7)
November, 2007 (8)
Links
Categories
Admin Login
Sign In
Blogroll