Our Polyserve cluster took a deep dive and crashed, all nodes. Root cause is still under research, but basically we zoned some new storage to the cluster and after a reboot of the nodes the Polyserve software was unable to read or write to the membership partitions. Of course the error didn't state that, as that would have made troubleshooting the problem easier, instead we received this error:
Event Type: Error
Event Source: sanpulse
Event Category: SAN Storage
Event ID: 17005
Date: 7/2/2008
Time: 9:35:40 AM
User: N/A
Computer: BCPLYSQL03
Description:
This matrix is unable to take control of SAN because the servers are unable to perform fencing operations, possibly due to a networking or fencing hardware failure or misconfiguration. As a result, some or all filesystem operations may be paused throughout the matrix. In addition, filesystem mounts and unmounts and disk imports and deports cannot be performed.
We have zoned storage to Polyserve many many times, and never had a stability issue, we've had isolated issues with LUNS not showing up, mini / storport issues, emulex issues, but nothing that caused the cluster to become unstable.
So we eventually de-zoned the new storage, rebooted the entire cluster and everything worked fine. We're not sure if we zoned the storage incorrectly (we have a new SAN Administrator, so maybe it wasn't done correctly), though I don't suspect this. Our SAN Administrator while new has succesfully zoned storage to our clusters in the past with no issues, and understands how / what Polyserve is.
More so, I suspect some internal issue to windows / emulex / Powerpath or something that upon the zoning of the new storage, caused the LUN Id's change to map incorrectly.