Better known as, "Polyserve is puking and you get to clean it up !". Technincally it's not fair to blame Polyserve, as the root cause of this issue looks to be a corrupt registry from version 3.4 that was never properly corrected. We've never experienced an issue like this with SQL Instances installed from version 3.6 originally.
When you have a SQL Server instance in Polyserve that will not fail back to it's primary, you know you have a problem. Best bet, call technincal support [that's what maintenance is for!], but here is what happenned to us recently and what we did to correct this issue [we have seen this before and called technical support and concurred this is the permanent fix].
Automated patching of the development environment caused some sql instances to fail over, this is expected. We do not run the instances in "auto fail back" mode, preferring to complete this step manually to minimize "ping-ponging" instances. After patching we reviewed the environment, and it looks good, with the excpetion of one instance, it is on it's secondary, it is running, it is available, but notice the status of "warning".
There is no nice error message in the console. Right clicking the instance to "show alerts", displays nothing. What gives ?
Who cares right, just move the instance back to the primary and let's get working....no dice. The instance won't move, and NO Error or Message is given, crap, you know your in trouble now....for the newly Polyserve initiated, this is when you STOP and call technical support. The more you play with things, the worse it will get and it will cause technincal support consternation in correcting the issue. Since this was development I get to play...
The instance will not fail back, if we rehost, that result is no change. Disabling the instance is the same result. I'm pretty sure you could reboot the server and that would cause it to fail back to the primary, matter of fact, i'm positive....but what if you have other instances on that same physical server and you can't afford an outage ? Also that may temporarily correct the issue, but it doesn't address the root cause and future scenarios of patching or fail over, may result in the same condition.
What the hell is happenning ?
Finally digging through all the logs (including Polyserve logs), i find that Polyserve still thinks the instance is "starting", and since it is in "starting mode", it does not fail it over when requested. There needs to be someway of over-riding this stupid behavior but it is "by design". See pic:
What and why ? Based on past experience and knowing this instance existed from version 3.4 to 3.6, I know this deals with the registry issues. A quick peak of the registry on the server it is currently hosted on shows that some of the registry entries point to the virtual root (3.6) and some of the registry entries point to mxshells (3.4), notice the registry entries below, they should all point to the virtual root:
Now the correct way to fix this is to delete the sql instance from Polyserve (not the machines). Verify the registry entries and sql instance on each machine. Delete any polyserve sql.original, sql.preg etc files (make a copy first). Re-virtualize it and re-verify everything.
Obviously if this is production instance, you might have to wait in doing this, as it is time consuming. In which case you can manually stop the services and see if you can get things to fail back, though you may have to reboot the server. At some point, the only way to correct the root cause is find a maintenance window to allow you to delete the instance from polyserve, correct each individual sql instance and re-virtualize. Fun !