ericr Posted October 14, 2018 Report Share Posted October 14, 2018 We will be having a short one to three minute outage of s4 to complete a live migration move. Quote Link to comment Share on other sites More sharing options...
ericr Posted October 14, 2018 Author Report Share Posted October 14, 2018 This stage is complete. We will have another short outage to revert the move later this evening. Quote Link to comment Share on other sites More sharing options...
ericr Posted October 14, 2018 Author Report Share Posted October 14, 2018 We will be doing the second outage of the night at 12:20 PM Quote Link to comment Share on other sites More sharing options...
ericr Posted October 14, 2018 Author Report Share Posted October 14, 2018 The work is complete for the evening. Quote Link to comment Share on other sites More sharing options...
ericr Posted October 14, 2018 Author Report Share Posted October 14, 2018 I am taking S4 down immediatly to resolve a issue that is occuring as a result of the physical server. Quote Link to comment Share on other sites More sharing options...
ericr Posted October 14, 2018 Author Report Share Posted October 14, 2018 S4 is booted. YOur sites should be back online shortly. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted October 14, 2018 Report Share Posted October 14, 2018 As the S4 server has been migrated to different hardware that isn't experiencing issues - we do not anticipate further problems. At this time the hardware that S4 was on previously is no longer providing services to any clients and we're working with Dell to have the hardware replaced under warranty. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted October 14, 2018 Report Share Posted October 14, 2018 A few clients have reached out asking why the system did not automatically fail-over the S4 server to another piece of hardware and why we had to manually intervene. I'm going to provide the same answer here as I did in the tickets so that we're being clear and transparent. The failure of the host wasn't total. The server was online but degraded. The long and short of it is that the motherboard in the host system that failed is failing in an interesting way - in that it's corrupting data read from 3 out of 24 memory modules in the server - 3 out of 12 attached to CPU1. Not all VMs on the host were affected - the host was not entirely down. S4 was most affected as the RAM that was failing was being used by that server. We are using error-correcting code memory [ECC Memory] that can detect memory errors and correct them automatically but we have verified the RAM modules themselves are not the cause of the issues. If the hardware were to fail as to take the services 100% offline they would have come back up on another machine automatically. As the host server was online the automatic system did not migrate the guest servers. We intervened manually as we monitor the servers closely. We were aware of the issue within less than 60 seconds and working on it. As not to cause file system corruption / data loss / etc we elected to gracefully bring the service down to bring it back up on another machine which took more time than forcefully killing the machine and bringing it back online. More time, but less risk. While we could always forcefully kill the guests and bring them back online on another machine faster - killing a server does risk data damage and corruption particularly to MySQL data so we do try to avoid that whenever possible. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.