Jump to content
MDDHosting Forums

S4 temporary outage in 11:40PM ET 10/13


Recommended Posts

As the S4 server has been migrated to different hardware that isn't experiencing issues - we do not anticipate further problems. At this time the hardware that S4 was on previously is no longer providing services to any clients and we're working with Dell to have the hardware replaced under warranty.

Link to comment
Share on other sites

A few clients have reached out asking why the system did not automatically fail-over the S4 server to another piece of hardware and why we had to manually intervene.

 

I'm going to provide the same answer here as I did in the tickets so that we're being clear and transparent.

 

 

The failure of the host wasn't total. The server was online but degraded. The long and short of it is that the motherboard in the host system that failed is failing in an interesting way - in that it's corrupting data read from 3 out of 24 memory modules in the server - 3 out of 12 attached to CPU1. Not all VMs on the host were affected - the host was not entirely down. S4 was most affected as the RAM that was failing was being used by that server. We are using error-correcting code memory [ECC Memory] that can detect memory errors and correct them automatically but we have verified the RAM modules themselves are not the cause of the issues.

 

If the hardware were to fail as to take the services 100% offline they would have come back up on another machine automatically. As the host server was online the automatic system did not migrate the guest servers. We intervened manually as we monitor the servers closely. We were aware of the issue within less than 60 seconds and working on it.

 

As not to cause file system corruption / data loss / etc we elected to gracefully bring the service down to bring it back up on another machine which took more time than forcefully killing the machine and bringing it back online. More time, but less risk. While we could always forcefully kill the guests and bring them back online on another machine faster - killing a server does risk data damage and corruption particularly to MySQL data so we do try to avoid that whenever possible.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...