S4 temporary outage in 11:40PM ET 10/13

ericr · October 14, 2018

We will be having a short one to three minute outage of s4 to complete a live migration move.

ericr · October 14, 2018

This stage is complete. We will have another short outage to revert the move later this evening.

ericr · October 14, 2018

We will be doing the second outage of the night at 12:20 PM

ericr · October 14, 2018

The work is complete for the evening.

ericr · October 14, 2018

I am taking S4 down immediatly to resolve a issue that is occuring as a result of the physical server.

ericr · October 14, 2018

S4 is booted. YOur sites should be back online shortly.

Michael D. · October 14, 2018

As the S4 server has been migrated to different hardware that isn't experiencing issues - we do not anticipate further problems. At this time the hardware that S4 was on previously is no longer providing services to any clients and we're working with Dell to have the hardware replaced under warranty.

Michael D. · October 14, 2018

A few clients have reached out asking why the system did not automatically fail-over the S4 server to another piece of hardware and why we had to manually intervene.

I'm going to provide the same answer here as I did in the tickets so that we're being clear and transparent.

The failure of the host wasn't total. The server was online but degraded. The long and short of it is that the motherboard in the host system that failed is failing in an interesting way - in that it's corrupting data read from 3 out of 24 memory modules in the server - 3 out of 12 attached to CPU1. Not all VMs on the host were affected - the host was not entirely down. S4 was most affected as the RAM that was failing was being used by that server. We are using error-correcting code memory [ECC Memory] that can detect memory errors and correct them automatically but we have verified the RAM modules themselves are not the cause of the issues.

If the hardware were to fail as to take the services 100% offline they would have come back up on another machine automatically. As the host server was online the automatic system did not migrate the guest servers. We intervened manually as we monitor the servers closely. We were aware of the issue within less than 60 seconds and working on it.

As not to cause file system corruption / data loss / etc we elected to gracefully bring the service down to bring it back up on another machine which took more time than forcefully killing the machine and bringing it back online. More time, but less risk. While we could always forcefully kill the guests and bring them back online on another machine faster - killing a server does risk data damage and corruption particularly to MySQL data so we do try to avoid that whenever possible.

Sign In

S4 temporary outage in 11:40PM ET 10/13

Recommended Posts

ericr

Link to comment

Share on other sites

ericr

Link to comment

Share on other sites

ericr

Link to comment

Share on other sites

ericr

Link to comment

Share on other sites

ericr

Link to comment

Share on other sites

ericr

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Join the conversation

Browse

Activity