Resolved Degraded performance - P1, S1, S4

Scott · November 19, 2017

For several hours in the evening of Nov 18, three of our servers experienced high load, degraded performance, and intermittent issues with web requests. We investigated the issues internally for several hours and escalated to several of our software vendors for assistance in locating the root cause. The problem was caused by a network health metric we were previously not monitoring. At this time, all servers are working normally and the outage is resolved.

Event start time: 7:28pm ET | Nov 18

Resolved time: 1:11am ET | Nov 19

Impacted servers: P1, S1, S4

Additional details may be available at a later time. The staff who have more technical details are currently getting some much needed sleep.

We apologize for any inconvenience the outage may have caused and appreciate the patience and reports so many of you shared with us. As always, please feel free to ask any general questions here. If your question is specific to your account, please direct it to our support department by email or ticket.

Michael D. · November 19, 2017

We have determined the root cause of the issues experienced last night/this morning.

Why did it happen?

This was a result of a combination of unexpected issues - either of which having happened on their own would not have caused any downtime or disruption.

At approximately 7:28 PM ET we had a drive fail in one of our storage servers. Our storage platform is designed to handle drive failures gracefully and drops the drive from the storage cluster. We maintain 3 copies of all data on 3 distinct drives in 3 distinct systems out of many. The result is that when a drive fails we can recreate a third copy of the missing data onto a new drive from the other 2 copies that remain. This is generally a seamless process.

The first issue was with how our raid controller in this specific storage server handled the drive failure. In this case when the drive failed the raid controller handling this drive disabled write caching on all drives in the system. This is unexpected behavior and not something we or our storage vendor has seen before. The result was increased write latency. This alone would not have created downtime or issues.

The second compounding issue was that we did not have LiteSpeed configured to write logs Asynchronously with AIO - meaning that it writes the entries to RAM and then flushes them to the disk as it can. This would have given us a buffer to handle the delayed / latent writes. As LiteSpeed is an event-driven web server without AIO enabled for logging it would get stuck waiting to write log entries out and would fail to serve all other requests while it was waiting. This would happen for a couple of seconds which was long enough for the system to see LiteSpeed as down and for it to issue a restart.

LiteSpeed writes many thousands of log entries per minute and S1, P1, and S4 were all using storage that had 1/3 of their redundant data on the storage server that had lost the write cache unexpectedly. This means that out of many thousands of writes per minute - on occasion - the latency to write would be high enough that LiteSpeed would be seen as stuck and would get restarted - in some cases many times per minute.

The end result is that LiteSpeed would go offline for 10 to 30 seconds seemingly randomly. P1 was affected the most and was offline for 30 to 60 seconds every few minutes while S1 was affected the least and was mostly online. S4 was affected more than S1 but nowhere near as much as P1.

What are we doing to prevent a similar issue from occurring in the future?

We have configured additional monitoring on our storage cluster to detect higher-than-normal write latency so that we can intervene quickly. In this case as we are now aware of the potential issue with the write cache we can proactively check and resolve it in the event of a drive failure to avoid unexpectedly high write latency.
We have reconfigured LiteSpeed to use AIO log writing so that should we ever experience higher than normal write latency in the future the impact should be minimal if not invisible to end-users.

Should you have any questions about any of this please let us know! We apologize for any trouble this may have caused you.

AMC4x4 · November 20, 2017

Thank you for the RCA. As always, the transparency and these RCA's are a big reason why I stay here and re-signed for three more years. Thanks again!

Michael D. · November 20, 2017

Thank you for the RCA. As always, the transparency and these RCA's are a big reason why I stay here and re-signed for three more years. Thanks again!

Absolutely. I wish we would have sorted the root issue a bit faster but with the changes we've implemented it should not recur.

It's always best to explain what happened and what we're doing about it in my honest opinion.

Sign In

Resolved Degraded performance - P1, S1, S4

Recommended Posts

Scott

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

AMC4x4

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Join the conversation

Browse

Activity