We have determined the root cause of the issues experienced last night/this morning.
Why did it happen?
This was a result of a combination of unexpected issues - either of which having happened on their own would not have caused any downtime or disruption.
At approximately 7:28 PM ET we had a drive fail in one of our storage servers. Our storage platform is designed to handle drive failures gracefully and drops the drive from the storage cluster. We maintain 3 copies of all data on 3 distinct drives in 3 distinct systems out of many. The result is that when a drive fails we can recreate a third copy of the missing data onto a new drive from the other 2 copies that remain. This is generally a seamless process.
The first issue was with how our raid controller in this specific storage server handled the drive failure. In this case when the drive failed the raid controller handling this drive disabled write caching on all drives in the system. This is unexpected behavior and not something we or our storage vendor has seen before. The result was increased write latency. This alone would not have created downtime or issues.
The second compounding issue was that we did not have LiteSpeed configured to write logs Asynchronously with AIO - meaning that it writes the entries to RAM and then flushes them to the disk as it can. This would have given us a buffer to handle the delayed / latent writes. As LiteSpeed is an event-driven web server without AIO enabled for logging it would get stuck waiting to write log entries out and would fail to serve all other requests while it was waiting. This would happen for a couple of seconds which was long enough for the system to see LiteSpeed as down and for it to issue a restart.
LiteSpeed writes many thousands of log entries per minute and S1, P1, and S4 were all using storage that had 1/3 of their redundant data on the storage server that had lost the write cache unexpectedly. This means that out of many thousands of writes per minute - on occasion - the latency to write would be high enough that LiteSpeed would be seen as stuck and would get restarted - in some cases many times per minute.
The end result is that LiteSpeed would go offline for 10 to 30 seconds seemingly randomly. P1 was affected the most and was offline for 30 to 60 seconds every few minutes while S1 was affected the least and was mostly online. S4 was affected more than S1 but nowhere near as much as P1.
What are we doing to prevent a similar issue from occurring in the future?
- We have configured additional monitoring on our storage cluster to detect higher-than-normal write latency so that we can intervene quickly. In this case as we are now aware of the potential issue with the write cache we can proactively check and resolve it in the event of a drive failure to avoid unexpectedly high write latency.
- We have reconfigured LiteSpeed to use AIO log writing so that should we ever experience higher than normal write latency in the future the impact should be minimal if not invisible to end-users.
Should you have any questions about any of this please let us know! We apologize for any trouble this may have caused you.