A user reported higher than normal latency / lag to their sites this morning and upon investigation we could only identify an issue with their specific VPS. We couldn't replicate or identify the issue on any additional servers.
Within a couple of hours the issue began to propagate to the rest of our servers and at that point we reached out to our storage vendor, StorPool, for them to investigate on an emergency basis. During this time all servers were online but functioning much slower than they should have. Static, simple sites, and those heavily cached would have noticed little to no impact while heavily dynamic sites would have potentially been latent to the point of being unresponsive.
Our storage platform is regularly updated to add new features, to increase speed, and to improve reliability and redundancy. These upgrades are seamless and ordinarily would not have any negative impact on our servers or our clients. In the event that we were performing maintenance that could impact our services we would reach out via email as well as here on our forums to schedule a window for maintenance.
StorPool, just like us, is always working to improve and last night there was some minor network maintenance. Some VLAN [networking] changes were made to the storage network for increased redundancy in the event of hardware failure. Ultimately the issues we experienced this morning were due to simple human error where a single setting was not saved properly. The result was rather than requests completing within a sub-millisecond as usual they began taking tens of milliseconds to complete.
Here is a graph that shows normal storage latency, the incident this morning, as well as the resolution:
StorPool noted that none of their monitoring or alerting picked up on this and, as such, they are making some changes to the monitoring of our storage cluster so issues like this. Should our storage cluster ever experience abnormal latency in the future we and StorPool will be able to proactively investigate and resolve the issue before it starts to affect our services and our clients.
StorPool has also updated their internal documentation for the update process that was conducted last night to ensure that the human error that caused this issue is avoided entirely.
Keep in mind that this is not due to the recent migrations or us having moved to this platform recently - but simply due to human error or omission and would have happened as a part of this normal day-to-day upgrade even if we had been on this platform for months or years prior to the upgrade.
We, as well as StorPool, are sorry for any inconvenience or trouble this caused. If you have any questions or concerns please just let us know.