Increased Latency / Service Intermittence - 10/03/2017 from 8:20 AM to 10:10 AM ET

Michael D. · October 3, 2017

A user reported higher than normal latency / lag to their sites this morning and upon investigation we could only identify an issue with their specific VPS. We couldn't replicate or identify the issue on any additional servers.

Within a couple of hours the issue began to propagate to the rest of our servers and at that point we reached out to our storage vendor, StorPool, for them to investigate on an emergency basis. During this time all servers were online but functioning much slower than they should have. Static, simple sites, and those heavily cached would have noticed little to no impact while heavily dynamic sites would have potentially been latent to the point of being unresponsive.

Our storage platform is regularly updated to add new features, to increase speed, and to improve reliability and redundancy. These upgrades are seamless and ordinarily would not have any negative impact on our servers or our clients. In the event that we were performing maintenance that could impact our services we would reach out via email as well as here on our forums to schedule a window for maintenance.

StorPool, just like us, is always working to improve and last night there was some minor network maintenance. Some VLAN [networking] changes were made to the storage network for increased redundancy in the event of hardware failure. Ultimately the issues we experienced this morning were due to simple human error where a single setting was not saved properly. The result was rather than requests completing within a sub-millisecond as usual they began taking tens of milliseconds to complete.

Here is a graph that shows normal storage latency, the incident this morning, as well as the resolution:

StorPool noted that none of their monitoring or alerting picked up on this and, as such, they are making some changes to the monitoring of our storage cluster so issues like this. Should our storage cluster ever experience abnormal latency in the future we and StorPool will be able to proactively investigate and resolve the issue before it starts to affect our services and our clients.

StorPool has also updated their internal documentation for the update process that was conducted last night to ensure that the human error that caused this issue is avoided entirely.

Keep in mind that this is not due to the recent migrations or us having moved to this platform recently - but simply due to human error or omission and would have happened as a part of this normal day-to-day upgrade even if we had been on this platform for months or years prior to the upgrade.

We, as well as StorPool, are sorry for any inconvenience or trouble this caused. If you have any questions or concerns please just let us know.

cziv · October 3, 2017

Yeap. I have seen that at S1.

Michael D. · October 3, 2017

Yeap. I have seen that at S1.

All servers were eventually affected this morning but the issue is now resolved.

Dehenderson · October 3, 2017

Did not notice any issues on R3 but StorPool raises a question -- the company is located in Sofia. Does that mean you are hosting from Bulgaria? Or are you still using Handy in Denver?

AMC4x4 · October 4, 2017

@Dehenderson - Perhaps it's an issue where Storpool can log into MDD's storage and monitor the services? We have some Infinidat boxes in our lab, and they are always contacting us to see when they can schedule something or other. Either way, I'm curious as well.

Michael D. · October 4, 2017

Did not notice any issues on R3 but StorPool raises a question -- the company is located in Sofia. Does that mean you are hosting from Bulgaria? Or are you still using Handy in Denver?

StorPool is a software company - not a hardware company. We are running their software on our hardware in Denver. We couldn't get under 0.4MS [less than half a millisecond] read times from anything but local storage.

@Dehenderson - Perhaps it's an issue where Storpool can log into MDD's storage and monitor the services? We have some Infinidat boxes in our lab, and they are always contacting us to see when they can schedule something or other. Either way, I'm curious as well.

Correct - they install and manage the storage platform itself. They're most familiar with it while we're still learning it and taking on additional tasks. Even once we're fully comfortable 100% managing it ourselves - they will still continue monitoring the systems.

Within about 30 seconds of me opening an emergency case with them on this issue they were already investigating. Normally they're proactive to the point that we don't have time to notify them - they reach out to us - but in this case we ran across an issue neither they nor us had seen before. Changes have been made to detect the issue we face so proactive resolution is possible.

Dehenderson · October 4, 2017

Thanks much for explaining. The speed is impressive!

Sign In

Increased Latency / Service Intermittence - 10/03/2017 from 8:20 AM to 10:10 AM ET

Recommended Posts

Michael D.

Link to comment

Share on other sites

cziv

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Dehenderson

Link to comment

Share on other sites

AMC4x4

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Dehenderson

Link to comment

Share on other sites

Join the conversation

Browse

Activity