Resolved Kobold Server -> I/O Spikes under Investigation

Michael D. · October 15, 2014

Hello,

We are aware of some I/O spikes that have been happening regularly on the Kobold server causing some slowness/issues for about 30 to 45 seconds at a time once every couple of hours or so. We have been actively investigating and attempting to resolve this issue for several days now and are leaning towards it possibly being a raid controller firmware issue or a drive issue.

Just now we disabled SSD caching on the server and re-configured it using a different set of drives to help rule out the caching drives. This did cause a couple of minutes of latency/slowness which you can see at the end of this graph:

http://www.screen-shot.net/2014-10-15_15-35-17.png

All of those spikes - should not exist. So far we've been unable to find a software reason for this which is why we're leaning towards a hardware problem. Disabling the SSD caching even temporarily to do a drive swap is going to impact performance but it should only impact performance for a minute or so. We're doing our best to get this sorted with as little interruption to the service as possible.

I just wanted to make this post for anybody that may have noticed the issue so that you can be assured we're aware of it and actively working to resolve it.

Michael D. · October 15, 2014

Kobold has normalized after the SSD swap - we will monitor it for the next 4 hours to see if the spikes are resolved. If they are - we identified a bad disk - if not we will need to do one more swap to rule out the other drive responsible for caching but will not do so until later this evening.

The investigation/tweaking we've been able to do on this issue so far has greatly reduced the impact of the spikes on the system and users but the spikes still exist and still need to ultimately be resolved.

Michael D. · October 15, 2014

For clarity here is a comparison between two servers with the exact same hardware over the same time period. The graph outlined in green is from Jasmine and is 100% normal - you will notice that the graph tops out at 9.0. The red is from Kobold and you will notice the graph tops out at 30. Keeping the scale difference in mind you can see how spiky and how big the spikes are on Kobold. If not for the spikes the graphs would look almost exactly the same.

http://www.screen-shot.net/2014-10-15_15-41-47.png

We are currently and have been working to actively resolve this, however, we're running out of options that are completely impact-free. For example the drive swaps should have almost no impact but it clearly did have some impact as evidenced by the graph and some tickets that were opened from customers.

We will be keeping this thread updated regarding this matter.

Michael D. · October 15, 2014

Another administrator pointed out to me that the changes I made on the system earlier today did greatly reduce I/O latency even though the I/O wait spikes are still happening. The spike at the far right of this graph is where caching was temporarily disabled to switch caching drives.

http://www.screen-shot.net/2014-10-15_16-06-19.png

Ultimately the root cause still needs to be determined but, at least, latency when the spikes happen should be low enough that it largely goes unnoticed.

Michael D. · October 15, 2014

I believe we've identified and rectified the issue - we're going to need to allow at least 6 hours before we can confirm this but within 2 hours we can be fairly certain.

slushatwork · October 16, 2014

Thanks for letting us know - I have Uptime robot monitoring my account, and I've been getting lots of downtime warnings. I look forward to a solution.

Michael D. · October 17, 2014

There shouldn't be any additional unexpected downtime - none since we resolved this but we're always watching.

Michael D. · November 2, 2014

We lessend the issue greatly but after working with our software vendor, CloudLinux, we were able to finally get a full fix for the issue.

See the bottom graph:

http://www.screen-shot.net/2014-11-02_0221.png

http://www.screen-shot.net/2014-11-02_0238.png

Sign In

Resolved Kobold Server -> I/O Spikes under Investigation

Recommended Posts

Michael D.

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

slushatwork

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Join the conversation

Browse

Activity