Jump to content
MDDHosting Forums

Kobold Server -> I/O Spikes under Investigation

Recommended Posts



We are aware of some I/O spikes that have been happening regularly on the Kobold server causing some slowness/issues for about 30 to 45 seconds at a time once every couple of hours or so. We have been actively investigating and attempting to resolve this issue for several days now and are leaning towards it possibly being a raid controller firmware issue or a drive issue.


Just now we disabled SSD caching on the server and re-configured it using a different set of drives to help rule out the caching drives. This did cause a couple of minutes of latency/slowness which you can see at the end of this graph:



All of those spikes - should not exist. So far we've been unable to find a software reason for this which is why we're leaning towards a hardware problem. Disabling the SSD caching even temporarily to do a drive swap is going to impact performance but it should only impact performance for a minute or so. We're doing our best to get this sorted with as little interruption to the service as possible.


I just wanted to make this post for anybody that may have noticed the issue so that you can be assured we're aware of it and actively working to resolve it.

Link to comment
Share on other sites

Kobold has normalized after the SSD swap - we will monitor it for the next 4 hours to see if the spikes are resolved. If they are - we identified a bad disk - if not we will need to do one more swap to rule out the other drive responsible for caching but will not do so until later this evening.


The investigation/tweaking we've been able to do on this issue so far has greatly reduced the impact of the spikes on the system and users but the spikes still exist and still need to ultimately be resolved.

Link to comment
Share on other sites

For clarity here is a comparison between two servers with the exact same hardware over the same time period. The graph outlined in green is from Jasmine and is 100% normal - you will notice that the graph tops out at 9.0. The red is from Kobold and you will notice the graph tops out at 30. Keeping the scale difference in mind you can see how spiky and how big the spikes are on Kobold. If not for the spikes the graphs would look almost exactly the same.



We are currently and have been working to actively resolve this, however, we're running out of options that are completely impact-free. For example the drive swaps should have almost no impact but it clearly did have some impact as evidenced by the graph and some tickets that were opened from customers.


We will be keeping this thread updated regarding this matter.

Link to comment
Share on other sites

Another administrator pointed out to me that the changes I made on the system earlier today did greatly reduce I/O latency even though the I/O wait spikes are still happening. The spike at the far right of this graph is where caching was temporarily disabled to switch caching drives.



Ultimately the root cause still needs to be determined but, at least, latency when the spikes happen should be low enough that it largely goes unnoticed.

Link to comment
Share on other sites

  • 3 weeks later...

We lessend the issue greatly but after working with our software vendor, CloudLinux, we were able to finally get a full fix for the issue.


See the bottom graph:




Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...