Michael D. Posted October 15, 2014 Report Share Posted October 15, 2014 Hello, We are aware of some I/O spikes that have been happening regularly on the Kobold server causing some slowness/issues for about 30 to 45 seconds at a time once every couple of hours or so. We have been actively investigating and attempting to resolve this issue for several days now and are leaning towards it possibly being a raid controller firmware issue or a drive issue. Just now we disabled SSD caching on the server and re-configured it using a different set of drives to help rule out the caching drives. This did cause a couple of minutes of latency/slowness which you can see at the end of this graph:http://www.screen-shot.net/2014-10-15_15-35-17.png All of those spikes - should not exist. So far we've been unable to find a software reason for this which is why we're leaning towards a hardware problem. Disabling the SSD caching even temporarily to do a drive swap is going to impact performance but it should only impact performance for a minute or so. We're doing our best to get this sorted with as little interruption to the service as possible. I just wanted to make this post for anybody that may have noticed the issue so that you can be assured we're aware of it and actively working to resolve it. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted October 15, 2014 Author Report Share Posted October 15, 2014 Kobold has normalized after the SSD swap - we will monitor it for the next 4 hours to see if the spikes are resolved. If they are - we identified a bad disk - if not we will need to do one more swap to rule out the other drive responsible for caching but will not do so until later this evening. The investigation/tweaking we've been able to do on this issue so far has greatly reduced the impact of the spikes on the system and users but the spikes still exist and still need to ultimately be resolved. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted October 15, 2014 Author Report Share Posted October 15, 2014 For clarity here is a comparison between two servers with the exact same hardware over the same time period. The graph outlined in green is from Jasmine and is 100% normal - you will notice that the graph tops out at 9.0. The red is from Kobold and you will notice the graph tops out at 30. Keeping the scale difference in mind you can see how spiky and how big the spikes are on Kobold. If not for the spikes the graphs would look almost exactly the same.http://www.screen-shot.net/2014-10-15_15-41-47.png We are currently and have been working to actively resolve this, however, we're running out of options that are completely impact-free. For example the drive swaps should have almost no impact but it clearly did have some impact as evidenced by the graph and some tickets that were opened from customers. We will be keeping this thread updated regarding this matter. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted October 15, 2014 Author Report Share Posted October 15, 2014 Another administrator pointed out to me that the changes I made on the system earlier today did greatly reduce I/O latency even though the I/O wait spikes are still happening. The spike at the far right of this graph is where caching was temporarily disabled to switch caching drives.http://www.screen-shot.net/2014-10-15_16-06-19.png Ultimately the root cause still needs to be determined but, at least, latency when the spikes happen should be low enough that it largely goes unnoticed. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted October 15, 2014 Author Report Share Posted October 15, 2014 I believe we've identified and rectified the issue - we're going to need to allow at least 6 hours before we can confirm this but within 2 hours we can be fairly certain. Quote Link to comment Share on other sites More sharing options...
slushatwork Posted October 16, 2014 Report Share Posted October 16, 2014 Thanks for letting us know - I have Uptime robot monitoring my account, and I've been getting lots of downtime warnings. I look forward to a solution. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted October 17, 2014 Author Report Share Posted October 17, 2014 There shouldn't be any additional unexpected downtime - none since we resolved this but we're always watching. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted November 2, 2014 Author Report Share Posted November 2, 2014 We lessend the issue greatly but after working with our software vendor, CloudLinux, we were able to finally get a full fix for the issue. See the bottom graph:http://www.screen-shot.net/2014-11-02_0221.png http://www.screen-shot.net/2014-11-02_0238.png Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.