Michael D. Posted May 16, 2015 Report Share Posted May 16, 2015 We have identified an issue with the storage drivers on these servers resulting in randomly high latency and unresponsiveness from time to time. We were planning on scheduling the reboots necessary a week or two out but the condition is degrading and we must take action quickly to keep things stable and to avoid damage to data integrity. We will be performing these reboots as soon as possible. While we expect each server to be offline no more than 15 minutes it is possible we will run into unexpected issues. We are looking at an estimated 2 hour maintenance window as reboots are performed and updates verified. There is the small possibility any of these servers could require file system checks which will add time onto the process. We will keep this thread updated as we have new information to report. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted May 16, 2015 Author Report Share Posted May 16, 2015 We will update this thread when the reboots are starting. At this time we are working to help ensure that all of the maintenance we need to do gets done correctly and quickly to minimize actual downtime. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted May 16, 2015 Author Report Share Posted May 16, 2015 VPS1 is going down for a reboot. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted May 16, 2015 Author Report Share Posted May 16, 2015 SR1 and SD1 are going down for a reboot now as well. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted May 16, 2015 Author Report Share Posted May 16, 2015 The reboots have completed. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted May 16, 2015 Author Report Share Posted May 16, 2015 We're going to be needing to issue reboots once more - some of the driver updates didn't apply properly. We'll keep this thread updated. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted May 16, 2015 Author Report Share Posted May 16, 2015 We're about to begin the reboots now within about a minute. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted May 16, 2015 Author Report Share Posted May 16, 2015 Reboots completed quickly. Quote Link to comment Share on other sites More sharing options...
ericr Posted May 22, 2015 Report Share Posted May 22, 2015 We are continuing this work to improve the storage system on these servers. I am restarting SD1 at this time to adjust settings. Quote Link to comment Share on other sites More sharing options...
ericr Posted May 22, 2015 Report Share Posted May 22, 2015 The change is successfull on SD1 and the server is fully online. I am rebooting sr1 at this time. Quote Link to comment Share on other sites More sharing options...
ericr Posted May 22, 2015 Report Share Posted May 22, 2015 All of the servers have been changed. Quote Link to comment Share on other sites More sharing options...
ericr Posted May 23, 2015 Report Share Posted May 23, 2015 The underlying fault is continuing to occur. We are going to do a quick restart to implement another change to hopefully address the issue or isolate the cause. Quote Link to comment Share on other sites More sharing options...
ericr Posted May 23, 2015 Report Share Posted May 23, 2015 The reboots are completed. We will update this thread if a reboot is needed in the future. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted May 30, 2015 Author Report Share Posted May 30, 2015 The server is once again reporting storage access errors causing intermittent downtime and instability. We are upgrading the firmware on the raid controller and restarting the host. This should take no more than 10 minutes and we will post updates here. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted May 30, 2015 Author Report Share Posted May 30, 2015 The host went offline during the software update - we're currently investigating and working to bring all services back online as quickly as possible. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted May 30, 2015 Author Report Share Posted May 30, 2015 We ended up power cycling the server and it booted back up. We are checking to see if the firmware update stuck or if we are still on the older version. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted May 30, 2015 Author Report Share Posted May 30, 2015 The firmware update did take. With luck the server will be stable moving forward. We will continue to monitor the server closely for the next 72 hours minimum. Quote Link to comment Share on other sites More sharing options...
nztim Posted May 30, 2015 Report Share Posted May 30, 2015 Thanks for the updates! Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 3, 2015 Author Report Share Posted June 3, 2015 The firmware update did resolve the storage instability. We are still seeing poor write capacity compared to what is expected and, as such, we've adjusted the server caches to compensate for this. We're working on a hardware solution and if any further downtime is required on this matter concerning the hardware solution a new post will be made as well as an email dispatched to all affected customers. We're doing our best to keep things as seamless as possible and are working hard to ensure things remain online and stable. At this time I am going to mark this particular thread as resolved. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.