Michael D. Posted September 8, 2018 Report Share Posted September 8, 2018 Below you will find information on this outage. If you have any questions about any of this at all please let us know. Sorry for any trouble this may have caused you. What happened to cause the outage? We were alerted to the P1 server being down at 11:50 PM and upon immediate investigation we found that the server operating system had crashed [a "kernel panic"]. This does, unfortunately, happen from time to time as operating systems are written by humans and bugs do happen. It's thankfully very rare, but it does happen. The system was rebooted within seconds of going down as this is the normal resolution for a panic, however, the system failed to boot properly. Why did the system not recover automatically as it is highly available? Our servers are highly available and while, in most situations, they will recover from an issue quickly and automatically - in this case the issue was with the data itself which kept the server from coming back online on any hardware and required manual intervention. Initially we believed the issue was with the connection between this specific server and our storage layer, however, we worked with StorPool's emergency response team and verified that the storage platform was behaving as expected and was not the cause. Why was the issue not resolved more quickly? I'm not happy to say it but it took us longer than it should have to identify that the boot loader, the software that sits between the BIOS and the Operating System, had become corrupted. This isn't something we'd normally check for as the boot loader is modified very infrequently and this file system is very rarely touched. Only when an operating system upgrade is performed is this modified and no upgrades have been performed. Once we identified the cause of the failure to boot we repaired the boot loader and the server came back online. We did take it back offline for approximately a minute to revert a couple of setting changes we made while investigating and working to resolve the outage. What precipitated the outage and what are we doing to prevent it from recurring? We believe that the boot loader was corrupted during a bad installation of an operating system update that we had not yet booted into although we are investigating to verify this and/or to identify the actual cause so we can check for it and/or prevent it. We've added checking the boot loader to our operating procedures for recovering from a failure to boot. Should this issue ever recur, which we do not anticipate, we will be prepared to check for and resolve this issue quickly - on the order of a couple of minutes at most. Quote Link to comment Share on other sites More sharing options...
AMC4x4 Posted September 8, 2018 Report Share Posted September 8, 2018 As always, thanks for the detailed report. Great work getting to the bottom of it. Hope the rest of your night is uneventful. Quote Link to comment Share on other sites More sharing options...
Victoria Bampton Posted September 8, 2018 Report Share Posted September 8, 2018 Love your honesty and openness guys. Well done! Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.