Michael D. Posted December 13, 2012 Report Share Posted December 13, 2012 Today we were doing some network maintenance, and in the process we were identifying our private networking ports on our servers to ensure that we didn't mistakenly change the wrong thing, better to be sure than sorry. In this process we ran a tool called "ethtool" to identify the networking ports (causes them to flash to identify) and when we ran this tool, which we've run hundreds if not thousands of times before, the process got stuck on all servers and caused load to skyrocket to a point that we could no longer access them. We could immediately reset each server to bring it back online, but doing so risks data corruption and lengthy file system repairs (2 to 8 hours) so, instead, we're going to be logging into each server locally and killing the stuck processes manually. It will take longer than a reset, but doesn't incur the same risk of corruption and lengthy file system repairs. It's really an odd issue to have, the servers are 'online' in that they're powered up, operating, and have networking - they're just too busy with this stuck process to do anything else. We'll update this thread as we have more information. Quote Link to comment Share on other sites More sharing options...
Scott Posted December 13, 2012 Report Share Posted December 13, 2012 All services should be returning to normal as the servers catch up. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted December 13, 2012 Author Report Share Posted December 13, 2012 Echo is requiring a file system check, it's the only server we were not able to get unlocked without a reset. The rest, the loads are stabilizing and things are returning to normal. Quote Link to comment Share on other sites More sharing options...
cvos Posted December 13, 2012 Report Share Posted December 13, 2012 how long will this take? Quote Link to comment Share on other sites More sharing options...
8thos Posted December 13, 2012 Report Share Posted December 13, 2012 Darn. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted December 13, 2012 Author Report Share Posted December 13, 2012 how long will this take?The FSCK on Echo will take anywhere from 30 minutes to 4 hours, I really have no way to be more specific than that. It can go up to 40% and then jump to "100%" or it could go up to 99% in 30 minutes, and then take 3.5 hours to get the last 1%. Here is an idea of what happened: (it's an image): http://www.screencast.com/t/o7LfO3MyJl Loads skyrocked up into the thousands which, while they were online, made them unresponsive. Our data center techs were able to log in and fix every one of them but Echo, which needed a forced reset. Jasmine is the last one to get back to stable, which has just happened in the last few minutes. All VPS are being restarted one at a time (takes a few seconds each) to make sure their quotas are accurate as well and this should be done soon as well. I'll update about echo as I can, it reports 47.7% completed. Quote Link to comment Share on other sites More sharing options...
SarisIsop Posted December 13, 2012 Report Share Posted December 13, 2012 All working for me. Thanks. Quote Link to comment Share on other sites More sharing options...
Scott Posted December 13, 2012 Report Share Posted December 13, 2012 VPS servers are restarting one by one to fix quotas. Each reboot only takes maybe 10 or 15 seconds on average, so all of the VPS servers should be back online shortly. Quote Link to comment Share on other sites More sharing options...
gentleman Posted December 13, 2012 Report Share Posted December 13, 2012 What about the other servers? I am on demeter and it has not been working for 1 hour now, and t is is still not working. Quote Link to comment Share on other sites More sharing options...
Scott Posted December 13, 2012 Report Share Posted December 13, 2012 What about the other servers? I am on demeter and it has not been working for 1 hour now, and t is is still not working. It should be stabalizing as we speak. Check again in five minutes. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted December 13, 2012 Author Report Share Posted December 13, 2012 Indeed, Demeter was just now fixed - when we ran the original process kill it looks like we typo'd it by one letter (it doesn't allow Copy+Paste in that interface). I re-did it manually and load on Demeter is stabilizing. Echo reports 75% done on the file system check. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted December 13, 2012 Author Report Share Posted December 13, 2012 All VPS have restarted for their quota checks, all VPS are online and we expect no further interruptions to VPS service. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted December 13, 2012 Author Report Share Posted December 13, 2012 Here's what this issue looked like, it's similar across all servers: http://www.screencast.com/t/Ap0aIAeuwH The numbers in the thousands are extremely high compared to the average. Notice that the server wasn't taken offline, rebooted, shut down (except in the case of the Echo server). Quote Link to comment Share on other sites More sharing options...
cvos Posted December 13, 2012 Report Share Posted December 13, 2012 what is happening to echo. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted December 13, 2012 Author Report Share Posted December 13, 2012 what is happening to echo.See this:Echo is requiring a file system check, it's the only server we were not able to get unlocked without a reset. The rest, the loads are stabilizing and things are returning to normal. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted December 13, 2012 Author Report Share Posted December 13, 2012 The Echo server is now rebooting, and should be online and stable within the next 10 minutes. This will mean all services are fully restored and operational. Quote Link to comment Share on other sites More sharing options...
joshualoy Posted December 13, 2012 Report Share Posted December 13, 2012 cPanel is still unaccesible on Jasmine. Quote Link to comment Share on other sites More sharing options...
Scott Posted December 13, 2012 Report Share Posted December 13, 2012 Echo is back online. It will be slow while it catches up with requests. Quote Link to comment Share on other sites More sharing options...
Scott Posted December 13, 2012 Report Share Posted December 13, 2012 cPanel is still unaccesible on Jasmine. It's working normally from here. Please open/update your support ticket so we can investigate. Now I'm showing the 500 error as well. It's unrelated to the earlier issue affecting all servers, but we are debugging it now. Quote Link to comment Share on other sites More sharing options...
Kraken Posted December 13, 2012 Report Share Posted December 13, 2012 I gather this is why realtime Google Analytics was showing 15 connections from Sammamish WA? They're gone now and all looks well. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted December 13, 2012 Author Report Share Posted December 13, 2012 Joshua, I provided some additional detail on this issue in the support ticket you opened. I've gotten in contact with the CEO and VP of Operations at cPanel to get the right people into the server immediately to investigate. cPanel has been trying to track down and reproduce this issue for months but haven't been able to do so. We expect to leave it in this state for no longer than 30 minutes, or as soon as cPanel is done investigating which they're extremely good at. I'll update this thread and ticket once cPanel on jasmine is 100%. I suspect most users aren't having issues, but some are intermittently. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted December 13, 2012 Author Report Share Posted December 13, 2012 cPanel restarted cPanel/WHM on Jasmine and it should function normally at this point. They did get what they needed and, hopefully, it contains the information necessary to squash this bug! All services on all servers are 100% online and operational. If you have any general questions about this outage, feel free to ask them here. If you have any questions specific to your account, do please open or update a support ticket. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.