Unexpected Outage due to Stuck Process - All servers

Today we were doing some network maintenance, and in the process we were identifying our private networking ports on our servers to ensure that we didn't mistakenly change the wrong thing, better to be sure than sorry. In this process we ran a tool called "ethtool" to identify the networking ports (causes them to flash to identify) and when we ran this tool, which we've run hundreds if not thousands of times before, the process got stuck on all servers and caused load to skyrocket to a point that we could no longer access them.


We could immediately reset each server to bring it back online, but doing so risks data corruption and lengthy file system repairs (2 to 8 hours) so, instead, we're going to be logging into each server locally and killing the stuck processes manually.


It will take longer than a reset, but doesn't incur the same risk of corruption and lengthy file system repairs.


It's really an odd issue to have, the servers are 'online' in that they're powered up, operating, and have networking - they're just too busy with this stuck process to do anything else.


We'll update this thread as we have more information.

how long will this take?

The FSCK on Echo will take anywhere from 30 minutes to 4 hours, I really have no way to be more specific than that. It can go up to 40% and then jump to "100%" or it could go up to 99% in 30 minutes, and then take 3.5 hours to get the last 1%.


Here is an idea of what happened: (it's an image): http://www.screencast.com/t/o7LfO3MyJl


Loads skyrocked up into the thousands which, while they were online, made them unresponsive. Our data center techs were able to log in and fix every one of them but Echo, which needed a forced reset.


Jasmine is the last one to get back to stable, which has just happened in the last few minutes. All VPS are being restarted one at a time (takes a few seconds each) to make sure their quotas are accurate as well and this should be done soon as well.


I'll update about echo as I can, it reports 47.7% completed.

Indeed, Demeter was just now fixed - when we ran the original process kill it looks like we typo'd it by one letter (it doesn't allow Copy+Paste in that interface). I re-did it manually and load on Demeter is stabilizing.


Echo reports 75% done on the file system check.

Here's what this issue looked like, it's similar across all servers: http://www.screencast.com/t/Ap0aIAeuwH


The numbers in the thousands are extremely high compared to the average.


Notice that the server wasn't taken offline, rebooted, shut down (except in the case of the Echo server).

I provided some additional detail on this issue in the support ticket you opened. I've gotten in contact with the CEO and VP of Operations at cPanel to get the right people into the server immediately to investigate. cPanel has been trying to track down and reproduce this issue for months but haven't been able to do so. We expect to leave it in this state for no longer than 30 minutes, or as soon as cPanel is done investigating which they're extremely good at.


I'll update this thread and ticket once cPanel on jasmine is 100%. I suspect most users aren't having issues, but some are intermittently.

cPanel restarted cPanel/WHM on Jasmine and it should function normally at this point. They did get what they needed and, hopefully, it contains the information necessary to squash this bug!


All services on all servers are 100% online and operational. If you have any general questions about this outage, feel free to ask them here. If you have any questions specific to your account, do please open or update a support ticket.

