Jump to content


Photo

Unexpected Outage due to Stuck Process - All servers

Resolved

  • Please log in to reply
21 replies to this topic

#1 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 13 December 2012 - 04:55 PM

Today we were doing some network maintenance, and in the process we were identifying our private networking ports on our servers to ensure that we didn't mistakenly change the wrong thing, better to be sure than sorry. In this process we ran a tool called "ethtool" to identify the networking ports (causes them to flash to identify) and when we ran this tool, which we've run hundreds if not thousands of times before, the process got stuck on all servers and caused load to skyrocket to a point that we could no longer access them.

We could immediately reset each server to bring it back online, but doing so risks data corruption and lengthy file system repairs (2 to 8 hours) so, instead, we're going to be logging into each server locally and killing the stuck processes manually.

It will take longer than a reset, but doesn't incur the same risk of corruption and lengthy file system repairs.

It's really an odd issue to have, the servers are 'online' in that they're powered up, operating, and have networking - they're just too busy with this stuck process to do anything else.

We'll update this thread as we have more information.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#2 Scott

Scott

    MDDHosting Staff

  • Staff Administrator
  • PipPipPipPip
  • 421 posts
  • Gender:Male

Posted 13 December 2012 - 05:10 PM

All services should be returning to normal as the servers catch up.
  • 0
Scott S - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#3 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 13 December 2012 - 05:14 PM

Echo is requiring a file system check, it's the only server we were not able to get unlocked without a reset. The rest, the loads are stabilizing and things are returning to normal.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#4 cvos

cvos

    Newbie

  • Members
  • Pip
  • 20 posts
  • Gender:Not Telling

Posted 13 December 2012 - 05:18 PM

how long will this take?
  • 0

#5 8thos

8thos

    Newbie

  • Members
  • Pip
  • 1 posts
  • Gender:Male

Posted 13 December 2012 - 05:23 PM

Darn.
  • 0

#6 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 13 December 2012 - 05:26 PM

how long will this take?

The FSCK on Echo will take anywhere from 30 minutes to 4 hours, I really have no way to be more specific than that. It can go up to 40% and then jump to "100%" or it could go up to 99% in 30 minutes, and then take 3.5 hours to get the last 1%.

Here is an idea of what happened: (it's an image): http://www.screencast.com/t/o7LfO3MyJl

Loads skyrocked up into the thousands which, while they were online, made them unresponsive. Our data center techs were able to log in and fix every one of them but Echo, which needed a forced reset.

Jasmine is the last one to get back to stable, which has just happened in the last few minutes. All VPS are being restarted one at a time (takes a few seconds each) to make sure their quotas are accurate as well and this should be done soon as well.

I'll update about echo as I can, it reports 47.7% completed.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#7 SarisIsop

SarisIsop

    Advancing Member

  • Members
  • PipPipPip
  • 155 posts
  • Gender:Not Telling

Posted 13 December 2012 - 05:26 PM

All working for me.

Thanks.
  • 0

#8 Scott

Scott

    MDDHosting Staff

  • Staff Administrator
  • PipPipPipPip
  • 421 posts
  • Gender:Male

Posted 13 December 2012 - 05:30 PM

VPS servers are restarting one by one to fix quotas. Each reboot only takes maybe 10 or 15 seconds on average, so all of the VPS servers should be back online shortly.
  • 0
Scott S - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#9 gentleman

gentleman

    Newbie

  • Members
  • Pip
  • 2 posts

Posted 13 December 2012 - 05:36 PM

What about the other servers? I am on demeter and it has not been working for 1 hour now, and t is is still not working.
  • 0

#10 Scott

Scott

    MDDHosting Staff

  • Staff Administrator
  • PipPipPipPip
  • 421 posts
  • Gender:Male

Posted 13 December 2012 - 05:43 PM

What about the other servers? I am on demeter and it has not been working for 1 hour now, and t is is still not working.


It should be stabalizing as we speak. Check again in five minutes.
  • 0
Scott S - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#11 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 13 December 2012 - 05:43 PM

Indeed, Demeter was just now fixed - when we ran the original process kill it looks like we typo'd it by one letter (it doesn't allow Copy+Paste in that interface). I re-did it manually and load on Demeter is stabilizing.

Echo reports 75% done on the file system check.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#12 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 13 December 2012 - 05:44 PM

All VPS have restarted for their quota checks, all VPS are online and we expect no further interruptions to VPS service.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#13 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 13 December 2012 - 05:45 PM

Here's what this issue looked like, it's similar across all servers: http://www.screencast.com/t/Ap0aIAeuwH

The numbers in the thousands are extremely high compared to the average.

Notice that the server wasn't taken offline, rebooted, shut down (except in the case of the Echo server).
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#14 cvos

cvos

    Newbie

  • Members
  • Pip
  • 20 posts
  • Gender:Not Telling

Posted 13 December 2012 - 05:51 PM

what is happening to echo.
  • 0

#15 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 13 December 2012 - 05:53 PM

what is happening to echo.

See this:

Echo is requiring a file system check, it's the only server we were not able to get unlocked without a reset. The rest, the loads are stabilizing and things are returning to normal.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#16 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 13 December 2012 - 05:54 PM

The Echo server is now rebooting, and should be online and stable within the next 10 minutes. This will mean all services are fully restored and operational.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#17 joshualoy

joshualoy

    Newbie

  • Members
  • Pip
  • 19 posts

Posted 13 December 2012 - 05:56 PM

cPanel is still unaccesible on Jasmine.
  • 0

#18 Scott

Scott

    MDDHosting Staff

  • Staff Administrator
  • PipPipPipPip
  • 421 posts
  • Gender:Male

Posted 13 December 2012 - 05:57 PM

Echo is back online. It will be slow while it catches up with requests.
  • 0
Scott S - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#19 Scott

Scott

    MDDHosting Staff

  • Staff Administrator
  • PipPipPipPip
  • 421 posts
  • Gender:Male

Posted 13 December 2012 - 05:57 PM

cPanel is still unaccesible on Jasmine.


It's working normally from here. Please open/update your support ticket so we can investigate.

Now I'm showing the 500 error as well. It's unrelated to the earlier issue affecting all servers, but we are debugging it now.
  • 0
Scott S - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#20 Kraken

Kraken

    Newbie

  • Members
  • Pip
  • 13 posts

Posted 13 December 2012 - 06:10 PM

I gather this is why realtime Google Analytics was showing 15 connections from Sammamish WA? They're gone now and all looks well.
  • 0





2 user(s) are reading this topic

0 members, 2 guests, 0 anonymous users