Jump to content


Photo

Jasmine Emergency Kernel Update Status

Resolved Jasmine

  • Please log in to reply
44 replies to this topic

#21 Jacqui Best

Jacqui Best

    Newbie

  • Members
  • Pip
  • 2 posts

Posted 14 May 2013 - 07:58 PM

1. This is unreal.. really a cloud server that can not be rolled back while you figure out what is wrong? Are you kidding me?

2. It is not OK that you have no eta. I have client sites on here funneling thousands of dollars of leads to the email hosted on here. 

3. I look unprofessional having recommended your server over oh lets say RACKSPACE who would never do a  KERNEL update on a live server


  • 0

#22 Ryan

Ryan

    Newbie

  • Members
  • Pip
  • 3 posts

Posted 14 May 2013 - 08:07 PM

Hello everybody,

 

I as well as two other administrators have been working to diagnose the issues with the Jasmine server and we've not been able to get the server to function properly.  Everything is indicating hardware failure - it looks like processor failure.  We're working as hard as we can to get service restored.

Better than hard drive I guess.  Thanks for being on top of everything Mike. First time there has been an issue since I've been a customer for years and it seems like you guys are taking care of it as fast as you can. 


  • 0

#23 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,893 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 14 May 2013 - 08:09 PM

1. This is unreal.. really a cloud server that can not be rolled back while you figure out what is wrong? Are you kidding me?

We run 'CloudLinux' which is a Linux Distribution, it is not a 'cloud server'.

2. It is not OK that you have no eta. I have client sites on here funneling thousands of dollars of leads to the email hosted on here.

If we knew what the issue was we could provide an ETA and would have it resolved as quickly as possible.

3. I look unprofessional having recommended your server over oh lets say RACKSPACE who would never do a  KERNEL update on a live server

Kernel updates are standard fare, the issue is the new kernel didn't boot and upon rolling back to the old kernel it also would not boot. Data is intact and the array is working fine which indicates a hardware issue with either CPU or RAM most likely and processors are being swapped now.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#24 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,893 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 14 May 2013 - 08:11 PM

Processors have been swapped, booting it up now.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#25 Scott

Scott

    MDDHosting Staff

  • Staff Administrator
  • PipPipPipPip
  • 421 posts
  • Gender:Male

Posted 14 May 2013 - 08:11 PM

Hi Jacqui,

Thank you for your feedback. Please let me address each comment point by point:

1. This is unreal.. really a cloud server that can not be rolled back while you figure out what is wrong? Are you kidding me?

When the kernel update failed, the first thing we did was attempt to roll back the kernel. Unfortunately, this also failed. As Michael explained in a post above yours, this points to a hardware (processor) failure and we are currently having the hardware in the server checked, and replaced if necessary.

2. It is not OK that you have no eta.

Any ETA would be an absolute guess. We wouldn't feel comfortable advising you that the server will be online in any amount of time unless we were sure it would be back online by then. Under normal circumstances, and for most issues, we can give an ETA, but emergency maintenance is different.

I have client sites on here funneling thousands of dollars of leads to the email hosted on here.

3. I look unprofessional having recommended your server over oh lets say RACKSPACE who would never do a KERNEL update on a live server

I'm sorry to hear about your companies image. As I am sure you can imagine, this doesn't look great for us either. The bottom line is that there was a critical zero day root escalation exploit being used in the wild, and it was imperative that we update our servers before any client site was compromised and used to take full access to the server. We would normally perform extensive testing prior to updating our servers kernel, but in this case we needed to act quickly before anything bad could happen. The kernel patches we applied came from a trusted source and we felt confident we could update safely. Unfortunately, we didn't anticipate hardware failure in one of our newer servers.

You may wish to read more about the zero day vulnerability in our forum thread for that topic: http://forums.mddhos...-security-hole/
  • 0
Scott S - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#26 Ryan

Ryan

    Newbie

  • Members
  • Pip
  • 3 posts

Posted 14 May 2013 - 08:12 PM

1. This is unreal.. really a cloud server that can not be rolled back while you figure out what is wrong? Are you kidding me?

2. It is not OK that you have no eta. I have client sites on here funneling thousands of dollars of leads to the email hosted on here. 

3. I look unprofessional having recommended your server over oh lets say RACKSPACE who would never do a  KERNEL update on a live server

 

That was a fairly rude comment. If thousands of dollars are on the line for a hosting account worth a few bucks per month, I suggest you look into getting multiple dedicated servers and hosting accounts. While it's frustrating  and I myself will have about 3,000 users hitting our site within the next 1/2 hour (From our twitter account @PHLSportsMedia for after the Phillies game) I think they are handeling this very professionally and it seems like they already know the cause/ issue and are working on it. 


  • 0

#27 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,893 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 14 May 2013 - 08:13 PM

It is indeed processor failure - the server has booted up now [into the wrong kernel] and we're issuing a re-boot on the server to move into the right kernel and service should be restored. ETA 5~10 min.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#28 Flix

Flix

    Newbie

  • Members
  • Pip
  • 14 posts

Posted 14 May 2013 - 08:14 PM

Jasmine users right now: http://i.imgur.com/Z7tEt.gif


  • 1

#29 ashkir

ashkir

    Newbie

  • Members
  • Pip
  • 4 posts

Posted 14 May 2013 - 08:15 PM

I didn't even know it was a cloud server. Cool! Is this why sometimes some people report an outdated super old page (rare but I get them now/then) randomly?

 

And anyways. This is the first problem I've had. I'm patient. You guys have been amazing thus far. Keep up the good work! :)


  • 0

#30 Scott

Scott

    MDDHosting Staff

  • Staff Administrator
  • PipPipPipPip
  • 421 posts
  • Gender:Male

Posted 14 May 2013 - 08:19 PM

Jasmine users right now: http://i.imgur.com/Z7tEt.gif


I suspect that is our competition, not our clients... But thanks for the chuckle, either way.
 
 

I didn't even know it was a cloud server.


Lot's of people define "the cloud" differently. This server doesn't fit under 95% of those definitions, and isn't what I would call a "cloud" server. We are running software called "CloudLinux," but the two are unrelated.
  • 0
Scott S - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#31 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,893 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 14 May 2013 - 08:20 PM

Jasmine is online, it's playing catch-up to the monumental pile of requests it's facing due to being offline. I estimate 5~10 minutes before things normalize.
  • 2
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#32 El Skeptico

El Skeptico

    Newbie

  • Members
  • Pip
  • 2 posts

Posted 14 May 2013 - 08:23 PM

Jasmine users right now: http://i.imgur.com/Z7tEt.gif

 

Oh yeah...  :D


  • 1

#33 dlyons

dlyons

    Newbie

  • Members
  • Pip
  • 8 posts
  • Gender:Female
  • Location:New Jersey

Posted 14 May 2013 - 08:26 PM

Awesome news! Thanks for all the hard work guys and keeping us updated as much as you could. It's been tense watching the thread to say the least and I can't even imagine how crazy it must be over by you.


  • 0

#34 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,893 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 14 May 2013 - 08:27 PM

I have to say, this is the first time in over 5 years I've felt like I could have a heart attack while working to resolve an issue. We do still have another server offline with the exact same issue but it's the exact same hardware build so we're swapping processors on it as well. If all goes well it will come back online and all services will be 100% restored.

We're then going to investigate why this happened and see what we have to do to to correct it - be it installing larger/different coolers on the processors, more fans in the chassis, discussing cooling with the facility, etc... At this point I don't have an 'Reason For Outage' beyond hardware failure but once we have more details as to what exactly caused this I will be sure to post them up.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#35 El Skeptico

El Skeptico

    Newbie

  • Members
  • Pip
  • 2 posts

Posted 14 May 2013 - 08:36 PM

I have to say, this is the first time in over 5 years I've felt like I could have a heart attack while working to resolve an issue. We do still have another server offline with the exact same issue but it's the exact same hardware build so we're swapping processors on it as well. If all goes well it will come back online and all services will be 100% restored.

We're then going to investigate why this happened and see what we have to do to to correct it - be it installing larger/different coolers on the processors, more fans in the chassis, discussing cooling with the facility, etc... At this point I don't have an 'Reason For Outage' beyond hardware failure but once we have more details as to what exactly caused this I will be sure to post them up.

 

If this system meltdown thing was a pattern with MDD like it is other places, then and only then would I worry. Been a number of years since I moved over here, first time this has ever happened...can't say the same about other places.

 

Hopefully someone is trained in CPR.

 

and so it goes...


  • 0

#36 dlyons

dlyons

    Newbie

  • Members
  • Pip
  • 8 posts
  • Gender:Female
  • Location:New Jersey

Posted 14 May 2013 - 08:36 PM

I have to say, this is the first time in over 5 years I've felt like I could have a heart attack while working to resolve an issue. We do still have another server offline with the exact same issue but it's the exact same hardware build so we're swapping processors on it as well. If all goes well it will come back online and all services will be 100% restored.

We're then going to investigate why this happened and see what we have to do to to correct it - be it installing larger/different coolers on the processors, more fans in the chassis, discussing cooling with the facility, etc... At this point I don't have an 'Reason For Outage' beyond hardware failure but once we have more details as to what exactly caused this I will be sure to post them up.

 

I give you guys a ton of credit, I don't think I could handle that myself if I were in your shoes.

 

Yeah it would be interesting to find out why exactly it went out, I'm just glad it was a relatively easy repair. Good luck with the other server!


  • 0

#37 FirestormIS

FirestormIS

    Newbie

  • Members
  • Pip
  • 5 posts
  • Gender:Male

Posted 14 May 2013 - 08:37 PM

Thanks for the hard work guys, as well as the updates along the way. Despite the downtime I am probably more confident in your service than I was before this happened, due to you guys staying in communication with us about it while it happened, and already addressing how you will go about preventing this in the future.


  • 0

#38 Juan

Juan

    Newbie

  • Members
  • Pip
  • 9 posts
  • Gender:Male

Posted 14 May 2013 - 08:45 PM

Great job MDD! Thank you for keeping us all updated. :)


  • 0

#39 jonwatson

jonwatson

    Newbie

  • Members
  • Pip
  • 4 posts

Posted 14 May 2013 - 09:05 PM

My last host had a similar issue: rebooting into a new kernel and had problems bringing things back up. That resulted in over 28 hours of downtime.

 

This, MDD, is good work.

 

Thanks!

 

Jon


  • 0

#40 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,893 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 14 May 2013 - 10:05 PM

I tend to handle most low-level administration tasks such as system failure, data corruption, etc... My immediate feeling was that there was hardware failure but I couldn't wrap my head around it happening to two separate servers simultaneously. I almost felt sure that it had to be something in the new kernel as that was the only thing that had changed.

Ultimately had I gone with my first instinct and had the processors swapped I could have saved a lot of downtime. I put off the hardware swap in order to try and fix the issue in software both because I didn't really believe both servers could have the exact same hardware failure at the same time and I also didn't want to have the on-site staff have to pull apart the servers and swap parts - at the end of the day I didn't want to make them do work for nothing [i.e. if it wasn't failed processors].

Honestly should anything like this happen again, and here's hoping not, and we even have the hint that it's hardware - we're going to swap hardware immediately. I've learned quite a bit from this experience and will definitely not make the same mistakes in the future that I've made tonight when directing how we were working on this issue. Ultimately it was resolved quicker than most providers in our situation would resolve it, but not as quick as I feel appropriate - I try to hold us to a higher standard.

I sincerely apologize to all of our customers that were offline due to this issue. You certainly don't like being offline and we're certainly not in the business of providing downtime.

That said, I'm going to go through one-by-one and respond to every support ticket that was opened during the outage and address questions/comments/concerns as best I can. Once I've done this, I'm going to try and get some sleep so I can get up early tomorrow and begin testing those failed processors to see if we can determine what caused the failure so we can prevent it. I have to say that having [at least] two processors fail in two separate servers at the same time has to be the weirdest instance of failure I've seen since I've been in the industry.

If you have any general questions about the issue, feel free to post them here.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/





0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users