Resolved Jasmine Emergency Kernel Update Status

Ryan · May 15, 2013

1. This is unreal.. really a cloud server that can not be rolled back while you figure out what is wrong? Are you kidding me?
2. It is not OK that you have no eta. I have client sites on here funneling thousands of dollars of leads to the email hosted on here.
3. I look unprofessional having recommended your server over oh lets say RACKSPACE who would never do a KERNEL update on a live server

That was a fairly rude comment. If thousands of dollars are on the line for a hosting account worth a few bucks per month, I suggest you look into getting multiple dedicated servers and hosting accounts. While it's frustrating and I myself will have about 3,000 users hitting our site within the next 1/2 hour (From our twitter account @PHLSportsMedia for after the Phillies game) I think they are handeling this very professionally and it seems like they already know the cause/ issue and are working on it.

Michael D. · May 15, 2013

It is indeed processor failure - the server has booted up now [into the wrong kernel] and we're issuing a re-boot on the server to move into the right kernel and service should be restored. ETA 5~10 min.

Flix · May 15, 2013

Jasmine users right now: http://i.imgur.com/Z7tEt.gif

ashkir · May 15, 2013

I didn't even know it was a cloud server. Cool! Is this why sometimes some people report an outdated super old page (rare but I get them now/then) randomly?

And anyways. This is the first problem I've had. I'm patient. You guys have been amazing thus far. Keep up the good work!

Scott · May 15, 2013

Jasmine users right now: http://i.imgur.com/Z7tEt.gif

I suspect that is our competition, not our clients... But thanks for the chuckle, either way.

I didn't even know it was a cloud server.

Lot's of people define "the cloud" differently. This server doesn't fit under 95% of those definitions, and isn't what I would call a "cloud" server. We are running software called "CloudLinux," but the two are unrelated.

Michael D. · May 15, 2013

Jasmine is online, it's playing catch-up to the monumental pile of requests it's facing due to being offline. I estimate 5~10 minutes before things normalize.

El Skeptico · May 15, 2013

Jasmine users right now: http://i.imgur.com/Z7tEt.gif

Oh yeah...

dlyons · May 15, 2013

Awesome news! Thanks for all the hard work guys and keeping us updated as much as you could. It's been tense watching the thread to say the least and I can't even imagine how crazy it must be over by you.

Michael D. · May 15, 2013

I have to say, this is the first time in over 5 years I've felt like I could have a heart attack while working to resolve an issue. We do still have another server offline with the exact same issue but it's the exact same hardware build so we're swapping processors on it as well. If all goes well it will come back online and all services will be 100% restored.

We're then going to investigate why this happened and see what we have to do to to correct it - be it installing larger/different coolers on the processors, more fans in the chassis, discussing cooling with the facility, etc... At this point I don't have an 'Reason For Outage' beyond hardware failure but once we have more details as to what exactly caused this I will be sure to post them up.

El Skeptico · May 15, 2013

I have to say, this is the first time in over 5 years I've felt like I could have a heart attack while working to resolve an issue. We do still have another server offline with the exact same issue but it's the exact same hardware build so we're swapping processors on it as well. If all goes well it will come back online and all services will be 100% restored.

We're then going to investigate why this happened and see what we have to do to to correct it - be it installing larger/different coolers on the processors, more fans in the chassis, discussing cooling with the facility, etc... At this point I don't have an 'Reason For Outage' beyond hardware failure but once we have more details as to what exactly caused this I will be sure to post them up.

If this system meltdown thing was a pattern with MDD like it is other places, then and only then would I worry. Been a number of years since I moved over here, first time this has ever happened...can't say the same about other places.

Hopefully someone is trained in CPR.

and so it goes...

dlyons · May 15, 2013

I have to say, this is the first time in over 5 years I've felt like I could have a heart attack while working to resolve an issue. We do still have another server offline with the exact same issue but it's the exact same hardware build so we're swapping processors on it as well. If all goes well it will come back online and all services will be 100% restored.

We're then going to investigate why this happened and see what we have to do to to correct it - be it installing larger/different coolers on the processors, more fans in the chassis, discussing cooling with the facility, etc... At this point I don't have an 'Reason For Outage' beyond hardware failure but once we have more details as to what exactly caused this I will be sure to post them up.

I give you guys a ton of credit, I don't think I could handle that myself if I were in your shoes.

Yeah it would be interesting to find out why exactly it went out, I'm just glad it was a relatively easy repair. Good luck with the other server!

FirestormIS · May 15, 2013

Thanks for the hard work guys, as well as the updates along the way. Despite the downtime I am probably more confident in your service than I was before this happened, due to you guys staying in communication with us about it while it happened, and already addressing how you will go about preventing this in the future.

Juan · May 15, 2013

Great job MDD! Thank you for keeping us all updated.

jonwatson · May 15, 2013

My last host had a similar issue: rebooting into a new kernel and had problems bringing things back up. That resulted in over 28 hours of downtime.

This, MDD, is good work.

Thanks!

Jon

Michael D. · May 15, 2013

I tend to handle most low-level administration tasks such as system failure, data corruption, etc... My immediate feeling was that there was hardware failure but I couldn't wrap my head around it happening to two separate servers simultaneously. I almost felt sure that it had to be something in the new kernel as that was the only thing that had changed.

Ultimately had I gone with my first instinct and had the processors swapped I could have saved a lot of downtime. I put off the hardware swap in order to try and fix the issue in software both because I didn't really believe both servers could have the exact same hardware failure at the same time and I also didn't want to have the on-site staff have to pull apart the servers and swap parts - at the end of the day I didn't want to make them do work for nothing [i.e. if it wasn't failed processors].

Honestly should anything like this happen again, and here's hoping not, and we even have the hint that it's hardware - we're going to swap hardware immediately. I've learned quite a bit from this experience and will definitely not make the same mistakes in the future that I've made tonight when directing how we were working on this issue. Ultimately it was resolved quicker than most providers in our situation would resolve it, but not as quick as I feel appropriate - I try to hold us to a higher standard.

I sincerely apologize to all of our customers that were offline due to this issue. You certainly don't like being offline and we're certainly not in the business of providing downtime.

That said, I'm going to go through one-by-one and respond to every support ticket that was opened during the outage and address questions/comments/concerns as best I can. Once I've done this, I'm going to try and get some sleep so I can get up early tomorrow and begin testing those failed processors to see if we can determine what caused the failure so we can prevent it. I have to say that having [at least] two processors fail in two separate servers at the same time has to be the weirdest instance of failure I've seen since I've been in the industry.

If you have any general questions about the issue, feel free to post them here.

ithelpme · May 15, 2013

1. This is unreal.. really a cloud server that can not be rolled back while you figure out what is wrong? Are you kidding me?
2. It is not OK that you have no eta. I have client sites on here funneling thousands of dollars of leads to the email hosted on here.
3. I look unprofessional having recommended your server over oh lets say RACKSPACE who would never do a KERNEL update on a live server

I sort of agree with Jacqui. I'm a IT professional in the corporate world and Hardware Failure should really be a thing of the past with cloud servers and redundancy. In our organization, we have redundant servers so if 1 server fail, all software and services are moved to another server without any interaction. Once the failure occurs, I'm alerted of the failure and fix the hardware issue but I don't need to worry about services since it has been moved or migrated to another server already. My job has become stress free because of this design. How come MDD hosting doesn't have this type of system in place? Also, we have a test development environment in which we duplicate the existing production environment and before any major update, we perform it in the test environment before production. If these two things are adopted by MDDHosting, I believe the downtime could have been avoided or at the very least, minimized to several minutes.

Michael D. · May 15, 2013

I sort of agree with Jacqui. I'm a IT professional in the corporate world and Hardware Failure should really be a thing of the past with cloud servers and redundancy.

We're moving in that direction but we're still too small to have the capital on-hand to buy the very expensive equipment it takes to do a cloud right.

I wish we could now and it's a matter of time but we're not there yet.

In our organization, we have redundant servers so if 1 server fail, all software and services are moved to another server without any interaction. Once the failure occurs, I'm alerted of the failure and fix the hardware issue but I don't need to worry about services since it has been moved or migrated to another server already. My job has become stress free because of this design. How come MDD hosting doesn't have this type of system in place?

cPanel in and of itself doesn't support that functionality and we're not to the point of being able to invest into a larger cloud infrastructure.

Also, we have a test development environment in which we duplicate the existing production environment and before any major update, we perform it in the test environment before production. If these two things are adopted by MDDHosting, I believe the downtime could have been avoided or at the very least, minimized to several minutes.

There is no way our test environment would know to replicate a processor failing - the update would have been successful and taken less than 5 minutes as it did on all other servers.

The issue wasn't the software update, the issue was that a processor in the server completely failed.

ithelpme · May 15, 2013

We're moving in that direction but we're still too small to have the capital on-hand to buy the very expensive equipment it takes to do a cloud right.

I wish we could now and it's a matter of time but we're not there yet.

cPanel in and of itself doesn't support that functionality and we're not to the point of being able to invest into a larger cloud infrastructure.

There is no way our test environment would know to replicate a processor failing - the update would have been successful and taken less than 5 minutes as it did on all other servers.

The issue wasn't the software update, the issue was that a processor in the server completely failed.

Thank you for being forthcoming and forthright. Although this situation is unfortunate, I have to say that MDDHosting is really a great hosting company compared to the other guys. This incident seems to be an exception to the norm rather than occurring every week like the other guys. It is this fact that I will continue to use MDD.

T0M · May 15, 2013

To Mike, Scott and the crew at MDDHosting, thanks for the hard work during this unfortunate event. I'm sure you did what was required as fast as possible. As Gump said - It happens.

I have to mention that there are cloud providers out there that offer redundancy for mission critical applications, but you will pay for it.

Michael D. · May 15, 2013

I have to mention that there are cloud providers out there that offer redundancy for mission critical applications, but you will pay for it.

This is true and once we are able to offer such services the prices will likely be substantially higher than our current offerings as well. At the end of the day the more it costs us to operate a service the more we have to charge for it.

It's something we have been looking forward to/planning on though. When we do it, we'll likely do it from a different location to give our customers a choice of locations at that point as well.

Sign In

Resolved Jasmine Emergency Kernel Update Status

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation