Jump to content
MDDHosting Forums

Jasmine Emergency Kernel Update Status


Recommended Posts

1. This is unreal.. really a cloud server that can not be rolled back while you figure out what is wrong? Are you kidding me?

2. It is not OK that you have no eta. I have client sites on here funneling thousands of dollars of leads to the email hosted on here.

3. I look unprofessional having recommended your server over oh lets say RACKSPACE who would never do a KERNEL update on a live server

 

That was a fairly rude comment. If thousands of dollars are on the line for a hosting account worth a few bucks per month, I suggest you look into getting multiple dedicated servers and hosting accounts. While it's frustrating and I myself will have about 3,000 users hitting our site within the next 1/2 hour (From our twitter account @PHLSportsMedia for after the Phillies game) I think they are handeling this very professionally and it seems like they already know the cause/ issue and are working on it.

Link to comment
Share on other sites

I didn't even know it was a cloud server. Cool! Is this why sometimes some people report an outdated super old page (rare but I get them now/then) randomly?

 

And anyways. This is the first problem I've had. I'm patient. You guys have been amazing thus far. Keep up the good work! :)

Link to comment
Share on other sites

Jasmine users right now: http://i.imgur.com/Z7tEt.gif

I suspect that is our competition, not our clients... But thanks for the chuckle, either way.

 

 

I didn't even know it was a cloud server.

Lot's of people define "the cloud" differently. This server doesn't fit under 95% of those definitions, and isn't what I would call a "cloud" server. We are running software called "CloudLinux," but the two are unrelated.

Link to comment
Share on other sites

I have to say, this is the first time in over 5 years I've felt like I could have a heart attack while working to resolve an issue. We do still have another server offline with the exact same issue but it's the exact same hardware build so we're swapping processors on it as well. If all goes well it will come back online and all services will be 100% restored.

 

We're then going to investigate why this happened and see what we have to do to to correct it - be it installing larger/different coolers on the processors, more fans in the chassis, discussing cooling with the facility, etc... At this point I don't have an 'Reason For Outage' beyond hardware failure but once we have more details as to what exactly caused this I will be sure to post them up.

Link to comment
Share on other sites

I have to say, this is the first time in over 5 years I've felt like I could have a heart attack while working to resolve an issue. We do still have another server offline with the exact same issue but it's the exact same hardware build so we're swapping processors on it as well. If all goes well it will come back online and all services will be 100% restored.

 

We're then going to investigate why this happened and see what we have to do to to correct it - be it installing larger/different coolers on the processors, more fans in the chassis, discussing cooling with the facility, etc... At this point I don't have an 'Reason For Outage' beyond hardware failure but once we have more details as to what exactly caused this I will be sure to post them up.

 

If this system meltdown thing was a pattern with MDD like it is other places, then and only then would I worry. Been a number of years since I moved over here, first time this has ever happened...can't say the same about other places.

 

Hopefully someone is trained in CPR.

 

and so it goes...

Link to comment
Share on other sites

I have to say, this is the first time in over 5 years I've felt like I could have a heart attack while working to resolve an issue. We do still have another server offline with the exact same issue but it's the exact same hardware build so we're swapping processors on it as well. If all goes well it will come back online and all services will be 100% restored.

 

We're then going to investigate why this happened and see what we have to do to to correct it - be it installing larger/different coolers on the processors, more fans in the chassis, discussing cooling with the facility, etc... At this point I don't have an 'Reason For Outage' beyond hardware failure but once we have more details as to what exactly caused this I will be sure to post them up.

 

I give you guys a ton of credit, I don't think I could handle that myself if I were in your shoes.

 

Yeah it would be interesting to find out why exactly it went out, I'm just glad it was a relatively easy repair. Good luck with the other server!

Link to comment
Share on other sites

Thanks for the hard work guys, as well as the updates along the way. Despite the downtime I am probably more confident in your service than I was before this happened, due to you guys staying in communication with us about it while it happened, and already addressing how you will go about preventing this in the future.

Link to comment
Share on other sites

I tend to handle most low-level administration tasks such as system failure, data corruption, etc... My immediate feeling was that there was hardware failure but I couldn't wrap my head around it happening to two separate servers simultaneously. I almost felt sure that it had to be something in the new kernel as that was the only thing that had changed.

 

Ultimately had I gone with my first instinct and had the processors swapped I could have saved a lot of downtime. I put off the hardware swap in order to try and fix the issue in software both because I didn't really believe both servers could have the exact same hardware failure at the same time and I also didn't want to have the on-site staff have to pull apart the servers and swap parts - at the end of the day I didn't want to make them do work for nothing [i.e. if it wasn't failed processors].

 

Honestly should anything like this happen again, and here's hoping not, and we even have the hint that it's hardware - we're going to swap hardware immediately. I've learned quite a bit from this experience and will definitely not make the same mistakes in the future that I've made tonight when directing how we were working on this issue. Ultimately it was resolved quicker than most providers in our situation would resolve it, but not as quick as I feel appropriate - I try to hold us to a higher standard.

 

I sincerely apologize to all of our customers that were offline due to this issue. You certainly don't like being offline and we're certainly not in the business of providing downtime.

 

That said, I'm going to go through one-by-one and respond to every support ticket that was opened during the outage and address questions/comments/concerns as best I can. Once I've done this, I'm going to try and get some sleep so I can get up early tomorrow and begin testing those failed processors to see if we can determine what caused the failure so we can prevent it. I have to say that having [at least] two processors fail in two separate servers at the same time has to be the weirdest instance of failure I've seen since I've been in the industry.

 

If you have any general questions about the issue, feel free to post them here.

Link to comment
Share on other sites

1. This is unreal.. really a cloud server that can not be rolled back while you figure out what is wrong? Are you kidding me?

2. It is not OK that you have no eta. I have client sites on here funneling thousands of dollars of leads to the email hosted on here.

3. I look unprofessional having recommended your server over oh lets say RACKSPACE who would never do a KERNEL update on a live server

 

I sort of agree with Jacqui. I'm a IT professional in the corporate world and Hardware Failure should really be a thing of the past with cloud servers and redundancy. In our organization, we have redundant servers so if 1 server fail, all software and services are moved to another server without any interaction. Once the failure occurs, I'm alerted of the failure and fix the hardware issue but I don't need to worry about services since it has been moved or migrated to another server already. My job has become stress free because of this design. How come MDD hosting doesn't have this type of system in place? Also, we have a test development environment in which we duplicate the existing production environment and before any major update, we perform it in the test environment before production. If these two things are adopted by MDDHosting, I believe the downtime could have been avoided or at the very least, minimized to several minutes.

Link to comment
Share on other sites

I sort of agree with Jacqui. I'm a IT professional in the corporate world and Hardware Failure should really be a thing of the past with cloud servers and redundancy.

We're moving in that direction but we're still too small to have the capital on-hand to buy the very expensive equipment it takes to do a cloud right.

 

I wish we could now and it's a matter of time but we're not there yet.

 

In our organization, we have redundant servers so if 1 server fail, all software and services are moved to another server without any interaction. Once the failure occurs, I'm alerted of the failure and fix the hardware issue but I don't need to worry about services since it has been moved or migrated to another server already. My job has become stress free because of this design. How come MDD hosting doesn't have this type of system in place?

cPanel in and of itself doesn't support that functionality and we're not to the point of being able to invest into a larger cloud infrastructure.

 

Also, we have a test development environment in which we duplicate the existing production environment and before any major update, we perform it in the test environment before production. If these two things are adopted by MDDHosting, I believe the downtime could have been avoided or at the very least, minimized to several minutes.

There is no way our test environment would know to replicate a processor failing - the update would have been successful and taken less than 5 minutes as it did on all other servers.

 

The issue wasn't the software update, the issue was that a processor in the server completely failed.

  • Upvote 1
Link to comment
Share on other sites

We're moving in that direction but we're still too small to have the capital on-hand to buy the very expensive equipment it takes to do a cloud right.

 

I wish we could now and it's a matter of time but we're not there yet.

 

cPanel in and of itself doesn't support that functionality and we're not to the point of being able to invest into a larger cloud infrastructure.

 

There is no way our test environment would know to replicate a processor failing - the update would have been successful and taken less than 5 minutes as it did on all other servers.

 

The issue wasn't the software update, the issue was that a processor in the server completely failed.

 

Thank you for being forthcoming and forthright. Although this situation is unfortunate, I have to say that MDDHosting is really a great hosting company compared to the other guys. This incident seems to be an exception to the norm rather than occurring every week like the other guys. It is this fact that I will continue to use MDD.

Link to comment
Share on other sites

To Mike, Scott and the crew at MDDHosting, thanks for the hard work during this unfortunate event. I'm sure you did what was required as fast as possible. As Gump said - It happens.

 

I have to mention that there are cloud providers out there that offer redundancy for mission critical applications, but you will pay for it.

Link to comment
Share on other sites

I have to mention that there are cloud providers out there that offer redundancy for mission critical applications, but you will pay for it.

This is true and once we are able to offer such services the prices will likely be substantially higher than our current offerings as well. At the end of the day the more it costs us to operate a service the more we have to charge for it.

 

It's something we have been looking forward to/planning on though. When we do it, we'll likely do it from a different location to give our customers a choice of locations at that point as well.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...