Jump to content
MDDHosting Forums

Major Outage - 09/21/18+ - Client Discussion


KevinD872

Recommended Posts

A few things I did and a few things I learned. I have a reseller account with a few clients.

I found out about the outage about 10 minutes after it happened. but it was about 30 minutes before I could verify my sites were truly offline. I think by that time or soon after I received an email from Mike about the issue.

 

The first suggestion I made to one of my clients was to change what email forwarding to their domain was being done over to their Gmail account. While inconvenient and as the outage went on, they did have some mail bounce they were able to continue to accept and service orders as their main site is not hosted by MDD. They were very grateful for the suggestion even though their secondary domain hosted on MDD was down.

 

Second thing I did was I informed my other clients that there would be an extended outage of their websites as there was an issue in the data center hosting them and I would keep them updated as to when I thought the outage would be over.

 

The third thing was verify how new the offline backups that I auto generate and auto download and keep of the site files and databases for each client, A few days old and sites don't change much. Good there but I was not doing cpanel full backups so no email backed up. My bad there, never thought about it.

 

Next I waited and monitored until late Friday evening to see if things would recover. When nothing had recovered after several hours, I sent a trouble ticket asking if any servers were still operational and if the outage would go past Noon on Saturday. VPS servers were operational and I could get an account on one to restore the sites if I wanted. Mike thought it would be fine by early morning. This was before they knew the data arrays were corrupted and could not be recovered.

 

Saturday noon came and went and talk was not good about the situation. I informed my clients and asked if they were good to wait until the servers were restored from backups which could be Tuesday or Wednesday, or if they wanted me to move their sites temporarily They elected to just wait and those domains were not likely to miss any emails.

 

A few things I learned that might help others in the future.

I thought I was prepared, as I have in the past had hosts disappear in the night, owner die, etc., but learned I was not fully prepared to recover all data. I need to make sure I have email backed up and a full cpanel copy in addition to site and database copies that I already do.

I was much better off than those who had stored clients backups online, did not have current ones or any.

No matter how good the host service, how big (or small), how much trust there is, how good their customer service is, always be prepared to have all of your data and email or a customers data and email lost in an instant. Head the "you are responsible for all data, and your own backups, we are not responsible for lost data" that every host posts.

Don't wait forever to inform your customers and be honest with them about the issue.

This I already know from my primary job. Whatever the problem it will take at least 4 times as long to fix than what is first thought.

Have a plan to bring your clients websites back online quickly if the outage will be extended even if it costs you money in the short term.

Send a follow up letter summarizing the outage, and what you will or can do better in the future to prevent extended downtime in the future.

Link to comment
Share on other sites

I don't think I've seen anyone mention this, but while my sites on s2 all came back properly, I did lose all of the logs that are used for things like awstats and Webalizer. In fact, Webalizer was turned off (the normal default). Personally, I don't care about this, but if anyone is using the metrics available through cPanel, you might want to take a look at what's there (in the tmp directory).

Link to comment
Share on other sites

command line commands generally don't ask are you sure. If you have the authority to run the command it executes immediately.

I think that's probably true, but as has been quite painfully demonstrated at MDD, makes no sense when the command can be so destructive. I don't expect that administrators can control it but confirmations would help:

 

Are you sure you want to DELETE all data on ALL SERVERS? Type YES to continue, NO to cancel.

 

Or, if the confirmation text could be configured:

 

Are you sure you want to put MDD out of business? Type YES to continue, NO to cancel.

 

Also, is a block discard command EVER used on these systems? I simply can't get my head around how this command, significantly different than the cleanup command, could be entered.

 

Regardless, it happened and Mike and crew did the right thing, an outstanding effort to restore as quickly as possible. And they are addressing the issues that caused it and will make recovery faster and less traumatic in future.

Link to comment
Share on other sites

I think that's probably true, but as has been quite painfully demonstrated at MDD, makes no sense when the command can be so destructive. I don't expect that administrators can control it but confirmations would help:

 

Are you sure you want to DELETE all data on ALL SERVERS? Type YES to continue, NO to cancel.

 

Or, if the confirmation text could be configured:

 

Are you sure you want to put MDD out of business? Type YES to continue, NO to cancel.

 

Also, is a block discard command EVER used on these systems? I simply can't get my head around how this command, significantly different than the cleanup command, could be entered.

 

Regardless, it happened and Mike and crew did the right thing, an outstanding effort to restore as quickly as possible. And they are addressing the issues that caused it and will make recovery faster and less traumatic in future.

 

We've actually blacklisted the command. It won't ask, 'Are you sure,' it will say, 'You are not permitted to run this command.'

 

There are reasons you would use a block discard - it's a valid command when used in the right situation. This wasn't one of them.

Link to comment
Share on other sites

Thanks for the work and updates.

These type of events are never fun for the client or the provider.

 

While it was far from ideal, things do happen. I have had things happen even on AWS with multiple layers of redundancy.

 

The biggest message I would leave for the folks that are really pushing on business disruption is if your services are that critical, have layers of redundancy that you manage as well.

For example, my eggs are not all in one basket for critical services. NameServers, DNS Zones, Web Hosting, and Email are all on separate providers that can be independently managed.

 

If your email is that critical, you really should not be hosting it on your webhost....get a Google or Office365 account or something more specialized.

If your site is that critical, have a backup and mirrors that you can bring online.

 

No technology is perfect, even ones we pay exponentially more for.

Link to comment
Share on other sites

We’re evaluating what options there are so that hopefully we can offer such functionality for you. I know it’s doable with a custom script of some kind but it would be nice for it to be built in.

Thanks. Please keep us posted on this progress. It would be nice to implement it as soon as possible.

Link to comment
Share on other sites

Mike, just want to thank you again for such great service over the years and for your hard work and transparency during this whole ordeal. I know this has been very stressful and problematic for you and for so many people.

Your professionalism has been exemplary – working hard within the limits of human sleep deprivation and equipment bottlenecks, keeping us updated so frequently, outlining your plans for improving things in the future and responding calmly no matter what the tone of the message.
I've also been impressed by the high levels of professionalism, tolerance, compassion and helpfulness of the user posts. Feels like a solid community that I'm happy to be part of.
I'm following your lead and thinking hard about what I can do differently to minimize disruption should a major problem arise in the future. As a non-IT professional with a couple of low-traffic non-commerce websites and lots of email accounts in my domains, I've been good about keeping website backups with a WordPress plugin but honestly didn't realize that cPanel backups would include all the email accounts so I thought they were redundant and didn't do them. And I was one of the people who didn't understand that the servers were up and I could have restored my websites with my backups until you spelled it out on Monday in this thread (I'm not on Twitter).
So going forward I'm going to keep up-to-date cPanel backups and I'll remember that “the data needs to be restored” does not mean “the server is down” (I'm sure that's glaringly obvious to an IT professional, but we enthusiastic amateurs, well, our minds work in some strange ways).
Maybe this would be a good time to update/expand a couple articles in the Knowledgebase, to help us do our part? I see there is a category called “Backups and Restorations” that has no articles in it, and two brief articles on backing up and restoring in the “Account Questions” category.
Thanks again for your commitment to maintaining such high standards. We all make mistakes, what matters is how we respond to them, and again, I think you have been exemplary. I'm sure that this episode is costing MDDHosting a bundle. Please let us know if it compromises the viability of your company – I'm sure there are a lot of us who would happily jump on that GoFundMe train rather than see MDDHosting go out of business.
Link to comment
Share on other sites

From the 2 cents department, and maybe this has been talked about already (this is a loooong thread), I wonder if you have considered implementing software that requires 2 separate users entering their password for commands that are potentially catastrophic. (and no cheating :-))

It has been mentioned. We already have a system that outright blocks destructive commands but this command wasn’t on the list. It is now, that is for sure.
Link to comment
Share on other sites

cPanel, in their infinite wisdom, stores stats data in the temporary folder. It’s something I will be addressing with both JetBackup as well as cPanel on Monday or Tuesday.

 

I have a full Cpanel backup of my own, could i do something ? upload to the temp folder what awstats need ?

Link to comment
Share on other sites

One more suggestion: I didn't even know about these forums until I searched on Google for info about the outage. Now that I'm looking for it I do see at the bottom of your home page a link to Community Forums, but within the Client Area I don't see any link or information. I would suggest adding a link in the Client-Area Support dropdown menu. I bet this would also cut down on the number of tickets submitted.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...