Jump to content
MDDHosting Forums

Major Outage - 09/21/18 - 09/24/2018


Recommended Posts

We are now starting the restore of R3

 

s1 Completed
p1 Completed
r1 Completed
p2 Completed
s2 Completed
r2 Completed
s3 In Progress
r3 In Progress
s4 Wednesday, September 26, 2018 at 5:00:00 AM EDT
r4 Completed
s5 Wednesday, September 26, 2018 at 11:00:00 AM EDT
s0 Wednesday, September 26, 2018 at 11:00:00 AM EDT
  • Upvote 1
Link to comment
Share on other sites

For those worried that something like this could happen again.

 

We have already enabled snapshots on our storage cluster. We're doing one snapshot every hour and keeping them for 10 hours.

 

So from a hypothetical standpoint - let's say that this did manage to happen again. We would simply mount a snapshot from before the incident - within the hour before - and boot everything back up.

 

Total downtime would be - ~5 minutes - for the whole network. Would there be any data loss? Possibly anything written in the preceding hour or less - but nothing compared to the losses of a multi-day outage.

 

It would look literally like we just shut everything down and booted it back up. No 'restorations', no lost emails, nothing. There's a great chance almost nobody would even notice.

 

This is something that our storage vendor, StorPool, set up for us immediately upon seeing what had happened. They actually apologized that it was not already set up and said that as a result of our disaster they are going to make sure that it is a default behavior that has to be actively disabled rather than the other way around.

 

Even with these snapshots and as powerful as they are - we are still going to overhaul our backup servers. We have identified the issues with the present setup that caused restorations to be so slow and already have fixes for those issues planned for once we are fully online and all of our clients are taken care of.

 

Snapshots are a very powerful tool against data loss and corruption. We actually used them a couple of times on our old storage platform, the Nimble CS500, to recover data on servers when clients made big mistakes themselves.

  • Upvote 1
Link to comment
Share on other sites

Once all servers are fully restored we will be performing a quick reboot of them all. This reboot should take ~30 seconds each - and we're going to be doing this to get the systems into a clear/fresh state after all of the massive restorations and data transfers.

Link to comment
Share on other sites

Identified a setting that needed changed - Deferred Webserver Restarts. It was restarting on S4 every ~3 seconds - now 5 minutes in between and things are stable and MUCH faster. Replicated this network wide [with verification of another admin first].

 

S4 is still going to be unhappy until the restores are done but we're down to the last 4 accounts.

Link to comment
Share on other sites

Restores are 100% Completed

 

If your site is offline showing a cPanel error page:

 

  • Try connecting to your cPanel by adding "/cpanel" on to the end of your domain. If you can sign in, this verifies your account was restored.
  • Check to see if you're using our nameservers - if you aren't, you'll need to get your IP from cPanel and update your third party DNS.
  • Make sure you're not just reloading the error page - hitting reload while viewing the error just reloads the error page.

If you are not using third party DNS and your site doesn't appear but you can get into cPanel - try clearing your browser cache and restarting your browser. If that doesn't work try another browser. If it loads for you on one browser but not another - that's a caching issue and not a server or network issue.

 

If you are having any issues with your mail client - what we have seen work the most is removing the email account from the client and adding it back. We haven't yet identified what the difference is. You can also add "/webmail" to the end of your domain to access your email if your mail client isn't working.

 

We do expect there to be a lot of little issues that we have to resolve so if you have issues and can't sort them please reach out in a ticket.

 

We are doing our best to keep up with support tickets. I am sorry if it takes us longer to reply than normal but we are answering tickets in the order received and doing our best to fully resolve any issues and to offer good proper non-copy-and-pasted advice.

  • Upvote 2
Link to comment
Share on other sites

All servers are online and all accounts are restored!

 

We reached out to our storage platform vendor after the incident and we have worked with them to take steps to prevent an issue like this from happening again. Changes have also been implemented that will allow us to recover from a catastrophic event such as the one we just experienced as quickly as within a few minutes with little to no impact on the services themselves.

 

We are going to be conducting a thorough review of the events leading up to this incident and making changes to our policies and procedures based upon our findings. How the incident was handled is also going to be reviewed and we are going to develop a new comprehensive backup and emergency response plan.

 

If you are still experiencing any issues at all or need help with anything please do not hesitate to reach out to us. We are here to help and will do our best to assist you in recovering from this incident in any way that we can.

 

Thank you,

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
×
×
  • Create New...