ericr Posted September 25, 2018 Report Share Posted September 25, 2018 We are now starting the restore of R3 s1 Completed p1 Completed r1 Completed p2 Completed s2 Completed r2 Completed s3 In Progress r3 In Progress s4 Wednesday, September 26, 2018 at 5:00:00 AM EDTr4 Completed s5 Wednesday, September 26, 2018 at 11:00:00 AM EDTs0 Wednesday, September 26, 2018 at 11:00:00 AM EDT 1 Link to comment Share on other sites More sharing options...
Michael D. Posted September 25, 2018 Author Report Share Posted September 25, 2018 80% Restored/Back Online. Link to comment Share on other sites More sharing options...
ericr Posted September 25, 2018 Report Share Posted September 25, 2018 Due to a unexpected reboot of the backup server we are going to have to swap s5 and s4 so we can not lose progress made towards s4's restoration. Link to comment Share on other sites More sharing options...
Michael D. Posted September 26, 2018 Author Report Share Posted September 26, 2018 R3 is about 50% restored - it's taking longer than expected as it as a lot of tiny accounts. S3 is just about done. S5 will be starting soon. Link to comment Share on other sites More sharing options...
Michael D. Posted September 26, 2018 Author Report Share Posted September 26, 2018 The S3 server is fully restored. Link to comment Share on other sites More sharing options...
Michael D. Posted September 26, 2018 Author Report Share Posted September 26, 2018 For those worried that something like this could happen again. We have already enabled snapshots on our storage cluster. We're doing one snapshot every hour and keeping them for 10 hours. So from a hypothetical standpoint - let's say that this did manage to happen again. We would simply mount a snapshot from before the incident - within the hour before - and boot everything back up. Total downtime would be - ~5 minutes - for the whole network. Would there be any data loss? Possibly anything written in the preceding hour or less - but nothing compared to the losses of a multi-day outage. It would look literally like we just shut everything down and booted it back up. No 'restorations', no lost emails, nothing. There's a great chance almost nobody would even notice. This is something that our storage vendor, StorPool, set up for us immediately upon seeing what had happened. They actually apologized that it was not already set up and said that as a result of our disaster they are going to make sure that it is a default behavior that has to be actively disabled rather than the other way around. Even with these snapshots and as powerful as they are - we are still going to overhaul our backup servers. We have identified the issues with the present setup that caused restorations to be so slow and already have fixes for those issues planned for once we are fully online and all of our clients are taken care of. Snapshots are a very powerful tool against data loss and corruption. We actually used them a couple of times on our old storage platform, the Nimble CS500, to recover data on servers when clients made big mistakes themselves. 1 Link to comment Share on other sites More sharing options...
Michael D. Posted September 26, 2018 Author Report Share Posted September 26, 2018 R3 is still restoring, and is about done. We are restoring the S5 server now and preparing the S4 server for restoration. Link to comment Share on other sites More sharing options...
Michael D. Posted September 26, 2018 Author Report Share Posted September 26, 2018 R3 server is completed. S5 is still restoring. S4 should start restoring in about 3 hours or so. Link to comment Share on other sites More sharing options...
Michael D. Posted September 26, 2018 Author Report Share Posted September 26, 2018 S5 is completed, S4 starting soon. Link to comment Share on other sites More sharing options...
Michael D. Posted September 26, 2018 Author Report Share Posted September 26, 2018 My personal estimation is that we'll be fully done restoring data by 8 AM. Link to comment Share on other sites More sharing options...
Michael D. Posted September 26, 2018 Author Report Share Posted September 26, 2018 Once all servers are fully restored we will be performing a quick reboot of them all. This reboot should take ~30 seconds each - and we're going to be doing this to get the systems into a clear/fresh state after all of the massive restorations and data transfers. Link to comment Share on other sites More sharing options...
ericr Posted September 26, 2018 Report Share Posted September 26, 2018 We are doing the 5 remaining s0 accounts straight off the old backup server while the s4 backup data is being prepped onto the ssd server. S4 backups will begin in about 15 minutes. Link to comment Share on other sites More sharing options...
ericr Posted September 26, 2018 Report Share Posted September 26, 2018 I have started the restores of the s4 server. Link to comment Share on other sites More sharing options...
ericr Posted September 26, 2018 Report Share Posted September 26, 2018 The accounts restore's on S5 are over half way done. A bit slower then we hoped. But we will get done before too long. Link to comment Share on other sites More sharing options...
ericr Posted September 26, 2018 Report Share Posted September 26, 2018 The restore speed has slowed noticeably. however it will complete as soon as possible. The issue is the restores are not competing with the active visitors instead of a empty server. Link to comment Share on other sites More sharing options...
ericr Posted September 26, 2018 Report Share Posted September 26, 2018 The last few accounts on s0 have been restored. We have 263 accounts left on S4 to restore. Link to comment Share on other sites More sharing options...
ericr Posted September 26, 2018 Report Share Posted September 26, 2018 We are down to 122 accounts on s4 Link to comment Share on other sites More sharing options...
Michael D. Posted September 26, 2018 Author Report Share Posted September 26, 2018 Due to it being the middle of the day and the servers being busy - the S4 server is bogging down due to the restorations we are conducting. Once they are done performance should go back to normal. Link to comment Share on other sites More sharing options...
Michael D. Posted September 26, 2018 Author Report Share Posted September 26, 2018 Identified a setting that needed changed - Deferred Webserver Restarts. It was restarting on S4 every ~3 seconds - now 5 minutes in between and things are stable and MUCH faster. Replicated this network wide [with verification of another admin first]. S4 is still going to be unhappy until the restores are done but we're down to the last 4 accounts. Link to comment Share on other sites More sharing options...
Michael D. Posted September 26, 2018 Author Report Share Posted September 26, 2018 Restores are 100% Completed If your site is offline showing a cPanel error page: Try connecting to your cPanel by adding "/cpanel" on to the end of your domain. If you can sign in, this verifies your account was restored.Check to see if you're using our nameservers - if you aren't, you'll need to get your IP from cPanel and update your third party DNS.Make sure you're not just reloading the error page - hitting reload while viewing the error just reloads the error page.If you are not using third party DNS and your site doesn't appear but you can get into cPanel - try clearing your browser cache and restarting your browser. If that doesn't work try another browser. If it loads for you on one browser but not another - that's a caching issue and not a server or network issue. If you are having any issues with your mail client - what we have seen work the most is removing the email account from the client and adding it back. We haven't yet identified what the difference is. You can also add "/webmail" to the end of your domain to access your email if your mail client isn't working. We do expect there to be a lot of little issues that we have to resolve so if you have issues and can't sort them please reach out in a ticket. We are doing our best to keep up with support tickets. I am sorry if it takes us longer to reply than normal but we are answering tickets in the order received and doing our best to fully resolve any issues and to offer good proper non-copy-and-pasted advice. 2 Link to comment Share on other sites More sharing options...
Michael D. Posted September 26, 2018 Author Report Share Posted September 26, 2018 We are aware that SpamAssassin is either not properly scoring spam - or not scoring it at all and have a ticket opened with cPanel on this. As soon as this is sorted we'll fix it network wide. Link to comment Share on other sites More sharing options...
Michael D. Posted September 27, 2018 Author Report Share Posted September 27, 2018 The SpamAssassin issue has been resolved. Link to comment Share on other sites More sharing options...
Michael D. Posted September 27, 2018 Author Report Share Posted September 27, 2018 All servers are online and all accounts are restored! We reached out to our storage platform vendor after the incident and we have worked with them to take steps to prevent an issue like this from happening again. Changes have also been implemented that will allow us to recover from a catastrophic event such as the one we just experienced as quickly as within a few minutes with little to no impact on the services themselves. We are going to be conducting a thorough review of the events leading up to this incident and making changes to our policies and procedures based upon our findings. How the incident was handled is also going to be reviewed and we are going to develop a new comprehensive backup and emergency response plan. If you are still experiencing any issues at all or need help with anything please do not hesitate to reach out to us. We are here to help and will do our best to assist you in recovering from this incident in any way that we can. Thank you, Link to comment Share on other sites More sharing options...
Recommended Posts