Jump to content


kevnich83

Member Since 25 Sep 2018
Offline Last Active Sep 28 2018 09:15 AM

Posts I've Made

In Topic: Major Outage - 09/21/18+ - Client Discussion

25 September 2018 - 12:39 PM

Im am still trying to figure out and understand fully how 1 command took out 12 physical servers?

 

i get we make mistakes but dang thats a BIG BIG mistake..  thats like changing a Radiator with the engine running

 

 

Honestly, as I stated before, we're all human.  Every tech and engineer has done something very STUPID like this in their career.  I have broken more things than I can forget, but it's simply the process of learning.  I can guarantee that he/she will NEVER forget this and I sincerely hope they're able to forgive themselves, LEARN and move on. 

 

Granted, the fact that they were able to run one command that has that devastating of effects on a production system is a little hard to believe.  I come from a system where you have a lab that's as identical of equipment as you can and run all of your commands there, get a script from it then only after verifying everything is fine in the lab, take that script and run it on production.  But again, I don't know their environment.  If I were MDD, I would be contacting the storage vendor and see if there's a way to disable or require a confirmation or escalation before even being able to do a drop command like that.  Humans are humans and are the weakest link in the tech world.  You have to program your hardware and software around that fact, sadly.


In Topic: Major Outage - 09/21/18+ - Client Discussion

25 September 2018 - 12:32 PM

As I understand it we were all very lucky that you had a another (off-site) backup of the data. If it were only the snapshots every one would have been up the creek without a paddle. A lot of hosts would simply point to their 'suicide clause' of maintain and only rely on your own backups. You have that clause in your TOS but still went the extra mile. 

 

No one wants down time and it's never easy especially when you have clients hooting and hollering at you over something you really cannot control. It happens and it sucks but frankly this turned into a good learning opportunity. 

 

I learned that I had grown complacent with MDD's stellar uptime and general lack of headaches compared to other hosts. I wasn't pulling down backups as often as I should have. That's on me. 

 

Secondly, only some of my clients are on Cloud Flare. All my clients are going to use Cloud Flare in the future. With Cloud Flare I can at the very least point email at another host with minimal downtime until MDD restores. 

 

Current backups + Cloud Flare means I can restore service to another server pretty much as fast as I'm able to upload them.

 

For additional piece of mind I'm going to keep a 'hot' account on another server with email forwarders setup for all clients. This way I can just do an MX switch at Cloud Flare to restore email (the life blood of many client's businesses) ASAP then work on restoring the websites. 

 

Mike and the team still have my vote of confidence even more so seeing how they handled this outage. I really should've had this disaster plan in place already. 

 

This would be excellent advice for everyone.  However, I'm also going to add that if you're using WHM (I haven't seen the options for Cpanel), there's a backup configuration that can be configured to automatically send daily server backups to an Amazon S3 account, then configure lifecycle rules, versioning rules, etc on the S3 side so you have continuous backups of your hosting reseller or VPS account remotely.  I have backups going back a solid 12 months.  These are daily, weekly and monthly.  I then have lifecycle rules that push all backups over 30 days to Amazon Glacier and expire them out after one year.  If you think it takes hours to set this up, it took me a grand total of about 15 minutes.  That 15 min was well spent knowing my backups are transferred offsite every morning at 2am.

 

Always, always have your own backups and use some scripting to try and automate the process to off-site your own backups so you don't have to remember to download your own server backups.  At the end of the day, we're all human.