Jump to content


Photo

Major Outage - 09/21/18+ - Client Discussion


  • Please log in to reply
419 replies to this topic

#301 mikeofmandan

mikeofmandan

    Newbie

  • Members
  • Pip
  • 2 posts

Posted 25 September 2018 - 11:29 AM

Just wondering if anyone is experiencing a lag with services being accessible/live after server restoration is complete.  I swear I'm not impatient, but I saw r4 was recently restored but it seems like the sites are slowly flickering on, and I'm still unable to log into WHM from the MDD dashboard.  Just wondering if I need to sit tight or open a ticket?  Also, does anyone know if emails during downtime will eventually come through or are they lost for good?


  • 0

#302 chris.holmes

chris.holmes

    Newbie

  • Members
  • Pip
  • 4 posts

Posted 25 September 2018 - 11:30 AM

I would say if the server is up and you are still having problems, log a ticket.


  • 0

#303 CLINECO

CLINECO

    Newbie

  • Members
  • Pip
  • 4 posts

Posted 25 September 2018 - 11:37 AM

 

So the hard lesson for me is to NOT get lazy and rely on any given company, even those with positive reputations, because at the end of the day, I am responsible for my own business.

 

 

I agree. While 90% of my sites are back on-line in some capacity. I did lose a major portion of a brand new site, that hadn't been backed up by MDD since last Wednesday (r1 server), this means I'm out thousands in development fees. Ultimately it is up to me and I fall responsible for taking care of my own data. Though I never thought it would possible for somebody to fire a single command and nuke MDD's whole system. Especially since "Multiple copies of your data are stored synchronously across multiple storage drives and servers with no single points of failure."

 

I too appreciate MDD's transparency but I'm still having a tough time swallowing this pill. 

 

I would like to see MDD offer a backup to Dropbox or S3 through the WHM panel. 


  • 1

#304 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,893 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 25 September 2018 - 11:38 AM

Hi Mike, not to divert the topic but I'm both curious and concerned about the person responsible for this error? Honestly, the first thing that came into my mind when things got a bit serious was how this admin would have affected him/her psychologically. I don't know.. I would have had some kind of nervous breakdown or something if it was me. Hows he/she holding up BTW? I hope he/she is ok.

 

He's been instrumental to the restoration of services.  I just hope that he'll be able to get some rest once this is over.

 

I'll take it's not all back to normal, more messing around with r2? Clients sites still down and no whm for me

 

Open a ticket.

 

Just wondering if anyone is experiencing a lag with services being accessible/live after server restoration is complete.  I swear I'm not impatient, but I saw r4 was recently restored but it seems like the sites are slowly flickering on, and I'm still unable to log into WHM from the MDD dashboard.  Just wondering if I need to sit tight or open a ticket?  Also, does anyone know if emails during downtime will eventually come through or are they lost for good?

 

Ticket please.

 

I would say if the server is up and you are still having problems, log a ticket.

 

Yup.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#305 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,893 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 25 September 2018 - 11:41 AM

Though I never thought it would possible for somebody to fire a single command and nuke MDD's whole system. Especially since "Multiple copies of your data are stored synchronously across multiple storage drives and servers with no single points of failure."


Neither did we - but we learned that the hard way. If we had snapshots configured properly we'd have been able to flip a few switches and bring everything back.  Hard lesson for sure.

 

That said as this happened inside of the servers - that is why it affected all copies.  So when you write a file to a server - it's stored on 3 separate disks in 3 separate servers.  If you erase it - it's erased from all 3.  In this case - this block discard happened inside of the server - so it discarded all copies.

 

Another thing to note is that our backup system is supposed to be taking backups every day - so at most you should have lost hours not days of data - this is another issue we're going to make sure is resolved.  Not only will we have more regular snapshots - perhaps even on an hourly basis - that we can restore from immediately [as in, just hit 'boot' and you're back online just as you were] - but also in making sure our backup systems moving forward are tested and audited on a regular basis.


  • 1
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#306 Stahlrosen

Stahlrosen

    Newbie

  • Members
  • Pip
  • 2 posts

Posted 25 September 2018 - 11:47 AM

Hi Mike, not to divert the topic but I'm both curious and concerned about the person responsible for this error? Honestly, the first thing that came into my mind when things got a bit serious was how this admin would have affected him/her psychologically. I don't know.. I would have had some kind of nervous breakdown or something if it was me. Hows he/she holding up BTW? I hope he/she is ok.

Hoping so too!!

 

We actually do maintain a full backup locally as we change things on the sites, but our upload speeds are SO slow here, it would have taken longer to restore the sites than to wait it out. Ironically my 2015 501c3 site is on s1 and is up and running, and my 2009 business site is on s3. Not complaining, just found it a bit funny. Fortunately my clients have been very understanding.
I think MDD is doing a great recovery job and we definitely won't be moving.

 

If anyone knows, I am curious though how to handle the IP change, since we are on a dedicated IP, I'm not even sure what this would affect. It has been so long since it has changed!


  • 0

#307 Djqueball

Djqueball

    Newbie

  • Members
  • Pip
  • 5 posts

Posted 25 September 2018 - 12:24 PM

Im am still trying to figure out and understand fully how 1 command took out 12 physical servers?

 

i get we make mistakes but dang thats a BIG BIG mistake..  thats like changing a Radiator with the engine running

 

 

but..as most have stated

 

1.BACK UP YOUR STUFF

2.have a backup plan your self incase of a disaster

3. dont host a money making business site on a shared server provider get your self your own dedicated machine and

have backups inplace.

4. you manage your own domain/DNS

 

dont put all your eggs in 1 basket when you are a live money making business.

 

i dont blame MDD i blame myself for not having my own backup plan (which i do just stating the obvious)

 

 

 

i hope that MDD as they have stated learns from this really good... iv been around with MDD for long time

this straight up sucks for them as well as everyone else..

 

 


 


  • 0

#308 Avatar

Avatar

    Newbie

  • Members
  • Pip
  • 6 posts

Posted 25 September 2018 - 12:29 PM

Despite our account being restored we lost all of our 27 MAILMAN email lists. I was told by support that our mailman lists could not be restored. Manually rebuilding these lists took me the better part of 6-hours. Would you please advise me why mailman list are not backed up and how I might go about backing up our mailman lists in the future?


  • 0

#309 digibread

digibread

    Newbie

  • Members
  • Pip
  • 9 posts

Posted 25 September 2018 - 12:30 PM

 

I'm also getting a HTTP 500 error for our and all of our clients' sites on the R1 server. I've opened a ticket.

For me it turned out to the Plugin UpdraftPlus, the fix was to rename the folder to updraftplus-bac. Login and go to Plugins (you'll see the plugin disabled), go back to the file system and rename it back to updraftplus, delete from within plugins and then upload the new version and install. All my settings were preserved. I hope that helps. Russ


  • 0

#310 kevnich83

kevnich83

    Newbie

  • Members
  • Pip
  • 2 posts

Posted 25 September 2018 - 12:32 PM

As I understand it we were all very lucky that you had a another (off-site) backup of the data. If it were only the snapshots every one would have been up the creek without a paddle. A lot of hosts would simply point to their 'suicide clause' of maintain and only rely on your own backups. You have that clause in your TOS but still went the extra mile. 

 

No one wants down time and it's never easy especially when you have clients hooting and hollering at you over something you really cannot control. It happens and it sucks but frankly this turned into a good learning opportunity. 

 

I learned that I had grown complacent with MDD's stellar uptime and general lack of headaches compared to other hosts. I wasn't pulling down backups as often as I should have. That's on me. 

 

Secondly, only some of my clients are on Cloud Flare. All my clients are going to use Cloud Flare in the future. With Cloud Flare I can at the very least point email at another host with minimal downtime until MDD restores. 

 

Current backups + Cloud Flare means I can restore service to another server pretty much as fast as I'm able to upload them.

 

For additional piece of mind I'm going to keep a 'hot' account on another server with email forwarders setup for all clients. This way I can just do an MX switch at Cloud Flare to restore email (the life blood of many client's businesses) ASAP then work on restoring the websites. 

 

Mike and the team still have my vote of confidence even more so seeing how they handled this outage. I really should've had this disaster plan in place already. 

 

This would be excellent advice for everyone.  However, I'm also going to add that if you're using WHM (I haven't seen the options for Cpanel), there's a backup configuration that can be configured to automatically send daily server backups to an Amazon S3 account, then configure lifecycle rules, versioning rules, etc on the S3 side so you have continuous backups of your hosting reseller or VPS account remotely.  I have backups going back a solid 12 months.  These are daily, weekly and monthly.  I then have lifecycle rules that push all backups over 30 days to Amazon Glacier and expire them out after one year.  If you think it takes hours to set this up, it took me a grand total of about 15 minutes.  That 15 min was well spent knowing my backups are transferred offsite every morning at 2am.

 

Always, always have your own backups and use some scripting to try and automate the process to off-site your own backups so you don't have to remember to download your own server backups.  At the end of the day, we're all human.


  • 0

#311 kevnich83

kevnich83

    Newbie

  • Members
  • Pip
  • 2 posts

Posted 25 September 2018 - 12:39 PM

Im am still trying to figure out and understand fully how 1 command took out 12 physical servers?

 

i get we make mistakes but dang thats a BIG BIG mistake..  thats like changing a Radiator with the engine running

 

 

Honestly, as I stated before, we're all human.  Every tech and engineer has done something very STUPID like this in their career.  I have broken more things than I can forget, but it's simply the process of learning.  I can guarantee that he/she will NEVER forget this and I sincerely hope they're able to forgive themselves, LEARN and move on. 

 

Granted, the fact that they were able to run one command that has that devastating of effects on a production system is a little hard to believe.  I come from a system where you have a lab that's as identical of equipment as you can and run all of your commands there, get a script from it then only after verifying everything is fine in the lab, take that script and run it on production.  But again, I don't know their environment.  If I were MDD, I would be contacting the storage vendor and see if there's a way to disable or require a confirmation or escalation before even being able to do a drop command like that.  Humans are humans and are the weakest link in the tech world.  You have to program your hardware and software around that fact, sadly.


  • 0

#312 sf2099

sf2099

    Newbie

  • Members
  • Pip
  • 6 posts

Posted 25 September 2018 - 12:47 PM

 

If I were MDD, I would be contacting the storage vendor and see if there's a way to disable or require a confirmation or escalation before even being able to do a drop command like that. 

 

Excellent point. 

 

I would go even one step further and explore if snapshots can be manged by an admin who does *not* have the privilege to issue discard commands.


  • 1

#313 Kevin

Kevin

    Newbie

  • Members
  • Pip
  • 15 posts

Posted 25 September 2018 - 12:47 PM

 

This would be excellent advice for everyone.  However, I'm also going to add that if you're using WHM (I haven't seen the options for Cpanel), there's a backup configuration that can be configured to automatically send daily server backups to an Amazon S3 account, then configure lifecycle rules, versioning rules, etc on the S3 side so you have continuous backups of your hosting reseller or VPS account remotely.  I have backups going back a solid 12 months.  These are daily, weekly and monthly.  I then have lifecycle rules that push all backups over 30 days to Amazon Glacier and expire them out after one year.  If you think it takes hours to set this up, it took me a grand total of about 15 minutes.  That 15 min was well spent knowing my backups are transferred offsite every morning at 2am.

 

Always, always have your own backups and use some scripting to try and automate the process to off-site your own backups so you don't have to remember to download your own server backups.  At the end of the day, we're all human.

 

Are you on a VPS? I've seen this in WHM on various VPSs, but I don't see those options in WHM as a reseller. Would be great to be able to take advantage of those backup options there.


  • 1

#314 djMot

djMot

    Newbie

  • Members
  • Pip
  • 11 posts
  • Gender:Male

Posted 25 September 2018 - 12:57 PM

Question regarding spam filtering in email - I'm getting WAY more spam in at least one of my email accounts than before we crashed and burned.  

 

What is the status of spam filtering?  Is that a service that needs to be retrained, tweaked, or even brought back online?


  • 0

#315 CLINECO

CLINECO

    Newbie

  • Members
  • Pip
  • 4 posts

Posted 25 September 2018 - 01:01 PM

 

Are you on a VPS? I've seen this in WHM on various VPSs, but I don't see those options in WHM as a reseller. Would be great to be able to take advantage of those backup options there.

I'd also like to know this...


  • 0

#316 sputnik

sputnik

    Newbie

  • Members
  • Pip
  • 2 posts

Posted 25 September 2018 - 01:10 PM

 

I would like to see MDD offer a backup to Dropbox or S3 through the WHM panel. 

 

I'd like to see that too.


  • 0

#317 moorejames

moorejames

    Newbie

  • Members
  • Pip
  • 3 posts

Posted 25 September 2018 - 01:18 PM

R1 - reseller account

 

Anyone else having the issue of Wordpress sites giving the dreaded White Screen of Death HTTP ERROR 500?

I have tried all the usual.

1. "php_value display_errors on" in .htaccess with no errors displayed

2. renamed .htaccess to .htaccess-bac

3. increased memory limit with define('WP_MEMORY_LIMIT', '64M');

4. disabled plugins by renaming them on the file system.

I guess I'm hoping someone might have experienced this and has solved it when coming online earlier.

Thanks for any help.

Russ

 

Hey Russ, 

 

If you're using MainWP, try renaming the plugin folders for mainWP child and MainWP child reports. They tracked that down to being the issue my sites were having. 

 

jim


  • 0

#318 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,893 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 25 September 2018 - 01:38 PM

Im am still trying to figure out and understand fully how 1 command took out 12 physical servers?

 

i get we make mistakes but dang thats a BIG BIG mistake..  thats like changing a Radiator with the engine running

 

 

but..as most have stated

 

1.BACK UP YOUR STUFF

2.have a backup plan your self incase of a disaster

3. dont host a money making business site on a shared server provider get your self your own dedicated machine and

have backups inplace.

4. you manage your own domain/DNS

 

dont put all your eggs in 1 basket when you are a live money making business.

 

i dont blame MDD i blame myself for not having my own backup plan (which i do just stating the obvious)

 

 

 

i hope that MDD as they have stated learns from this really good... iv been around with MDD for long time

this straight up sucks for them as well as everyone else..

 

 

 

Rather than signing into each server to run "fstrim" it was executed via parallel access.  Since an incorrect and destructive command was sent that we hadn't filtered for - it got run on everything.

 

Despite our account being restored we lost all of our 27 MAILMAN email lists. I was told by support that our mailman lists could not be restored. Manually rebuilding these lists took me the better part of 6-hours. Would you please advise me why mailman list are not backed up and how I might go about backing up our mailman lists in the future?

 

Normal cPanel backups [full cpanel backup] contains them.  JetBackup, by default, does not back up MailMan lists.  This is something I will be talking to them about when I get a chance.  It's absurd that it wouldn't be included by default.  It's not like MailMan uses a ton of space or anything.

 

For me it turned out to the Plugin UpdraftPlus, the fix was to rename the folder to updraftplus-bac. Login and go to Plugins (you'll see the plugin disabled), go back to the file system and rename it back to updraftplus, delete from within plugins and then upload the new version and install. All my settings were preserved. I hope that helps. Russ

 

We've seen a few plugins that caused issues / didn't restore properly and mostly it was due to those plugins changing their own permissions in such a way the backup system couldn't create a copy.  Why? I don't know.

 

 

Honestly, as I stated before, we're all human.  Every tech and engineer has done something very STUPID like this in their career.  I have broken more things than I can forget, but it's simply the process of learning.  I can guarantee that he/she will NEVER forget this and I sincerely hope they're able to forgive themselves, LEARN and move on. 

 

Granted, the fact that they were able to run one command that has that devastating of effects on a production system is a little hard to believe.  I come from a system where you have a lab that's as identical of equipment as you can and run all of your commands there, get a script from it then only after verifying everything is fine in the lab, take that script and run it on production.  But again, I don't know their environment.  If I were MDD, I would be contacting the storage vendor and see if there's a way to disable or require a confirmation or escalation before even being able to do a drop command like that.  Humans are humans and are the weakest link in the tech world.  You have to program your hardware and software around that fact, sadly.

 

The command that was run is already blacklisted - and we've already put two systems in place to help prevent that sort of thing as well as allowing us to recover in minutes, not days, should catastrophe ever happen to strike a second time.  We're doing what we can to prevent this from being possible again and making sure that if it does - that we can recover much more quickly.

 

Question regarding spam filtering in email - I'm getting WAY more spam in at least one of my email accounts than before we crashed and burned.  

 

What is the status of spam filtering?  Is that a service that needs to be retrained, tweaked, or even brought back online?

 

It's probable that your MX records got reset on restore - do please open a ticket.

 

 

I'd like to see that too.

 

We're going to look into it once we get a chance - if there's a plugin/addon for cPanel that makes it possible we'll see about getting it installed.  Just depends on licensing / security.

 

 

Hey Russ, 

 

If you're using MainWP, try renaming the plugin folders for mainWP child and MainWP child reports. They tracked that down to being the issue my sites were having. 

 

jim

 

I think we traced your issue down to ultimately being that your site needed PHP 7 while the system put you back on 5.6.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#319 moorejames

moorejames

    Newbie

  • Members
  • Pip
  • 3 posts

Posted 25 September 2018 - 01:49 PM

 

 

I think we traced your issue down to ultimately being that your site needed PHP 7 while the system put you back on 5.6.

 

That was the issue for our "admin" site. All our client sites had issues with MainWP. Renaming those plugin folders in cPanel File Manager brought the rest of our sites back online. (and I'm pretty sure it was MainWP that required PHP 7 on the admin site.....) 

 

Just trying to help... I would have much preferred being able to fix the issue myself without having to bother you guys.. you got a lot on your plate right now. 


  • 0

#320 chris.holmes

chris.holmes

    Newbie

  • Members
  • Pip
  • 4 posts

Posted 25 September 2018 - 01:59 PM

Once my sites (s3) come back up, should I be able to immediately access them through my FTP client or cPanel? Just looking for a more surefire way to tell if I'm up and running other than hitting F5 on the sites all day.


  • 0




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users