Jump to content


Photo

Major Outage - 09/21/18+ - Client Discussion


  • Please log in to reply
419 replies to this topic

#281 digibread

digibread

    Newbie

  • Members
  • Pip
  • 9 posts

Posted 25 September 2018 - 08:10 AM

R1 - reseller account

 

Anyone else having the issue of Wordpress sites giving the dreaded White Screen of Death HTTP ERROR 500?

I have tried all the usual.

1. "php_value display_errors on" in .htaccess with no errors displayed

2. renamed .htaccess to .htaccess-bac

3. increased memory limit with define('WP_MEMORY_LIMIT', '64M');

4. disabled plugins by renaming them on the file system.

I guess I'm hoping someone might have experienced this and has solved it when coming online earlier.

Thanks for any help.

Russ


  • 0

#282 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 25 September 2018 - 08:28 AM

Hello all,

This community and forum have been really helpful during the outage.

 

I see r4 is restored.  Most of my sites are working except a few where the IP has changed.  Aside from that, I had two sites that I was developing on http://r4.temporary-...ss.com/~example, which don't seem to be working. 

Any idea why that would be?  Anyone else experience this?

We haven't yet repaired `temporary-access.com`.  We will do so when we have some time to focus on it.

 

my sites with s2 server are up, though it looks like the server ip address was changed from 173.248.187.43 to xxx.xxx.xxx.16

 

all my email account passwords are not working so i will need to check on those to see why.

 

although I have the urge to do a full cpanel and site backups, I am going to wait so that I don't negatively impact the server.

 

good luck everyone

 

Feel free to generate cPanel backups - it's not a bad idea.  Our storage platform is robust enough that it won't cause any issues.

 

R1 - reseller account

 

Anyone else having the issue of Wordpress sites giving the dreaded White Screen of Death HTTP ERROR 500?

I have tried all the usual.

1. "php_value display_errors on" in .htaccess with no errors displayed

2. renamed .htaccess to .htaccess-bac

3. increased memory limit with define('WP_MEMORY_LIMIT', '64M');

4. disabled plugins by renaming them on the file system.

I guess I'm hoping someone might have experienced this and has solved it when coming online earlier.

Thanks for any help.

Russ

We've seen a few instances of strangeness like this.  Most of them were due to the PHP version set wrong or a missing file.

 

If you can't sort it out - open a ticket and we'll do what we can to help.  We are swamped so it may take some time - but we will definitely help as much as we can.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#283 moorejames

moorejames

    Newbie

  • Members
  • Pip
  • 3 posts

Posted 25 September 2018 - 08:48 AM

R1 - reseller account

 

Anyone else having the issue of Wordpress sites giving the dreaded White Screen of Death HTTP ERROR 500?

 

I guess I'm hoping someone might have experienced this and has solved it when coming online earlier.

Thanks for any help.

Russ

 

I'm also getting a HTTP 500 error for our and all of our clients' sites on the R1 server. I've opened a ticket.


  • 0

#284 Rhody401

Rhody401

    Newbie

  • Members
  • Pip
  • 12 posts
  • Gender:Male
  • Location:Providence, RI USA
  • Interests:Consultant, Contractor, Beta Tester, IT Director.

Posted 25 September 2018 - 09:31 AM

Just wanted to say thanks again.  ALL of my reseller accounts on R1 are back and working 100%, including a vBulletin forum and several wordpress sites.  The restore was from last Tuesday afternoon, but that only matters for the forum. (and I auto back it up twice daily to my home server, so that's not really an issue)

 

NO problems here.

 

We had a similar problem back on 9/11/01, when our previous data center/isp was knocked out by the falling twin towers.  They were down for more than a month.  

 

Great work, team.  After this, you may need to sleep for a week.


  • 1

#285 Brad

Brad

    Member

  • Members
  • PipPip
  • 29 posts

Posted 25 September 2018 - 09:46 AM

Managed to get my S5 server website back online yesterday by moving to another host a couple days ago.  Had I moved when my gut told me to, as soon as all this happened, I could have been back online already on Sunday.  Lesson learned...  When things this serious go wrong, grab another host for a month right away and take shelter.  Or in my case...  Move.


  • 0

#286 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 25 September 2018 - 09:49 AM

Managed to get my S5 server website back online yesterday by moving to another host a couple days ago.  Had I moved when my gut told me to, as soon as all this happened, I could have been back online already on Sunday.  Lesson learned...  When things this serious go wrong, grab another host for a month right away and take shelter.  Or in my case...  Move.

If you could have moved - then we could have re-created your account and you could have restored in-place.  Whatever you moved to another provider could have been quickly brought back online with us.

 

The server has been online since Friday - the only thing we've been working on is copying our backup data over.  Any clients that have their own backups have been online for days already.


  • 3
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#287 fifers

fifers

    Newbie

  • Members
  • Pip
  • 1 posts

Posted 25 September 2018 - 10:05 AM

Thank you for all of your hard work these past few days. My five sites on S2 are back up and running perfectly again. And rather than even thinking of leaving, I plan on upgrading my account as soon as everything settles down. Hope you all can get some well-earned rest soon.


  • 0

#288 Brad

Brad

    Member

  • Members
  • PipPip
  • 29 posts

Posted 25 September 2018 - 10:06 AM

Absolutely, I considered that but given the chaos and uncertainty that I was experiencing I just felt more secure on another provider.


  • 0

#289 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 25 September 2018 - 10:07 AM

Absolutely, I considered that but given the chaos and uncertainty that I was experiencing I just felt more secure on another provider.

Your decision ultimately, I wish you the best.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#290 sf2099

sf2099

    Newbie

  • Members
  • Pip
  • 6 posts

Posted 25 September 2018 - 10:08 AM

Just wanted to say thanks again.  ALL of my reseller accounts on R1 are back and working 100%, including a vBulletin forum and several wordpress sites.  The restore was from last Tuesday afternoon, but that only matters for the forum. (and I auto back it up twice daily to my home server, so that's not really an issue)

 

NO problems here.

 

We had a similar problem back on 9/11/01, when our previous data center/isp was knocked out by the falling twin towers.  They were down for more than a month.  

 

Great work, team.  After this, you may need to sleep for a week.

 

 

Having gone through 9-11 working in tech for a large financial firm, I disagree that this event is remotely comparable to what happened back then.

 

9-11 was an externally driven event; what happened here is self-inflicted. 

 

Not counting on the critical human error (because we all make mistakes), the gap between the advertised backup / redundancy guarantees to the actual recovery effort is too far and too wide in my opinion.

 

I agree that MDD deserves some kudos for being transparent as it struggled to recover but I am not ready to write any congratulatory letters because I have two well paying customers who are angry and are ready fire me and I have to spend my time and money to fix this fiasco.

 

So the hard lesson for me is to NOT get lazy and rely on any given company, even those with positive reputations, because at the end of the day, I am responsible for my own business.

 

For those of you impacted, I implore you to:

 

1.  Move any/all domains out to a separate registrar so that you can control your DNS

2.  Regularly back up everything and store them in cloud and local storagte

3.  Pay for an alternative hosting company as a stanby

4.  Regularly restore backups with standby hosting and test!

 

Good luck


  • 0

#291 chris.holmes

chris.holmes

    Newbie

  • Members
  • Pip
  • 4 posts

Posted 25 September 2018 - 10:10 AM

I'm on S3, so from a timing perspective when might I reasonably expect to see my sites restored? The status thread says restoration will start soon but I'm not sure what that means.


  • 0

#292 Avatar

Avatar

    Newbie

  • Members
  • Pip
  • 6 posts

Posted 25 September 2018 - 10:29 AM

I missed the notification of what happened; what caused this catastrophic failure?


  • 0

#293 Maal

Maal

    Newbie

  • Members
  • Pip
  • 4 posts

Posted 25 September 2018 - 10:31 AM

MDD helped me setting up a new account and the only thing I had to do was copy my own backup using FTP. Setup the email accounts and that was it.

 

Everything is online again, thanks everybody for being helpful !

 

The question still remains why Cloudflare was not feeding a cached version - these were static sites ?? Can someone answer this ?


  • 0

#294 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 25 September 2018 - 10:32 AM

I'm going to elaborate what is taking so long in this process.

 

I'll explain the process a little bit so that you can understand what is going on and why.

  1. The backup server we were backing up to in Phoenix is having issues and is not able to send data very quickly. This data rate is between 100 an 500 megabytes/second for a single stream and as soon as we start up more than one stream they all drop down to 1~5 megabytes/second.
  2. Due to 1 above - we can't do restorations directly from the backup server at any reasonable rate - it's too slow.
  3. Due to the single stream transfer rate being significantly faster, although still much slower than we'd like, this is the bottle neck we are dealing with.

 

The process is as follows:

 

  1. Export server backup from Slow Backup server to 4x1.6TB SSD Array.  This takes between 4 and 8 hours per server.  We can't do more than one at a time - or it'll take weeks.
  2. Once export is done, take 4x1.6TB SSDs out of Slow Backup and physically move them to "Fast Backup" - a new server we brought online to affect restorations very quickly.
  3. Once 4x1.6TB SSDs are in Fast Backup we are restoring that backup image to an array of 24x1.6TB SSDs on Hardware Raid.
  4. We then begin 30 to 50 simultaneous restoration processes from this new backup server directly to the server.  This results in the server being fully restored within an hour or two for most from the time we start restoring.

The hold-up is the step where we're copying the data to the SSDs and moving it to the fast server - once it's moved over the process is extremely fast.

Here's an image from the transfer process of one of the server images to give you an idea: 2018-09-25_11-24-57.png

If anybody has a backup of their own - we can use that to get you online immediately.  We do have several dozen clients that provided backups in the beginning and were back online on day one.  I will be honest in saying that it has surprised me how few of our clients have their own backups.  I do my best to preach how important having your own backups are no matter who your provider is or what they promise - even us but it doesn't seem like anybody listens to this advice until it bites them in the rear.

 

We have learned a lot from this incident.  A few things we weren't doing that we should be doing, a few things we should do differently, and a few things we shouldn't do.

 

One of the changes is snapshotting on our storage platform - which would have allowed us to restore services within minutes.  Another is having more than one backup server and testing both the backup and restoration rates on a regular basis so we can resolve any slowness / issues when we aren't in need of the data from those systems.

 

There are a lot of internal policies and procedures we're changing as well.  Once we're back online and our clients' issues are resolved we'll detail all of this but for now we're focusing on restoration.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#295 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 25 September 2018 - 10:33 AM

I'm on S3, so from a timing perspective when might I reasonably expect to see my sites restored? The status thread says restoration will start soon but I'm not sure what that means.

What that means is we're waiting on the facility to move the SSDs from the Slow Backup server to the Fast Backup server so we can start restoring the image and then restoring data.  I'd estimate 1 hour before we start restoring but that's just my estimate.  This would be step 2 in the process listed in my post just prior to this one.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#296 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 25 September 2018 - 10:33 AM

I missed the notification of what happened; what caused this catastrophic failure?

Sadly - human error - and then a chain of unfortunate circumstances got us to where we ended up. A chain that we will be breaking in numerous places so that this cannot happen again.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#297 Big Dan

Big Dan

    Member

  • Members
  • PipPip
  • 46 posts
  • Gender:Male

Posted 25 September 2018 - 10:52 AM

As I understand it we were all very lucky that you had a another (off-site) backup of the data. If it were only the snapshots every one would have been up the creek without a paddle. A lot of hosts would simply point to their 'suicide clause' of maintain and only rely on your own backups. You have that clause in your TOS but still went the extra mile. 

 

No one wants down time and it's never easy especially when you have clients hooting and hollering at you over something you really cannot control. It happens and it sucks but frankly this turned into a good learning opportunity. 

 

I learned that I had grown complacent with MDD's stellar uptime and general lack of headaches compared to other hosts. I wasn't pulling down backups as often as I should have. That's on me. 

 

Secondly, only some of my clients are on Cloud Flare. All my clients are going to use Cloud Flare in the future. With Cloud Flare I can at the very least point email at another host with minimal downtime until MDD restores. 

 

Current backups + Cloud Flare means I can restore service to another server pretty much as fast as I'm able to upload them.

 

For additional piece of mind I'm going to keep a 'hot' account on another server with email forwarders setup for all clients. This way I can just do an MX switch at Cloud Flare to restore email (the life blood of many client's businesses) ASAP then work on restoring the websites. 

 

Mike and the team still have my vote of confidence even more so seeing how they handled this outage. I really should've had this disaster plan in place already. 


  • 3

#298 SarisIsop

SarisIsop

    Advancing Member

  • Members
  • PipPipPip
  • 155 posts
  • Gender:Not Telling

Posted 25 September 2018 - 10:55 AM


 

Mike and the team still have my vote of confidence even more so seeing how they handled this outage. I really should've had this disaster plan in place already. 

 

Same here.


  • 0

#299 npad69

npad69

    Newbie

  • Members
  • Pip
  • 4 posts

Posted 25 September 2018 - 11:19 AM

Sadly - human error - and then a chain of unfortunate circumstances got us to where we ended up. A chain that we will be breaking in numerous places so that this cannot happen again.

Hi Mike, not to divert the topic but I'm both curious and concerned about the person responsible for this error? Honestly, the first thing that came into my mind when things got a bit serious was how this admin would have affected him/her psychologically. I don't know.. I would have had some kind of nervous breakdown or something if it was me. Hows he/she holding up BTW? I hope he/she is ok.


  • 3

#300 NiKEUS

NiKEUS

    Newbie

  • Members
  • Pip
  • 4 posts

Posted 25 September 2018 - 11:27 AM

I'll take it's not all back to normal, more messing around with r2? Clients sites still down and no whm for me 🙄
  • 0




5 user(s) are reading this topic

0 members, 5 guests, 0 anonymous users