Jump to content
MDDHosting Forums

Major Outage - 09/21/18+ - Client Discussion


KevinD872

Recommended Posts

I am on S2 and this morning ~6:00am central I find I'm back up.

 

Email is showing a curious backtracking with some email from quite a long time ago re-arriving. I use Outlook in POP3 mode so nothing is supposed to be left on the server once fetched. Other than that, there's the expected lack of new mail. The hardest part of this will now be verifying that bonafide senders that have me on a list haven't dropped me for around 72 hours of undeliverability.

 

Thanks for getting me back up and running.

 

 

Just want to throw out there I am on R1 with around 15 sites and everything looks good after restoration. I need to check a bit deeper but so far so good!

 

 

It seems that my 1st plan on S2 is ok, i don't see any (obvious) problems.

 

www and email all ok.

 

This is comforting to hear for those of us waiting last on the list.

 

Thank you.

Link to comment
Share on other sites

Michael & MDD Gang,

 

Appreciate your hard work in getting us back online. I've been with you folks for many years and even the smallest (usually user errors) issues have always been addressed quickly and professionally. While this was a bit painful for me (and all of us), I know it will make you a better company and I'll be a customer after this is all behind us.

  • Upvote 1
Link to comment
Share on other sites

Hello all,

This community and forum have been really helpful during the outage.

 

I see r4 is restored. Most of my sites are working except a few where the IP has changed. Aside from that, I had two sites that I was developing on http://r4.temporary-access.com/~example, which don't seem to be working.

Any idea why that would be? Anyone else experience this?

Link to comment
Share on other sites

my sites with s2 server are up, though it looks like the server ip address was changed from 173.248.187.43 to xxx.xxx.xxx.16

 

all my email account passwords are not working so i will need to check on those to see why.

 

although I have the urge to do a full cpanel and site backups, I am going to wait so that I don't negatively impact the server.

 

good luck everyone

Link to comment
Share on other sites

my sites with s2 server are up, though it looks like the server ip address was changed from 173.248.187.43 to xxx.xxx.xxx.16

 

all my email account passwords are not working so i will need to check on those to see why.

 

although I have the urge to do a full cpanel and site backups, I am going to wait so that I don't negatively impact the server.

 

good luck everyone

 

 

 

I'm on S2 also.

 

S2 was on 173.248.191.166 if i'm not mistaken. And it on 173.248.187.16 now.

Link to comment
Share on other sites

R1 - reseller account

 

Anyone else having the issue of Wordpress sites giving the dreaded White Screen of Death HTTP ERROR 500?

I have tried all the usual.

1. "php_value display_errors on" in .htaccess with no errors displayed

2. renamed .htaccess to .htaccess-bac

3. increased memory limit with define('WP_MEMORY_LIMIT', '64M');

4. disabled plugins by renaming them on the file system.

I guess I'm hoping someone might have experienced this and has solved it when coming online earlier.

Thanks for any help.

Russ

Link to comment
Share on other sites

Hello all,

This community and forum have been really helpful during the outage.

 

I see r4 is restored. Most of my sites are working except a few where the IP has changed. Aside from that, I had two sites that I was developing on http://r4.temporary-access.com/~example, which don't seem to be working.

Any idea why that would be? Anyone else experience this?

We haven't yet repaired `temporary-access.com`. We will do so when we have some time to focus on it.

 

my sites with s2 server are up, though it looks like the server ip address was changed from 173.248.187.43 to xxx.xxx.xxx.16

 

all my email account passwords are not working so i will need to check on those to see why.

 

although I have the urge to do a full cpanel and site backups, I am going to wait so that I don't negatively impact the server.

 

good luck everyone

 

Feel free to generate cPanel backups - it's not a bad idea. Our storage platform is robust enough that it won't cause any issues.

 

R1 - reseller account

 

Anyone else having the issue of Wordpress sites giving the dreaded White Screen of Death HTTP ERROR 500?

I have tried all the usual.

1. "php_value display_errors on" in .htaccess with no errors displayed

2. renamed .htaccess to .htaccess-bac

3. increased memory limit with define('WP_MEMORY_LIMIT', '64M');

4. disabled plugins by renaming them on the file system.

I guess I'm hoping someone might have experienced this and has solved it when coming online earlier.

Thanks for any help.

Russ

We've seen a few instances of strangeness like this. Most of them were due to the PHP version set wrong or a missing file.

 

If you can't sort it out - open a ticket and we'll do what we can to help. We are swamped so it may take some time - but we will definitely help as much as we can.

Link to comment
Share on other sites

R1 - reseller account

 

Anyone else having the issue of Wordpress sites giving the dreaded White Screen of Death HTTP ERROR 500?

 

I guess I'm hoping someone might have experienced this and has solved it when coming online earlier.

Thanks for any help.

Russ

 

I'm also getting a HTTP 500 error for our and all of our clients' sites on the R1 server. I've opened a ticket.

Link to comment
Share on other sites

Just wanted to say thanks again. ALL of my reseller accounts on R1 are back and working 100%, including a vBulletin forum and several wordpress sites. The restore was from last Tuesday afternoon, but that only matters for the forum. (and I auto back it up twice daily to my home server, so that's not really an issue)

 

NO problems here.

 

We had a similar problem back on 9/11/01, when our previous data center/isp was knocked out by the falling twin towers. They were down for more than a month.

 

Great work, team. After this, you may need to sleep for a week.

  • Upvote 1
Link to comment
Share on other sites

Managed to get my S5 server website back online yesterday by moving to another host a couple days ago. Had I moved when my gut told me to, as soon as all this happened, I could have been back online already on Sunday. Lesson learned... When things this serious go wrong, grab another host for a month right away and take shelter. Or in my case... Move.

Link to comment
Share on other sites

Managed to get my S5 server website back online yesterday by moving to another host a couple days ago. Had I moved when my gut told me to, as soon as all this happened, I could have been back online already on Sunday. Lesson learned... When things this serious go wrong, grab another host for a month right away and take shelter. Or in my case... Move.

If you could have moved - then we could have re-created your account and you could have restored in-place. Whatever you moved to another provider could have been quickly brought back online with us.

 

The server has been online since Friday - the only thing we've been working on is copying our backup data over. Any clients that have their own backups have been online for days already.

  • Upvote 3
Link to comment
Share on other sites

Just wanted to say thanks again. ALL of my reseller accounts on R1 are back and working 100%, including a vBulletin forum and several wordpress sites. The restore was from last Tuesday afternoon, but that only matters for the forum. (and I auto back it up twice daily to my home server, so that's not really an issue)

 

NO problems here.

 

We had a similar problem back on 9/11/01, when our previous data center/isp was knocked out by the falling twin towers. They were down for more than a month.

 

Great work, team. After this, you may need to sleep for a week.

 

 

Having gone through 9-11 working in tech for a large financial firm, I disagree that this event is remotely comparable to what happened back then.

 

9-11 was an externally driven event; what happened here is self-inflicted.

 

Not counting on the critical human error (because we all make mistakes), the gap between the advertised backup / redundancy guarantees to the actual recovery effort is too far and too wide in my opinion.

 

I agree that MDD deserves some kudos for being transparent as it struggled to recover but I am not ready to write any congratulatory letters because I have two well paying customers who are angry and are ready fire me and I have to spend my time and money to fix this fiasco.

 

So the hard lesson for me is to NOT get lazy and rely on any given company, even those with positive reputations, because at the end of the day, I am responsible for my own business.

 

For those of you impacted, I implore you to:

 

1. Move any/all domains out to a separate registrar so that you can control your DNS

2. Regularly back up everything and store them in cloud and local storagte

3. Pay for an alternative hosting company as a stanby

4. Regularly restore backups with standby hosting and test!

 

Good luck

Link to comment
Share on other sites

MDD helped me setting up a new account and the only thing I had to do was copy my own backup using FTP. Setup the email accounts and that was it.

 

Everything is online again, thanks everybody for being helpful !

 

The question still remains why Cloudflare was not feeding a cached version - these were static sites ?? Can someone answer this ?

Link to comment
Share on other sites

I'm going to elaborate what is taking so long in this process.

 

I'll explain the process a little bit so that you can understand what is going on and why.

  1. The backup server we were backing up to in Phoenix is having issues and is not able to send data very quickly. This data rate is between 100 an 500 megabytes/second for a single stream and as soon as we start up more than one stream they all drop down to 1~5 megabytes/second.
  2. Due to 1 above - we can't do restorations directly from the backup server at any reasonable rate - it's too slow.
  3. Due to the single stream transfer rate being significantly faster, although still much slower than we'd like, this is the bottle neck we are dealing with.

 

The process is as follows:

 

  1. Export server backup from Slow Backup server to 4x1.6TB SSD Array. This takes between 4 and 8 hours per server. We can't do more than one at a time - or it'll take weeks.
  2. Once export is done, take 4x1.6TB SSDs out of Slow Backup and physically move them to "Fast Backup" - a new server we brought online to affect restorations very quickly.
  3. Once 4x1.6TB SSDs are in Fast Backup we are restoring that backup image to an array of 24x1.6TB SSDs on Hardware Raid.
  4. We then begin 30 to 50 simultaneous restoration processes from this new backup server directly to the server. This results in the server being fully restored within an hour or two for most from the time we start restoring.

The hold-up is the step where we're copying the data to the SSDs and moving it to the fast server - once it's moved over the process is extremely fast.

Here's an image from the transfer process of one of the server images to give you an idea: http://www.screen-shot.net/2018-09-25_11-24-57.png

If anybody has a backup of their own - we can use that to get you online immediately. We do have several dozen clients that provided backups in the beginning and were back online on day one. I will be honest in saying that it has surprised me how few of our clients have their own backups. I do my best to preach how important having your own backups are no matter who your provider is or what they promise - even us but it doesn't seem like anybody listens to this advice until it bites them in the rear.

 

We have learned a lot from this incident. A few things we weren't doing that we should be doing, a few things we should do differently, and a few things we shouldn't do.

 

One of the changes is snapshotting on our storage platform - which would have allowed us to restore services within minutes. Another is having more than one backup server and testing both the backup and restoration rates on a regular basis so we can resolve any slowness / issues when we aren't in need of the data from those systems.

 

There are a lot of internal policies and procedures we're changing as well. Once we're back online and our clients' issues are resolved we'll detail all of this but for now we're focusing on restoration.

Link to comment
Share on other sites

I'm on S3, so from a timing perspective when might I reasonably expect to see my sites restored? The status thread says restoration will start soon but I'm not sure what that means.

What that means is we're waiting on the facility to move the SSDs from the Slow Backup server to the Fast Backup server so we can start restoring the image and then restoring data. I'd estimate 1 hour before we start restoring but that's just my estimate. This would be step 2 in the process listed in my post just prior to this one.

Link to comment
Share on other sites

As I understand it we were all very lucky that you had a another (off-site) backup of the data. If it were only the snapshots every one would have been up the creek without a paddle. A lot of hosts would simply point to their 'suicide clause' of maintain and only rely on your own backups. You have that clause in your TOS but still went the extra mile.

 

No one wants down time and it's never easy especially when you have clients hooting and hollering at you over something you really cannot control. It happens and it sucks but frankly this turned into a good learning opportunity.

 

I learned that I had grown complacent with MDD's stellar uptime and general lack of headaches compared to other hosts. I wasn't pulling down backups as often as I should have. That's on me.

 

Secondly, only some of my clients are on Cloud Flare. All my clients are going to use Cloud Flare in the future. With Cloud Flare I can at the very least point email at another host with minimal downtime until MDD restores.

 

Current backups + Cloud Flare means I can restore service to another server pretty much as fast as I'm able to upload them.

 

For additional piece of mind I'm going to keep a 'hot' account on another server with email forwarders setup for all clients. This way I can just do an MX switch at Cloud Flare to restore email (the life blood of many client's businesses) ASAP then work on restoring the websites.

 

Mike and the team still have my vote of confidence even more so seeing how they handled this outage. I really should've had this disaster plan in place already.

  • Upvote 3
Link to comment
Share on other sites

Sadly - human error - and then a chain of unfortunate circumstances got us to where we ended up. A chain that we will be breaking in numerous places so that this cannot happen again.

Hi Mike, not to divert the topic but I'm both curious and concerned about the person responsible for this error? Honestly, the first thing that came into my mind when things got a bit serious was how this admin would have affected him/her psychologically. I don't know.. I would have had some kind of nervous breakdown or something if it was me. Hows he/she holding up BTW? I hope he/she is ok.

  • Upvote 3
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...