Resolved SD1, SR1, VPS1 Outage on 10/29/2014 @ 2:04 AM

Michael D. · October 29, 2014

Hello,

First and foremost I want to apologize for this outage. We were alerted at 2:04 AM to this issue by our internal monitoring and have been working on restoring since that time. All hands were brought on deck and we were able to solve an obscure and undocumented issue restoring all services successfully after approximately 7 hours. I wanted to go over the issue in a bit more detail and you will find that below.

We use Logical Volume Management to provide distinct storage to each piece of hardware. The volumes are then thinly provisioned so that they can share the same overall pool of storage without using more space than they actually need allowing maximum efficiency of the storage.

Today at 2:04 AM ET the storage system for the SD1, SR1, and VPS1 servers went read-only with file system errors. All server administrators were alerted and brought on duty to investigate and resolve the issue. Upon initial investigation we determined that due to a single configuration error the servers were not giving back free space to the pool resulting in them growing and never shrinking. We do monitor the storage but we were not correctly watching this metric.

All initial research on this specific issue indicated that the data was irreparably destroyed and we determined we needed to begin restoring our backups from 10:30 PM [3.5 hours before the issue] to hardware. We brought up extra hardware and began restoring backups immediately. While restorations were in progress we continued to work at recovering the data on the original storage. It took us about 7 hours to successfully repair the data at which time we checked the restoration and it showed at least 3 more hours remaining.

We then brought up the servers with the repaired storage and SD1 and SR1 came online immediately. VPS1 needed a file system check but due to the storage being solid state these were completed within minutes and VPS1 was brought back online.

We have identified several changes that will prevent this from happening again.
* We will be properly monitoring for storage pool free space to catch this issue before it becomes a problem.
* We will be converting the storage to a state more easily managed
* We are looking at high availability storage that will prevent this issue.

You may find it unusual that we were not keeping the forums up to date as this is something we do when there is an outage. I normally handle the forum updates personally but due to the critical nature of this outage I was focused on resolving the issue. In the future should an issue arise where I am focused on restoring the service I will bring an additional staff member in for the sole purpose of keeping customers updated on changes as they happen if I'm not able to do it myself.

Outages such as these are extremely rare for us and we do apologize again for any trouble it may have caused you.

Michael D. · October 29, 2014

We believe there to be residual file system issues due to the nature of the original failure and some oddities we are seeing. These file system errors will require a reboot for a file system check which we are scheduling for 1 AM ET tonight. We expect the maintenance to take less than 10 minutes although we are scheduling 1 hour just in case we run into anyting unexpected.

Fresh backups will be taken 2 hours prior to the maintenance.

We will keep this thread updated during this process.

iutopi · October 29, 2014

This really sucks. My clients are very angry with me for this. I'm trying to temporary restore their sites in a different hosting provider. But Im really really furious with this...

Laimonas · October 29, 2014

Were you restoring via R1Soft?

Michael D. · October 29, 2014

This really sucks. My clients are very angry with me for this. I'm trying to temporary restore their sites in a different hosting provider. But Im really really furious with this...

The issue has been resolved for many hours - I'm not sure why as of 2 PM today you would be attempting to restore sites to another provider. The servers have been online since 9 AM [this issue has been resolved for 6 hours now]. If you need help generating backups let us know in a support ticket.

Were you restoring via R1Soft?

Correct.

iutopi · October 29, 2014

One of my wordpress installs crashes hard! Is it possible to restore just one wordpress folder + database ?

Michael D. · October 29, 2014

One of my wordpress installs crashes hard! Is it possible to restore just one wordpress folder + database ?

It is absolutely possible but you'll need to open a ticket.

Michael D. · October 30, 2014

We are shutting down SR1 now for this maintenance. We are doing one server at a time.

Michael D. · October 30, 2014

SR1 is coming back online.

Michael D. · October 30, 2014

SD1 is now going down for this maintenance.

Michael D. · October 30, 2014

SD1 is coming back online.

Michael D. · October 30, 2014

This maintenance is now complete.

slushatwork · October 30, 2014

Though this didn't affect me, I appreciate the detailed explanation and updates. I have five different hosting companies for various clients and different aspects of my business, and you guys are the most dedicated to keeping customers in the loop on any problems.

NickK · October 30, 2014

Yeah, they even e-mailed us, when otherwise we probably wouldn't have noticed . I thought letting us know was cool, although obviously I'm not happy about the downtime. But the support team has earned enough good will that I don't really mind.

Thanks for the updates. Is eveything ok with SR1 now?

Michael D. · October 30, 2014

Yeah, they even e-mailed us, when otherwise we probably wouldn't have noticed . I thought letting us know was cool, although obviously I'm not happy about the downtime. But the support team has earned enough good will that I don't really mind.

Thanks for the updates. Is eveything ok with SR1 now?

That is the 'danger' of emailing everybody and letting them know you had an issue - some probably wouldn't notice if you didn't do it. That said it's more important that you know what happened, why it happened, and how we're working to prevent it than trying to hide that it occurred. Many providers take the opposite route which is unfortunate.

SR1 is fine - file system check came back OK.

Laimonas · October 30, 2014

Yes, my previous hosting provider tried to hide some outages and it was one of the reasons I left them. Actually was not happy this outage happened only after 3 months I chose MDD, but must add MDD fairness rocks.

Michael D. · October 30, 2014

This is the worst outage we've had in years but thankfully it only affected 3 servers and not our entire fleet. I saw HostGator was having a bad day as well - all of their reseller servers were offline all day and they're still working on repairing MySQL across their reseller fleet. Yesterday was just a bad day I think.

Laimonas · October 30, 2014

Are there any dates set on these two steps:

* We will be converting the storage to a state more easily managed
* We are looking at high availability storage that will prevent this issue.

Michael D. · October 30, 2014

Are there any dates set on these two steps:
* We will be converting the storage to a state more easily managed
* We are looking at high availability storage that will prevent this issue.

As soon as we can get everything lined up. Talking with several hardware vendors currently.

Laimonas · October 30, 2014

Yesterday was just a bad day I think.

NASA rocket also failed yesterday.

Michael D. · October 31, 2014

NASA rocket also failed yesterday.

Ouch - I wasn't aware I'll have to do a search. Hopefully nobody hurt.

Sign In

Resolved SD1, SR1, VPS1 Outage on 10/29/2014 @ 2:04 AM

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation