Jump to content
MDDHosting Forums

SD1, SR1, VPS1 Outage on 10/29/2014 @ 2:04 AM


Recommended Posts

Hello,

 

First and foremost I want to apologize for this outage. We were alerted at 2:04 AM to this issue by our internal monitoring and have been working on restoring since that time. All hands were brought on deck and we were able to solve an obscure and undocumented issue restoring all services successfully after approximately 7 hours. I wanted to go over the issue in a bit more detail and you will find that below.

We use Logical Volume Management to provide distinct storage to each piece of hardware. The volumes are then thinly provisioned so that they can share the same overall pool of storage without using more space than they actually need allowing maximum efficiency of the storage.

Today at 2:04 AM ET the storage system for the SD1, SR1, and VPS1 servers went read-only with file system errors. All server administrators were alerted and brought on duty to investigate and resolve the issue. Upon initial investigation we determined that due to a single configuration error the servers were not giving back free space to the pool resulting in them growing and never shrinking. We do monitor the storage but we were not correctly watching this metric.

All initial research on this specific issue indicated that the data was irreparably destroyed and we determined we needed to begin restoring our backups from 10:30 PM [3.5 hours before the issue] to hardware. We brought up extra hardware and began restoring backups immediately. While restorations were in progress we continued to work at recovering the data on the original storage. It took us about 7 hours to successfully repair the data at which time we checked the restoration and it showed at least 3 more hours remaining.

We then brought up the servers with the repaired storage and SD1 and SR1 came online immediately. VPS1 needed a file system check but due to the storage being solid state these were completed within minutes and VPS1 was brought back online.

We have identified several changes that will prevent this from happening again.
* We will be properly monitoring for storage pool free space to catch this issue before it becomes a problem.
* We will be converting the storage to a state more easily managed
* We are looking at high availability storage that will prevent this issue.

 

You may find it unusual that we were not keeping the forums up to date as this is something we do when there is an outage. I normally handle the forum updates personally but due to the critical nature of this outage I was focused on resolving the issue. In the future should an issue arise where I am focused on restoring the service I will bring an additional staff member in for the sole purpose of keeping customers updated on changes as they happen if I'm not able to do it myself.

 

Outages such as these are extremely rare for us and we do apologize again for any trouble it may have caused you.

  • Upvote 1
Link to comment
Share on other sites

We believe there to be residual file system issues due to the nature of the original failure and some oddities we are seeing. These file system errors will require a reboot for a file system check which we are scheduling for 1 AM ET tonight. We expect the maintenance to take less than 10 minutes although we are scheduling 1 hour just in case we run into anyting unexpected.

 

Fresh backups will be taken 2 hours prior to the maintenance.

 

We will keep this thread updated during this process.

Link to comment
Share on other sites

This really sucks. My clients are very angry with me for this. I'm trying to temporary restore their sites in a different hosting provider. But Im really really furious with this...

The issue has been resolved for many hours - I'm not sure why as of 2 PM today you would be attempting to restore sites to another provider. The servers have been online since 9 AM [this issue has been resolved for 6 hours now]. If you need help generating backups let us know in a support ticket.

 

Were you restoring via R1Soft?

Correct.

Link to comment
Share on other sites

Yeah, they even e-mailed us, when otherwise we probably wouldn't have noticed ;) . I thought letting us know was cool, although obviously I'm not happy about the downtime. But the support team has earned enough good will that I don't really mind.

 

Thanks for the updates. Is eveything ok with SR1 now?

Link to comment
Share on other sites

Yeah, they even e-mailed us, when otherwise we probably wouldn't have noticed ;) . I thought letting us know was cool, although obviously I'm not happy about the downtime. But the support team has earned enough good will that I don't really mind.

 

Thanks for the updates. Is eveything ok with SR1 now?

That is the 'danger' of emailing everybody and letting them know you had an issue - some probably wouldn't notice if you didn't do it. That said it's more important that you know what happened, why it happened, and how we're working to prevent it than trying to hide that it occurred. Many providers take the opposite route which is unfortunate.

 

SR1 is fine - file system check came back OK.

Link to comment
Share on other sites

This is the worst outage we've had in years but thankfully it only affected 3 servers and not our entire fleet. I saw HostGator was having a bad day as well - all of their reseller servers were offline all day and they're still working on repairing MySQL across their reseller fleet. Yesterday was just a bad day I think.

Link to comment
Share on other sites

Are there any dates set on these two steps:

* We will be converting the storage to a state more easily managed

* We are looking at high availability storage that will prevent this issue.

As soon as we can get everything lined up. Talking with several hardware vendors currently.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...