Jump to content


Photo

SD1, SR1, VPS1 Outage on 10/29/2014 @ 2:04 AM

Resolved SD1 SR1 VPS1 Outage

  • Please log in to reply
20 replies to this topic

#1 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 29 October 2014 - 12:37 PM

Hello,

 

First and foremost I want to apologize for this outage.  We were alerted at 2:04 AM to this issue by our internal monitoring and have been working on restoring since that time.  All hands were brought on deck and we were able to solve an obscure and undocumented issue restoring all services successfully after approximately 7 hours.  I wanted to go over the issue in a bit more detail and you will find that below.

We use Logical Volume Management to provide distinct storage to each piece of hardware.  The volumes are then thinly provisioned so that they can share the same overall pool of storage without using more space than they actually need allowing maximum efficiency of the storage.

Today at 2:04 AM ET the storage system for the SD1, SR1, and VPS1 servers went read-only with file system errors.  All server administrators were alerted and brought on duty to investigate and resolve the issue.  Upon initial investigation we determined that due to a single configuration error the servers were not giving back free space to the pool resulting in them growing and never shrinking.  We do monitor the storage but we were not correctly watching this metric.

All initial research on this specific issue indicated that the data was irreparably destroyed and we determined we needed to begin restoring our backups from 10:30 PM [3.5 hours before the issue] to hardware.  We brought up extra hardware and began restoring backups immediately.  While restorations were in progress we continued to work at recovering the data on the original storage.  It took us about 7 hours to successfully repair the data at which time we checked the restoration and it showed at least 3 more hours remaining.  

We then brought up the servers with the repaired storage and SD1 and SR1 came online immediately.  VPS1 needed a file system check but due to the storage being solid state these were completed within minutes and VPS1 was brought back online.

We have identified several changes that will prevent this from happening again.
* We will be properly monitoring for storage pool free space to catch this issue before it becomes a problem.
* We will be converting the storage to a state more easily managed
* We are looking at high availability storage that will prevent this issue.

 

You may find it unusual that we were not keeping the forums up to date as this is something we do when there is an outage.  I normally handle the forum updates personally but due to the critical nature of this outage I was focused on resolving the issue.  In the future should an issue arise where I am focused on restoring the service I will bring an additional staff member in for the sole purpose of keeping customers updated on changes as they happen if I'm not able to do it myself.

 

Outages such as these are extremely rare for us and we do apologize again for any trouble it may have caused you.


  • 1
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#2 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 29 October 2014 - 12:44 PM

We believe there to be residual file system issues due to the nature of the original failure and some oddities we are seeing.  These file system errors will require a reboot for a file system check which we are scheduling for 1 AM ET tonight.  We expect the maintenance to take less than 10 minutes although we are scheduling 1 hour just in case we run into anyting unexpected.

 

Fresh backups will be taken 2 hours prior to the maintenance.

 

We will keep this thread updated during this process.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#3 iutopi

iutopi

    Newbie

  • Members
  • Pip
  • 2 posts

Posted 29 October 2014 - 01:07 PM

This really sucks. My clients are very angry with me for this. I'm trying to temporary restore their sites in a different hosting provider. But Im really really furious with this...


  • 0

#4 Laimonas

Laimonas

    Newbie

  • Members
  • Pip
  • 14 posts
  • Gender:Male
  • Location:European Union
  • Interests:wine

Posted 29 October 2014 - 01:56 PM

Were you restoring via R1Soft?


  • 0

#5 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 29 October 2014 - 02:11 PM

This really sucks. My clients are very angry with me for this. I'm trying to temporary restore their sites in a different hosting provider. But Im really really furious with this...

The issue has been resolved for many hours - I'm not sure why as of 2 PM today you would be attempting to restore sites to another provider.  The servers have been online since 9 AM [this issue has been resolved for 6 hours now].  If you need help generating backups let us know in a support ticket.

 

Were you restoring via R1Soft?

Correct.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#6 iutopi

iutopi

    Newbie

  • Members
  • Pip
  • 2 posts

Posted 29 October 2014 - 03:34 PM

One of my wordpress installs crashes hard! Is it possible to restore just one wordpress folder + database ?


  • 0

#7 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 29 October 2014 - 04:04 PM

One of my wordpress installs crashes hard! Is it possible to restore just one wordpress folder + database ?

It is absolutely possible but you'll need to open a ticket.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#8 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 30 October 2014 - 12:02 AM

We are shutting down SR1 now for this maintenance.  We are doing one server at a time.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#9 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 30 October 2014 - 12:12 AM

SR1 is coming back online.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#10 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 30 October 2014 - 12:13 AM

SD1 is now going down for this maintenance.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#11 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 30 October 2014 - 12:18 AM

SD1 is coming back online.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#12 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 30 October 2014 - 12:18 AM

This maintenance is now complete.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#13 slushatwork

slushatwork

    Newbie

  • Members
  • Pip
  • 2 posts

Posted 30 October 2014 - 08:29 AM

Though this didn't affect me, I appreciate the detailed explanation and updates. I have five different hosting companies for various clients and different aspects of my business, and you guys are the most dedicated to keeping customers in the loop on any problems.


  • 0

#14 NickK

NickK

    Newbie

  • Members
  • Pip
  • 1 posts

Posted 30 October 2014 - 08:34 AM

Yeah, they even e-mailed us, when otherwise we probably wouldn't have noticed ;) . I thought letting us know was cool, although obviously I'm not happy about the downtime. But the support team has earned enough good will that I don't really mind.

 

Thanks for the updates. Is eveything ok with SR1 now?


  • 0

#15 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 30 October 2014 - 09:32 AM

Yeah, they even e-mailed us, when otherwise we probably wouldn't have noticed ;) . I thought letting us know was cool, although obviously I'm not happy about the downtime. But the support team has earned enough good will that I don't really mind.

 

Thanks for the updates. Is eveything ok with SR1 now?

That is the 'danger' of emailing everybody and letting them know you had an issue - some probably wouldn't notice if you didn't do it.  That said it's more important that you know what happened, why it happened, and how we're working to prevent it than trying to hide that it occurred.  Many providers take the opposite route which is unfortunate.

 

SR1 is fine - file system check came back OK.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#16 Laimonas

Laimonas

    Newbie

  • Members
  • Pip
  • 14 posts
  • Gender:Male
  • Location:European Union
  • Interests:wine

Posted 30 October 2014 - 03:31 PM

Yes, my previous hosting provider tried to hide some outages and it was one of the reasons I left them. Actually was not happy this outage happened only after 3 months I chose MDD, but must add MDD fairness rocks.


  • 0

#17 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 30 October 2014 - 03:34 PM

This is the worst outage we've had in years but thankfully it only affected 3 servers and not our entire fleet.  I saw HostGator was having a bad day as well - all of their reseller servers were offline all day and they're still working on repairing MySQL across their reseller fleet.  Yesterday was just a bad day I think.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#18 Laimonas

Laimonas

    Newbie

  • Members
  • Pip
  • 14 posts
  • Gender:Male
  • Location:European Union
  • Interests:wine

Posted 30 October 2014 - 05:17 PM

Are there any dates set on these two steps:

* We will be converting the storage to a state more easily managed
* We are looking at high availability storage that will prevent this issue.


  • 0

#19 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 30 October 2014 - 05:19 PM

Are there any dates set on these two steps:

* We will be converting the storage to a state more easily managed
* We are looking at high availability storage that will prevent this issue.

As soon as we can get everything lined up.  Talking with several hardware vendors currently.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#20 Laimonas

Laimonas

    Newbie

  • Members
  • Pip
  • 14 posts
  • Gender:Male
  • Location:European Union
  • Interests:wine

Posted 30 October 2014 - 05:40 PM

 Yesterday was just a bad day I think.

NASA rocket also failed yesterday.


  • 0





1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users