First and foremost I want to apologize for this outage. We were alerted at 2:04 AM to this issue by our internal monitoring and have been working on restoring since that time. All hands were brought on deck and we were able to solve an obscure and undocumented issue restoring all services successfully after approximately 7 hours. I wanted to go over the issue in a bit more detail and you will find that below.
We use Logical Volume Management to provide distinct storage to each piece of hardware. The volumes are then thinly provisioned so that they can share the same overall pool of storage without using more space than they actually need allowing maximum efficiency of the storage.
Today at 2:04 AM ET the storage system for the SD1, SR1, and VPS1 servers went read-only with file system errors. All server administrators were alerted and brought on duty to investigate and resolve the issue. Upon initial investigation we determined that due to a single configuration error the servers were not giving back free space to the pool resulting in them growing and never shrinking. We do monitor the storage but we were not correctly watching this metric.
All initial research on this specific issue indicated that the data was irreparably destroyed and we determined we needed to begin restoring our backups from 10:30 PM [3.5 hours before the issue] to hardware. We brought up extra hardware and began restoring backups immediately. While restorations were in progress we continued to work at recovering the data on the original storage. It took us about 7 hours to successfully repair the data at which time we checked the restoration and it showed at least 3 more hours remaining.
We then brought up the servers with the repaired storage and SD1 and SR1 came online immediately. VPS1 needed a file system check but due to the storage being solid state these were completed within minutes and VPS1 was brought back online.
We have identified several changes that will prevent this from happening again.
* We will be properly monitoring for storage pool free space to catch this issue before it becomes a problem.
* We will be converting the storage to a state more easily managed
* We are looking at high availability storage that will prevent this issue.
You may find it unusual that we were not keeping the forums up to date as this is something we do when there is an outage. I normally handle the forum updates personally but due to the critical nature of this outage I was focused on resolving the issue. In the future should an issue arise where I am focused on restoring the service I will bring an additional staff member in for the sole purpose of keeping customers updated on changes as they happen if I'm not able to do it myself.
Outages such as these are extremely rare for us and we do apologize again for any trouble it may have caused you.