Jump to content
MDDHosting Forums

Major Outage - 09/21/18 - 09/24/2018


Recommended Posts

While I was hoping to save some of this for the official RFO [Reason For Outage] - enough people are getting tremendously upset over this that I'm going to spell out what I can now - keeping in mind that I will provide more details when I can.

**What happened?**

First and foremost - this failure is not something that we planned on or expected. A server administrator, the most experienced administrator we have, made a big mistake. During some routine maintenance where they were supposed to perform a _file system trim_ they mistakenly performed a _block discard_.

**What does this mean?**

The server administrator essentially told our storage platform to drop all data rather than simply dropping data that had been marked as _deleted_ by our servers.

**Why is restoration taking so long?**

Initially we believed that only the primary operating system partition of the servers was damaged - so we worked to bring new machines online to connect to our storage to bring accounts back online. Had our initial belief been correct - we'd have been back online in a few hours at most.

As it turns out our local data was corrupted beyond repair - to the point that we could not even mount the file systems to attempt data recovery.

Normally we would rely on snapshots in our storage platform - simply mounting a snapshot from prior to the incident and booting servers back up. It would have taken minutes - if maybe an hour. We are not sure as of yet, and will need to investigate, but snapshots were disabled. I wish I could tell you why - and I wish I knew why - but we don't know yet and will have to look into it.

We are working to restore cPanel backups from our off-site backup server in Phoenix Arizona. While you would think the distance and connectivity was the issue - the real issue is the amount of I/O that backup server has available to it. While it is a robust server with 24 drives - it can only read so much data so fast. As these are high capacity spinning drives - they have limits on speed.

Our disaster recovery server is our **last resort** to restore client data and, as it stands, is the _only_ copy we have remaining of all client data - except that which has already been restored which is back to being stored in triplicate.

**What will you do to prevent this in the future?**

We have, as we've been working on this and running into issues getting things back online quickly, discussing what changes we need to make to ensure that this both doesn't happen again as well as that we can restore quicker in the future should the need arise. I will go into more detail about this once we are back online.

**We are sorry - we don't want you to be offline any more than you do.**

Personally I'm not going to be getting any sleep until every customer affected by this is back online. I wish I could snap my fingers and have everybody back online or that I could go into the past and make a couple of _minor_ changes that would have prevented this. I do wish, now that this has happened, that there was a quick and easy solution.

I understand you're upset / mad / angry / frustrated. Believe me - I am sitting here listening to each and every one of you about how upset you are - I know you're upset and I am sorry. We're human - and we make mistakes. In this case **thankfully** we do have a last resort disaster recovery that we can pull data from. There are _many_ providers that, having faced this many failures - a perfect storm so to speak - would have simply lost your data entirely.

This is the **first** major outage we've had in over a decade and while this is definitely major - our servers are online and we are actively working as quickly as possible to get all accounts restored and back online. For clarity - the bottleneck here is not a staffing issue. We evaluated numerous options to speed up the process and unfortunately short of copying the data off to faster disks - which we did try - there's nothing we can do to speed this up. The process of copying the data off to faster disks was going to take just as long, if not longer, than the restoration process is taking on it's own.

Once everybody is back online - and there are accounts coming online every minute - we will be performing a complete post-mortem on this and will be writing a clear and transparent Reason For Outage [RFO] which we will be making available to all clients.

I hope that you understand that while this restoration process is ongoing there really isn't much to report beyond, "Accounts are still being restored as quickly as possible." I wish there was some interesting update I could provide you like, "Suddenly things have sped up 100x!" but that's not the case.

I am personally doing my best to reach out to clients that have opened tickets are updated as to when their accounts are in the active restoration queue. While we do have thousands of accounts to restore - our disaster recovery system actually transfers data substantially faster with fewer simultaneous transfers. While it sounds counter-intuitive - we're actively watching the restoration processes and balancing the number of accounts being restored at once against the performance of the disaster recovery system to get as many people back online as quickly as possible.

Most sites are coming back online after restoration without issues, however, if once your account is restored you are still having issues - we are here to help. While we are quite overwhelmed by tickets like, "WHY IS THIS NOT UP YET!?!?!" "WHY ARE YOU DOWN SO LONG!?!??!!" "FIX THIS NOWWWW!" - we are still trying to wade through all of that to help those that have come back online and are having issues - as few and far between as it has been.

If you have any questions - we will definitely answer them - but please understand that while we're restoring accounts we're really trying to focus on the restoration of services as well as resolving issues for those that are already resolved.

Again - I am sorry for the trouble this is causing you - we definitely don't want you offline any more than you do and will have all services restored as quickly as we can.

Link to comment
Share on other sites

I know everybody wants an ETA. As I have no accurate way of giving an ETA I'm going to go ahead and say 96 hours.

 

Do I really expect this to take 96 hours? No. I don't and I hope not - but I don't want to say 24 hours and have 10 people not restored yet get really upset about it. At the end of the day I want to be as accurate as I can be and I don't want to lie or mislead,

 

This situation is really bad for us - the worst thing we've faced since we founded this company in 2007.

 

We don't want you offline anymore than you do - and we're restoring accounts as quickly as possible to restore service to all clients.

Link to comment
Share on other sites

Here is the original text of our Server Status page:

 


Update 7

All of our attempts to speed this process up haven't been successful. As it stands we're restoring several dozen accounts per server all at the same time. If you're seeing a cPanel IP Error page and you're using our DNS/Nameservers - your site isn't restored yet. If you can't log into your cPanel with your cPanel username and password - your site isn't restored yet.

Nobody wants you back online more than us - I wish we could snap our fingers and everybody was online. Better yet if we could go back and avoid this issue entirely that would be great. As it stands - if you're not online yet - we're sorry and we're working on it. If you are back online now - if you have any issues please let us know.

====
Update 6

We are working on two issues right now:
1. The cPanel Restoration process is not restoring MySQL data although the data is in the backup and verified good. We're working with cPanel Emergency Support on this.
2. We are working on also getting a secondary copy of our backup data local to the backup server that we're going to be spreading across 3 servers with 10 GBPS links so that we can ideally cut the restoration time down as low as, in theory. 2.5 hours. In reality I doubt we'll be able to saturate the connections due to cPanel restoration overhead - but we're doing our best to both get a second copy of the data just in case as well as giving us more throughput for restorations.

Ideally we'll have everything back online within 10 to 20 hours at the most - but we're still very much working-in-progress on this.

====
Update 5

It looks like our luck is against us. The damage to the servers was more extensive than we believed originally. We have the servers in a state of being ready to restore client data and when we attempted to mount the current data - the data can't be mounted / isn't workable.

We still have an administrator working on recovering this local data, however, it is looking like we're going to have to restore all client data from our latest backups. We're not any more happy about this than any of our clients. At this point it looks like it could take up to 30 or 40 hours to restore this data.

Once everything is back online we are going to be doing a complete overhaul on our backup system so that a full restore such as this will take only 4 to 6 hours. There are some bottlenecks in our current backup system that are going to keep things from going quickly. We could make some changes to the system now in an attempt to speed things up but as the backups we have now are the last remaining copy of the data we aren't going to be taking any chances. As it stands while we hate to be offline for an extended period - the risk of migrating our backup drives to another server is too great.

====
Update 4

We are performing final updates to the servers and software in preparation for bringing accounts back online. There is the possibility that we may have missed some settings / software and there is some software that isn't a priority, like Softaculous, that we'll focus on once services are restored.

We will be getting started on bringing accounts back online shortly. We're also working to try and make sure that account IP addresses do not change. While we can't make any promises - we're doing our best.

====
Update 3

All servers are online - we're working on configuring all software - cPanel, LiteSpeed, MySQL, etc - to get them ready to bring acounts back online.

====
Update 2

We are working to bring online the servers that host all clients. This is primarily a function of provisioning the guests on the hosts and configuring them to be ready to accept accounts. Once this is done we will be conducing cPanel restorations to the servers of just the cPanel and MySQL data and re-connecting the accounts to the home directory data from before the outage.

We do have current and up-to-date MySQL data from before the outage, however, we're going to be restoring MySQL data from our backups taken last night and then will be working to restore any MySQL data lost between then and the outage with clients on a one-on-one basis as needed.

We hope to have everything back online tonight, however, we will keep you updated as we progress with the disaster recovery.

Once we are fully back online we will be providing a complete Reason For Outage [RFO] to all affected clients. This RFO will outline what happened, why it happened, what we're changing to prevent it from happening again, and what we're changing so that should anything like this happen again we can recover from it substantially faster.

Our goal is always to be as open and transparent as we can be - and we will continue this. Right now we're focused on restoring services and will provide more details once services are back online.

====
Update 1

We are having to bring new servers online to restore services. We are working hard to get this done as quickly as possible and are going to do our best to do this with as little disruption as possible.

No customer data has been damaged or lost - only system-level data. We will provide full details as to what caused this outage, what we did to resolve it, and what we're going to do to prevent it from happening again as soon as we have the chance.

====
Initial Message

We are experiencing a major outage across all services at this time. We are aware of the issue and are working to restore services as quickly as possible.

We will provide more detail when we can, however, we are focused on restoring services and diverting all energy to those tasks presently.

Link to comment
Share on other sites

We are still way further behind on support tickets than we'd like to be - we're doing our best to give an individual response to each and every ticket, however, that takes time. If you have a ticket that hasn't yet been addressed - we aren't ignoring you and will get to you as soon as we can.

Link to comment
Share on other sites

One thing I can promise - is that even if this exact circumstance were to happen again - it won't take us offline for more than an hour or two at most. There are systems that should have protected us from this that were either disengaged, and shouldn't have been, or were simply not set up properly.

 

We do perform disaster recovery testing and practice, however, this particular chain of circumstances is simply something we didn't conceive. We try to plan for everything - but we're human.

 

That said - some of the changes will be in software configuration, hardware configuration, and some in simple policy.

 

I wish I had the time to go into detail right now - but I'm working at a fever pace on zero sleep to keep up with support tickets. I don't want people that reach out to us with questions or for help to wait longer than necessary even if they're not being nice about it.

 

To all of those of you that have been understanding - I very much appreciate your understanding and patience. I'm not going to offer any excuses - but I will explain when I have more time what happened in more detail and what we've done along the way during this process to speed things up.

Link to comment
Share on other sites

I want to provide a side update. we have been able to locate the cause of sites redirecting unexpectedly to other sites. We have a script that runs to ensure that all ip's have a default site attached that is not a customers. Due to a unexpected formatting change of ifconfig, this script was no longer getting IP's and leaving the section blank. We have corrected the file and the redirection failures should permanently cease.

  • Upvote 1
Link to comment
Share on other sites

We are working as quickly as we can.

 

The current exact plan is moving all the data off the current server that was under performing in the restoration using a direct copy method that is performing much faster and move it to a ssd based server with a 10G nic. I am starting with the s1 server and as it is 3TB of data it will take some time to complete.

Link to comment
Share on other sites

This letter is being sent to all affected clients. I'm posting it here as well for everyone to see as well as should anybody not receive the email for any reason.

 

Hello,

 

First and foremost I want to apologize again for any issues caused to you by this extended outage. If you are still offline we are working to restore your services as quickly as possible. I am also sorry if we are not responding to your support ticket as quickly as we normally would. This outage has been an absolute nightmare and something that I honestly never envisioned would happen even though we've always done our best to plan and to try to expect the unexpected.

 

I do understand that downtime is unacceptable and not being able to reecover from a disaster quickly is not acceptable.

 

We are a small company and we have never pretended otherwise and have always been proud of the services and support that we've given to our clients. I founded this company in 2007 after personally being upset with the fact that there were no good hosting providers that didn't ignore you, give you copy and paste answers, and provide services that were unreliable or offline more than not.

 

I will be honest in that I have had several companies over the years attempt to buy us out and the offers were always good. The reason I've never sold this company is because I know that if I hand this company and our clients over to somebody else - the quality of service and support will not be maintained. I have been in this industry long enough and seen enough sales to know what happens to the clients of a company that gets sold and that's not something that I want to see happen.

 

My personal goal is that when one of my two sons, presently 3 and 6, are older that one of them will want to work with me at this company and that we will always remain a family business even when I may not be directly involved in all day to day operations. We're not a huge corporation or owned by one and we would very much like to stay that way.

 

Since 2007 we have had a pretty solid track record. We did experience a 72 hour outage in 2008 due to the data center our server [yes, one] was in catching fire and experiencing an explosion. Even then - when we were so small - that was a stressful and exhausting experience. I still remember how I felt helpless to resolve the issue and to do anything for our clients.

 

I have always made sure that we had backups of client data and, in many cases, backups of our backups. I have seen over the years that issues can and will happen and that it's only a matter of time. Google and AWS have both had issues and they invest millions, if not billiions, into making sure they have no downtime. We don't have anywhere near their budget but we have always invested in reliabilty but have never been foolish enough to believe that we were completely isolated from downtime and unexpected issues.

 

We moved to our new StorPool powered storage cluster last year and the platform has been absolutely amazing. This outage is in no way related to the platform and there is a major feature of the platform that, if we were using it properly, would have allowed us to restore from this incident extremely quickly and that is called 'snapshotting'. The honest truth of the matter is that I knew that snapshotting was available and I thought that we were making use of it as a first line of protection against major outages and issues when it comes to data. I do not personally manage the storage platform as that is a bit out of my skill set but we do have an administrator that manages the storage and StorPool also oversees the storage and has always been there within minutes to help us when we need it.

 

As a part of managing our storage cluster we do have to keep tabs on the total amount of storage available on the cluster as well as the total free space. On Friday afternoon we received an alert from StorPool monitoring that we were getting low on SSD space so I reached out to the administrator that handles our storage and asked him to make sure that our servers were doing what is called "discarding". On distributed storage like this when a server deletes data it doesn't actually physically overwrite the data but sends a command to the storage platform letting it know that the block of data is no longer in use. This generally happens automatically but there are some situations where it won't happen automatically and we do have to issue a manual file system trim.

 

The administrator that handles our storage, at my request, began to look over the cluster to make sure that everything was good to go. I had discussed that we needed to add more solid state disks to the cluster to increase our capacity but did want him to run a manual file sytsem trim to make sure we weren't wasting any space. Keeping in mind that a trim simply makes sure that the operating system running your server has communicated all deleted blocks to the storage platform. The administrator performing this work intended to run a "fstrim" [file syste trim] to remove any extra blocks but actually ran a "blkdiscard" [block discard]. This is no easy mistake to make as these commands are entirely different and perform different tasks. A block discard, by default, discards all blocks on a device regardless of whether they hold important data, file system data, or anything else.

 

The administrator very quickly realized the huge mistake they had made and did what he could to immediately cease the block discard and to preserve what data he could. I suppose you could say the incredible speed of our storage cluster worked against us in this case the discard was able to drop enough data even in a few seconds to essentially corrupt everything we had stored. This is where snapshots would have saved us.

 

If we were using snapshots as I thought we were all we'd have had to do is shut down all of the client servers, mounted a snapshot prior to the block discard, and then booted the servers up. There would have been a small amount of lost time/data due to rolling back to a prior snapshot but we would have been back online within minutes. Snapshotting simply keeps track of data changed on the storage platform from the time that the snapshot is taken until the time the changes are meged into that snapshot so that a new one can be created. In essence all changes after the snapshot can be discarded or ignored in an emergency to bring everything back to an earlier state without major impact.

 

I do not have an explanation as to why snapshots were not in use as this is the primary and first line of defense. It was my understanding that snapsots were in use and this is one of the big, and surprisingly simple, changes we are making moving forward. Having the ability to roll individual servers or the whole storage cluster back to a time just prior to a major incident will give us the power to restore from any normally unrecoverable errors or issues quickly and efficiently with minimal impact.

 

The administrator that made this mistake has been working with me and doing his best to help us recover from this incident. At this moment he is actually taking a few hours off as he has gotten to the point of a total anxiety attack and has been working to hold himself together since the incident occurred. He's distraught because he knows that he could have avoided this mistake by simply sending the correct command to the system and doesn't have an explanation for the mistake beyond that he screwed up. This wasn't malicious in nature and as sad as it is to say is human error. I do not know if it was carelessness or a simple lapse in clear thinking but regardless of why - we are where we are now.

 

I have been in this industry long enough both as a customer initially as well as a provider to see how important it is to have backups of your data. I have seen all too often providers that go completely out of business after losing all of their client data and clients that were attacked by malware or performed a bad update and didn't take a backup first. I've also seen hosting clients of other providers lose everything they've worked on for years or even decades due to not keeping their own backups and a provider losing their data. These are things I've always worked to make sure would not happen at our company.

 

For about a year we have had a backup server in a data center in Phoenix, Arizona. Customers have had access to this backup server to conduct their own restorations of their files, databases, email accounts, etc and it has been a great convenience. The idea behind this backup server is that in an absolute worst-case scenario, such as the primary facility being destroyed by a natural disaster, we would have a safe copy of all client data that could be used to restore services in another facility. This server holds 14 copies of all client data and has been great when a client needed to restore something back a day, a week, or a couple of weeks ago.

 

I have been in this industry long enough both as a customer initially as well as a provider to see how important it is to have backups of your data. I have seen all too often providers that go completely out of business after losing all of their client data and clients that were attacked by malware or performed a bad update and didn't take a backup first. I've also seen hosting clients of other providers lose everything they've worked on for years or even decades due to not keeping their own backups and a provider losing their data. These are things I've always worked to make sure would not happen at our company.

 

Now here we stand in the biggest disaster my company has faced in over a decade. The longest outage and the most stressful situation - all due to a simple wrong command on a keyboard sent once a couple of days ago. Due to oversight on our part or sheer ignorance we are not able to simply mount a snapshot and recover within minutes. This is something that we are going to change as soon as we have some time to devote to making it happen.

 

We have been doing our best to keep up with all support tickets opened, tweets sent, posts on our forums, etc. I will be honest in saying that we are getting new support tickets opened at the rate of a few every few seconds - far faster than we can keep up with. I am sure that this adds on to the frustration of the situation when you open a ticket and don't get your normal nearly immediate response from our support staff. I am sure some of you feel like you've been ignored or like we're simply doing a poor job or ignoring the issue. I can assure you that although our support time responses are far longer than normal it's not because we aren't doing our best. We haven't walked away and we aren't ignoring you and we are taking the time to do our best to address each individual concern personally and not to send pre-defined replies whenever possible.

 

The backup server that we have in Phoenix is stuffed full of regular spinning hard drives to give us the capacity we need to hold a copy of all data. Due to the number of drives we needed for the capacity we decided to use some compression to both reduce the total storage footprint needed as well as to speed the system up. The idea was that as less data had to be read or written to the disks due to compression that we could trade off CPU for the compression for space on the storage. This has worked well over the year for normal one-off restorations and nightly backups of everything.

 

When we set this server up we did get a copy of all data to it over the course of a few weeks. This isn't because we couldn't have done it faster but simply because we were trying to avoid using an inordinate amount of bandwidth. On our level we don't pay for data transfer by the amount transferred but by how long we use a large amount of bandwidth. We can, for example, use 1,000 megabit per second every second for a month without paying any overages but if we were to use 2,000 megabit for 3 days our bandwidth bill would double. We did perform testing on the server to make sure that it was capable of receiving, handling, and sending large amounts of data quickly. In short we wanted to make sure that in the event that we needed to conduct a restoration that it would perform up to our requirements.

 

Over the last year we did not have cause to perform any large restorations or operations. I do not, as of right now, have an explanation as to why our backup server has gone from fast to slow. I do not know if it's some sort of fragementation of the data storred, if it has to do with how our backup system performs incremental copies to keep multiple restoration points, or something else. I do wish I had some sort of explanation to provide but I haven't been able to figure one out and have not been able to figure out a good way to get it to go faster. The bottleneck in this recovery is the disk storage in this backup server.

 

Initially when the mistake happened and the administrator realized he had discarded data he thought it had only been performed on the operating system device - that client data was safe and intact. Some of the servers were still online and serving requests although they were experiencing issues and would crash in short order. The initial belief was that we could reconfigure and reinstall our operating systems on our client machines and then simply reconnect client storage and, while we'd have a lot of tedious things to fix and address, we'd recover fairly quickly and with minimal to no client data loss.

 

It took 2 or 3 hours for us to reprovision all client machines, to install the operating systems, control panels, and all software and settings necessary for the servers to perform their jobs and to serve data for our clients. We do perform this from time to time on a new server as we need more capacity and to be honest I think we did a reasonable job of getting as many servers as we needed configured and online as quickly as we did. I thought that was going to be the end of the major outage and the beginning of days or weeks of fixing small issues and glitches here or there and I was terribly wrong.

 

We found that when we went to mount the client storage to the servers that the devices wouldn't mount, the file systems were damaged, or the data stored was corrupted to the point that we couldn't simply recover from it. I'm sure if we were a substantially larger company with a much larger budget it may have been possible to find some way to recover some of this data. One of the primary reasons I was glad, even during such an event, to think that we were going to be able to bring these disks online was to avoid any extended downtime and issues for our clients.

 

Unfortunately we were wrong and the data was not usable. This meant that we were now not only in a state of disaster but that the disaster was not going to be over nearly as quickly as we had originally thought. If we were restoring a single server it would have been done in a few hours and, while that would have not been fun for anybody, it wouldn't have taken very long. Unfortunately we are in a disaster scenario where we are having to restore all servers. No matter how we slice it - be it restoring all servers at once slower, or restoring one server at a time a bit quicker - the speed of overall recovery is about the same.

 

I know that many of our clients have been extremely frustrated with our inability to provide what is felt as a "simple ETA," and I wish that we were able to do so. One of the issues with this backup system that we are using is that the performance is extremely inconsistent. We may be able to restore data extremely quickly for a few minutes and then the data rate will drop to an extremely low rate for a while. We may restore 5 accounts in a few minutes and then it may take an hour to restore the next account in the queue. Another issue with this is that our backup and restoration software doesn't give us any insight into the process. We can't see backed up accounts by their size so that we can prioritize small accounts to restore as much service as quickly as possible. We can't see how fast the transfer of an account is going, how long it's taken, how long it has remaining, or anything but the account name itself and that it is in progress.

 

Every time I have tried to generate a realistic ETA that ETA would change from minute to minute or hour to hour - and I have never been a fan of giving incorrect information. The truth is that I hadn't gotten any sleep since we went down until just a few hours ago. I have been sitting at my computer working on restoring services and helping clients from the moment the issue occurred until I was no longer able to peform my duties and then I did so for many hours more doing my best. I am sorry if you feel we haven't communicated effectively due to delays in responding to support tickets or an inability to tell you when you would be back online or how long this issue is going to take to resolve.

 

Due to the issues with the backup server that we are restoring from we have decided to move in a new direction when it comes to restoring services. It is already clear that this is going to be an extended outage and that no matter how much we want things to be back online quickly that we are limited by this backup server and its capabilities. One of the larger issues is that we can perform a single stream or data copy fairly quickly but the second we need to do 2, or 3, or 4, or more at the same time they all slow to a crawl and come nowhere near the capacity of a single data copy.

 

When we are restoring using our backup software there is a data stream for each account on each server. We have 12 client servers we need to restore and even if we go one account at a time we're looking at a minimum of 12 data streams. Normally in such an instance we would be restoring 4 to 16 accounts per server to get things restored as quickly as possible. Normally we would expect to saturate a 1 GBPS or 2 GBPS networking link and as soon as we determined that we needed to restore from this sytem we got with the facility hosting the backup server and requested they swap in 10 GBPS networking. We really did believe that we were going to saturate the 1 GBPS link in the server during these restorations due to the number of streams of data we were going to need to support.

 

Unfortunately we found that we were not even able to use 25% of 1 GBPS much less to come anywhere near saturating 1 GBPS or even touching 10 GBPS. Some have asked us why we didn't relocate the backup server closer to where we are conducting the restorations - such as flying it over. The simple answer is that it's not the connectivity between them that is the problem and causing delays. Even if the backup server were sitting right next to the servers being restored the data transfer rate issue would still exist.

 

Due to the fact that there is no quick recovery from this and that we get much faster single-stream throughput than multi-stream throughput we are replicating the whole server backups off of this storage and onto solid state raid arrays one at a time. We are getting between 1 GPBS and 4 GBPS data transfer rates on this single stream which is orders of magnitude more than we have been getting trying to restore directly from the backup server to the client servers. The downside is that this means there is an intermediary step that is going to take time before we can actually perform any restorations. For example the transfer that is running right now is going to take an approximate total of 6 hours to run and is about 3.5 hours into that run.

 

Once each stream copy of a server's backups is done we are going to remove the solid state array from the backup server and locate it into a new chassis with 10 GBPS networking to conduct restorations from Phoenix to Denver as quickly as possible. For comparison the restoration of our S1 server via the normal restoration process from the backup server could take days or even a week where as this process we believe will allow us to restore the server as a whole within hours.

 

The only real upside to the slower restoration process of restoring accounts directly from the backup server is that, as each account is finished, that account comes online. So even if it were to take days - there would be accounts online and recovered within the first hour, first day, etc. With the process we're running now - during this 6 hour copy - nobody new is going to be coming online. The upside to the route we've chosen to take is that once this 6 hour copy is done client accounts from that backup are going to be coming online substantially faster with a shorter overall downtime for everybody.

 

I am personally extremely sorry that this has happened and that you have experienced these issues. I am doing my absolute best to hold everything together and to restore services. Once we are recovered from this incident there are a great many changes that we are going to be making to our backup system as well as to our overall policies and procedures concerning data security and recovery. The most immediate change is going to be that we will be using snapshots on our storage platform to protect individual servers as well as the whole network from major catastrophe. We also plan on having a secondary redundant copy of these snapshots/our data local to the storage cluster.

 

Data storage isn't cheap and the storage platform we are running is expensive before you take into account that we store three copies of every piece of data in our storage platform to protect against drive and server failure. We could lose more than half of our storage servers or drives and our storage cluster would remain online and operational. This means to store 20 TB of data we need more than 60 TB of actual storage - tripling the cost of the storage itself. Another factor that makes this so expensive is that we rely on enterprise class storage and not your standard consumer grade off-the-shelf storage. Even for all of this additional cost we are going to bring online a storage platform capable of holding a local copy of our data as an extra layer of protection beyond the snapshots we will be performing.

 

We will also be overhauling and replacing our off-site disaster recovery backup systems. We will most likely move to having several smaller servers with high performance storage - each handling one or two client servers - rather than one big behemoth of a backup server. Although we are planning and putting steps in place to ensure we never get to the point of needing to do a disaster recovery like this again it would be remiss for us not to plan for it regardless. Should we ever have to perform disaster recovery from our off-site location in the future this setup will allow us to sustain numerous high speed restorations without any one specific bottleneck.

 

The downside to this is that all of it, short of the snapshots we're enabling, is going to be extremely expensive. Now is obviously not going to be a good time for us to be spending money as I know that there are going to be a lot of our customers that have lost faith in us over this incident. I know that we're going to have cancellations and more than likely we're going to lose a lot of sales and new revenue over this incident. It is unfortunate that at the time that we really need the revenue to invest in making sure we are protected from a major incident like this in the future is the time that we are likely going to suffer the most when it comes to revenue.

 

If you are losing money, sales, clients, or anything else due to this outage I am extremely sorry. I want nothing more than to restore all services and to get everybody back online. The truth is that I didn't start this company to get rich and I don't do it for the money but becasue I genuinely enjoy providing a solid service and quality support. I enjoy providing a hosting platform that our clients are happy with and providing support that is above and beyond what other provdiers can or are willing to provide. While I know for a fact that we will get all data restored and that it's only a matter of time - I do hope that you have not lost hope in us and that you understand that we realize we majorly screwed up. We screwed up on more level than one in this situation and we know it and are going to be making changes to protect you as well as us from any further incidents like this.

 

We have been in business since 2007 and other than the data center fire early on we have had no major issues or incidents. We've had a server issue here or there but have always been able to recover within, at most, a few hours. We've been in business almost 12 years or 4,250 days with exceptional uptime, reliability, and support. I do understand that we have been down, at this point, for nearly 38 hours. I hope that you can see that while this is a long time that in the grand scheme of things we've been reliable and that we are going to do everything in our power to make sure that we stay reliable and online moving forward.

 

We have about 20 TB of total data to restore and it's going to take us about 2 hours per terabyte of data to copy the data off of our disaster recovery backup server in Phoenix onto the solid state storage arrays. As we do not have to copy the entire 20 TB of data completely before we can begin restoring from the solid state storage we will be begining high speed restoration of services in as soon as 2 hours from now. As we have a server's data copied over we will begin restorations.

 

We will be performing these copies in the order that the machines were provisioned. Ultimately there is no way for us to copy all machines at once, or we would do that, so we had to decide how we were going to handle this. As we have added machines as we've added new customers we want to restore service to our oldest clients that have been with us the longest first. Please don't misunderstand this to say that we don't wish that we could restore data for everybody all at once or that we value older clients more than newer ones. We simply had to choose a way to decide which servers got restored first and no matter how we go about it there are going to be clients that experience longer outages than others.

 

As we go through this process I will be providing status updates on our forums at https://forums.mddhosting.com/topic/1582-major-outage-092118-09222018/. I will also be answering every support ticket that I can personally, responding on Twitter and Facebook. I will send an email to you if there is something important we need to share where we can't rely on you checking other means of communication like our forums - but most centralized communication is going to happen on our forums on that thread.

 

As soon as we begin conducting a restoration of the data from the higher speed storage we're copying our backups to I will do my best to provide an accurate ETA for the restoration of services on a server-by-server basis. Until we begin that restoration process anything I provided would be nothing more than a guess and I don't want to mislead by providing inaccurate information. I know that it has been frustrating not to have an ETA and as soon as I am able to provide one you will have it.

 

Any replies to this email will come directly into the management department which I am handling entirely on my own. Please keep in mind that I am also handling regular tickets so I will do my absolute best to respond to you directly as I get a chance. There is a good chance that over the next several days I am going to be so overwhelmed with support tickets, email, social media, and the like that my responses may come with a fair bit of delay but I will get back to you as soon as I can.

 

Sorry for such a long message as I wanted to make sure that I was detailed and that I was able to give a clear picture of where we stood, why we stood there, and what we are doing to get back where we need to be.

 

I am sorry that we made mistakes that landed us in this situation and that we have dropped the ball and screwed up. I hope that you can give us the chance to recover from this and to continue providing you the solid and reliable service and support that I know we can provide and that we strive for every minute of every day.

 

I am sorry, I really am. I am going to make sure that it is made right even if it takes longer than anybody would like and I will be personally available as much as I can be both until we are fully back online as well as for an extended period after for any feedback, questions, comments, or concerns you may have.

 

I hope that you are able to give us the chance to continue hosting your account once we are able to recover from this.

 

Thank you for taking the time to read my message.

 

Sincerely,

 

Michael Denney

MDDHosting LLC

Link to comment
Share on other sites

Sorry guys, I can't understand one thing : why can't MDD place info: site i down due to MDD Hosting failure. All people think my site was hacked or my webmaster made something wrong and my phone is ringing non stop.

If your site is not restored you should see a cPanel error page.

 

That said I will see about customizing that error page to say something along the lines of, "This site will be back as soon as possible." or something along those lines.

Link to comment
Share on other sites

If your account has not yet been restored and you want us to create a new/empty copy for you so you can put up a 'we'll be back' site or you want to recreate your email accounts let us know in a ticket.

 

The only caveat is that when we go to restore your account - we will restore our backup of your account over what you put in this new/empty account. In the event you don't want that to happen you'll need to let us know.

Link to comment
Share on other sites

Also if you have backups of your own and want an account to restore to - we can not overwrite your account, at your request, and simply provide you with a backup of the data instead. If you have a full cPanel backup of any account of yours we can restore it for you to get you back online now.

  • Upvote 1
Link to comment
Share on other sites

The copy of the backup for the first server, S1, is done - we are now going to be restoring that data onto the much improved backup system and beginning restores right to the server. As this is from SSD to SSD to SSD it should all be very fast but I will keep you informed.

  • Upvote 1
Link to comment
Share on other sites

I and many of my clients are on s1.

 

The cPanel error page has gone, but I still can't get to the sites. I now get a different error page saying my browser can't establish a secure connection. I tried http: instead of https: but it made no difference.

 

When I try to open cPanel, I'm told my login is invalid.

 

I'm hoping these are just symptoms of the recovery process, and that once the sites are fully recovered I'll have access again.

 

Of course I'm impatient like everyone else, but like many of the members above, I've been nothing but delighted with MDD over the years and have no intention of changing hosts.

 

Mark

We changed the text of the error page at the request of quite a few clients so that their visitors would know that they weren't hacked and will come back as soon as possible.

Link to comment
Share on other sites

The backup of S1 is copying to our 24 Disk SSD Array from the 4-Disk array and once this is done we will begin restorations of the S1 server - it should go very quickly.

 

We are presently copying from the old backup server to another set of 4 SSDs the backup for P1.

Link to comment
Share on other sites

Guest
This topic is now closed to further replies.
×
×
  • Create New...