Jump to content


MikeDVB

Member Since 27 Sep 2008
Offline Last Active Yesterday, 11:39 AM

#7158 Major Outage - 09/21/18+ - Client Discussion

Posted by MikeDVB on 24 September 2018 - 11:16 AM


Is there a reason why the restoration sequence was changed? S2 was moved to after P2 in contrast to the previous annoucement. It's not a major difference in time, but neverless that's a detail a bit irritating in a sensible situation like this where people are losing money and/or customers.

We are simply doing things as quickly as we can.  We had a short lapse in the ability to move the disks from Slow Backup to Fast Backup and in this lapse we went ahead and copied the smallest server which is S2.   We did this in lieu of just sitting around doing nothing.

 

If P2 was the smallest server it still would have been copied at that point.

 

This is one of the reasons the ETAs are only estimates - because we are doing our best to predict how long it will take to copy and restore each servers' backups and this changes based upon the actual data being copied.  For example a server with 1TB of data usage and 150,000,000 files is going to take a LOT longer to copy than a server with 4TB of data and 25,000,000 files.

 

If you have issues with your site or account after the restoration you will need to open a support ticket.  While I wish we could keep up with individual issues here on these forums it's not feasible.  We have extra staff working on the helpdesk and we're doing our absolute best to keep up considering the ticket load.

 

If your site is restored and you are seeing a cPanel error page there is a good chance your account is not on the same IP as it was and you're using third party DNS.  If you log into your cPanel you can see your new site IP in the status bar or under 'Server Information'.  You'll want to update this at your DNS.  Originally we planned on trying to make sure everybody was put back on their original IPs but the work to do that would double the restoration time or more.


  • 1


#7050 Major Outage - 09/21/18 - 09/24/2018

Posted by MikeDVB on 23 September 2018 - 06:58 PM

The restorations from the new backup system to the S1 server are going about 1000% faster than the old backup server by my estimations.  I'm still trying to work out a good ETA - and I need to wait on some more accounts to restore before I have a data set big enough to make such an estimation.


  • 2


#7046 Major Outage - 09/21/18+ - Client Discussion

Posted by MikeDVB on 23 September 2018 - 06:49 PM

The cname option should work even if your account's IP changes I believe - but it's been a while since I've messed with it as most use the nameservers.


  • 1


#7021 Major Outage - 09/21/18 - 09/24/2018

Posted by MikeDVB on 23 September 2018 - 03:22 PM

The S1 backup image is still copying to the 24 Drive SSD array for restoration.  The P1 backup image is being created on the 4 Disk SSD array in preparation for copying to the 24 Drive system.


  • 1


#7001 Major Outage - 09/21/18+ - Client Discussion

Posted by MikeDVB on 23 September 2018 - 12:57 PM

The default page should be better now for when your site doesn't load should be better now.


  • 2


#6997 Major Outage - 09/21/18+ - Client Discussion

Posted by MikeDVB on 23 September 2018 - 12:51 PM

 

 

Can you edit the "cgi-sys/defaultwebpage.cgi" file?  Since it is already there, there's no need to kill/create/overwrite anything.

Already working on that.


  • 1


#6987 Major Outage - 09/21/18 - 09/24/2018

Posted by MikeDVB on 23 September 2018 - 10:42 AM

The copy of the backup for the first server, S1, is done - we are now going to be restoring that data onto the much improved backup system and beginning restores right to the server.  As this is from SSD to SSD to SSD it should all be very fast but I will keep you informed.


  • 1


#6980 Major Outage - 09/21/18 - 09/24/2018

Posted by MikeDVB on 23 September 2018 - 08:58 AM

Also if you have backups of your own and want an account to restore to - we can not overwrite your account, at your request, and simply provide you with a backup of the data instead.  If you have a full cPanel backup of any account of yours we can restore it for you to get you back online now.


  • 1


#6923 Major Outage - 09/21/18+ - Client Discussion

Posted by MikeDVB on 22 September 2018 - 11:57 AM

I can't speak for everyone, but while I'm not too happy about this, I can understand and empathize with your side of this situation.

 

That said, I do have one question:  How recent is the data backup on your "last resort disaster recovery" server(s)?

Most are only about 14 hours before the outage, one or two servers are 48 hours or so.


  • 1


#6781 IP on S5 Server Null-Routed by Up-Stream due to DDoS

Posted by MikeDVB on 31 March 2018 - 10:04 PM

If you are using CloudFlare you will need to update the IP for your domain(s) at CloudFlare if you are on the affected IP.

 

I do wish we could email only those affected and let them know - but only a tiny percentage of the server was affected and there is no way for us to email only those affected without it being entirely manual.  While it is a small percentage of accounts on the server - it's enough accounts that manually notifying one by one would take a fairly substantial amount of time.

 

If you aren't subscribed to this section of the forums, I do suggest it -> https://forums.mddho...ues-and-events/


  • 1


#6696 Degraded performance - P1, S1, S4

Posted by MikeDVB on 19 November 2017 - 04:13 PM

We have determined the root cause of the issues experienced last night/this morning.

 

 

Why did it happen?

This was a result of a combination of unexpected issues - either of which having happened on their own would not have caused any downtime or disruption.

 

At approximately 7:28 PM ET we had a drive fail in one of our storage servers.  Our storage platform is designed to handle drive failures gracefully and drops the drive from the storage cluster.  We maintain 3 copies of all data on 3 distinct drives in 3 distinct systems out of many.  The result is that when a drive fails we can recreate a third copy of the missing data onto a new drive from the other 2 copies that remain.  This is generally a seamless process.

 

The first issue was with how our raid controller in this specific storage server handled the drive failure.  In this case when the drive failed the raid controller handling this drive disabled write caching on all drives in the system.  This is unexpected behavior and not something we or our storage vendor has seen before.  The result was increased write latency.  This alone would not have created downtime or issues.

 

The second compounding issue was that we did not have LiteSpeed configured to write logs Asynchronously with AIO - meaning that it writes the entries to RAM and then flushes them to the disk as it can.  This would have given us a buffer to handle the delayed / latent writes.  As LiteSpeed is an event-driven web server without AIO enabled for logging it would get stuck waiting to write log entries out and would fail to serve all other requests while it was waiting.  This would happen for a couple of seconds which was long enough for the system to see LiteSpeed as down and for it to issue a restart.

 

LiteSpeed writes many thousands of log entries per minute and S1, P1, and S4 were all using storage that had 1/3 of their redundant data on the storage server that had lost the write cache unexpectedly.  This means that out of many thousands of writes per minute - on occasion - the latency to write would be high enough that LiteSpeed would be seen as stuck and would get restarted - in some cases many times per minute.

 

The end result is that LiteSpeed would go offline for 10 to 30 seconds seemingly randomly.  P1 was affected the most and was offline for 30 to 60 seconds every few minutes while S1 was affected the least and was mostly online.  S4 was affected more than S1 but nowhere near as much as P1.

 

What are we doing to prevent a similar issue from occurring in the future?

 

  1. We have configured additional monitoring on our storage cluster to detect higher-than-normal write latency so that we can intervene quickly.  In this case as we are now aware of the potential issue with the write cache we can proactively check and resolve it in the event of a drive failure to avoid unexpectedly high write latency.
  2. We have reconfigured LiteSpeed to use AIO log writing so that should we ever experience higher than normal write latency in the future the impact should be minimal if not invisible to end-users.

 

Should you have any questions about any of this please let us know!  We apologize for any trouble this may have caused you.


  • 1


#6646 Results of the NimbleStorage CS500 -> StorPool Distributed Storage Platfo...

Posted by MikeDVB on 23 September 2017 - 11:31 PM

These forums are running on the P1 server - and you may notice they're significantly faster now that we've migrated P1 to the new storage platform :).


  • 1


#6640 S4 Server Migration - Friday, September 22nd @ 10 PM ET [GMT-4]

Posted by MikeDVB on 22 September 2017 - 09:40 PM

The server is back online on the new storage platform and we're keeping an eye on it :).


  • 1


#6539 Infrastructure Upgrades - Storage and Servers

Posted by MikeDVB on 23 August 2017 - 11:18 AM

Hello!
 

We have been working hard for about the last year to plan and execute some pretty huge upgrades for all of our customers.  For a couple of years now we've been using a Nimble CS500 SSD Accelerated Storage Area Network to provide highly available storage to our servers.  This worked great initially but as we've grown we've run into some issues here or there with the storage platform not meeting our expectations.

 

We are moving all servers to a StorPool Highly Available, Distributed, Self-Healing SSD Storage Area Network.  Our testing has shown that not only do we get more than 900% increase in I/O operations per second but we are also increasing the total bandwidth available to our storage platform by 1,200%.  While our Nimble Storage platform was SSD Cached where data read from the platform may or may not come from SSDs, all data read from our StorPool cluster will come from SSDs.

 

The end result of these migrations will be a faster and more consistent hosting experience across all of our hosting platforms - Shared, Reseller, Premium.  All VPS are already migrated to the new storage platform and all new orders for Shared, Reseller, and Premium hosting after August 15, 2017 are already on the new platform as well.

 

We will be migrating all servers over to the new storage platform on or before September 15, 2017 and we have already begun copying data in the background from the old platform to the new.  As a part of this migration we will be scheduling some downtime and we will post about it here as well as emailing all clients directly with details.

 

As we will be scheduling and announcing the downtime within maintenance windows and making you aware of these windows - we have a limited ability to migrate accounts manually prior to the maintenance window.  Please understand that while we understand nobody wants to have downtime - that we cannot possibly migrate everybody individually prior to the primary migrations and we will do our best in this regard.

 

If you have any questions or concerns about any of this please feel free to either reply to this thread or to open a support ticket directly.


  • 1


#6339 R2 Unplanned Outage - 02/16/17 - 7:30 AM - 8:45 AM ET

Posted by MikeDVB on 16 February 2017 - 09:26 AM

The server is stable at this point - we are slowly re-enabling some of the ancillary services to keep the load down and the server as responsive as possible.

 

If you are still experiencing any issues at all do please reach out to technical support.

 

We are sorry for the trouble this has caused.  We've taken potential issues such as this into consideration when looking at our new storage platform.  Ideally we'll be moving to the new platform in the very near future.


  • 1