Jump to content


MikeDVB

Member Since 27 Sep 2008
Offline Last Active Nov 20 2018 09:48 PM

#7521 Minor Test of Storage Snapshots Post-Disaster

Posted by MikeDVB on 23 October 2018 - 04:51 PM

As many of you already know we experienced a major disaster at the end of September where we were forced to restore data from a backup server due to a misconfiguration of snapshots on our storage platform.  Since this disaster we resolved the snapshot issue and today we were able to use the snapshots to help a client and we couldn't be happier with the experience.

 

We had a client today that had an employee submit a cancellation for services that shouldn't have been canceled.  The result was 43 accounts terminated that shouldn't have been.

 

In the past normally what we'd have done is turned to our backup server, which takes backups once per day, and restored the latest point we could.  This could have been as far as 24 hours prior to the termination of services.

 

What we did as it would give more recent data to the client as well as give us a good test of working with snapshots - was to mount a snapshot of the server just prior to the termination of the services.  We booted the server up, generated backups for the accounts which was exceptionally fast due to the SSD storage, and then restored those backups to the live server.

 

In the event of an actual disaster we'd simply mount the snapshots and boot them up without doing any backups or restorations and services would immediately come back online as they were when the snapshot was taken.

 

All in all we're very happy with this as we were able to provide more recent data to the client than otherwise would have been possible much faster than normally would have been possible.

 

The process was extremely simple and straightforward.  While we don't ever plan on needing snapshots for disaster recovery it is good to know that if we did - they are available and work very well for that purpose.


  • 2


#7520 Processor Upgrades - Faster Single-Threaded Performance & Higher Overall...

Posted by MikeDVB on 23 October 2018 - 09:18 AM

Hardware acceleration is on the networking for our storage so it helps everything. S1, P1, R1 are on the upgraded processors so far.
  • 1


#7512 Processor Upgrades - Faster Single-Threaded Performance & Higher Overall...

Posted by MikeDVB on 21 October 2018 - 02:37 PM

Hoping next would be P1 but i truly understand if it will take time. :)

 

Great service from you guys. 

P1 is slated for this week - and R1 is to follow.  P1 will likely be Wednesday or Thursday and R1 will likely be Saturday or this next Monday.

 

S1, R1, and P1 are our largest most populated servers from before our current architecture - they host 3 to 4 times the clients each that S2, R2, P2, etc host so once we get done with S1, R1, P1 we'll likely be able to do 2 or 3 servers at a time for upgrades.


  • 1


#7511 Processor Upgrades - Faster Single-Threaded Performance & Higher Overall...

Posted by MikeDVB on 21 October 2018 - 02:36 PM

Performance is indeed great at the moment. I'm on S1. When it comes to front-end cached pages, I was averaging around 650 ms prior to the processor upgrade, now it's around 500 ms. WP-admin feels a bit more responsive, too.

 

As for NVMe, I had the chance to test the performance of a friend's site which is hosted on a server located in Germany which had NVMe SSDs, and it was indeed quite a bit faster there vs. the previous host. Avg. load time of cached pages went from around 450 ms to 220 ms for that site, with everything else being the same (LSCache enabled). But I also heard that NVMe servers cost up to 4-5x more in the US than in countries like Germany, and perhaps that's why I'm yet to hear about any US-based shared hosting server of any provider being powered by NVMe storage.

 

About this processor upgrade, how big of an upgrade was it in terms of processor architecture? (i.e. Intel 4th gen. vs. 8th gen.) Clock speed hardly tells the whole story, as efficiency goes up every year. So, even though the newer processors are clocked 21% higher, I think the actual performance gain will be much more than that.

They're the same generation - we're actually just going from low-power versions of the processors to higher power versions.  65 Watt TDP to 120 Watt TDP.  It's something we've wanted to do for a long time but at ~$1,750 per processor it's not a cheap endeavor when you have 20+ to swap out.

 

We thought it would just be the clock speed but the memory controller is handling ram faster - 2 Gigabytes/second faster as well.  I've never seen MySQL on this server use so little CPU while doing more work and this isn't just because of the clock speed but because of how much faster RAM access is.  Queries are running faster and more consistently.

 

When it comes to NVMe - it's just like standard SSDs when we were all HDD - we're waiting on the price to come down to make it reasonable.  Most already think we're expensive for what we give not really understanding how much it costs to provide the services we provide.  Unfortunately shared hosting profit margins have gotten thinner and thinner over the years.  It seems that people don't value quality support, reliability, and consistent speed like they used to and seem to value low price above anything else.  We're just not willing to cram our servers as full as our competitors to improve our profit margins at the cost of performance and reliability.

 

If we were to go to all NVMe - we'd most certainly have to double or triple pricing and the average hosting client isn't going to understand and would cancel/leave.

 

If NVMe is something you want - we could certainly build you a custom solution but it would be substantially more expensive than anything else we could offer until NVMe comes down in price.  Believe me - I'd love to be on NVMe and I remember the day when we were able to go pure SSD.  There were days when we were on HDD that I'd just sit and dream about how nice it would be to be on SSD and not have to worry so much about IOPS and latency and throughput.  Where things will 'just work' without having to be babysat so much :).


  • 1


#7504 DDoS Attack Mitigation on P1

Posted by MikeDVB on 21 October 2018 - 12:34 AM

I'm not sure if anybody noticed the attack - our internal monitoring did and since it runs on the same IP Pingdom did notice some downtime while the rest of the server was still online.

 

We've mitigated the attack and have reached out to the client that is the target to keep them in the loop.

 

So far so good on mitigating the attack - but we did also move the affected client to their own IP just in case we have to take more drastic actions.  We're keeping an eye on the situation but so far the server is online and 100% operational :).


  • 1


#7503 Processor Upgrades - Faster Single-Threaded Performance & Higher Overall...

Posted by MikeDVB on 20 October 2018 - 10:06 PM

Yeah, I am definitely surprised at how large of an improvement this is.  We will be upgrading all servers as quickly as we can.  The original goal was to upgrade all servers at once but due to the major disaster and the damage it caused to our revenue we're having to do this in stages.

 

2018-10-20_22-56-17.png


  • 1


#7435 Major Outage - 09/21/18+ - Client Discussion

Posted by MikeDVB on 27 September 2018 - 01:07 AM

Hi all
Quick question
Considering what happened it would be wise to have full cpanel backups.
Any suggestions on how to automate this process and maybe upload to amazon S3 bucket on a shared hosting account?
thanks!

We’re evaluating what options there are so that hopefully we can offer such functionality for you. I know it’s doable with a custom script of some kind but it would be nice for it to be built in.
  • 2


#7405 Major Outage - 09/21/18 - 09/24/2018

Posted by MikeDVB on 26 September 2018 - 01:33 PM

Restores are 100% Completed

 

If your site is offline showing a cPanel error page:

 

  • Try connecting to your cPanel by adding "/cpanel" on to the end of your domain.  If you can sign in, this verifies your account was restored.
  • Check to see if you're using our nameservers - if you aren't, you'll need to get your IP from cPanel and update your third party DNS.
  • Make sure you're not just reloading the error page - hitting reload while viewing the error just reloads the error page.

If you are not using third party DNS and your site doesn't appear but you can get into cPanel - try clearing your browser cache and restarting your browser.  If that doesn't work try another browser.  If it loads for you on one browser but not another - that's a caching issue and not a server or network issue.

 

If you are having any issues with your mail client - what we have seen work the most is removing the email account from the client and adding it back.  We haven't yet identified what the difference is.  You can also add "/webmail" to the end of your domain to access your email if your mail client isn't working.

 

We do expect there to be a lot of little issues that we have to resolve so if you have issues and can't sort them please reach out in a ticket.

 

We are doing our best to keep up with support tickets.  I am sorry if it takes us longer to reply than normal but we are answering tickets in the order received and doing our best to fully resolve any issues and to offer good proper non-copy-and-pasted advice.


  • 2


#7363 Major Outage - 09/21/18 - 09/24/2018

Posted by MikeDVB on 25 September 2018 - 09:16 PM

For those worried that something like this could happen again.

 

We have already enabled snapshots on our storage cluster.  We're doing one snapshot every hour and keeping them for 10 hours.

 

So from a hypothetical standpoint - let's say that this did manage to happen again.  We would simply mount a snapshot from before the incident - within the hour before - and boot everything back up.

 

Total downtime would be - ~5 minutes - for the whole network.  Would there be any data loss? Possibly anything written in the preceding hour or less - but nothing compared to the losses of a multi-day outage.

 

It would look literally like we just shut everything down and booted it back up.  No 'restorations', no lost emails, nothing.  There's a great chance almost nobody would even notice.

 

This is something that our storage vendor, StorPool, set up for us immediately upon seeing what had happened.  They actually apologized that it was not already set up and said that as a result of our disaster they are going to make sure that it is a default behavior that has to be actively disabled rather than the other way around.

 

Even with these snapshots and as powerful as they are - we are still going to overhaul our backup servers.  We have identified the issues with the present setup that caused restorations to be so slow and already have fixes for those issues planned for once we are fully online and all of our clients are taken care of.

 

Snapshots are a very powerful tool against data loss and corruption.  We actually used them a couple of times on our old storage platform, the Nimble CS500, to recover data on servers when clients made big mistakes themselves.


  • 1


#7303 Major Outage - 09/21/18+ - Client Discussion

Posted by MikeDVB on 25 September 2018 - 11:41 AM

Though I never thought it would possible for somebody to fire a single command and nuke MDD's whole system. Especially since "Multiple copies of your data are stored synchronously across multiple storage drives and servers with no single points of failure."


Neither did we - but we learned that the hard way. If we had snapshots configured properly we'd have been able to flip a few switches and bring everything back.  Hard lesson for sure.

 

That said as this happened inside of the servers - that is why it affected all copies.  So when you write a file to a server - it's stored on 3 separate disks in 3 separate servers.  If you erase it - it's erased from all 3.  In this case - this block discard happened inside of the server - so it discarded all copies.

 

Another thing to note is that our backup system is supposed to be taking backups every day - so at most you should have lost hours not days of data - this is another issue we're going to make sure is resolved.  Not only will we have more regular snapshots - perhaps even on an hourly basis - that we can restore from immediately [as in, just hit 'boot' and you're back online just as you were] - but also in making sure our backup systems moving forward are tested and audited on a regular basis.


  • 1


#7296 Major Outage - 09/21/18 - 09/24/2018

Posted by MikeDVB on 25 September 2018 - 11:14 AM

s1    Completed
p1    Completed
r1    Completed
p2    Completed
s2    Completed
r2    Completed
s3    Tuesday, September 25, 2018 at 7:00:00 PM
r3    Wednesday, September 26, 2018 at 3:00:00 AM
s4    Wednesday, September 26, 2018 at 4:00:00 AM
r4    Completed
s5    Wednesday, September 26, 2018 at 10:00:00 AM
s0    Wednesday, September 26, 2018 at 10:00:00 AM


  • 1


#7291 Major Outage - 09/21/18 - 09/24/2018

Posted by MikeDVB on 25 September 2018 - 10:36 AM

We are now moving the SSDs from Slow Backup to Fast Backup to get started on restoring the S3 server.  This is step 2 on the process outlined here: https://forums.mddho...cussion/?p=7288


  • 1


#7279 Major Outage - 09/21/18+ - Client Discussion

Posted by MikeDVB on 25 September 2018 - 09:49 AM

Managed to get my S5 server website back online yesterday by moving to another host a couple days ago.  Had I moved when my gut told me to, as soon as all this happened, I could have been back online already on Sunday.  Lesson learned...  When things this serious go wrong, grab another host for a month right away and take shelter.  Or in my case...  Move.

If you could have moved - then we could have re-created your account and you could have restored in-place.  Whatever you moved to another provider could have been quickly brought back online with us.

 

The server has been online since Friday - the only thing we've been working on is copying our backup data over.  Any clients that have their own backups have been online for days already.


  • 3


#7246 Major Outage - 09/21/18 - 09/24/2018

Posted by MikeDVB on 24 September 2018 - 10:26 PM

R2 is being copied from Slow Backup to Fast Backup in preparation for restoration.

 

R4 Server finished restoring.

 

R1 is almost done.

 

S2 is restoring now.


  • 2


#7240 Major Outage - 09/21/18 - 09/24/2018

Posted by MikeDVB on 24 September 2018 - 09:06 PM

About a half hour left of copying s2 from Slow Backup.  R4 and R1 are restoring now.


  • 2