Jump to content


MikeDVB

Member Since 27 Sep 2008
Offline Last Active Oct 26 2019 09:39 PM

#7591 R2 Server - Memory Doubled

Posted by MikeDVB on 01 October 2019 - 11:39 AM

We experienced an outage today on the R2 server and upon close investigation we believe a very brief but very large memory spike was the cause.

 

We've doubled the memory available to the system and this should prevent further outages of this nature.

 

I'm keeping an eye on the server personally and will keep this thread updated if we have to make any other changes.


  • 1


#7583 Network connectivity issues. 08/12/2019

Posted by MikeDVB on 12 August 2019 - 04:54 PM

In speaking with the senior network engineer at Handy Networks, our upstream facility, this issue affected both redundant pieces of hardware responsible for routing traffic.  The primary crashed and then the secondary took over and subsequently crashed.  While they were working to determine the cause this was ongoing and explains why things would show as online for a minute or two and then back down.

 

The mode of failure is definitely unusual and I still personally believe it to be a bug in the Juniper OS.

 

Juniper as well as Handy Networks are still working to trace the cause and I expect to have an RFO within 72 hours or less.


  • 1


#7582 Network connectivity issues. 08/12/2019

Posted by MikeDVB on 12 August 2019 - 01:52 PM

It is my current understanding that this issue was due to an unusual hardware failure in a core piece of networking equipment at the facility.  This piece of hardware failed in such a way that it wasn't servicing requests but wasn't 'offline' - sort of like an operating system crash/panic.

 

As this piece of equipment is redundant - there is another identical piece of hardware doing the same job that should pick up the slack - I do not at this point know why the failure caused an outage that redundancy didn't prevent.  It could be due to the nature of the failure in that the gear stayed online but wasn't actually working but that's speculation on my part.

 

As it stands everything is back online but we have lost the redundancy of this core piece of hardware until the issue is fully resolved.  It is suspected that this is a bug in the operating system running on the core networking equipment and the facility is working with Juniper Emergency Support to both investigate the cause of the issue as well as working to ensure it doesn't happen again.

 

Here is a snippet of the kernel/operating system log from the failed piece of networking hardware:

Aug 12 17:29:05  dist3.denver2 /kernel: BAD_PAGE_FAULT: pid 1972 (fxpc), uid 0: pc 0x0 got a read fault at 0x0, x86 fault flags = 0x4
Aug 12 17:29:05  dist3.denver2 /kernel: Trapframe Register Dump:
Aug 12 17:29:05  dist3.denver2 /kernel: eax: 20dba498ecx: 000000ffedx: 20dba494ebx: 20dba468
Aug 12 17:29:05  dist3.denver2 /kernel: esp: af97de6cebp: af97de98esi: 21054b98edi: 00000000
Aug 12 17:29:05  dist3.denver2 /kernel: eip: 00000000eflags: 00010202
Aug 12 17:29:05  dist3.denver2 /kernel: cs: 0033ss: 003bds: 003bes: 003b
Aug 12 17:29:05  dist3.denver2 /kernel: fs: b0b5003btrapno: 0000000cerr: 00000004
Aug 12 17:29:05  dist3.denver2 /kernel: PC address 0x0 is inaccessible, PDE = 0x0, ****** = 0x0
Aug 12 17:29:05  dist3.denver2 /kernel: BAD_PAGE_FAULT: pid 1972 (fxpc), uid 0: pc 0x0 got a read fault at 0x0, x86 fault flags = 0x4
Aug 12 17:29:05  dist3.denver2 /kernel: Trapframe Register Dump:
Aug 12 17:29:05  dist3.denver2 /kernel: eax: 20dba498ecx: 000000ffedx: 20dba494ebx: 20dba468
Aug 12 17:29:05  dist3.denver2 /kernel: esp: af97de6cebp: af97de98esi: 21054b98edi: 00000000
Aug 12 17:29:05  dist3.denver2 /kernel: eip: 00000000eflags: 00010202
Aug 12 17:29:05  dist3.denver2 /kernel: cs: 0033ss: 003bds: 003bes: 003b
Aug 12 17:29:05  dist3.denver2 /kernel: fs: b0b5003btrapno: 0000000cerr: 00000004
Aug 12 17:29:05  dist3.denver2 /kernel: PC address 0x0 is inaccessible, PDE = 0x0, ****** = 0x0

Once the Reason For Outage [RFO] is available from our facility we will make it available.


  • 1


#7547 DDoS Attack on S3 Server - 1 IP Affected

Posted by MikeDVB on 25 February 2019 - 09:11 PM

Hello!

 

Unfortunately a very high packets-per-second Distributed Denial of Service attack hit an IP on the S3 server tonight.  This attack wasn't large in the sense that it overwhelmed our network capacity but was large in the sense that it was a high enough number of packets that it was exhausting the web server's sockets and queues rendering sites on the IP offline.

 

We did identify the target of the attack and have moved them off to their own IP address - should the attack recur or adapt and we have to take action it should only affect the target site and not others on the server.

 

This attack was a new variant we haven't seen prior to tonight so we're using our packet captures to investigate how we could handle such an attack better and more efficiently should anything like it recur in the future.

 

If you have any questions about the attack do please open a support ticket.  Do feel free to reference this thread.


  • 1


#7521 Minor Test of Storage Snapshots Post-Disaster

Posted by MikeDVB on 23 October 2018 - 04:51 PM

As many of you already know we experienced a major disaster at the end of September where we were forced to restore data from a backup server due to a misconfiguration of snapshots on our storage platform.  Since this disaster we resolved the snapshot issue and today we were able to use the snapshots to help a client and we couldn't be happier with the experience.

 

We had a client today that had an employee submit a cancellation for services that shouldn't have been canceled.  The result was 43 accounts terminated that shouldn't have been.

 

In the past normally what we'd have done is turned to our backup server, which takes backups once per day, and restored the latest point we could.  This could have been as far as 24 hours prior to the termination of services.

 

What we did as it would give more recent data to the client as well as give us a good test of working with snapshots - was to mount a snapshot of the server just prior to the termination of the services.  We booted the server up, generated backups for the accounts which was exceptionally fast due to the SSD storage, and then restored those backups to the live server.

 

In the event of an actual disaster we'd simply mount the snapshots and boot them up without doing any backups or restorations and services would immediately come back online as they were when the snapshot was taken.

 

All in all we're very happy with this as we were able to provide more recent data to the client than otherwise would have been possible much faster than normally would have been possible.

 

The process was extremely simple and straightforward.  While we don't ever plan on needing snapshots for disaster recovery it is good to know that if we did - they are available and work very well for that purpose.


  • 2


#7520 Processor Upgrades - Faster Single-Threaded Performance & Higher Overall...

Posted by MikeDVB on 23 October 2018 - 09:18 AM

Hardware acceleration is on the networking for our storage so it helps everything. S1, P1, R1 are on the upgraded processors so far.
  • 1


#7512 Processor Upgrades - Faster Single-Threaded Performance & Higher Overall...

Posted by MikeDVB on 21 October 2018 - 02:37 PM

Hoping next would be P1 but i truly understand if it will take time. :)

 

Great service from you guys. 

P1 is slated for this week - and R1 is to follow.  P1 will likely be Wednesday or Thursday and R1 will likely be Saturday or this next Monday.

 

S1, R1, and P1 are our largest most populated servers from before our current architecture - they host 3 to 4 times the clients each that S2, R2, P2, etc host so once we get done with S1, R1, P1 we'll likely be able to do 2 or 3 servers at a time for upgrades.


  • 1


#7511 Processor Upgrades - Faster Single-Threaded Performance & Higher Overall...

Posted by MikeDVB on 21 October 2018 - 02:36 PM

Performance is indeed great at the moment. I'm on S1. When it comes to front-end cached pages, I was averaging around 650 ms prior to the processor upgrade, now it's around 500 ms. WP-admin feels a bit more responsive, too.

 

As for NVMe, I had the chance to test the performance of a friend's site which is hosted on a server located in Germany which had NVMe SSDs, and it was indeed quite a bit faster there vs. the previous host. Avg. load time of cached pages went from around 450 ms to 220 ms for that site, with everything else being the same (LSCache enabled). But I also heard that NVMe servers cost up to 4-5x more in the US than in countries like Germany, and perhaps that's why I'm yet to hear about any US-based shared hosting server of any provider being powered by NVMe storage.

 

About this processor upgrade, how big of an upgrade was it in terms of processor architecture? (i.e. Intel 4th gen. vs. 8th gen.) Clock speed hardly tells the whole story, as efficiency goes up every year. So, even though the newer processors are clocked 21% higher, I think the actual performance gain will be much more than that.

They're the same generation - we're actually just going from low-power versions of the processors to higher power versions.  65 Watt TDP to 120 Watt TDP.  It's something we've wanted to do for a long time but at ~$1,750 per processor it's not a cheap endeavor when you have 20+ to swap out.

 

We thought it would just be the clock speed but the memory controller is handling ram faster - 2 Gigabytes/second faster as well.  I've never seen MySQL on this server use so little CPU while doing more work and this isn't just because of the clock speed but because of how much faster RAM access is.  Queries are running faster and more consistently.

 

When it comes to NVMe - it's just like standard SSDs when we were all HDD - we're waiting on the price to come down to make it reasonable.  Most already think we're expensive for what we give not really understanding how much it costs to provide the services we provide.  Unfortunately shared hosting profit margins have gotten thinner and thinner over the years.  It seems that people don't value quality support, reliability, and consistent speed like they used to and seem to value low price above anything else.  We're just not willing to cram our servers as full as our competitors to improve our profit margins at the cost of performance and reliability.

 

If we were to go to all NVMe - we'd most certainly have to double or triple pricing and the average hosting client isn't going to understand and would cancel/leave.

 

If NVMe is something you want - we could certainly build you a custom solution but it would be substantially more expensive than anything else we could offer until NVMe comes down in price.  Believe me - I'd love to be on NVMe and I remember the day when we were able to go pure SSD.  There were days when we were on HDD that I'd just sit and dream about how nice it would be to be on SSD and not have to worry so much about IOPS and latency and throughput.  Where things will 'just work' without having to be babysat so much :).


  • 1


#7504 DDoS Attack Mitigation on P1

Posted by MikeDVB on 21 October 2018 - 12:34 AM

I'm not sure if anybody noticed the attack - our internal monitoring did and since it runs on the same IP Pingdom did notice some downtime while the rest of the server was still online.

 

We've mitigated the attack and have reached out to the client that is the target to keep them in the loop.

 

So far so good on mitigating the attack - but we did also move the affected client to their own IP just in case we have to take more drastic actions.  We're keeping an eye on the situation but so far the server is online and 100% operational :).


  • 1


#7503 Processor Upgrades - Faster Single-Threaded Performance & Higher Overall...

Posted by MikeDVB on 20 October 2018 - 10:06 PM

Yeah, I am definitely surprised at how large of an improvement this is.  We will be upgrading all servers as quickly as we can.  The original goal was to upgrade all servers at once but due to the major disaster and the damage it caused to our revenue we're having to do this in stages.

 

2018-10-20_22-56-17.png


  • 1


#7435 Major Outage - 09/21/18+ - Client Discussion

Posted by MikeDVB on 27 September 2018 - 01:07 AM

Hi all
Quick question
Considering what happened it would be wise to have full cpanel backups.
Any suggestions on how to automate this process and maybe upload to amazon S3 bucket on a shared hosting account?
thanks!

We’re evaluating what options there are so that hopefully we can offer such functionality for you. I know it’s doable with a custom script of some kind but it would be nice for it to be built in.
  • 2


#7405 Major Outage - 09/21/18 - 09/24/2018

Posted by MikeDVB on 26 September 2018 - 01:33 PM

Restores are 100% Completed

 

If your site is offline showing a cPanel error page:

 

  • Try connecting to your cPanel by adding "/cpanel" on to the end of your domain.  If you can sign in, this verifies your account was restored.
  • Check to see if you're using our nameservers - if you aren't, you'll need to get your IP from cPanel and update your third party DNS.
  • Make sure you're not just reloading the error page - hitting reload while viewing the error just reloads the error page.

If you are not using third party DNS and your site doesn't appear but you can get into cPanel - try clearing your browser cache and restarting your browser.  If that doesn't work try another browser.  If it loads for you on one browser but not another - that's a caching issue and not a server or network issue.

 

If you are having any issues with your mail client - what we have seen work the most is removing the email account from the client and adding it back.  We haven't yet identified what the difference is.  You can also add "/webmail" to the end of your domain to access your email if your mail client isn't working.

 

We do expect there to be a lot of little issues that we have to resolve so if you have issues and can't sort them please reach out in a ticket.

 

We are doing our best to keep up with support tickets.  I am sorry if it takes us longer to reply than normal but we are answering tickets in the order received and doing our best to fully resolve any issues and to offer good proper non-copy-and-pasted advice.


  • 2


#7363 Major Outage - 09/21/18 - 09/24/2018

Posted by MikeDVB on 25 September 2018 - 09:16 PM

For those worried that something like this could happen again.

 

We have already enabled snapshots on our storage cluster.  We're doing one snapshot every hour and keeping them for 10 hours.

 

So from a hypothetical standpoint - let's say that this did manage to happen again.  We would simply mount a snapshot from before the incident - within the hour before - and boot everything back up.

 

Total downtime would be - ~5 minutes - for the whole network.  Would there be any data loss? Possibly anything written in the preceding hour or less - but nothing compared to the losses of a multi-day outage.

 

It would look literally like we just shut everything down and booted it back up.  No 'restorations', no lost emails, nothing.  There's a great chance almost nobody would even notice.

 

This is something that our storage vendor, StorPool, set up for us immediately upon seeing what had happened.  They actually apologized that it was not already set up and said that as a result of our disaster they are going to make sure that it is a default behavior that has to be actively disabled rather than the other way around.

 

Even with these snapshots and as powerful as they are - we are still going to overhaul our backup servers.  We have identified the issues with the present setup that caused restorations to be so slow and already have fixes for those issues planned for once we are fully online and all of our clients are taken care of.

 

Snapshots are a very powerful tool against data loss and corruption.  We actually used them a couple of times on our old storage platform, the Nimble CS500, to recover data on servers when clients made big mistakes themselves.


  • 1


#7303 Major Outage - 09/21/18+ - Client Discussion

Posted by MikeDVB on 25 September 2018 - 11:41 AM

Though I never thought it would possible for somebody to fire a single command and nuke MDD's whole system. Especially since "Multiple copies of your data are stored synchronously across multiple storage drives and servers with no single points of failure."


Neither did we - but we learned that the hard way. If we had snapshots configured properly we'd have been able to flip a few switches and bring everything back.  Hard lesson for sure.

 

That said as this happened inside of the servers - that is why it affected all copies.  So when you write a file to a server - it's stored on 3 separate disks in 3 separate servers.  If you erase it - it's erased from all 3.  In this case - this block discard happened inside of the server - so it discarded all copies.

 

Another thing to note is that our backup system is supposed to be taking backups every day - so at most you should have lost hours not days of data - this is another issue we're going to make sure is resolved.  Not only will we have more regular snapshots - perhaps even on an hourly basis - that we can restore from immediately [as in, just hit 'boot' and you're back online just as you were] - but also in making sure our backup systems moving forward are tested and audited on a regular basis.


  • 1


#7296 Major Outage - 09/21/18 - 09/24/2018

Posted by MikeDVB on 25 September 2018 - 11:14 AM

s1    Completed
p1    Completed
r1    Completed
p2    Completed
s2    Completed
r2    Completed
s3    Tuesday, September 25, 2018 at 7:00:00 PM
r3    Wednesday, September 26, 2018 at 3:00:00 AM
s4    Wednesday, September 26, 2018 at 4:00:00 AM
r4    Completed
s5    Wednesday, September 26, 2018 at 10:00:00 AM
s0    Wednesday, September 26, 2018 at 10:00:00 AM


  • 1