Jump to content


MikeDVB's Content

There have been 24 items by MikeDVB (Search limited from 28-January 19)


By content type

See this member's

Sort by                Order  

#7593 S3 Server - Service Disruptions

Posted by MikeDVB on 26 October 2019 - 09:40 PM in Server and Network Announcements

It looks like OOM Killer [out of memory killer] was going wild over a few accounts - and the amount it was running was causing system wide issues.  The system itself didn't exhaust RAM, but the user-level issues did cause some problems.

 

We adjusted some system settings as per CloudLinux's advice and we believe this issue should not recur.  We are going to continue watching the server closely regardless.

 

The changes recommended by CloudLinux have been applied to all servers.




#7592 S3 Server - Service Disruptions

Posted by MikeDVB on 26 October 2019 - 07:27 PM in Server and Network Announcements

At 11:02 AM ET on Sat October 26, 2019 the S3 server exhausted all of its physical RAM resulting in an outage.  Before bringing the server back online we applied an additional 30GB of ram to the system bringing it up to a total of 120GB.

 

At 7:53 PM ET on Sat October 26, 2019 the S3 server once again experienced a brief outage.

 

We presently believe the issue to be with CloudLinux and it not properly applying RAM limits to all accounts.

 

A ticket has been opened with CloudLinux and we are watching the server closely.




#7591 R2 Server - Memory Doubled

Posted by MikeDVB on 01 October 2019 - 11:39 AM in Server and Network Announcements

We experienced an outage today on the R2 server and upon close investigation we believe a very brief but very large memory spike was the cause.

 

We've doubled the memory available to the system and this should prevent further outages of this nature.

 

I'm keeping an eye on the server personally and will keep this thread updated if we have to make any other changes.




#7588 R2 Server File System Check - September 27, 2019 - 11:30 PM ET

Posted by MikeDVB on 27 September 2019 - 01:15 PM in Server and Network Announcements

Hello!

 

We have been seeing some really weird storage performance with the R2 server lately up to and including short outages such as the one the server just now experienced.

 

This server has been online for quite some time and has not had a file system check performed in a while.  We believe the issues faced are related to some file system corruption that needs to be corrected.

 

We are scheduling emergency maintenance for the R2 server for tonight, September 27, 2019 at 11:30 PM Eastern Time [GMT-4].  We expect the file system check to take no more than 15 minutes but we are going to schedule a 2 hour window in case it takes longer than we anticipate.

 

I apologize for the short notice on this as we need to take action to prevent further unplanned or unexpected downtime.

 

If you have any questions about this maintenance or the maintenance window feel free to reply here or to open a ticket with technical support.

 

Thank you.




#7587 RFO: S3 Server Data Roll-Back and Restoration - 09/08/2019

Posted by MikeDVB on 10 September 2019 - 09:18 AM in Server and Network Announcements

What happened?
 
On Sunday, September 8, 2019 routine maintenance was scheduled to upgrade MariaDB [MySQL] from version 10.1 to 10.3.  Our testing as well as public documentation demonstrated that MariaDB 10.3 performs substantially faster and is more efficient than 10.1.  Our testing indicated that the upgrade process was nearly seamless only requiring a couple restarts of the MariaDB server [about a minute of downtime for MariaDB total].
 
We upgraded all of our shared cloud and reseller servers from 10.1 to 10.3 and spot tested numerous sites on each server before and after the upgrades to ensure that everything went smoothly.
 
Within about an hour of completing the maintenance we began to receive numerous support tickets from clients on the S3 server reporting their sites were not working, databases were being corrupted, and a myriad of other issues related to MariaDB and databases.
 
We brought the server back online from a snapshot taken on August 24, 2019 and then immediately began restoring data changed/added to the server after that point.
 
Why did it happen?
 
There are several steps that are required for us to upgrade MariaDB from 10.1 to 10.3.  We run a CloudLinux technology called "MySQL Governor" which is what allows CloudLinux to restrict MySQL[MariaDB] usage to your resource limits [1 CPU core].  The MySQL Governor has its own versions of the MariaDB Binaries that have additional code allowing CloudLinux to hook in, monitor, and control.
 
In order to upgrade MariaDB from 10.1 to 10.3 we had to remove the MySQL Governor, perform the upgrades, and then reinstall the MySQL Governor.  We performed this on all servers and tested several sites on each server both before and after the upgrade to ensure things were working as expected.
 
Within an hour of performing the maintenance we began to receive support tickets from our clients on the S3 server indicating issues with MariaDB Connectivity, corrupted databases, and a myriad of other database-related issues.  At the time we did not know the exact cause but we did know that it was due to the maintenance we had just performed.
 
Our post-incident investigation determined that the MySQL Governor reinstallation put back the older MariaDB 10.1 binaries instead of installing the new 10.3 binaries.  I don't know by what mechanism this caused the actual corruption experienced but I do know that reinstalling MariaDB 10.3 did not resolve the issue.
 
What was done to correct the issue?
 
As a part of our standard procedures for maintenance a manual snapshot of the storage system was taken prior to getting started.  A snapshot allows us to roll back a whole server or even all servers to the point the snapshot was taken nearly instantly.  This protects you and your data from data loss or corruption should an upgrade fail in some catastrophic or unexpected manner that isn't recoverable.
 
As soon as we were able to determine this wasn't something we were going to be able to fix in-place in a timely fashion without restoring to a snapshot that is exactly what we chose to do - to restore to a snapshot.
 
After we verified the upgrades were successful and before the issues with the S3 server were apparent we dropped the snapshots we had taken prior to the maintenance.  In hindsight we should have allowed more time for potential issues to surface and should have kept the manual snapshot longer - at least a few hours if not a few days.  In this case we dropped the snapshot just prior to actually needing it.
 
We normally would have snapshots every hour on the hour, however, on 08/24 we had reached out to StorPool, our storage software vendor, with some concerns we had about snapshots.  Namely that we had a few thousand and we didn't want to risk data corruption, data loss, performance loss, etc.  While working with them on this the automatic snapshots were temporarily disabled so the snapshot tree could be cleaned up and extraneous snapshots pruned.  This took a while and when it was done snapshots were not re-enabled.
 
Bringing a server online from a snapshot takes only a few minutes - about as long as it takes to identify the disk IDs in our cloud platform [ a few seconds ] and then as long as it takes to identify the latest snapshot for those disks and mount them.  It's a fantastic way to recover - if you have a recent snapshot.  In this case as the closest snapshot was from 08/24 we brought the server up from that point and immediately began to restore data added to the server after that snapshot via our primary backup system.
 
The total actual downtime for the server was only about 30 minutes due to the MariaDB upgrades, corruption, and then bringing the server online from a snapshot.  It has taken about 30 hours after bringing the server online from a snapshot to restore all data added and changed since the snapshot was taken from our primary backup system.  The biggest bottleneck in this case was the cPanel restoration system - we've already drafted plans for our own recovery script that will skip the cPanel checks and hooks and stream data right to the server at up to 20 times the speed.  Unfortunately we weren't able to get this done while restoring the S3 server as we need to test any such tool before putting it into production use.
 
What is being done to prevent this from happening again?
 
As of yesterday we are now monitoring snapshot activity via our internal monitoring system.  This system has been configured, and tested, to alert us if any single storage device goes longer than 6 hours without a snapshot and performs an emergency all-staff-notified alert if any device goes longer than 12 hours without a snapshot.
 
Prior to yesterday we were manually checking for the existence of snapshots monthly as a part of our Disaster Recovery Preparedness Plan, or DRPP.  The DRPP was drafted due to the outage we experienced in 2018 where we had no snapshots at all.  Checking for the existence of valid snapshots is only a small portion of the DRPP but it is a very important part.
 
To be as straightforward as I can be - we should have set up automated monitoring of snapshots to begin with to monitor snapshots.  StorPool has been working on a very robust snapshot management tool that includes options such as taking and keeping so many backups per hour, per day, per week, per month, and per year.  The tool they're working on also includes monitoring and alerts.  We have been monitoring snapshots manually rather than building our own automated monitoring while waiting on StorPool to release their new tool.
 
Our Disaster Recovery Preparedness Plan has been updated as a result of this incident and we have added some new standard operating procedures when it comes to performing maintenance.  While it did already state that snapshots were to be taken before maintenance and then removed when completed and verified - we've changed this so that we will keep the manual snapshot for at least 24 hours after a maintenance window.  While we can't prevent an upgrade from going wrong - we can make sure that we protect and insulate our clients from such incidents as much as possible.  
 
Additional Information
 
I do understand that from the client perspective that an outage is an outage - and many may not care if each outage is for a distinct and new issue.  We do our best to avoid outages but we're human and we aren't perfect.  When we screw up or make a mistake we'll acknowledge and accept that and then learn from it as not to make the same mistakes twice.  This is the first outage of this kind for us and should be the last with our new operating procedures governing maintenance and snapshots.
 
I am sorry for any trouble this caused you and if you have any questions or concerns don't hesitate to respond here or to reach out directly.



#7585 Network connectivity issues. 08/12/2019

Posted by MikeDVB on 15 August 2019 - 08:39 AM in Server and Network Announcements

I was waiting on the RFO before updating this - but I haven't seen one yet so I at least wanted to post that the maintenance on the 13th went well and that we are fully redundant once again.

 

Juniper is still investigating the cause but from my conversations with the networking department at our upstream provider a filter has been put in place that should prevent the issue from recurring.

 

Once I have the RFO I will make it available.




#7584 Network connectivity issues. 08/12/2019

Posted by MikeDVB on 13 August 2019 - 12:34 PM in Server and Network Announcements

Our upstream facility has scheduled a maintenance window tonight from 11 PM to 4 AM Eastern Time.

 

They expect we may see a couple instances of downtime of up to 15 minutes but are going to strive to keep any downtime to a minimum.

 

For full details you can read their status at https://helpdesk.han...wsItem/View/276




#7583 Network connectivity issues. 08/12/2019

Posted by MikeDVB on 12 August 2019 - 04:54 PM in Server and Network Announcements

In speaking with the senior network engineer at Handy Networks, our upstream facility, this issue affected both redundant pieces of hardware responsible for routing traffic.  The primary crashed and then the secondary took over and subsequently crashed.  While they were working to determine the cause this was ongoing and explains why things would show as online for a minute or two and then back down.

 

The mode of failure is definitely unusual and I still personally believe it to be a bug in the Juniper OS.

 

Juniper as well as Handy Networks are still working to trace the cause and I expect to have an RFO within 72 hours or less.




#7582 Network connectivity issues. 08/12/2019

Posted by MikeDVB on 12 August 2019 - 01:52 PM in Server and Network Announcements

It is my current understanding that this issue was due to an unusual hardware failure in a core piece of networking equipment at the facility.  This piece of hardware failed in such a way that it wasn't servicing requests but wasn't 'offline' - sort of like an operating system crash/panic.

 

As this piece of equipment is redundant - there is another identical piece of hardware doing the same job that should pick up the slack - I do not at this point know why the failure caused an outage that redundancy didn't prevent.  It could be due to the nature of the failure in that the gear stayed online but wasn't actually working but that's speculation on my part.

 

As it stands everything is back online but we have lost the redundancy of this core piece of hardware until the issue is fully resolved.  It is suspected that this is a bug in the operating system running on the core networking equipment and the facility is working with Juniper Emergency Support to both investigate the cause of the issue as well as working to ensure it doesn't happen again.

 

Here is a snippet of the kernel/operating system log from the failed piece of networking hardware:

Aug 12 17:29:05  dist3.denver2 /kernel: BAD_PAGE_FAULT: pid 1972 (fxpc), uid 0: pc 0x0 got a read fault at 0x0, x86 fault flags = 0x4
Aug 12 17:29:05  dist3.denver2 /kernel: Trapframe Register Dump:
Aug 12 17:29:05  dist3.denver2 /kernel: eax: 20dba498ecx: 000000ffedx: 20dba494ebx: 20dba468
Aug 12 17:29:05  dist3.denver2 /kernel: esp: af97de6cebp: af97de98esi: 21054b98edi: 00000000
Aug 12 17:29:05  dist3.denver2 /kernel: eip: 00000000eflags: 00010202
Aug 12 17:29:05  dist3.denver2 /kernel: cs: 0033ss: 003bds: 003bes: 003b
Aug 12 17:29:05  dist3.denver2 /kernel: fs: b0b5003btrapno: 0000000cerr: 00000004
Aug 12 17:29:05  dist3.denver2 /kernel: PC address 0x0 is inaccessible, PDE = 0x0, ****** = 0x0
Aug 12 17:29:05  dist3.denver2 /kernel: BAD_PAGE_FAULT: pid 1972 (fxpc), uid 0: pc 0x0 got a read fault at 0x0, x86 fault flags = 0x4
Aug 12 17:29:05  dist3.denver2 /kernel: Trapframe Register Dump:
Aug 12 17:29:05  dist3.denver2 /kernel: eax: 20dba498ecx: 000000ffedx: 20dba494ebx: 20dba468
Aug 12 17:29:05  dist3.denver2 /kernel: esp: af97de6cebp: af97de98esi: 21054b98edi: 00000000
Aug 12 17:29:05  dist3.denver2 /kernel: eip: 00000000eflags: 00010202
Aug 12 17:29:05  dist3.denver2 /kernel: cs: 0033ss: 003bds: 003bes: 003b
Aug 12 17:29:05  dist3.denver2 /kernel: fs: b0b5003btrapno: 0000000cerr: 00000004
Aug 12 17:29:05  dist3.denver2 /kernel: PC address 0x0 is inaccessible, PDE = 0x0, ****** = 0x0

Once the Reason For Outage [RFO] is available from our facility we will make it available.




#7577 Not pre-sales, but account set-up: on-death contact?

Posted by MikeDVB on 22 July 2019 - 12:34 PM in Pre-Sales Enquiries

You could add her as a contact.  So long as you give her the ability to sign into the client area [full access] she would be able to obtain a support pin and request a details change.




#7573 iOS mail shows two “Junk” folders. Why?

Posted by MikeDVB on 28 May 2019 - 04:25 PM in Shared Hosting Support

One is 'local' to the phone and one is on the server.  You can go into the settings and set the 'local' one to point at the on-server one [settings on the phone].




#7555 Issues with S0 server. 04/18/2019

Posted by MikeDVB on 18 April 2019 - 04:25 PM in Server and Network Announcements

The hardware that was powering the S0 server powered off unexpectedly and without warning.  We brought S0 back online on an alternate hardware node.

We're still investigating what caused the hardware to power off.  It is and has been back online since the incident but is not presently being used to provide services to any clients while we continue to investigate.




#7552 Slow performance on: 04/14/19

Posted by MikeDVB on 14 April 2019 - 11:56 AM in Server and Network Announcements

The underlying issue was a failing disk that failed to eject from the cluster properly.  We reached out to StorPool and they identified and ejected the disk manually. We're working with them to identify ways that we an avoid issues like these in the future such as identifying a failed ejection and/or resolving the issue that caused it to fail to eject.

 

The S2 server was the only server that needed restarted or a reboot while all other servers were only slower than normal on writes to the storage.

 

If you have any questions about this let us know.




#7549 Network Disruption on April 11, 2019

Posted by MikeDVB on 13 April 2019 - 12:27 PM in Server and Network Announcements

On April 11th our monitoring alerted us to an outage on the network.

This outage affected the entire facility in which we have our servers and was unfortunately outside of our realm of control.

We received the first alert at 04/11/2019 12:47:56 PM and verified all services back online at 04/11/2019 01:11:49 PM, after 24m of downtime. Times are eastern.

The RFO (reason for outage) can be seen at our upstream provider here: https://handynetwork...RFO 4.11.19.pdf. Ive also attached it to this post.

Attached Files




#7547 DDoS Attack on S3 Server - 1 IP Affected

Posted by MikeDVB on 25 February 2019 - 09:11 PM in Server and Network Announcements

Hello!

 

Unfortunately a very high packets-per-second Distributed Denial of Service attack hit an IP on the S3 server tonight.  This attack wasn't large in the sense that it overwhelmed our network capacity but was large in the sense that it was a high enough number of packets that it was exhausting the web server's sockets and queues rendering sites on the IP offline.

 

We did identify the target of the attack and have moved them off to their own IP address - should the attack recur or adapt and we have to take action it should only affect the target site and not others on the server.

 

This attack was a new variant we haven't seen prior to tonight so we're using our packet captures to investigate how we could handle such an attack better and more efficiently should anything like it recur in the future.

 

If you have any questions about the attack do please open a support ticket.  Do feel free to reference this thread.




#7543 S0 and S1 Servers - Server IPs Null-Routed - How to access cPanel, Webmail, E...

Posted by MikeDVB on 19 January 2019 - 11:41 AM in Server and Network Announcements

Hello!
 
We're seeing a couple of very large attacks that are targeting a couple of our servers - S0 and S1.  While all client sites are online and operational the IPs used for cPanel, Webmail, and most email access are currently un-routed.  Due to a misconfiguration in our Anti-DDoS protection that we're working to fix we're not presently able to route those IPs through our Anti-DDoS services.  We expect this to be corrected within a couple of hours.
 
In the meantime you can make the following changes to access cPanel and Webmail.
 
To access cPanel you would want to access the "cpanel" subdomain on your primary domain. So if, for example, your cPanel's primary domain is "test.com" you would go to "cpanel.test.com" in your browser. You may get an SSL warning but you can safely accept it/pass it.

 

To access webmail would be similar to cPanel in that you would connect to the "webmail" subdomain of your primary domain. For example if your cPanel's primary domain is "test.com" you would go to "webmail.test.com" in your browser. You may get an SSL warning but you can safely accept it/pass it.

Email Clients [Mac Mail, Outlook, Thunderbird, etc] - if you have them configured to connect to "s0.supportedns.com" or "s1.supportedns.com" you can change this to point to the mail subdomain of your cPanel's primary domain. If, for example, your primary cPanel domain is "test.com" you would connect your mail client to "mail.test.com". You may get an SSL warning from your mail client which you can permanently accept.

 

FTP - in most cases you can simply connect to your domain name.  There are some situations where this wouldn't work such as if you're using CloudFlare or Sucuri CloudProxy in which case you can connect directly to your account IP address.  You can find the IP address in your cPanel under 'Server Information' or at CloudFlare or Sucuri.

 

We do expect this to be resolved within an hour or two so if you just want to wait it out you can.  If you do make any changes to your mail or FTP clients - you do not have to revert them.




#7542 Drupal Security Update - Critical Vulnerabilities Patched

Posted by MikeDVB on 17 January 2019 - 09:16 AM in General Announcements

Drupal announced an update to Drupal core today to address two critical vulnerabilities. Drupal recommends users update their core:

If you are using Drupal 8.6.x, upgrade to Drupal 8.6.6.
If you are using Drupal 8.5.x or earlier, upgrade to Drupal 8.5.9.
If you are using Drupal 7.x, upgrade to Drupal 7.62.
 
Note: Versions of Drupal 8 prior to 8.5.x are end-of-life and do not receive security coverage. Sites on 8.5.x will receive security coverage until May 2019.
 
The vulnerabilities are announced as
Drupal Core - Third-party libraries - SA-CORE-2019-001

Drupal Core - Remote code execution - SA-CORE-2019-002




#7541 Network Outage 1/17/2019 - 5:56 AM to 6:54 AM

Posted by MikeDVB on 17 January 2019 - 09:08 AM in Server and Network Announcements

I don't have a formal RCA from the facility yet but I do expect one sometime today or tomorrow.

 

The short version is essentially that there was a mis-ordered routing policy that resulted in unexpected withdrawal of the default networking route when a single carrier was removed from the transit mix.

 

It is normal from time to time for the mix to change - say a specific carrier/transit provider is experiencing packet loss or network issues - the network will be removed from the mix to ensure performance.  In this case Hurricane Electric was being dropped from the transit due to issues with their network.

 

As this happened at the facility level it occurred on networking equipment we do not maintain nor have access to - unfortunately out of our control.  We are confident that this issue will not recur.

 

If you have any questions about any of this please feel free to reply here or to open a support ticket.




#7536 Server Reboots for Security Update - ~2 Minutes Each

Posted by MikeDVB on 12 January 2019 - 12:03 AM in Server and Network Announcements

We're going to be rebooting all servers to apply a security patch.  The reboot will take 2 minutes or less per server.




#7535 Processor Upgrades - Faster Single-Threaded Performance & Higher Overall...

Posted by MikeDVB on 11 January 2019 - 08:28 PM in Server and Network Announcements

All processors in all compute nodes are fully upgraded :).  All shared, reseller, and VPS nodes will see the benefits of this upgrade.




#7533 November 18, 2018 - DDoS Attack

Posted by MikeDVB on 20 November 2018 - 09:40 PM in Server and Network Announcements

Any update on this?  Thanks.

Nothing to update. The attacks were absorbed and ended. 




#7530 November 18, 2018 - DDoS Attack

Posted by MikeDVB on 18 November 2018 - 02:19 PM in Server and Network Announcements

First thing I did was go to the twitter channel.  I did not see anything so I opened a ticket.

 

I think it is better to announce it there. 

It's only affecting a very extremely tiny subset of clients - so I didn't want to get everybody concerned about it unless the attack gets to the point that we can't manage it or have to take drastic steps.

 

To be completely straightforward - these attacks aren't big enough to take anything we have offline - but our facility is proactively trying to protect us.  I've asked that they stop doing this [stop null-routing IPs] which will result in things staying online.

 

If the attack does get to the point that our infrastructure can't handle it - we do have other options available to us other than de-routing IPs and we'll use those first.

 

It's been a LONG time since we've had any DDoS attacks much less of this magnitude and our infrastructure has changed substantially since the last time so our facility is doing their best to protect us and is being a bit ... too overprotective :).




#7528 November 18, 2018 - DDoS Attack

Posted by MikeDVB on 18 November 2018 - 02:03 PM in Server and Network Announcements

For clarity there is a lot of detail I am not providing here in this thread due to the nature of the attacks - but we are aware of them and working to resolve them as they change and adapt.




#7526 November 18, 2018 - DDoS Attack

Posted by MikeDVB on 18 November 2018 - 09:23 AM in Server and Network Announcements

Hello!

 

It's been a while since we've seen a decent DDoS attack - something large enough that our facility would take any sort of proactive action against it.

 

A decently sized DDoS attack started hitting our network this morning on the order of 8 or 9 GBPS.  Our facility saw this traffic and began to proactively put blocks in place resulting in some IP addresses showing offline.  Only a few were affected by this due to the very targeted nature of this attack.

 

Our network is capable of absorbing attacks of this size so for now we've asked the facility to rescind the blocks so that we can just absorb this attack.  All services are online and operational at this time.

 

If there are any major changes we'll update this thread.