Jump to content


MikeDVB's Content

There have been 21 items by MikeDVB (Search limited from 01-November 19)


By content type

See this member's

Sort by                Order  

#7552 Slow performance on: 04/14/19

Posted by MikeDVB on 14 April 2019 - 11:56 AM in Server and Network Announcements

The underlying issue was a failing disk that failed to eject from the cluster properly.  We reached out to StorPool and they identified and ejected the disk manually. We're working with them to identify ways that we an avoid issues like these in the future such as identifying a failed ejection and/or resolving the issue that caused it to fail to eject.

 

The S2 server was the only server that needed restarted or a reboot while all other servers were only slower than normal on writes to the storage.

 

If you have any questions about this let us know.




#7600 S5 Server Instability - 02/27/2020

Posted by MikeDVB on 27 February 2020 - 11:01 AM in Server and Network Announcements

All accounts have had their IP addresses updated.  I'm going to be emailing everybody on the server [even those not affected] as there is no way for us to separate out emails by IP.




#7604 S5 Server Instability - 02/27/2020

Posted by MikeDVB on 27 February 2020 - 01:39 PM in Server and Network Announcements

Good job responding quickly to the attack.

If the attack was targeting a specific account, are you going to isolate it?

If we're able to identify it, we would and would reach out to the owner.

So far the attack hasn't changed IPs.




#7601 S5 Server Instability - 02/27/2020

Posted by MikeDVB on 27 February 2020 - 11:11 AM in Server and Network Announcements

An email has been dispatched to all clients on the S5 server.




#7599 S5 Server Instability - 02/27/2020

Posted by MikeDVB on 27 February 2020 - 10:51 AM in Server and Network Announcements

Most accounts have had their IP addresses updated.  We still have a few more to update.




#7598 S5 Server Instability - 02/27/2020

Posted by MikeDVB on 27 February 2020 - 10:06 AM in Server and Network Announcements

Still working to move accounts to new IPs.




#7597 S5 Server Instability - 02/27/2020

Posted by MikeDVB on 27 February 2020 - 09:57 AM in Server and Network Announcements

This is actually one of the larger DDoS attacks we've ever seen - large enough that action had to be taken at the network level to keep the network stable.

 

Currently there is an IP address that is null routed and we're going to be working to restore services to other IPs.  Hopefully the attack won't move to the new IP but without us knowing who, exactly, was the target it's hard to be sure.

 

We're going to be evaluating all of the logs to do our best to identify who the target of the attacks are so we can isolate the account.




#7596 S5 Server Instability - 02/27/2020

Posted by MikeDVB on 27 February 2020 - 09:39 AM in Server and Network Announcements

The hardware powering the S5 server is experiencing issues.  The issues are not so large that an automatic fail-over happened but enough to disrupt the busiest services on the server.

 

We're working to restore stability and will likely be balancing the S5 server over to another piece of hardware.

 

We apologize for the trouble and will provide updates here as they come.




#7593 S3 Server - Service Disruptions

Posted by MikeDVB on 26 October 2019 - 09:40 PM in Server and Network Announcements

It looks like OOM Killer [out of memory killer] was going wild over a few accounts - and the amount it was running was causing system wide issues.  The system itself didn't exhaust RAM, but the user-level issues did cause some problems.

 

We adjusted some system settings as per CloudLinux's advice and we believe this issue should not recur.  We are going to continue watching the server closely regardless.

 

The changes recommended by CloudLinux have been applied to all servers.




#7592 S3 Server - Service Disruptions

Posted by MikeDVB on 26 October 2019 - 07:27 PM in Server and Network Announcements

At 11:02 AM ET on Sat October 26, 2019 the S3 server exhausted all of its physical RAM resulting in an outage.  Before bringing the server back online we applied an additional 30GB of ram to the system bringing it up to a total of 120GB.

 

At 7:53 PM ET on Sat October 26, 2019 the S3 server once again experienced a brief outage.

 

We presently believe the issue to be with CloudLinux and it not properly applying RAM limits to all accounts.

 

A ticket has been opened with CloudLinux and we are watching the server closely.




#7587 RFO: S3 Server Data Roll-Back and Restoration - 09/08/2019

Posted by MikeDVB on 10 September 2019 - 09:18 AM in Server and Network Announcements

What happened?
 
On Sunday, September 8, 2019 routine maintenance was scheduled to upgrade MariaDB [MySQL] from version 10.1 to 10.3.  Our testing as well as public documentation demonstrated that MariaDB 10.3 performs substantially faster and is more efficient than 10.1.  Our testing indicated that the upgrade process was nearly seamless only requiring a couple restarts of the MariaDB server [about a minute of downtime for MariaDB total].
 
We upgraded all of our shared cloud and reseller servers from 10.1 to 10.3 and spot tested numerous sites on each server before and after the upgrades to ensure that everything went smoothly.
 
Within about an hour of completing the maintenance we began to receive numerous support tickets from clients on the S3 server reporting their sites were not working, databases were being corrupted, and a myriad of other issues related to MariaDB and databases.
 
We brought the server back online from a snapshot taken on August 24, 2019 and then immediately began restoring data changed/added to the server after that point.
 
Why did it happen?
 
There are several steps that are required for us to upgrade MariaDB from 10.1 to 10.3.  We run a CloudLinux technology called "MySQL Governor" which is what allows CloudLinux to restrict MySQL[MariaDB] usage to your resource limits [1 CPU core].  The MySQL Governor has its own versions of the MariaDB Binaries that have additional code allowing CloudLinux to hook in, monitor, and control.
 
In order to upgrade MariaDB from 10.1 to 10.3 we had to remove the MySQL Governor, perform the upgrades, and then reinstall the MySQL Governor.  We performed this on all servers and tested several sites on each server both before and after the upgrade to ensure things were working as expected.
 
Within an hour of performing the maintenance we began to receive support tickets from our clients on the S3 server indicating issues with MariaDB Connectivity, corrupted databases, and a myriad of other database-related issues.  At the time we did not know the exact cause but we did know that it was due to the maintenance we had just performed.
 
Our post-incident investigation determined that the MySQL Governor reinstallation put back the older MariaDB 10.1 binaries instead of installing the new 10.3 binaries.  I don't know by what mechanism this caused the actual corruption experienced but I do know that reinstalling MariaDB 10.3 did not resolve the issue.
 
What was done to correct the issue?
 
As a part of our standard procedures for maintenance a manual snapshot of the storage system was taken prior to getting started.  A snapshot allows us to roll back a whole server or even all servers to the point the snapshot was taken nearly instantly.  This protects you and your data from data loss or corruption should an upgrade fail in some catastrophic or unexpected manner that isn't recoverable.
 
As soon as we were able to determine this wasn't something we were going to be able to fix in-place in a timely fashion without restoring to a snapshot that is exactly what we chose to do - to restore to a snapshot.
 
After we verified the upgrades were successful and before the issues with the S3 server were apparent we dropped the snapshots we had taken prior to the maintenance.  In hindsight we should have allowed more time for potential issues to surface and should have kept the manual snapshot longer - at least a few hours if not a few days.  In this case we dropped the snapshot just prior to actually needing it.
 
We normally would have snapshots every hour on the hour, however, on 08/24 we had reached out to StorPool, our storage software vendor, with some concerns we had about snapshots.  Namely that we had a few thousand and we didn't want to risk data corruption, data loss, performance loss, etc.  While working with them on this the automatic snapshots were temporarily disabled so the snapshot tree could be cleaned up and extraneous snapshots pruned.  This took a while and when it was done snapshots were not re-enabled.
 
Bringing a server online from a snapshot takes only a few minutes - about as long as it takes to identify the disk IDs in our cloud platform [ a few seconds ] and then as long as it takes to identify the latest snapshot for those disks and mount them.  It's a fantastic way to recover - if you have a recent snapshot.  In this case as the closest snapshot was from 08/24 we brought the server up from that point and immediately began to restore data added to the server after that snapshot via our primary backup system.
 
The total actual downtime for the server was only about 30 minutes due to the MariaDB upgrades, corruption, and then bringing the server online from a snapshot.  It has taken about 30 hours after bringing the server online from a snapshot to restore all data added and changed since the snapshot was taken from our primary backup system.  The biggest bottleneck in this case was the cPanel restoration system - we've already drafted plans for our own recovery script that will skip the cPanel checks and hooks and stream data right to the server at up to 20 times the speed.  Unfortunately we weren't able to get this done while restoring the S3 server as we need to test any such tool before putting it into production use.
 
What is being done to prevent this from happening again?
 
As of yesterday we are now monitoring snapshot activity via our internal monitoring system.  This system has been configured, and tested, to alert us if any single storage device goes longer than 6 hours without a snapshot and performs an emergency all-staff-notified alert if any device goes longer than 12 hours without a snapshot.
 
Prior to yesterday we were manually checking for the existence of snapshots monthly as a part of our Disaster Recovery Preparedness Plan, or DRPP.  The DRPP was drafted due to the outage we experienced in 2018 where we had no snapshots at all.  Checking for the existence of valid snapshots is only a small portion of the DRPP but it is a very important part.
 
To be as straightforward as I can be - we should have set up automated monitoring of snapshots to begin with to monitor snapshots.  StorPool has been working on a very robust snapshot management tool that includes options such as taking and keeping so many backups per hour, per day, per week, per month, and per year.  The tool they're working on also includes monitoring and alerts.  We have been monitoring snapshots manually rather than building our own automated monitoring while waiting on StorPool to release their new tool.
 
Our Disaster Recovery Preparedness Plan has been updated as a result of this incident and we have added some new standard operating procedures when it comes to performing maintenance.  While it did already state that snapshots were to be taken before maintenance and then removed when completed and verified - we've changed this so that we will keep the manual snapshot for at least 24 hours after a maintenance window.  While we can't prevent an upgrade from going wrong - we can make sure that we protect and insulate our clients from such incidents as much as possible.  
 
Additional Information
 
I do understand that from the client perspective that an outage is an outage - and many may not care if each outage is for a distinct and new issue.  We do our best to avoid outages but we're human and we aren't perfect.  When we screw up or make a mistake we'll acknowledge and accept that and then learn from it as not to make the same mistakes twice.  This is the first outage of this kind for us and should be the last with our new operating procedures governing maintenance and snapshots.
 
I am sorry for any trouble this caused you and if you have any questions or concerns don't hesitate to respond here or to reach out directly.



#7588 R2 Server File System Check - September 27, 2019 - 11:30 PM ET

Posted by MikeDVB on 27 September 2019 - 01:15 PM in Server and Network Announcements

Hello!

 

We have been seeing some really weird storage performance with the R2 server lately up to and including short outages such as the one the server just now experienced.

 

This server has been online for quite some time and has not had a file system check performed in a while.  We believe the issues faced are related to some file system corruption that needs to be corrected.

 

We are scheduling emergency maintenance for the R2 server for tonight, September 27, 2019 at 11:30 PM Eastern Time [GMT-4].  We expect the file system check to take no more than 15 minutes but we are going to schedule a 2 hour window in case it takes longer than we anticipate.

 

I apologize for the short notice on this as we need to take action to prevent further unplanned or unexpected downtime.

 

If you have any questions about this maintenance or the maintenance window feel free to reply here or to open a ticket with technical support.

 

Thank you.




#7591 R2 Server - Memory Doubled

Posted by MikeDVB on 01 October 2019 - 11:39 AM in Server and Network Announcements

We experienced an outage today on the R2 server and upon close investigation we believe a very brief but very large memory spike was the cause.

 

We've doubled the memory available to the system and this should prevent further outages of this nature.

 

I'm keeping an eye on the server personally and will keep this thread updated if we have to make any other changes.




#7577 Not pre-sales, but account set-up: on-death contact?

Posted by MikeDVB on 22 July 2019 - 12:34 PM in Pre-Sales Enquiries

You could add her as a contact.  So long as you give her the ability to sign into the client area [full access] she would be able to obtain a support pin and request a details change.




#7549 Network Disruption on April 11, 2019

Posted by MikeDVB on 13 April 2019 - 12:27 PM in Server and Network Announcements

On April 11th our monitoring alerted us to an outage on the network.

This outage affected the entire facility in which we have our servers and was unfortunately outside of our realm of control.

We received the first alert at 04/11/2019 12:47:56 PM and verified all services back online at 04/11/2019 01:11:49 PM, after 24m of downtime. Times are eastern.

The RFO (reason for outage) can be seen at our upstream provider here: https://handynetwork...RFO 4.11.19.pdf. Ive also attached it to this post.

Attached Files




#7584 Network connectivity issues. 08/12/2019

Posted by MikeDVB on 13 August 2019 - 12:34 PM in Server and Network Announcements

Our upstream facility has scheduled a maintenance window tonight from 11 PM to 4 AM Eastern Time.

 

They expect we may see a couple instances of downtime of up to 15 minutes but are going to strive to keep any downtime to a minimum.

 

For full details you can read their status at https://helpdesk.han...wsItem/View/276




#7585 Network connectivity issues. 08/12/2019

Posted by MikeDVB on 15 August 2019 - 08:39 AM in Server and Network Announcements

I was waiting on the RFO before updating this - but I haven't seen one yet so I at least wanted to post that the maintenance on the 13th went well and that we are fully redundant once again.

 

Juniper is still investigating the cause but from my conversations with the networking department at our upstream provider a filter has been put in place that should prevent the issue from recurring.

 

Once I have the RFO I will make it available.




#7582 Network connectivity issues. 08/12/2019

Posted by MikeDVB on 12 August 2019 - 01:52 PM in Server and Network Announcements

It is my current understanding that this issue was due to an unusual hardware failure in a core piece of networking equipment at the facility.  This piece of hardware failed in such a way that it wasn't servicing requests but wasn't 'offline' - sort of like an operating system crash/panic.

 

As this piece of equipment is redundant - there is another identical piece of hardware doing the same job that should pick up the slack - I do not at this point know why the failure caused an outage that redundancy didn't prevent.  It could be due to the nature of the failure in that the gear stayed online but wasn't actually working but that's speculation on my part.

 

As it stands everything is back online but we have lost the redundancy of this core piece of hardware until the issue is fully resolved.  It is suspected that this is a bug in the operating system running on the core networking equipment and the facility is working with Juniper Emergency Support to both investigate the cause of the issue as well as working to ensure it doesn't happen again.

 

Here is a snippet of the kernel/operating system log from the failed piece of networking hardware:

Aug 12 17:29:05  dist3.denver2 /kernel: BAD_PAGE_FAULT: pid 1972 (fxpc), uid 0: pc 0x0 got a read fault at 0x0, x86 fault flags = 0x4
Aug 12 17:29:05  dist3.denver2 /kernel: Trapframe Register Dump:
Aug 12 17:29:05  dist3.denver2 /kernel: eax: 20dba498ecx: 000000ffedx: 20dba494ebx: 20dba468
Aug 12 17:29:05  dist3.denver2 /kernel: esp: af97de6cebp: af97de98esi: 21054b98edi: 00000000
Aug 12 17:29:05  dist3.denver2 /kernel: eip: 00000000eflags: 00010202
Aug 12 17:29:05  dist3.denver2 /kernel: cs: 0033ss: 003bds: 003bes: 003b
Aug 12 17:29:05  dist3.denver2 /kernel: fs: b0b5003btrapno: 0000000cerr: 00000004
Aug 12 17:29:05  dist3.denver2 /kernel: PC address 0x0 is inaccessible, PDE = 0x0, ****** = 0x0
Aug 12 17:29:05  dist3.denver2 /kernel: BAD_PAGE_FAULT: pid 1972 (fxpc), uid 0: pc 0x0 got a read fault at 0x0, x86 fault flags = 0x4
Aug 12 17:29:05  dist3.denver2 /kernel: Trapframe Register Dump:
Aug 12 17:29:05  dist3.denver2 /kernel: eax: 20dba498ecx: 000000ffedx: 20dba494ebx: 20dba468
Aug 12 17:29:05  dist3.denver2 /kernel: esp: af97de6cebp: af97de98esi: 21054b98edi: 00000000
Aug 12 17:29:05  dist3.denver2 /kernel: eip: 00000000eflags: 00010202
Aug 12 17:29:05  dist3.denver2 /kernel: cs: 0033ss: 003bds: 003bes: 003b
Aug 12 17:29:05  dist3.denver2 /kernel: fs: b0b5003btrapno: 0000000cerr: 00000004
Aug 12 17:29:05  dist3.denver2 /kernel: PC address 0x0 is inaccessible, PDE = 0x0, ****** = 0x0

Once the Reason For Outage [RFO] is available from our facility we will make it available.




#7583 Network connectivity issues. 08/12/2019

Posted by MikeDVB on 12 August 2019 - 04:54 PM in Server and Network Announcements

In speaking with the senior network engineer at Handy Networks, our upstream facility, this issue affected both redundant pieces of hardware responsible for routing traffic.  The primary crashed and then the secondary took over and subsequently crashed.  While they were working to determine the cause this was ongoing and explains why things would show as online for a minute or two and then back down.

 

The mode of failure is definitely unusual and I still personally believe it to be a bug in the Juniper OS.

 

Juniper as well as Handy Networks are still working to trace the cause and I expect to have an RFO within 72 hours or less.




#7555 Issues with S0 server. 04/18/2019

Posted by MikeDVB on 18 April 2019 - 04:25 PM in Server and Network Announcements

The hardware that was powering the S0 server powered off unexpectedly and without warning.  We brought S0 back online on an alternate hardware node.

We're still investigating what caused the hardware to power off.  It is and has been back online since the incident but is not presently being used to provide services to any clients while we continue to investigate.




#7573 iOS mail shows two “Junk” folders. Why?

Posted by MikeDVB on 28 May 2019 - 04:25 PM in Shared Hosting Support

One is 'local' to the phone and one is on the server.  You can go into the settings and set the 'local' one to point at the on-server one [settings on the phone].