On Sunday, September 8, 2019 routine maintenance was scheduled to upgrade MariaDB [MySQL] from version 10.1 to 10.3. Our testing as well as public documentation demonstrated that MariaDB 10.3 performs substantially faster and is more efficient than 10.1. Our testing indicated that the upgrade process was nearly seamless only requiring a couple restarts of the MariaDB server [about a minute of downtime for MariaDB total].
We upgraded all of our shared cloud and reseller servers from 10.1 to 10.3 and spot tested numerous sites on each server before and after the upgrades to ensure that everything went smoothly.
Within about an hour of completing the maintenance we began to receive numerous support tickets from clients on the S3 server reporting their sites were not working, databases were being corrupted, and a myriad of other issues related to MariaDB and databases.
We brought the server back online from a snapshot taken on August 24, 2019 and then immediately began restoring data changed/added to the server after that point.
Why did it happen?
There are several steps that are required for us to upgrade MariaDB from 10.1 to 10.3. We run a CloudLinux technology called "MySQL Governor" which is what allows CloudLinux to restrict MySQL[MariaDB] usage to your resource limits [1 CPU core]. The MySQL Governor has its own versions of the MariaDB Binaries that have additional code allowing CloudLinux to hook in, monitor, and control.
In order to upgrade MariaDB from 10.1 to 10.3 we had to remove the MySQL Governor, perform the upgrades, and then reinstall the MySQL Governor. We performed this on all servers and tested several sites on each server both before and after the upgrade to ensure things were working as expected.
Within an hour of performing the maintenance we began to receive support tickets from our clients on the S3 server indicating issues with MariaDB Connectivity, corrupted databases, and a myriad of other database-related issues. At the time we did not know the exact cause but we did know that it was due to the maintenance we had just performed.
Our post-incident investigation determined that the MySQL Governor reinstallation put back the older MariaDB 10.1 binaries instead of installing the new 10.3 binaries. I don't know by what mechanism this caused the actual corruption experienced but I do know that reinstalling MariaDB 10.3 did not resolve the issue.
What was done to correct the issue?
As a part of our standard procedures for maintenance a manual snapshot of the storage system was taken prior to getting started. A snapshot allows us to roll back a whole server or even all servers to the point the snapshot was taken nearly instantly. This protects you and your data from data loss or corruption should an upgrade fail in some catastrophic or unexpected manner that isn't recoverable.
As soon as we were able to determine this wasn't something we were going to be able to fix in-place in a timely fashion without restoring to a snapshot that is exactly what we chose to do - to restore to a snapshot.
After we verified the upgrades were successful and before the issues with the S3 server were apparent we dropped the snapshots we had taken prior to the maintenance. In hindsight we should have allowed more time for potential issues to surface and should have kept the manual snapshot longer - at least a few hours if not a few days. In this case we dropped the snapshot just prior to actually needing it.
We normally would have snapshots every hour on the hour, however, on 08/24 we had reached out to StorPool, our storage software vendor, with some concerns we had about snapshots. Namely that we had a few thousand and we didn't want to risk data corruption, data loss, performance loss, etc. While working with them on this the automatic snapshots were temporarily disabled so the snapshot tree could be cleaned up and extraneous snapshots pruned. This took a while and when it was done snapshots were not re-enabled.
Bringing a server online from a snapshot takes only a few minutes - about as long as it takes to identify the disk IDs in our cloud platform [ a few seconds ] and then as long as it takes to identify the latest snapshot for those disks and mount them. It's a fantastic way to recover - if you have a recent snapshot. In this case as the closest snapshot was from 08/24 we brought the server up from that point and immediately began to restore data added to the server after that snapshot via our primary backup system.
The total actual downtime for the server was only about 30 minutes due to the MariaDB upgrades, corruption, and then bringing the server online from a snapshot. It has taken about 30 hours after bringing the server online from a snapshot to restore all data added and changed since the snapshot was taken from our primary backup system. The biggest bottleneck in this case was the cPanel restoration system - we've already drafted plans for our own recovery script that will skip the cPanel checks and hooks and stream data right to the server at up to 20 times the speed. Unfortunately we weren't able to get this done while restoring the S3 server as we need to test any such tool before putting it into production use.
What is being done to prevent this from happening again?
As of yesterday we are now monitoring snapshot activity via our internal monitoring system. This system has been configured, and tested, to alert us if any single storage device goes longer than 6 hours without a snapshot and performs an emergency all-staff-notified alert if any device goes longer than 12 hours without a snapshot.
Prior to yesterday we were manually checking for the existence of snapshots monthly as a part of our Disaster Recovery Preparedness Plan, or DRPP. The DRPP was drafted due to the outage we experienced in 2018 where we had no snapshots at all. Checking for the existence of valid snapshots is only a small portion of the DRPP but it is a very important part.
To be as straightforward as I can be - we should have set up automated monitoring of snapshots to begin with to monitor snapshots. StorPool has been working on a very robust snapshot management tool that includes options such as taking and keeping so many backups per hour, per day, per week, per month, and per year. The tool they're working on also includes monitoring and alerts. We have been monitoring snapshots manually rather than building our own automated monitoring while waiting on StorPool to release their new tool.
Our Disaster Recovery Preparedness Plan has been updated as a result of this incident and we have added some new standard operating procedures when it comes to performing maintenance. While it did already state that snapshots were to be taken before maintenance and then removed when completed and verified - we've changed this so that we will keep the manual snapshot for at least 24 hours after a maintenance window. While we can't prevent an upgrade from going wrong - we can make sure that we protect and insulate our clients from such incidents as much as possible.
I do understand that from the client perspective that an outage is an outage - and many may not care if each outage is for a distinct and new issue. We do our best to avoid outages but we're human and we aren't perfect. When we screw up or make a mistake we'll acknowledge and accept that and then learn from it as not to make the same mistakes twice. This is the first outage of this kind for us and should be the last with our new operating procedures governing maintenance and snapshots.
I am sorry for any trouble this caused you and if you have any questions or concerns don't hesitate to respond here or to reach out directly.