For those worried that something like this could happen again.
We have already enabled snapshots on our storage cluster. We're doing one snapshot every hour and keeping them for 10 hours.
So from a hypothetical standpoint - let's say that this did manage to happen again. We would simply mount a snapshot from before the incident - within the hour before - and boot everything back up.
Total downtime would be - ~5 minutes - for the whole network. Would there be any data loss? Possibly anything written in the preceding hour or less - but nothing compared to the losses of a multi-day outage.
It would look literally like we just shut everything down and booted it back up. No 'restorations', no lost emails, nothing. There's a great chance almost nobody would even notice.
This is something that our storage vendor, StorPool, set up for us immediately upon seeing what had happened. They actually apologized that it was not already set up and said that as a result of our disaster they are going to make sure that it is a default behavior that has to be actively disabled rather than the other way around.
Even with these snapshots and as powerful as they are - we are still going to overhaul our backup servers. We have identified the issues with the present setup that caused restorations to be so slow and already have fixes for those issues planned for once we are fully online and all of our clients are taken care of.
Snapshots are a very powerful tool against data loss and corruption. We actually used them a couple of times on our old storage platform, the Nimble CS500, to recover data on servers when clients made big mistakes themselves.