Jump to content


MikeDVB

Member Since 27 Sep 2008
Offline Last Active Nov 20 2018 09:48 PM

Topics I've Started

November 18, 2018 - DDoS Attack

18 November 2018 - 09:23 AM

Hello!

 

It's been a while since we've seen a decent DDoS attack - something large enough that our facility would take any sort of proactive action against it.

 

A decently sized DDoS attack started hitting our network this morning on the order of 8 or 9 GBPS.  Our facility saw this traffic and began to proactively put blocks in place resulting in some IP addresses showing offline.  Only a few were affected by this due to the very targeted nature of this attack.

 

Our network is capable of absorbing attacks of this size so for now we've asked the facility to rescind the blocks so that we can just absorb this attack.  All services are online and operational at this time.

 

If there are any major changes we'll update this thread.


Minor Test of Storage Snapshots Post-Disaster

23 October 2018 - 04:51 PM

As many of you already know we experienced a major disaster at the end of September where we were forced to restore data from a backup server due to a misconfiguration of snapshots on our storage platform.  Since this disaster we resolved the snapshot issue and today we were able to use the snapshots to help a client and we couldn't be happier with the experience.

 

We had a client today that had an employee submit a cancellation for services that shouldn't have been canceled.  The result was 43 accounts terminated that shouldn't have been.

 

In the past normally what we'd have done is turned to our backup server, which takes backups once per day, and restored the latest point we could.  This could have been as far as 24 hours prior to the termination of services.

 

What we did as it would give more recent data to the client as well as give us a good test of working with snapshots - was to mount a snapshot of the server just prior to the termination of the services.  We booted the server up, generated backups for the accounts which was exceptionally fast due to the SSD storage, and then restored those backups to the live server.

 

In the event of an actual disaster we'd simply mount the snapshots and boot them up without doing any backups or restorations and services would immediately come back online as they were when the snapshot was taken.

 

All in all we're very happy with this as we were able to provide more recent data to the client than otherwise would have been possible much faster than normally would have been possible.

 

The process was extremely simple and straightforward.  While we don't ever plan on needing snapshots for disaster recovery it is good to know that if we did - they are available and work very well for that purpose.


DDoS Attack Mitigation on P1

21 October 2018 - 12:34 AM

I'm not sure if anybody noticed the attack - our internal monitoring did and since it runs on the same IP Pingdom did notice some downtime while the rest of the server was still online.

 

We've mitigated the attack and have reached out to the client that is the target to keep them in the loop.

 

So far so good on mitigating the attack - but we did also move the affected client to their own IP just in case we have to take more drastic actions.  We're keeping an eye on the situation but so far the server is online and 100% operational :).


S3 - Recent Intermittent Brief Outages

20 October 2018 - 04:03 AM

We identified an issue where 'doveadm', the tool used as a part of dovecot the IMAP mail server, in a specific situation is able to go into an endless loop using more and more ram.  This single process in a single account was up to 20GB of RAM usage alone - which is enough to put pressure on the system to the point things slow down and become very slow to respond.

 

2018-10-20_04-18-35.png

 

We have reached out to the account owner to inform them of this - not that it is specifically their fault - but so that they are aware of what is going on and the changes we've made to their account.

 

We're also reporting this issue to cPanel as they are providing us with Dovecot and reporting this to them is, I believe, the best course of action to get this behavior changed/fixed.  While it will likely need to be Dovecot developers themselves that fix this - cPanel would have a working relationship with them and can likely get this fixed faster than we can.

 

Additionally we're reaching out to CloudLinux about this as it should not be possible for a process owned by a user to consume this much system memory.

 

Beyond that we've also made some changes to Dovecot configurations network-wide to help prevent this from causing issues for any other servers until the core issue is fixed.

 

I just wanted to provide some details as to what has been going on with the S3 server as it has had some intermittent and short-lived downtime over the last few days that we have been working to track down.  Thankfully we were able to trace it tonight and ideally assuming there are no other causes - the server should be rock solid stable again.

 

We're definitely keeping an eye on it :).


Processor Upgrades - Faster Single-Threaded Performance & Higher Overall Performance

18 October 2018 - 09:46 AM

Why we're upgrading processors.

 

The new processors will have the same number of cores and threads as our old processors, however, will be clocked 21% faster at 2.3 GHz in comparison to the 1.8 GHz processors currently in our compute nodes.

 

Originally when we set up our compute cluster we were being billed on power by our actual metered usage and, as such, we went with low-power-consumption processors to keep our power usage down.  We are now being billed by by the circuit meaning that regardless of how much power we use - we pay the same amount so long as we aren't using more than the power circuit can provide.  The new processors we will be rolling out are rated at 120 Watts versus the 65 Watts our current processors use - using around 45% more power per processor.

 

While it's less important for a server than a desktop or laptop - the current processors have a turbo frequency of 2.5 GHz and the new processors turbo at 3.1 GHz.  Servers do not get to use Turbo frequencies as often as desktops and laptops due to their fairly constant workload but it does happen occasionally.

 

When the server hosting *your* site be upgraded.

 

We will be placing our heaviest servers on the new processors first as they have the most client accounts and we want as many people to benefit from the upgrades as possible.  We always watch our resource usage carefully - especially CPU - and will lean towards leaving breathing room for the servers to ensure smooth and stable operation post-upgrade.

 

As of right now most likely the S1, P1, and R1 servers will be placed on the upgraded processors when the first upgrades are completed.  We are going to evaluate the CPU usage once these servers are on the new processors to evaluate whether or not we can move any additional hosting servers to the new processors during the first upgrade.  We'd will lean towards not overloading the servers and ensuring stable performance.

 

We will be upgrading all compute nodes as soon as we're able to do so and, as a result, everybody will be getting upgraded at some point in the near future.  We have been planning on upgrading the processors in our compute nodes for some time now.  We're a small provider and we do our best to continue investing in our platform to increase performance and reliability.  Unfortunately upgrading the processors is not a cheap endeavor and is something that we're going to have to work on over time.

 

How the maintenance will be performed.

 

The plan is to perform the upgrades themselves one hardware node at a time.  We may need to shut down and boot back up some hosting servers as we move them around during the upgrades and this process takes approximately 30 to 60 seconds per hosting server.   While we can live-migrate the hosting servers we have found that due to how busy our hosting servers are the live migration can actually impact performance much longer than the 30 to 60 seconds a shut down and boot-up can take.  In short we've found it's better to shut down and boot up than to live migrate from an actual ability-to-use-the-service standpoint.

 

Due to the extremely short duration of any downtime that will occur with the servers the lead-time on the restarts may be shorter than usual. We will update this thread with details as we perform the upgrades and will also email anybody on a server that is being upgraded prior to performing the work.

 

If you have any questions about any of this feel free to reply to this thread or to open a support ticket referencing this thread.

 

Thank you!