R1 Server - Slowness / Instability - Cause and Resolution

Today our internal monitoring started alerting us to a lack of available CPU on the R1 server.


Our initial investigation was focused on identifying why there was so little CPU / what was using so much CPU - but the actual cause was not an issue inside of the server itself.


We run a Highly Available setup - so that if a piece of hardware running a server fails - the server will migrate and come back online on a new / spare / empty piece of hardware. In this case we had hardware failure and the server did migrate successfully - but it did not migrate to an empty/spare server - but to another busy server.


Once we identified the cause of the CPU issues - namely overcommitted CPU on the host - we began shifting servers around on hardware to stabilize the CPU usage to ensure all servers have free CPU to work with. This took approximately 30 minutes as live migrating the servers does take a little bit of time.


At this time R1 is stable again and should be back to normal / fast if not faster than it has been recently.


There was no data loss, data corruption, or any *actual* downtime although the server was sluggish to the point of being nearly non-response off and on during the issue and investigation. If you do have any issues - do please open a ticket and we will be happy to check.

