Echo / Fresco - Instability over the last 12 hours

Our Echo and Fresco servers last night around 3:30 AM EST both went offline spontaneously with Operating System Crashes. Since we had recently booted into a newer version of the operating system (8.24) to improve disk I/O performance everything had been running great until last night when both systems crashed. Due to nothing else on the servers having been changed short of the OS upgrade we rolled the servers back to 8.18.


This afternoon the Fresco server once again crashed, this time on 8.18, so we rolled the server back to 8.17 which is actually what the server was on prior to going to 8.24. The main issue when the server crashes is that we have to initiate a reboot which takes around 5 minutes from start to finish and then the server spends about 5 to 10 minutes playing "catch up" for all of the requests it gets flooded with after a reboot.


As we had Fresco crash on 8.18 and rolled back to 8.17 we've set Echo and Fresco up so that should they crash again they would go to a much older and super-stable version of the OS which is version 7.49. We've made this decision based upon working directly with our software vendor for the operating system.


We do apologize for any trouble this may have caused you and as you know, just as you don't want your site offline, we don't want your sites offline either which is why we are working closely with our software vendor to get these crash issues identified and resolved so that we can move back to the newer versions of the kernels which are better optimized and offer overall better performance.


If you have any questions about any of this, by all means, feel free to ask. We always do our best to be as transparent as possible especially when we're facing issues.


Thank you,

For any of those who haven't kept up with recent events this upgrade is what ultimately resulted in this instability: http://forums.mddhosting.com/topic/402-completed-reboots-echo-fresco-and-cypress-estimated-downtime-5-to-7-minutes/


All servers are currently on 8.17 or 8.18 however if they do crash again (we are always watching) they are set to reboot into 7.49. We understand how extremely frustrating unexpected downtime can be however we do have monitoring in place so that within 60 seconds of a server going offline we can already have it manually kicked into a reboot so that it comes back online as quickly as possible.


We're crossing our fingers that 8.17 and 8.18 stay rock solid stable and that no more reboots are required until our software vendor updates us that all of the crash issues in 8.24 are resolved.

