Michael D. Posted January 12, 2016 Report Share Posted January 12, 2016 Hello! One of the great features of our new infrastructure is that we can migrate a server from one piece of hardware to another seamlessly. This allows us to perform maintenance without downtime, etc... Unfortunately the networking cards in the new hardware were not properly updated to the latest version resulting in some weird latency issues every once in a great while. We've been monitoring both servers on a second-by-second basis since we resolved the LiteSpeed issue and the server has been stable. The issue comes in when we do wish to seamlessly migrate the server - the host goes unresponsive for 30 to 60 seconds as the network interface crashes and restarts. We will be bringing down the S1 and R1 servers tonight for approximately 5 minutes [as long as it takes to shut down and boot back up] to update the networking firmware after which we should be able to perform maintenance in the future without any scheduled downtime whatsoever. We expect to begin around 9 PM ET and expect the downtime to be no greater than 5 minutes. We do apologize for the growing pains we are experiencing with this new hardware - it's a completely new setup from what we have run for years and we're running into small edge issues that we didn't anticipate. If you have any questions about this - let us know. We'll keep it as seamless as possible - and there is nothing for you to do / that you need to do. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 13, 2016 Author Report Share Posted January 13, 2016 It looks like the S1 server's networking card has gone unresponsive. We're bring it back online on another piece of hardware and it should be online momentarily. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 13, 2016 Author Report Share Posted January 13, 2016 S1 is back online on another piece of hardware. ~1 hour 20 minutes until we begin updating firmware on networking controllers. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 13, 2016 Author Report Share Posted January 13, 2016 It looks like we can no longer wait to flash this update - we're going to have to proceed immediately. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 13, 2016 Author Report Share Posted January 13, 2016 S1 should be stable at this time. Everything is updated and running smoothly. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 18, 2016 Author Report Share Posted January 18, 2016 For those that want more detail - this is the issue we're currently facing with the S1 server: Message from syslogd@s1 at Jan 16 17:17:27 ... kernel:BUG: soft lockup - CPU#44 stuck for 23s! [imap-login:1021336]Message from syslogd@s1 at Jan 16 17:17:27 ... kernel:BUG: soft lockup - CPU#11 stuck for 22s! [migration/11:159]Message from syslogd@s1 at Jan 16 17:17:27 ... kernel:BUG: soft lockup - CPU#1 stuck for 144s! [mysqld:807721]Message from syslogd@s1 at Jan 16 17:17:27 ... kernel:BUG: soft lockup - CPU#17 stuck for 140s! [mysqld:4000]Message from syslogd@s1 at Jan 16 17:17:27 ... kernel:BUG: soft lockup - CPU#22 stuck for 133s! [migration/22:214]Message from syslogd@s1 at Jan 16 17:17:27 ... kernel:BUG: soft lockup - CPU#19 stuck for 151s! [mysqld:4374] Now when this happens we also lose connectivity to our server storage. The issue is that we're unsure if the storage losing connectivity is causing the CPU errors or the CPU errors are causing the storage connectivity issues. The plan today was for me to spin up several test systems on the new hardware and to work as hard as I can to identify and isolate the cause of the issues. If we need new networking cards we'll get them. If we need to change the Operating Systems we'll do it. At the end of the day I do want to apologize deeply for any and all trouble this issue is causing you. We're not any more happy about it than you are and will have it resolved as quickly as we possibly can. I'm going to do my best to keep this thread up to date so you know what is going on. I do have an update to post - and will post it in a moment to keep it separate from this one. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 18, 2016 Author Report Share Posted January 18, 2016 The S1 server had another brief outage, however, this time around the OS marked the file system as read-only and we should perform a file system integrity check. Due to the speed of the new hardware this should not take long - but due to the amount of data it could take up to an hour or two. Ideally it will be done in 10 to 15 minutes. I'll keep this thread updated. We really need to do this as soon as possible. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 18, 2016 Author Report Share Posted January 18, 2016 The server is online, however, you may have some issues accessing some files through the browser. If you do - open a ticket and we'll get it touched up for you real quick. We're still going to perform the FSCK but I'm going to try and send out an email to everybody prior so you know what's going on even if you aren't watching this thread. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 18, 2016 Author Report Share Posted January 18, 2016 It looks like a glitch in internal communication has resulted in the FSCK beginning now. We'll keep this updated. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 18, 2016 Author Report Share Posted January 18, 2016 The File System Check is now on Pass 2 - Directory Structure. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 18, 2016 Author Report Share Posted January 18, 2016 It is still in pass 2. Quote Link to comment Share on other sites More sharing options...
ericr Posted January 18, 2016 Report Share Posted January 18, 2016 The server is booting up at this time. Quote Link to comment Share on other sites More sharing options...
ericr Posted January 18, 2016 Report Share Posted January 18, 2016 It is now doing it's boot up defrag and will be online soon. it is currently 75.2% of the way through the process. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 18, 2016 Author Report Share Posted January 18, 2016 And now the system forced another FSCK - this one is just a quick check so it should be up very soon. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 18, 2016 Author Report Share Posted January 18, 2016 The forced check is at 81% Quote Link to comment Share on other sites More sharing options...
ericr Posted January 18, 2016 Report Share Posted January 18, 2016 It is now at 86% Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 18, 2016 Author Report Share Posted January 18, 2016 It is at 95%. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 18, 2016 Author Report Share Posted January 18, 2016 Completed and rebooting. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 18, 2016 Author Report Share Posted January 18, 2016 The system has gone into a file system check loop even though it's not repairing anything. We're working on getting it out of this loop. Quote Link to comment Share on other sites More sharing options...
ericr Posted January 18, 2016 Report Share Posted January 18, 2016 I appologize for the delay. after the boot the server became hung again and we needed to resolve the underlying issues. the issues are resolved and the server is online. Quote Link to comment Share on other sites More sharing options...
ericr Posted January 18, 2016 Report Share Posted January 18, 2016 The server has initiated a dump and a restart. we are investigating the cause. Quote Link to comment Share on other sites More sharing options...
ericr Posted January 18, 2016 Report Share Posted January 18, 2016 We have found the cause of the restart loop. Our backup software was causing a kernel panic and causing the server to initiate a reboot. it has been disabled and the server is currently online. Quote Link to comment Share on other sites More sharing options...
ericr Posted January 18, 2016 Report Share Posted January 18, 2016 I am currently looking into litespeed issues that are causing pages not to load. Quote Link to comment Share on other sites More sharing options...
ericr Posted January 18, 2016 Report Share Posted January 18, 2016 The server has kernel panic'd again.. Sadly, r1soft was not the core cause. we are continuing to search for the reason behind this fault. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 18, 2016 Author Report Share Posted January 18, 2016 The system is out of the FSCK loop, however, it is producing kernel panics on boot now. We disabled one kernel module that we believed was responsible and it was online longer this time around. We're still working on this. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.