Scheduled S1 / R1 Servers - Network Device Updates - ~9 PM ET Jan 12, 2016.

Michael D. · January 12, 2016

Hello!

One of the great features of our new infrastructure is that we can migrate a server from one piece of hardware to another seamlessly. This allows us to perform maintenance without downtime, etc...

Unfortunately the networking cards in the new hardware were not properly updated to the latest version resulting in some weird latency issues every once in a great while. We've been monitoring both servers on a second-by-second basis since we resolved the LiteSpeed issue and the server has been stable.

The issue comes in when we do wish to seamlessly migrate the server - the host goes unresponsive for 30 to 60 seconds as the network interface crashes and restarts.

We will be bringing down the S1 and R1 servers tonight for approximately 5 minutes [as long as it takes to shut down and boot back up] to update the networking firmware after which we should be able to perform maintenance in the future without any scheduled downtime whatsoever.

We expect to begin around 9 PM ET and expect the downtime to be no greater than 5 minutes. We do apologize for the growing pains we are experiencing with this new hardware - it's a completely new setup from what we have run for years and we're running into small edge issues that we didn't anticipate.

If you have any questions about this - let us know. We'll keep it as seamless as possible - and there is nothing for you to do / that you need to do.

Michael D. · January 13, 2016

It looks like the S1 server's networking card has gone unresponsive. We're bring it back online on another piece of hardware and it should be online momentarily.

Michael D. · January 13, 2016

S1 is back online on another piece of hardware. ~1 hour 20 minutes until we begin updating firmware on networking controllers.

Michael D. · January 13, 2016

It looks like we can no longer wait to flash this update - we're going to have to proceed immediately.

Michael D. · January 13, 2016

S1 should be stable at this time. Everything is updated and running smoothly.

Michael D. · January 18, 2016

For those that want more detail - this is the issue we're currently facing with the S1 server:

Message from syslogd@s1 at Jan 16 17:17:27 ...
kernel:BUG: soft lockup - CPU#44 stuck for 23s! [imap-login:1021336]

Message from syslogd@s1 at Jan 16 17:17:27 ...
kernel:BUG: soft lockup - CPU#11 stuck for 22s! [migration/11:159]

Message from syslogd@s1 at Jan 16 17:17:27 ...
kernel:BUG: soft lockup - CPU#1 stuck for 144s! [mysqld:807721]

Message from syslogd@s1 at Jan 16 17:17:27 ...
kernel:BUG: soft lockup - CPU#17 stuck for 140s! [mysqld:4000]

Message from syslogd@s1 at Jan 16 17:17:27 ...
kernel:BUG: soft lockup - CPU#22 stuck for 133s! [migration/22:214]

Message from syslogd@s1 at Jan 16 17:17:27 ...
kernel:BUG: soft lockup - CPU#19 stuck for 151s! [mysqld:4374]

Now when this happens we also lose connectivity to our server storage. The issue is that we're unsure if the storage losing connectivity is causing the CPU errors or the CPU errors are causing the storage connectivity issues.

The plan today was for me to spin up several test systems on the new hardware and to work as hard as I can to identify and isolate the cause of the issues. If we need new networking cards we'll get them. If we need to change the Operating Systems we'll do it.

At the end of the day I do want to apologize deeply for any and all trouble this issue is causing you. We're not any more happy about it than you are and will have it resolved as quickly as we possibly can.

I'm going to do my best to keep this thread up to date so you know what is going on. I do have an update to post - and will post it in a moment to keep it separate from this one.

Michael D. · January 18, 2016

The S1 server had another brief outage, however, this time around the OS marked the file system as read-only and we should perform a file system integrity check.

Due to the speed of the new hardware this should not take long - but due to the amount of data it could take up to an hour or two. Ideally it will be done in 10 to 15 minutes.

I'll keep this thread updated. We really need to do this as soon as possible.

Michael D. · January 18, 2016

The server is online, however, you may have some issues accessing some files through the browser. If you do - open a ticket and we'll get it touched up for you real quick.

We're still going to perform the FSCK but I'm going to try and send out an email to everybody prior so you know what's going on even if you aren't watching this thread.

Michael D. · January 18, 2016

It looks like a glitch in internal communication has resulted in the FSCK beginning now.

We'll keep this updated.

Michael D. · January 18, 2016

The File System Check is now on Pass 2 - Directory Structure.

Michael D. · January 18, 2016

It is still in pass 2.

ericr · January 18, 2016

The server is booting up at this time.

ericr · January 18, 2016

It is now doing it's boot up defrag and will be online soon. it is currently 75.2% of the way through the process.

Michael D. · January 18, 2016

And now the system forced another FSCK - this one is just a quick check so it should be up very soon.

Michael D. · January 18, 2016

The forced check is at 81%

ericr · January 18, 2016

It is now at 86%

Michael D. · January 18, 2016

It is at 95%.

Michael D. · January 18, 2016

Completed and rebooting.

Michael D. · January 18, 2016

The system has gone into a file system check loop even though it's not repairing anything.

We're working on getting it out of this loop.

ericr · January 18, 2016

I appologize for the delay. after the boot the server became hung again and we needed to resolve the underlying issues. the issues are resolved and the server is online.

ericr · January 18, 2016

The server has initiated a dump and a restart. we are investigating the cause.

ericr · January 18, 2016

We have found the cause of the restart loop. Our backup software was causing a kernel panic and causing the server to initiate a reboot. it has been disabled and the server is currently online.

ericr · January 18, 2016

I am currently looking into litespeed issues that are causing pages not to load.

ericr · January 18, 2016

The server has kernel panic'd again.. Sadly, r1soft was not the core cause. we are continuing to search for the reason behind this fault.

Michael D. · January 18, 2016

The system is out of the FSCK loop, however, it is producing kernel panics on boot now. We disabled one kernel module that we believed was responsible and it was online longer this time around. We're still working on this.

Sign In

Scheduled S1 / R1 Servers - Network Device Updates - ~9 PM ET Jan 12, 2016.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation