Resolved Atlantis (VPS Node) Instability

Michael D. · June 19, 2012

Hello,

We upgraded our VPS clients from OpenVZ (RHEL) 5.X to 6.X when we moved everybody to the new hardware as the new version supposedly has a lot of very nice features, performance improvements, optimizations, etc... Unfortunately we've found that while it does have a lot of that - the old style of memory calculation (UBC) isn't stable and the new style of memory calculations are not long-term stable (i.e. they will cause the server to crash about once per day or every two days).

We did research the stability of the newer revision of OpenVZ before provisioning the server and moving customers to it, we also did do testing on the OS for a while to ensure it was stable, but it turns out that it's actually not stable.

With the new vSwap, we are seeing issues such as these (reported being out of memory, killing processes) when the VPS is nowhere near being out of ram and it results in a kernel panic such as in the image after this quote:

Jun 17 21:41:46 atlantis kernel: [142657.283383] OOM killed process 586314 (php) vm:49444kB, rss:30372kB, swap:0kB
Jun 17 21:41:46 atlantis kernel: [142657.484016] OOM killed process php (pid=586314, ve=175) exited, free=341522.
Jun 17 21:41:46 atlantis kernel: [142657.484024]
Jun 17 21:41:46 atlantis kernel: [142657.583947] >>> 175 oom generation 613 starts
Jun 17 21:41:46 atlantis kernel: [142657.583958] 586889 (php) invoked loc oom-killer: gfp 0x200d2 order 0 oomkilladj=0
Jun 17 21:41:46 atlantis kernel: [142657.585810] Out of memory in UB: Kill process 586514 (php) score 29 or sacrifice child
Jun 17 21:41:46 atlantis kernel: [142657.585971] OOM killed process 586514 (php) vm:49444kB, rss:30628kB, swap:0kB
Jun 17 21:41:46 atlantis kernel: [142657.684435] OOM killed process php (pid=586514, ve=175) exited, free=348772.
Jun 17 21:41:46 atlantis kernel: [142657.684442]
Jun 17 21:41:46 atlantis kernel: [142657.889579] >>> 175 oom generation 614 starts
Jun 17 21:41:46 atlantis kernel: [142657.889588] 586950 (php) invoked loc oom-killer: gfp 0x200d2 order 0 oomkilladj=0
Jun 17 21:41:46 atlantis kernel: [142657.891072] Out of memory in UB: Kill process 586520 (php) score 29 or sacrifice child
Jun 17 21:41:46 atlantis kernel: [142657.891227] OOM killer in rage, 1 tasks killed in ub 175
Jun 17 21:41:46 atlantis kernel: [142657.891303] OOM killed process 586520 (php) vm:49444kB, rss:30680kB, swap:0kB
Jun 17 21:41:46 atlantis kernel: [142657.981907] OOM killed process php (pid=586520, ve=175) exited, free=351070.

http://www.screen-shot.net/2012-06-18_2326.png

We are going to do what we can to make things stable. The fastest solution would result in the most downtime - including us backing the server up in full, doing a full OS reinstall, and then restoring VPS data back. This would result in between 6 and 12 hours of downtime (estimated) but result in the stability you have come to expect from us after this is done.

The other option is ordering another brand new server that is identical to the current one, migrating everybody over to it, and then finding something to do with this "old" server. This would take the longest as we would have to order the new hardware and wait 7 to 14 days for it to arrive. In the meantime the server is likely going to continue having issues.

We are evaluating all of the options, and trying to get things stable as quickly as possible.

Michael D. · June 19, 2012

We have set everybody back to vSwap - it's not stable (i.e. we may end up needing to do a daily 5 minute reboot) but at least while it's up and running, things will "work".

If you have any questions at all, let us know.

Michael D. · June 19, 2012

After discussing this with our team, we've decided to go ahead and order and bring online an identical server which we will configure with OpenVZ 5.X and migrate all existing VPS customers over to. We're going to place the order immediately and hope to have the server online by Friday~Monday of this next week.

In the meantime, everybody is going to stay on vSwap and we are going to perform a nightly reboot at 10 PM EST. Between reboots we do have monitoring by the second on the server and in the event that the server does lock up/panic we will immediately reboot it as well.

We apologize for this instability and these issues, however, we are going to do everything we can to stabilize this as soon as possible.

Michael D. · June 20, 2012

We are going to issue a reboot now, to help ensure that the server doesn't go down unexpectedly. We're doing it later than we planned (originally planned at 10 PM) and I apologize for that.

Michael D. · June 21, 2012

We are issuing the next nightly reboot now. The new hardware will be here on Friday with any luck, but likely will be Monday.

Michael D. · June 22, 2012

The server just crashed on it's own, we're rebooting it now. The node itself takes about 2 to 3 minutes to reboot and then each VPS starts up one at a time (as not to totally overload the server and cause it to take longer). Due to the nature of the crash each VPS will boot up, and once they are all booted up they will go offline one at a time to fix their disk quotas which takes a few minutes a piece.

If you have any questions, let us know - hopefully this new server will be here TOMORROW!

frankacter · June 22, 2012

Will you still be peforming the scheduled clean reboot later in the day?

Michael D. · June 22, 2012

No, a reboot is a reboot - so we'll leave it on at this point.

Michael D. · June 22, 2012

The server has crashed again, and is rebooting again. I'm going to check now with the facility to see if the new hardware has arrived *crosses fingers*.

Michael D. · June 22, 2012

All VPS on the node have successfully been re-started.

Michael D. · June 22, 2012

We've received everything but the server chassis itself, which it looks like won't arrive until Monday. We'll keep this thread updated.

frankacter · June 23, 2012

Reminder for other clients on Atlantis, be sure to check your MYSQL databases after each reboot / crash. A few of mine were in a crashed state and required repairing, didn't realize it until customers started emailing to complain. Also a good idea to step up backups of your MYSQL databases just to be on the safe side!

Michael D. · June 23, 2012

Indeed, it's not a bad idea to have backups either way whether the server is stable or not. Those who have reported crashed databases, I have added this to the /etc/rc.local file to repair databases automatically after booting up (about 2 minutes after boot):

(sleep 120;mysqlcheck --auto-repair --force --optimize --all-databases;)&

Michael D. · June 23, 2012

The reboot for 10PM has been scheduled into the server, at 10:00 PM EST on the second the server will automatically gracefully shut down and start back up. So long as the server does not crash between now and then - there won't be any need for database repairs as the shutdown will be graceful.

Let us know if you have any questions.

Michael D. · June 24, 2012

The node just went down for the reboot as planned.

Michael D. · June 25, 2012

The nightly reboot has been scheduled, and will happen in approximately 10 minutes.

Michael D. · June 25, 2012

The node just crashed on it's own, and is being rebooted.

Michael D. · June 25, 2012

The node is back online, each VPS will start up (one at a time, takes about 10 seconds for each) and once all are started they will each re-start to fix their disk quotas.

Michael D. · June 25, 2012

Tracking information indicates the last piece (the server chassis) should be here tomorrow. Hopefully the server will say online and stable between now and then and we can get everybody shifted quickly. You will likely see 2 to 5 minutes of brief downtime once or twice as we conduct the migrations but that will be the end of the outages.

I would like to thank everybody for their patience and understanding on this issue and apologize again for any trouble that it may have caused you.

Michael D. · June 26, 2012

The chassis has arrived at the facility, last I checked we have all of the parts necessary to build the server and should have it online by this evening. We'll update this thread if anything unexpected happens or if there are any delays.

Michael D. · June 27, 2012

Everything is going to plan so far, and by morning everybody should be 100% stable again. I'm staying up tonight to handle this instead of getting sleep as I know that nobody else wants to go another day wondering if the server will stay online without issues or if it will decide to crash on it's own.

If you have any questions, let us know.

Michael D. · June 28, 2012

Everybody did get moved over to the new server without issue. The VPS control panel isn't going to function for you at this current moment so if you need something like a reboot done let us know in a support ticket. We should have the VPS control panel hooked back up this evening.

frankacter · June 28, 2012

Thanks for the continued efforts.

While it sounds like it should be little to no impact, is there a timeline for phase 2 to move everyone back onto the original server once it is rebuilt with OpenVZ 5.X?

Michael D. · June 28, 2012

We will be doing "--online" migrations so there should only be 2 to 5 seconds of lag for the second transition and that is it. The reason we didn't do this for OVZ6 -> OVZ5 is that the way they handle memory is so different it was not an option. Since we will be going OVZ5 -> OVZ5 online transfers should work. If not, we'll update here and let everybody know.

Michael D. · June 28, 2012

VPS were still copying overnight, but I've checked now and everybody is back on the proper Atlantis server on OpenVZ 5. Things should remain stable and we will not be doing any more migrations (i.e. this issue is completed/resolved).

Sign In

Resolved Atlantis (VPS Node) Instability

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation