Jump to content
MDDHosting Forums

Atlantis (VPS Node) Instability


Recommended Posts

Hello,

 

We upgraded our VPS clients from OpenVZ (RHEL) 5.X to 6.X when we moved everybody to the new hardware as the new version supposedly has a lot of very nice features, performance improvements, optimizations, etc... Unfortunately we've found that while it does have a lot of that - the old style of memory calculation (UBC) isn't stable and the new style of memory calculations are not long-term stable (i.e. they will cause the server to crash about once per day or every two days).

 

We did research the stability of the newer revision of OpenVZ before provisioning the server and moving customers to it, we also did do testing on the OS for a while to ensure it was stable, but it turns out that it's actually not stable.

 

With the new vSwap, we are seeing issues such as these (reported being out of memory, killing processes) when the VPS is nowhere near being out of ram and it results in a kernel panic such as in the image after this quote:

Jun 17 21:41:46 atlantis kernel: [142657.283383] OOM killed process 586314 (php) vm:49444kB, rss:30372kB, swap:0kB

Jun 17 21:41:46 atlantis kernel: [142657.484016] OOM killed process php (pid=586314, ve=175) exited, free=341522.

Jun 17 21:41:46 atlantis kernel: [142657.484024]

Jun 17 21:41:46 atlantis kernel: [142657.583947] >>> 175 oom generation 613 starts

Jun 17 21:41:46 atlantis kernel: [142657.583958] 586889 (php) invoked loc oom-killer: gfp 0x200d2 order 0 oomkilladj=0

Jun 17 21:41:46 atlantis kernel: [142657.585810] Out of memory in UB: Kill process 586514 (php) score 29 or sacrifice child

Jun 17 21:41:46 atlantis kernel: [142657.585971] OOM killed process 586514 (php) vm:49444kB, rss:30628kB, swap:0kB

Jun 17 21:41:46 atlantis kernel: [142657.684435] OOM killed process php (pid=586514, ve=175) exited, free=348772.

Jun 17 21:41:46 atlantis kernel: [142657.684442]

Jun 17 21:41:46 atlantis kernel: [142657.889579] >>> 175 oom generation 614 starts

Jun 17 21:41:46 atlantis kernel: [142657.889588] 586950 (php) invoked loc oom-killer: gfp 0x200d2 order 0 oomkilladj=0

Jun 17 21:41:46 atlantis kernel: [142657.891072] Out of memory in UB: Kill process 586520 (php) score 29 or sacrifice child

Jun 17 21:41:46 atlantis kernel: [142657.891227] OOM killer in rage, 1 tasks killed in ub 175

Jun 17 21:41:46 atlantis kernel: [142657.891303] OOM killed process 586520 (php) vm:49444kB, rss:30680kB, swap:0kB

Jun 17 21:41:46 atlantis kernel: [142657.981907] OOM killed process php (pid=586520, ve=175) exited, free=351070.

http://www.screen-shot.net/2012-06-18_2326.png

 

We are going to do what we can to make things stable. The fastest solution would result in the most downtime - including us backing the server up in full, doing a full OS reinstall, and then restoring VPS data back. This would result in between 6 and 12 hours of downtime (estimated) but result in the stability you have come to expect from us after this is done.

 

The other option is ordering another brand new server that is identical to the current one, migrating everybody over to it, and then finding something to do with this "old" server. This would take the longest as we would have to order the new hardware and wait 7 to 14 days for it to arrive. In the meantime the server is likely going to continue having issues.

 

We are evaluating all of the options, and trying to get things stable as quickly as possible.

Link to comment
Share on other sites

After discussing this with our team, we've decided to go ahead and order and bring online an identical server which we will configure with OpenVZ 5.X and migrate all existing VPS customers over to. We're going to place the order immediately and hope to have the server online by Friday~Monday of this next week.

 

In the meantime, everybody is going to stay on vSwap and we are going to perform a nightly reboot at 10 PM EST. Between reboots we do have monitoring by the second on the server and in the event that the server does lock up/panic we will immediately reboot it as well.

 

We apologize for this instability and these issues, however, we are going to do everything we can to stabilize this as soon as possible.

Link to comment
Share on other sites

The server just crashed on it's own, we're rebooting it now. The node itself takes about 2 to 3 minutes to reboot and then each VPS starts up one at a time (as not to totally overload the server and cause it to take longer). Due to the nature of the crash each VPS will boot up, and once they are all booted up they will go offline one at a time to fix their disk quotas which takes a few minutes a piece.

 

If you have any questions, let us know - hopefully this new server will be here TOMORROW!

Link to comment
Share on other sites

Reminder for other clients on Atlantis, be sure to check your MYSQL databases after each reboot / crash. A few of mine were in a crashed state and required repairing, didn't realize it until customers started emailing to complain. Also a good idea to step up backups of your MYSQL databases just to be on the safe side!
Link to comment
Share on other sites

Indeed, it's not a bad idea to have backups either way whether the server is stable or not. Those who have reported crashed databases, I have added this to the /etc/rc.local file to repair databases automatically after booting up (about 2 minutes after boot):

(sleep 120;mysqlcheck --auto-repair --force --optimize --all-databases;)&

Link to comment
Share on other sites

The reboot for 10PM has been scheduled into the server, at 10:00 PM EST on the second the server will automatically gracefully shut down and start back up. So long as the server does not crash between now and then - there won't be any need for database repairs as the shutdown will be graceful.

 

Let us know if you have any questions.

Link to comment
Share on other sites

Tracking information indicates the last piece (the server chassis) should be here tomorrow. Hopefully the server will say online and stable between now and then and we can get everybody shifted quickly. You will likely see 2 to 5 minutes of brief downtime once or twice as we conduct the migrations but that will be the end of the outages.

 

I would like to thank everybody for their patience and understanding on this issue and apologize again for any trouble that it may have caused you.

Link to comment
Share on other sites

The chassis has arrived at the facility, last I checked we have all of the parts necessary to build the server and should have it online by this evening. We'll update this thread if anything unexpected happens or if there are any delays.
Link to comment
Share on other sites

Everything is going to plan so far, and by morning everybody should be 100% stable again. I'm staying up tonight to handle this instead of getting sleep as I know that nobody else wants to go another day wondering if the server will stay online without issues or if it will decide to crash on it's own.

 

If you have any questions, let us know.

Link to comment
Share on other sites

Everybody did get moved over to the new server without issue. The VPS control panel isn't going to function for you at this current moment so if you need something like a reboot done let us know in a support ticket. We should have the VPS control panel hooked back up this evening.
Link to comment
Share on other sites

We will be doing "--online" migrations so there should only be 2 to 5 seconds of lag for the second transition and that is it. The reason we didn't do this for OVZ6 -> OVZ5 is that the way they handle memory is so different it was not an option. Since we will be going OVZ5 -> OVZ5 online transfers should work. If not, we'll update here and let everybody know.
Link to comment
Share on other sites

VPS were still copying overnight, but I've checked now and everybody is back on the proper Atlantis server on OpenVZ 5. Things should remain stable and we will not be doing any more migrations (i.e. this issue is completed/resolved).
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...