Michael D. Posted June 19, 2012 Report Share Posted June 19, 2012 Hello, We upgraded our VPS clients from OpenVZ (RHEL) 5.X to 6.X when we moved everybody to the new hardware as the new version supposedly has a lot of very nice features, performance improvements, optimizations, etc... Unfortunately we've found that while it does have a lot of that - the old style of memory calculation (UBC) isn't stable and the new style of memory calculations are not long-term stable (i.e. they will cause the server to crash about once per day or every two days). We did research the stability of the newer revision of OpenVZ before provisioning the server and moving customers to it, we also did do testing on the OS for a while to ensure it was stable, but it turns out that it's actually not stable. With the new vSwap, we are seeing issues such as these (reported being out of memory, killing processes) when the VPS is nowhere near being out of ram and it results in a kernel panic such as in the image after this quote:Jun 17 21:41:46 atlantis kernel: [142657.283383] OOM killed process 586314 (php) vm:49444kB, rss:30372kB, swap:0kBJun 17 21:41:46 atlantis kernel: [142657.484016] OOM killed process php (pid=586314, ve=175) exited, free=341522.Jun 17 21:41:46 atlantis kernel: [142657.484024] Jun 17 21:41:46 atlantis kernel: [142657.583947] >>> 175 oom generation 613 startsJun 17 21:41:46 atlantis kernel: [142657.583958] 586889 (php) invoked loc oom-killer: gfp 0x200d2 order 0 oomkilladj=0Jun 17 21:41:46 atlantis kernel: [142657.585810] Out of memory in UB: Kill process 586514 (php) score 29 or sacrifice childJun 17 21:41:46 atlantis kernel: [142657.585971] OOM killed process 586514 (php) vm:49444kB, rss:30628kB, swap:0kBJun 17 21:41:46 atlantis kernel: [142657.684435] OOM killed process php (pid=586514, ve=175) exited, free=348772.Jun 17 21:41:46 atlantis kernel: [142657.684442] Jun 17 21:41:46 atlantis kernel: [142657.889579] >>> 175 oom generation 614 startsJun 17 21:41:46 atlantis kernel: [142657.889588] 586950 (php) invoked loc oom-killer: gfp 0x200d2 order 0 oomkilladj=0Jun 17 21:41:46 atlantis kernel: [142657.891072] Out of memory in UB: Kill process 586520 (php) score 29 or sacrifice childJun 17 21:41:46 atlantis kernel: [142657.891227] OOM killer in rage, 1 tasks killed in ub 175Jun 17 21:41:46 atlantis kernel: [142657.891303] OOM killed process 586520 (php) vm:49444kB, rss:30680kB, swap:0kBJun 17 21:41:46 atlantis kernel: [142657.981907] OOM killed process php (pid=586520, ve=175) exited, free=351070. http://www.screen-shot.net/2012-06-18_2326.png We are going to do what we can to make things stable. The fastest solution would result in the most downtime - including us backing the server up in full, doing a full OS reinstall, and then restoring VPS data back. This would result in between 6 and 12 hours of downtime (estimated) but result in the stability you have come to expect from us after this is done. The other option is ordering another brand new server that is identical to the current one, migrating everybody over to it, and then finding something to do with this "old" server. This would take the longest as we would have to order the new hardware and wait 7 to 14 days for it to arrive. In the meantime the server is likely going to continue having issues. We are evaluating all of the options, and trying to get things stable as quickly as possible. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 19, 2012 Author Report Share Posted June 19, 2012 We have set everybody back to vSwap - it's not stable (i.e. we may end up needing to do a daily 5 minute reboot) but at least while it's up and running, things will "work". If you have any questions at all, let us know. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 19, 2012 Author Report Share Posted June 19, 2012 After discussing this with our team, we've decided to go ahead and order and bring online an identical server which we will configure with OpenVZ 5.X and migrate all existing VPS customers over to. We're going to place the order immediately and hope to have the server online by Friday~Monday of this next week. In the meantime, everybody is going to stay on vSwap and we are going to perform a nightly reboot at 10 PM EST. Between reboots we do have monitoring by the second on the server and in the event that the server does lock up/panic we will immediately reboot it as well. We apologize for this instability and these issues, however, we are going to do everything we can to stabilize this as soon as possible. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 20, 2012 Author Report Share Posted June 20, 2012 We are going to issue a reboot now, to help ensure that the server doesn't go down unexpectedly. We're doing it later than we planned (originally planned at 10 PM) and I apologize for that. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 21, 2012 Author Report Share Posted June 21, 2012 We are issuing the next nightly reboot now. The new hardware will be here on Friday with any luck, but likely will be Monday. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 22, 2012 Author Report Share Posted June 22, 2012 The server just crashed on it's own, we're rebooting it now. The node itself takes about 2 to 3 minutes to reboot and then each VPS starts up one at a time (as not to totally overload the server and cause it to take longer). Due to the nature of the crash each VPS will boot up, and once they are all booted up they will go offline one at a time to fix their disk quotas which takes a few minutes a piece. If you have any questions, let us know - hopefully this new server will be here TOMORROW! Quote Link to comment Share on other sites More sharing options...
frankacter Posted June 22, 2012 Report Share Posted June 22, 2012 Will you still be peforming the scheduled clean reboot later in the day? Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 22, 2012 Author Report Share Posted June 22, 2012 No, a reboot is a reboot - so we'll leave it on at this point. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 22, 2012 Author Report Share Posted June 22, 2012 The server has crashed again, and is rebooting again. I'm going to check now with the facility to see if the new hardware has arrived *crosses fingers*. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 22, 2012 Author Report Share Posted June 22, 2012 All VPS on the node have successfully been re-started. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 22, 2012 Author Report Share Posted June 22, 2012 We've received everything but the server chassis itself, which it looks like won't arrive until Monday. We'll keep this thread updated. Quote Link to comment Share on other sites More sharing options...
frankacter Posted June 23, 2012 Report Share Posted June 23, 2012 Reminder for other clients on Atlantis, be sure to check your MYSQL databases after each reboot / crash. A few of mine were in a crashed state and required repairing, didn't realize it until customers started emailing to complain. Also a good idea to step up backups of your MYSQL databases just to be on the safe side! Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 23, 2012 Author Report Share Posted June 23, 2012 Indeed, it's not a bad idea to have backups either way whether the server is stable or not. Those who have reported crashed databases, I have added this to the /etc/rc.local file to repair databases automatically after booting up (about 2 minutes after boot):(sleep 120;mysqlcheck --auto-repair --force --optimize --all-databases;)& Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 23, 2012 Author Report Share Posted June 23, 2012 The reboot for 10PM has been scheduled into the server, at 10:00 PM EST on the second the server will automatically gracefully shut down and start back up. So long as the server does not crash between now and then - there won't be any need for database repairs as the shutdown will be graceful. Let us know if you have any questions. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 24, 2012 Author Report Share Posted June 24, 2012 The node just went down for the reboot as planned. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 25, 2012 Author Report Share Posted June 25, 2012 The nightly reboot has been scheduled, and will happen in approximately 10 minutes. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 25, 2012 Author Report Share Posted June 25, 2012 The node just crashed on it's own, and is being rebooted. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 25, 2012 Author Report Share Posted June 25, 2012 The node is back online, each VPS will start up (one at a time, takes about 10 seconds for each) and once all are started they will each re-start to fix their disk quotas. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 25, 2012 Author Report Share Posted June 25, 2012 Tracking information indicates the last piece (the server chassis) should be here tomorrow. Hopefully the server will say online and stable between now and then and we can get everybody shifted quickly. You will likely see 2 to 5 minutes of brief downtime once or twice as we conduct the migrations but that will be the end of the outages. I would like to thank everybody for their patience and understanding on this issue and apologize again for any trouble that it may have caused you. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 26, 2012 Author Report Share Posted June 26, 2012 The chassis has arrived at the facility, last I checked we have all of the parts necessary to build the server and should have it online by this evening. We'll update this thread if anything unexpected happens or if there are any delays. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 27, 2012 Author Report Share Posted June 27, 2012 Everything is going to plan so far, and by morning everybody should be 100% stable again. I'm staying up tonight to handle this instead of getting sleep as I know that nobody else wants to go another day wondering if the server will stay online without issues or if it will decide to crash on it's own. If you have any questions, let us know. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 28, 2012 Author Report Share Posted June 28, 2012 Everybody did get moved over to the new server without issue. The VPS control panel isn't going to function for you at this current moment so if you need something like a reboot done let us know in a support ticket. We should have the VPS control panel hooked back up this evening. Quote Link to comment Share on other sites More sharing options...
frankacter Posted June 28, 2012 Report Share Posted June 28, 2012 Thanks for the continued efforts. While it sounds like it should be little to no impact, is there a timeline for phase 2 to move everyone back onto the original server once it is rebuilt with OpenVZ 5.X? Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 28, 2012 Author Report Share Posted June 28, 2012 We will be doing "--online" migrations so there should only be 2 to 5 seconds of lag for the second transition and that is it. The reason we didn't do this for OVZ6 -> OVZ5 is that the way they handle memory is so different it was not an option. Since we will be going OVZ5 -> OVZ5 online transfers should work. If not, we'll update here and let everybody know. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted June 28, 2012 Author Report Share Posted June 28, 2012 VPS were still copying overnight, but I've checked now and everybody is back on the proper Atlantis server on OpenVZ 5. Things should remain stable and we will not be doing any more migrations (i.e. this issue is completed/resolved). Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.