Michael D. Posted January 4, 2014 Report Share Posted January 4, 2014 Hello, Unfortunately our R1Soft backups on the Kobold server didn't trigger until 11 AM yesterday and roughly 12:40 PM today due to some delays/slowness on the backup server itself. While this should be fine normally as our servers have good hardware and plenty of I/O - the R1Soft backup is triggering several extremely intensive queries to MySQL as a part of it's MySQL backup process and, as such, it's causing slowness/instability/issues. This was exacerbated by a few accounts using more than they should to begin with and we suspended and notified those users yesterday. Today upon investigation we found numerous 'stuck' queries running in the MySQL server:http://www.screen-shot.net/2014-01-04_13-35-18.png We did attempt to stop/kill those threads as you can see indicated by "Killed" at the beginning of the query, however, they have failed to close out. We've been forced today, just as we were yesterday, to force a quit of MySQL. The result is that when it starts back up it's going to take a few minutes to go over databases and make sure they're complete/repaired. After this is done we will be performing a manual check that will take a couple of hours but should have little to no impact. We apologize for any trouble this may have caused you or may be causing you and I can assure you we're doing everything we can to both resolve the issue currently as well as to avoid the issue in the future. We did stop the R1Soft backup for today and killed these MySQL threads, however, manual administrative interaction was still required. We'll update this thread if we have anything new to provide - we're currently working on restoring MySQL and we expect it to be back online within 5 minutes roughly. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 4, 2014 Author Report Share Posted January 4, 2014 MySQL has been restarted and is online - the server should be back to normal within a few minutes. We're about to begin the manual repair scan. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 4, 2014 Author Report Share Posted January 4, 2014 The server is back to normal. Due to us having to stop the R1Soft backup it will be performing a complete block scan [seed backup] tonight rather than an incremental backup of the changes. This means it will likely run somewhere between 14 and 24 hours, however, we did disable the MySQL Databases portion of the backup as this is what is causing the most issues. We're getting with R1Soft support regarding this matter, however, we're not holding our breath as their support tends to be slow/incompetent unfortunately. We will still have backups of all MySQL data on the server, however, we will not be able to restore individual databases or tables until this issue is resolved. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 4, 2014 Author Report Share Posted January 4, 2014 It appears I may have been incorrect - these queries are still firing off and it's not R1Soft that's doing it to the best of my knowledge. I'm going to have to open a ticket with cPanel [an urgent one] to see if they can determine the cause as I'm not seeing anything myself:http://www.screen-shot.net/2014-01-04_14-46-22.png Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 4, 2014 Author Report Share Posted January 4, 2014 I managed to get the queries under control without having to kill MySQL and will be watching to see if the crop back up. I'll be monitoring the server personally [i.e. having it up on my secondary display so I can watch it all day]. At this time we're still investigating. Quote Link to comment Share on other sites More sharing options...
GrooveAnatomy Posted January 5, 2014 Report Share Posted January 5, 2014 Thanks for the updates! Quote Link to comment Share on other sites More sharing options...
mrintech Posted January 6, 2014 Report Share Posted January 6, 2014 From past 2 days, getting Database Connection Errors randomly on every page. While writing this post, I am getting the errors, so thought of notifying you: http://i.imgur.com/J6HO2c2.pnghttp://i.imgur.com/a3VINfL.png Thanks Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 6, 2014 Author Report Share Posted January 6, 2014 The issue has been identified as a Kernel Memory Management issue. Unfortunately this is *way* over my head [it's a kernel developer issue]. You can see the issue here on this graph:http://www.screen-shot.net/memory-day.png You can notice where the purple [cache] basically disappears and is replaced by orange [slab cache]. What's happening is the Slab Cache is filling up all available system RAM and, as such, there is no ram available for disk I/O so disk I/O slows to a crawl and, as a result, MySQL queries slow down, back up, then the system runs out of CPU. We're getting with our kernel developer [CloudLinux] to see what can be done about this as this isn't intended behavior and it's certainly causing performance issues. While we certainly apologize for this issue do understand it's not within our control and we are going to be doing everything we possibly can to get this fixed as quickly as possible. Quote Link to comment Share on other sites More sharing options...
JohnUK Posted January 6, 2014 Report Share Posted January 6, 2014 Thanks a lot for update! Wonder what would cause kernel to improperly manage memory..There always is a cause for all things. I am IT tech my self in uk, and maybe there is some malicious code. I know that if anyone to upload any malicious code they could gain root access via meterpreter and php.. I really hope that CloudLinux will be able to resolve this. Tahnks Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 6, 2014 Author Report Share Posted January 6, 2014 The developer is aware of the issue but has, thus far, been unable to figure out what is causing it [we're not the only one with this issue]. That said - we use SSD Cached I/O and this issue is only causing issues while R1Soft backups are running on the server. The memory management issue is only affecting performance *while* R1Soft is running due to the nature of R1Soft. The long and short of it is that R1Soft re-allocates writes while it runs as not to cause data inconsistency during the backup and this slows things down as well. All of this added together results in a sluggish or non-responsive server. We're currently actively working on disabling R1Soft on this server and moving to the built-in cPanel backup system in the meantime so that this performance hit doesn't happen and we still maintain backups. The result should be that this issue will be resolved as far as the end-user is concerned [i.e. no more I/O or MySQL issues]. Quote Link to comment Share on other sites More sharing options...
JohnUK Posted January 6, 2014 Report Share Posted January 6, 2014 Right, understood. Maybe take a look at http://www.unitrends.com/ . It might just work out. Are your server virtualized, if so it would be easier to backup with vmware.? Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 6, 2014 Author Report Share Posted January 6, 2014 We're hoping to move to SSD-Cached SAN + VMWare ESXi soon to offer additional redundancy/HA but have no ETA on that. Still working out all of the details. In the meantime - we're just going to drop R1Soft and go with traditional backups until the kernel issue itself is resolved. If not for the kernel issue no impact would be noticed. Quote Link to comment Share on other sites More sharing options...
T0M Posted January 7, 2014 Report Share Posted January 7, 2014 Thanks for the updates Mike. I'm sure Cloud Linux and R1Soft will work this issue. In a perfect world, these things wouldn't happen but it is what it is. I appreciate the service you guys provide. Happy New Year! Quote Link to comment Share on other sites More sharing options...
mrintech Posted January 8, 2014 Report Share Posted January 8, 2014 Thanks for the updates Even HostGator (My previous host; 3+ years) never give such details Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 10, 2014 Author Report Share Posted January 10, 2014 We're still working with CloudLinux to investigate this - I have no ETA but it is actively being investigated and worked on. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 11, 2014 Author Report Share Posted January 11, 2014 The issue appears to be getting worse. The really irritating part of all of this is that our hands are tied and we have no power to resolve this ourselves as it's an issue with the system kernel itself. This morning this issue caused all ram - 64 GB of it - to be exhausted resulting in slowness/downtime. You can see from this graph the issue is getting more persistent and causing more issues [it should not be spikey as it is]:http://www.screen-shot.net/2014-01-11_10-07-08.png All of the spikes where you see the purple get smaller and the orange/yellow color get larger should not be happening at all. The system should have 10 to 30 GB of free ram at all times. I have sent off an email to the Chief Executive Officer of CloudLinux concerning this as this is unacceptable. We may, possibly, be forced to boot out of CloudLinux and into a standard CentOS 6 System Kernel. While we don't wish to do this - ultimately we have to do what we have to do to maintain stability. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 14, 2014 Author Report Share Posted January 14, 2014 This issue is still ongoing and we've been working to reduce the load on the server to help hide the issue [i.e. so you don't have issues even though the core issue still exists]. The server is normally loaded up to only about 50~60% of what it's capable of running almost half idle most of the time but even that is too much with this RAM issue ongoing. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 16, 2014 Author Report Share Posted January 16, 2014 CloudLinux is making some changes to the server that they believe will either help alleviate and/or resolve the issue. They're also installing some custom monitoring of their own to watch for the issue so that they can investigate the issue as it happens. With luck their modifications will, at least, restore full stability to the server. Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 17, 2014 Author Report Share Posted January 17, 2014 The developers have made some changes that appear to be reducing the issue, however, the system is dipping into SWAP now [it shouldn't, it's got 64 GB of ram]. I'm in the process of dumping SWAP back to RAM and then will re-enable SWAP and we will continue to monitor. You can see from this graph compared to the ones above that the behavior seems to be improved:http://www.screen-shot.net/2014-01-17_10-37-14.png Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 17, 2014 Author Report Share Posted January 17, 2014 I really want to call this fixed/resolved but I'm going to give it a bit more time to be sure. I will mark this 'tentatively resolved'. I'm going to find out from CloudLinux what changes they made [i.e. what the original issue was] for any that are curious - I will post it here. Quote Link to comment Share on other sites More sharing options...
Jacqui Best Posted January 20, 2014 Report Share Posted January 20, 2014 I am still experiencing serious speed issues on all of my accounts.. I wish I could say it was fixed. It actually seems worse today then it has in the past few days.. Thanks GuysJac Quote Link to comment Share on other sites More sharing options...
Michael D. Posted January 20, 2014 Author Report Share Posted January 20, 2014 The speed issues you have seen over the last 2 days are unrelated to this issue.We had to temporarily use Kobold as a staging area due to data corruption on the Echo server. It wasn't the load of the accounts but the migration of data in and out of the server. We leave enough breathing room on all of our servers that if we had to combine two of them - it would be able to sustain the load.Moving data at 200~300 megabytes/second can put an incredible strain on the server's disks which can slow things down but we needed to get the process done as quickly as possible.I'm actually in the process of removing the 'temporary' accounts from Kobold [which is wiping out about 300gb of data] and is fairly intensive. It should finish soon and when it's done things should go back to normal. I wish there was a way to throttle the deletion but there is not.Here is a graph showing the data that was migrated in:http://www.screen-shot.net/2014-01-20_15-05-00.png That same data is now being removed. As you can see the amount moved in was small compared to the total amount contained within the Kobold server but it is a significant amount of data at around 300gb. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.