Jump to content
MDDHosting Forums

Kobold - R1Soft / MySQL / Overall Slowness due to Operating System Bug


Recommended Posts

Hello,

 

Unfortunately our R1Soft backups on the Kobold server didn't trigger until 11 AM yesterday and roughly 12:40 PM today due to some delays/slowness on the backup server itself. While this should be fine normally as our servers have good hardware and plenty of I/O - the R1Soft backup is triggering several extremely intensive queries to MySQL as a part of it's MySQL backup process and, as such, it's causing slowness/instability/issues.

 

This was exacerbated by a few accounts using more than they should to begin with and we suspended and notified those users yesterday. Today upon investigation we found numerous 'stuck' queries running in the MySQL server:

http://www.screen-shot.net/2014-01-04_13-35-18.png

 

We did attempt to stop/kill those threads as you can see indicated by "Killed" at the beginning of the query, however, they have failed to close out. We've been forced today, just as we were yesterday, to force a quit of MySQL.

 

The result is that when it starts back up it's going to take a few minutes to go over databases and make sure they're complete/repaired. After this is done we will be performing a manual check that will take a couple of hours but should have little to no impact.

 

We apologize for any trouble this may have caused you or may be causing you and I can assure you we're doing everything we can to both resolve the issue currently as well as to avoid the issue in the future. We did stop the R1Soft backup for today and killed these MySQL threads, however, manual administrative interaction was still required.

 

We'll update this thread if we have anything new to provide - we're currently working on restoring MySQL and we expect it to be back online within 5 minutes roughly.

Link to comment
Share on other sites

The server is back to normal. Due to us having to stop the R1Soft backup it will be performing a complete block scan [seed backup] tonight rather than an incremental backup of the changes. This means it will likely run somewhere between 14 and 24 hours, however, we did disable the MySQL Databases portion of the backup as this is what is causing the most issues.

 

We're getting with R1Soft support regarding this matter, however, we're not holding our breath as their support tends to be slow/incompetent unfortunately.

 

We will still have backups of all MySQL data on the server, however, we will not be able to restore individual databases or tables until this issue is resolved.

Link to comment
Share on other sites

It appears I may have been incorrect - these queries are still firing off and it's not R1Soft that's doing it to the best of my knowledge.

 

I'm going to have to open a ticket with cPanel [an urgent one] to see if they can determine the cause as I'm not seeing anything myself:

http://www.screen-shot.net/2014-01-04_14-46-22.png

Link to comment
Share on other sites

I managed to get the queries under control without having to kill MySQL and will be watching to see if the crop back up.

 

I'll be monitoring the server personally [i.e. having it up on my secondary display so I can watch it all day]. At this time we're still investigating.

Link to comment
Share on other sites

The issue has been identified as a Kernel Memory Management issue. Unfortunately this is *way* over my head [it's a kernel developer issue].

 

You can see the issue here on this graph:

http://www.screen-shot.net/memory-day.png

 

You can notice where the purple [cache] basically disappears and is replaced by orange [slab cache]. What's happening is the Slab Cache is filling up all available system RAM and, as such, there is no ram available for disk I/O so disk I/O slows to a crawl and, as a result, MySQL queries slow down, back up, then the system runs out of CPU.

 

We're getting with our kernel developer [CloudLinux] to see what can be done about this as this isn't intended behavior and it's certainly causing performance issues.

 

While we certainly apologize for this issue do understand it's not within our control and we are going to be doing everything we possibly can to get this fixed as quickly as possible.

Link to comment
Share on other sites

Thanks a lot for update! Wonder what would cause kernel to improperly manage memory..

There always is a cause for all things.

 

I am IT tech my self in uk, and maybe there is some malicious code. I know that if anyone to upload any malicious code they could gain root access via meterpreter and php..

 

I really hope that CloudLinux will be able to resolve this.

 

Tahnks

Link to comment
Share on other sites

The developer is aware of the issue but has, thus far, been unable to figure out what is causing it [we're not the only one with this issue].

 

That said - we use SSD Cached I/O and this issue is only causing issues while R1Soft backups are running on the server. The memory management issue is only affecting performance *while* R1Soft is running due to the nature of R1Soft. The long and short of it is that R1Soft re-allocates writes while it runs as not to cause data inconsistency during the backup and this slows things down as well.

 

All of this added together results in a sluggish or non-responsive server.

 

We're currently actively working on disabling R1Soft on this server and moving to the built-in cPanel backup system in the meantime so that this performance hit doesn't happen and we still maintain backups. The result should be that this issue will be resolved as far as the end-user is concerned [i.e. no more I/O or MySQL issues].

Link to comment
Share on other sites

We're hoping to move to SSD-Cached SAN + VMWare ESXi soon to offer additional redundancy/HA but have no ETA on that. Still working out all of the details.

 

In the meantime - we're just going to drop R1Soft and go with traditional backups until the kernel issue itself is resolved. If not for the kernel issue no impact would be noticed.

Link to comment
Share on other sites

The issue appears to be getting worse. The really irritating part of all of this is that our hands are tied and we have no power to resolve this ourselves as it's an issue with the system kernel itself. This morning this issue caused all ram - 64 GB of it - to be exhausted resulting in slowness/downtime. You can see from this graph the issue is getting more persistent and causing more issues [it should not be spikey as it is]:

http://www.screen-shot.net/2014-01-11_10-07-08.png

 

All of the spikes where you see the purple get smaller and the orange/yellow color get larger should not be happening at all. The system should have 10 to 30 GB of free ram at all times.

 

I have sent off an email to the Chief Executive Officer of CloudLinux concerning this as this is unacceptable.

 

We may, possibly, be forced to boot out of CloudLinux and into a standard CentOS 6 System Kernel. While we don't wish to do this - ultimately we have to do what we have to do to maintain stability.

Link to comment
Share on other sites

This issue is still ongoing and we've been working to reduce the load on the server to help hide the issue [i.e. so you don't have issues even though the core issue still exists].

 

The server is normally loaded up to only about 50~60% of what it's capable of running almost half idle most of the time but even that is too much with this RAM issue ongoing.

Link to comment
Share on other sites

CloudLinux is making some changes to the server that they believe will either help alleviate and/or resolve the issue. They're also installing some custom monitoring of their own to watch for the issue so that they can investigate the issue as it happens.

 

With luck their modifications will, at least, restore full stability to the server.

Link to comment
Share on other sites

The developers have made some changes that appear to be reducing the issue, however, the system is dipping into SWAP now [it shouldn't, it's got 64 GB of ram]. I'm in the process of dumping SWAP back to RAM and then will re-enable SWAP and we will continue to monitor.

 

You can see from this graph compared to the ones above that the behavior seems to be improved:

http://www.screen-shot.net/2014-01-17_10-37-14.png

Link to comment
Share on other sites

I really want to call this fixed/resolved but I'm going to give it a bit more time to be sure. I will mark this 'tentatively resolved'.

 

I'm going to find out from CloudLinux what changes they made [i.e. what the original issue was] for any that are curious - I will post it here.

Link to comment
Share on other sites

The speed issues you have seen over the last 2 days are unrelated to this issue.

We had to temporarily use Kobold as a staging area due to data corruption on the Echo server. It wasn't the load of the accounts but the migration of data in and out of the server. We leave enough breathing room on all of our servers that if we had to combine two of them - it would be able to sustain the load.

Moving data at 200~300 megabytes/second can put an incredible strain on the server's disks which can slow things down but we needed to get the process done as quickly as possible.

I'm actually in the process of removing the 'temporary' accounts from Kobold [which is wiping out about 300gb of data] and is fairly intensive. It should finish soon and when it's done things should go back to normal. I wish there was a way to throttle the deletion but there is not.

Here is a graph showing the data that was migrated in:

http://www.screen-shot.net/2014-01-20_15-05-00.png

 

That same data is now being removed. As you can see the amount moved in was small compared to the total amount contained within the Kobold server but it is a significant amount of data at around 300gb.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...