Jump to content


Photo

Kobold - R1Soft / MySQL / Overall Slowness due to Operating System Bug

Tentatively Resolved

  • Please log in to reply
21 replies to this topic

#1 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 04 January 2014 - 01:41 PM

Hello,

 

Unfortunately our R1Soft backups on the Kobold server didn't trigger until 11 AM yesterday and roughly 12:40 PM today due to some delays/slowness on the backup server itself.  While this should be fine normally as our servers have good hardware and plenty of I/O - the R1Soft backup is triggering several extremely intensive queries to MySQL as a part of it's MySQL backup process and, as such, it's causing slowness/instability/issues.

 

This was exacerbated by a few accounts using more than they should to begin with and we suspended and notified those users yesterday.  Today upon investigation we found numerous 'stuck' queries running in the MySQL server:

2014-01-04_13-35-18.png

 

We did attempt to stop/kill those threads as you can see indicated by "Killed" at the beginning of the query, however, they have failed to close out.  We've been forced today, just as we were yesterday, to force a quit of MySQL.

 

The result is that when it starts back up it's going to take a few minutes to go over databases and make sure they're complete/repaired.  After this is done we will be performing a manual check that will take a couple of hours but should have little to no impact.

 

We apologize for any trouble this may have caused you or may be causing you and I can assure you we're doing everything we can to both resolve the issue currently as well as to avoid the issue in the future.  We did stop the R1Soft backup for today and killed these MySQL threads, however, manual administrative interaction was still required.

 

We'll update this thread if we have anything new to provide - we're currently working on restoring MySQL and we expect it to be back online within 5 minutes roughly.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#2 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 04 January 2014 - 01:46 PM

MySQL has been restarted and is online - the server should be back to normal within a few minutes.  We're about to begin the manual repair scan.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#3 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 04 January 2014 - 01:53 PM

The server is back to normal.  Due to us having to stop the R1Soft backup it will be performing a complete block scan [seed backup] tonight rather than an incremental backup of the changes.  This means it will likely run somewhere between 14 and 24 hours, however, we did disable the MySQL Databases portion of the backup as this is what is causing the most issues.

 

We're getting with R1Soft support regarding this matter, however, we're not holding our breath as their support tends to be slow/incompetent unfortunately.

 

We will still have backups of all MySQL data on the server, however, we will not be able to restore individual databases or tables until this issue is resolved.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#4 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 04 January 2014 - 02:46 PM

It appears I may have been incorrect - these queries are still firing off and it's not R1Soft that's doing it to the best of my knowledge.

 

I'm going to have to open a ticket with cPanel [an urgent one] to see if they can determine the cause as I'm not seeing anything myself:

2014-01-04_14-46-22.png


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#5 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 04 January 2014 - 03:04 PM

I managed to get the queries under control without having to kill MySQL and will be watching to see if the crop back up.

 

I'll be monitoring the server personally [i.e. having it up on my secondary display so I can watch it all day].  At this time we're still investigating.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#6 GrooveAnatomy

GrooveAnatomy

    Newbie

  • Members
  • Pip
  • 1 posts
  • Gender:Male
  • Location:Washington, DC

Posted 05 January 2014 - 05:35 AM

Thanks for the updates!


  • 0

Stewart Bernard II

Grooveanatomy.com

signature.png


#7 mrintech

mrintech

    Newbie

  • Members
  • Pip
  • 12 posts
  • Gender:Male
  • Location:Bhopal, India

Posted 06 January 2014 - 04:49 AM

From past 2 days, getting Database Connection Errors randomly on every page.

 

While writing this post, I am getting the errors, so thought of notifying you:

 

J6HO2c2.png

a3VINfL.png

 

Thanks 


  • 0

#8 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 06 January 2014 - 10:12 AM

The issue has been identified as a Kernel Memory Management issue.  Unfortunately this is *way* over my head [it's a kernel developer issue].

 

You can see the issue here on this graph:

memory-day.png

 

You can notice where the purple [cache] basically disappears and is replaced by orange [slab cache].  What's happening is the Slab Cache is filling up all available system RAM and, as such, there is no ram available for disk I/O so disk I/O slows to a crawl and, as a result, MySQL queries slow down, back up, then the system runs out of CPU.

 

We're getting with our kernel developer [CloudLinux] to see what can be done about this as this isn't intended behavior and it's certainly causing performance issues.

 

While we certainly apologize for this issue do understand it's not within our control and we are going to be doing everything we possibly can to get this fixed as quickly as possible.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#9 JohnUK

JohnUK

    Newbie

  • Members
  • Pip
  • 3 posts

Posted 06 January 2014 - 11:22 AM

Thanks a lot for update! Wonder what would cause kernel to improperly manage memory..

There always is a cause for all things.

 

I am IT tech my self in uk, and maybe there is some malicious code. I know that if anyone to upload any malicious code they could gain root access via meterpreter and php..

 

I really hope that CloudLinux will be able to resolve this.

 

Tahnks


  • 0

#10 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 06 January 2014 - 11:25 AM

The developer is aware of the issue but has, thus far, been unable to figure out what is causing it [we're not the only one with this issue].

 

That said - we use SSD Cached I/O and this issue is only causing issues while R1Soft backups are running on the server.  The memory management issue is only affecting performance *while* R1Soft is running due to the nature of R1Soft.  The long and short of it is that R1Soft re-allocates writes while it runs as not to cause data inconsistency during the backup and this slows things down as well.

 

All of this added together results in a sluggish or non-responsive server.

 

We're currently actively working on disabling R1Soft on this server and moving to the built-in cPanel backup system in the meantime so that this performance hit doesn't happen and we still maintain backups.  The result should be that this issue will be resolved as far as the end-user is concerned [i.e. no more I/O or MySQL issues].


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#11 JohnUK

JohnUK

    Newbie

  • Members
  • Pip
  • 3 posts

Posted 06 January 2014 - 11:33 AM

Right, understood. 

 

Maybe take a look at http://www.unitrends.com/ . It might just work out. Are your server virtualized, if so it would be easier to backup with vmware.?


  • 0

#12 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 06 January 2014 - 11:34 AM

We're hoping to move to SSD-Cached SAN + VMWare ESXi soon to offer additional redundancy/HA but have no ETA on that.  Still working out all of the details.

 

In the meantime - we're just going to drop R1Soft and go with traditional backups until the kernel issue itself is resolved.  If not for the kernel issue no impact would be noticed.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#13 T0M

T0M

    Newbie

  • Members
  • Pip
  • 6 posts

Posted 07 January 2014 - 11:29 AM

Thanks for the updates Mike.  I'm sure Cloud Linux and R1Soft will work this issue.  In a perfect world, these things wouldn't happen but it is what it is.  I appreciate the service you guys provide.  Happy New Year!


  • 0

#14 mrintech

mrintech

    Newbie

  • Members
  • Pip
  • 12 posts
  • Gender:Male
  • Location:Bhopal, India

Posted 08 January 2014 - 09:51 AM

Thanks for the updates :)  :wub:

 

Even HostGator (My previous host; 3+ years) never give such details  <_<  :mellow:


  • 0

#15 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 09 January 2014 - 07:11 PM

We're still working with CloudLinux to investigate this - I have no ETA but it is actively being investigated and worked on.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#16 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 11 January 2014 - 10:15 AM

The issue appears to be getting worse.  The really irritating part of all of this is that our hands are tied and we have no power to resolve this ourselves as it's an issue with the system kernel itself.  This morning this issue caused all ram - 64 GB of it - to be exhausted resulting in slowness/downtime.  You can see from this graph the issue is getting more persistent and causing more issues [it should not be spikey as it is]:

2014-01-11_10-07-08.png

 

All of the spikes where you see the purple get smaller and the orange/yellow color get larger should not be happening at all.  The system should have 10 to 30 GB of free ram at all times.

 

I have sent off an email to the Chief Executive Officer of CloudLinux concerning this as this is unacceptable.

 

We may, possibly, be forced to boot out of CloudLinux and into a standard CentOS 6 System Kernel.  While we don't wish to do this - ultimately we have to do what we have to do to maintain stability.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#17 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 14 January 2014 - 12:06 AM

This issue is still ongoing and we've been working to reduce the load on the server to help hide the issue [i.e. so you don't have issues even though the core issue still exists].

 

The server is normally loaded up to only about 50~60% of what it's capable of running almost half idle most of the time but even that is too much with this RAM issue ongoing.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#18 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 15 January 2014 - 10:18 PM

CloudLinux is making some changes to the server that they believe will either help alleviate and/or resolve the issue.  They're also installing some custom monitoring of their own to watch for the issue so that they can investigate the issue as it happens.

 

With luck their modifications will, at least, restore full stability to the server.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#19 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 17 January 2014 - 10:48 AM

The developers have made some changes that appear to be reducing the issue, however, the system is dipping into SWAP now [it shouldn't, it's got 64 GB of ram].  I'm in the process of dumping SWAP back to RAM and then will re-enable SWAP and we will continue to monitor.

 

You can see from this graph compared to the ones above that the behavior seems to be improved:

2014-01-17_10-37-14.png


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#20 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 17 January 2014 - 02:04 PM

I really want to call this fixed/resolved but I'm going to give it a bit more time to be sure.  I will mark this 'tentatively resolved'.

 

I'm going to find out from CloudLinux what changes they made [i.e. what the original issue was] for any that are curious - I will post it here.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/





Also tagged with one or more of these keywords: Tentatively Resolved

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users