Jump to content
MDDHosting Forums

S1 / R1 Servers - Network Device Updates - ~9 PM ET Jan 12, 2016.


Recommended Posts

I am working on fixing sites with missing content while continuing to search for the source of the restarts. at this time the restarts have slowed down significantly. I do have multiple debug sessions open watching different aspects of the server.

  • Upvote 1
Link to comment
Share on other sites

The server is still online and we are spining up secondary servers to recreate the problem with.
Also, if you continue to have websites with content issues please open or reply to your existing tickets. We have addressed the filesystem glitch that was causing some files not to load as well as the content encoding errors.

Link to comment
Share on other sites

Hey Guys,

 

When you first announced upgrades the other month, I think you mentioned more info to follow.

 

I'm just wondering if this thread is all the info on the new setup? I'm excited to see what you're rolling out and am wondering if there's an email blast to customers to follow?

 

Cheers!

Link to comment
Share on other sites

We've been working extremely hard all day on resolving this issue. I am going to be bringing down the S1 server momentarily to bring it online on a different piece of hardware.

 

The total downtime should not exceed 10 minutes although I expect it to be more around 3 minutes. I'll update this once it's going down and once it's back up.

Link to comment
Share on other sites

Hi - as always MDD's communication is incredible!

 

Question - I've been getting pingdom up & down reports all day that coincide with the S1 server for the Gemini server. Are the 2 somehow tied together? I'm not getting any downtime reports from the other sites on other servers @ MDD I monitor.

 

It's been confusing because the site has been offline according to the uptime robot & my trying to access the website via browser & MDD's server status page. Yet the public report on Gemini shows 100% uptime.

 

Thanks!

Link to comment
Share on other sites

I do apologize to everybody that experienced any issues or downtime as a result of these issues.

 

We are making a huge move to vastly improved hardware that is already as of now allowing us to allow far more redundancy and speed than we could before. If the hardware that the S1 server is running on were to completely fail catastrophically the server would come back online on another piece of hardware within a few minutes - far faster than we could have ever handled hardware failure before today.

 

Unfortunately such a big change has clearly caused us some growing pains - all of which I do believe should be resolved at this point. Some of it required working with our software vendors to resolve issues we're experiencing and some of it involved network consulting.

 

While I cannot promise you won't have issues / there won't be problems - what I can promise you is that if they do happen we will be working on them most likely before you have a chance to notice yourself. We are monitoring our new servers by the second where the old hardware was monitored by the minute. We are still human and may not respond within a second but it does greatly decrease our response time to incidents that may occur.

 

Thank you for your patience and understanding in this process and again I'm sorry for any trouble you may have experienced as a result. We don't like outages or downtime anymore than you do and these changes ultimately should result in the absolutely maximum uptime we could provide for a company of our size.

Link to comment
Share on other sites

Is it possible to revert to a known/previous/good configuration for the time being and maintain the current state of s1 offline as a testing platform to figure out the issue? Despite echo's issues I feel like it was more reliable in recent weeks than s1 has been since the migration. I know we're not paying for anywhere near 100% uptime and I completely understand that outages happen (I'm a Linode customer as well, so it's been a rough few weeks), but it seems at least in the short term this migration has caused more problems than it has solved. I fully support what you're doing to improve reliability down the road but I feel like it would be better for those both sides of this issue to get out of emergency/troubleshooting/damage control mode, get things running, and have time to properly diagnose outside of a production environment. Particularly when this appears to be a hardware/configuration problem and not a DDOS or upstream connectivity issue.

 

Please believe me when I say I truly respect what you're doing and all the hard work you're putting in to this, but I feel like we're beta testers in this endeavor and as a result it is hurting me and my clients. I know how hard it is to maintain servers - that's why I have several clients hosted here, because I trust you and I don't want to be spending the late hours doing it myself. I feel like when things are rushed it is more difficult to solve problems. I have no issue at all with copies of all my client sites being duplicated in a testing environment or whatever it takes to help get this new configuration up and running reliably. I'm not as comfortable with it happening on the server the A and MX records are pointed to, if that can be avoided at this point.

Link to comment
Share on other sites

S1 is and has been online and operational without issues since yesterday as far as I can see.

 

We just restarted R1 to perform some of the optimizations we performed on S1 and you can see that graph here:

http://www.screen-shot.net/2016-01-20_15-04-38.png

 

You will notice the dark yellow CPU usage is almost non-existent to the right of the reboot/restart gap. Optimization for the win :).

 

The yellow is a measure of how much CPU is waiting on data - read/write from the storage. Taking it from an average of 5 to an average of 0.26 is a phenomenal change.

Link to comment
Share on other sites

Things have been nice and stable since my last fix on this issue. The fix is more of work-around but it works. Hopefully cPanel and LiteSpeed will get together and work towards a real integration of LiteSpeed's services into cPanel's offerings. Currently LiteSpeed uses an 'Apache Wrapper' to function with cPanel while letting cPanel believe it was Apache that was running.

 

This doesn't exactly work so great on CentOS7 with SystemD and cPanel overwriting/'fixing' the httpd.service every night.

 

LiteSpeed informed me that they were actually modifying their software to watch for cPanel breaking the LiteSpeed integration so they can fix it on the fly which feels like a kludge and not a solution. IMHO cPanel should simply not be breaking LiteSpeed and should be aware that it is, in fact, LiteSpeed and not Apache.

 

At the end of the day even if our software vendors are unable to resolve issues with their software - we will if possible.

 

http://www.screen-shot.net/2016-01-24_13-48-36.png

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...