Scheduled S1 / R1 Servers - Network Device Updates - ~9 PM ET Jan 12, 2016.

Michael D. · January 18, 2016

The server is back online, however, I am not sure I can call it stable.

I'm going to get back to work on investigating the root cause of the instability so we can resolve this once and for all.

ericr · January 18, 2016

I am working on fixing sites with missing content while continuing to search for the source of the restarts. at this time the restarts have slowed down significantly. I do have multiple debug sessions open watching different aspects of the server.

mdd_shared_user · January 18, 2016

Thanks for the updates and fire fighting. Just FYI, here's what I'm seeing when trying to load my website using Firefox:

Content Encoding Error

The page you are trying to view cannot be shown because it uses an invalid or unsupported form of compression.

Please contact the website owners to inform them of this problem.

ericr · January 18, 2016

Can you please open a ticket for that specific issue? That is not a issue that is being seen across other sites at this time.

ericr · January 18, 2016

The server is still online and we are spining up secondary servers to recreate the problem with.
Also, if you continue to have websites with content issues please open or reply to your existing tickets. We have addressed the filesystem glitch that was causing some files not to load as well as the content encoding errors.

Plippers · January 19, 2016

Hey Guys,

When you first announced upgrades the other month, I think you mentioned more info to follow.

I'm just wondering if this thread is all the info on the new setup? I'm excited to see what you're rolling out and am wondering if there's an email blast to customers to follow?

Cheers!

Michael D. · January 19, 2016

We've been working extremely hard all day on resolving this issue. I am going to be bringing down the S1 server momentarily to bring it online on a different piece of hardware.

The total downtime should not exceed 10 minutes although I expect it to be more around 3 minutes. I'll update this once it's going down and once it's back up.

Michael D. · January 19, 2016

Bringing S1 down now.

Michael D. · January 19, 2016

S1 is back online.

Leah2 · January 19, 2016

Hi - as always MDD's communication is incredible!

Question - I've been getting pingdom up & down reports all day that coincide with the S1 server for the Gemini server. Are the 2 somehow tied together? I'm not getting any downtime reports from the other sites on other servers @ MDD I monitor.

It's been confusing because the site has been offline according to the uptime robot & my trying to access the website via browser & MDD's server status page. Yet the public report on Gemini shows 100% uptime.

Thanks!

Michael D. · January 19, 2016

Gemini accounts were moved to S1 a little while ago - you would have received an email about this and can see a copy of it here:

https://www.mddhosting.com/support/clientarea.php?action=emails

If you can't find it do please open a ticket.

Michael D. · January 19, 2016

I do apologize to everybody that experienced any issues or downtime as a result of these issues.

We are making a huge move to vastly improved hardware that is already as of now allowing us to allow far more redundancy and speed than we could before. If the hardware that the S1 server is running on were to completely fail catastrophically the server would come back online on another piece of hardware within a few minutes - far faster than we could have ever handled hardware failure before today.

Unfortunately such a big change has clearly caused us some growing pains - all of which I do believe should be resolved at this point. Some of it required working with our software vendors to resolve issues we're experiencing and some of it involved network consulting.

While I cannot promise you won't have issues / there won't be problems - what I can promise you is that if they do happen we will be working on them most likely before you have a chance to notice yourself. We are monitoring our new servers by the second where the old hardware was monitored by the minute. We are still human and may not respond within a second but it does greatly decrease our response time to incidents that may occur.

Thank you for your patience and understanding in this process and again I'm sorry for any trouble you may have experienced as a result. We don't like outages or downtime anymore than you do and these changes ultimately should result in the absolutely maximum uptime we could provide for a company of our size.

skunkbad · January 19, 2016

What's the schedule for Demeter?

Michael D. · January 19, 2016

What's the schedule for Demeter?

You will receive a notice 72 hours prior to us getting started with Demeter. At this point the only schedule for demeter is that it's not yet scheduled.

Michael D. · January 19, 2016

The S1 server is still experiencing short outages - but not due to the same/original issue. I believe this may simply be a CentOS7/CloudLinux7 issue but I am actively working on investigating this.

haus · January 19, 2016

Is it possible to revert to a known/previous/good configuration for the time being and maintain the current state of s1 offline as a testing platform to figure out the issue? Despite echo's issues I feel like it was more reliable in recent weeks than s1 has been since the migration. I know we're not paying for anywhere near 100% uptime and I completely understand that outages happen (I'm a Linode customer as well, so it's been a rough few weeks), but it seems at least in the short term this migration has caused more problems than it has solved. I fully support what you're doing to improve reliability down the road but I feel like it would be better for those both sides of this issue to get out of emergency/troubleshooting/damage control mode, get things running, and have time to properly diagnose outside of a production environment. Particularly when this appears to be a hardware/configuration problem and not a DDOS or upstream connectivity issue.

Please believe me when I say I truly respect what you're doing and all the hard work you're putting in to this, but I feel like we're beta testers in this endeavor and as a result it is hurting me and my clients. I know how hard it is to maintain servers - that's why I have several clients hosted here, because I trust you and I don't want to be spending the late hours doing it myself. I feel like when things are rushed it is more difficult to solve problems. I have no issue at all with copies of all my client sites being duplicated in a testing environment or whatever it takes to help get this new configuration up and running reliably. I'm not as comfortable with it happening on the server the A and MX records are pointed to, if that can be avoided at this point.

Michael D. · January 20, 2016

S1 is and has been online and operational without issues since yesterday as far as I can see.

We just restarted R1 to perform some of the optimizations we performed on S1 and you can see that graph here:

http://www.screen-shot.net/2016-01-20_15-04-38.png

You will notice the dark yellow CPU usage is almost non-existent to the right of the reboot/restart gap. Optimization for the win .

The yellow is a measure of how much CPU is waiting on data - read/write from the storage. Taking it from an average of 5 to an average of 0.26 is a phenomenal change.

Michael D. · January 24, 2016

Things have been nice and stable since my last fix on this issue. The fix is more of work-around but it works. Hopefully cPanel and LiteSpeed will get together and work towards a real integration of LiteSpeed's services into cPanel's offerings. Currently LiteSpeed uses an 'Apache Wrapper' to function with cPanel while letting cPanel believe it was Apache that was running.

This doesn't exactly work so great on CentOS7 with SystemD and cPanel overwriting/'fixing' the httpd.service every night.

LiteSpeed informed me that they were actually modifying their software to watch for cPanel breaking the LiteSpeed integration so they can fix it on the fly which feels like a kludge and not a solution. IMHO cPanel should simply not be breaking LiteSpeed and should be aware that it is, in fact, LiteSpeed and not Apache.

At the end of the day even if our software vendors are unable to resolve issues with their software - we will if possible.

http://www.screen-shot.net/2016-01-24_13-48-36.png

Sign In

Scheduled S1 / R1 Servers - Network Device Updates - ~9 PM ET Jan 12, 2016.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation