R1, S1, P1, SpamExperts Scanners outage

ericr · February 12, 2016

I have identified a outage affecting all servers in the new datacenter. I have escalated to our datacenter provider and will update as soon as possible.

ericr · February 12, 2016

The datacenter has confirmed that hey are aware of the outage and are working to resolve the issues.

ericr · February 12, 2016

I have adjusted the scope to include the inbound and outbound scanners as they are also in this facility.

ericr · February 12, 2016

The datacenter is in contact with the transit provider to identify any network connectivity problems. They are also on-site reviewing the network gear for issues.

frankacter · February 12, 2016

FYI, the link to the public report for P1 on the status page is linking to the MDD support page instead of the Pingdom URL.

Also, none of the speedtest links work for any of the servers.

ericr · February 12, 2016

We are aware of the speed test link issues. I will investigate the P1 reporting issue at later point.

AMGill · February 12, 2016

An hour down....any news on an ETA. Bummer.

ericr · February 12, 2016

I do not have a ETA at this time. I am standing by while the datacenter is working as fast as they can to locate and fix the fault.

ericr · February 12, 2016

The datacenter is not able to provide any further updates at this time. I am continuing to persist and try to get information from them.

ericr · February 12, 2016

I have been provided more information. The fault is located in Level 3's network in denver. Level 3 is working to correct the fault at this time.

AMGill · February 12, 2016

Wow this is a long one. i sure hope we dont have to change servers again....dont think I could handle another move.

Thanks for the updates as I have clients waiting for answers. I know you are doing what you can

ericr · February 12, 2016

We are showing the link backup. I will continue to monitor to ensure the link is stable.

andamira · February 12, 2016

It looks like pages in R1 are reachable again (for now). Thank you for the updates.

Vask · February 12, 2016

So, for two hours the problem was on another datacenter that initially was thought?

A simple tracert was showing where the network problem was:

Tracing route to ************* [173.248.188.176]

over a maximum of 30 hops:

1 <1 ms 1 ms 1 ms 10.0.0.1

2 1 ms 1 ms 1 ms 192.168.1.1

3 45 ms 39 ms 39 ms 80.107.108.110

4 599 ms 52 ms 53 ms athe-crsb-hera-gsra-1.backbone.otenet.net [79.128.224.217]

5 77 ms 53 ms 47 ms ten0-1-0-0-crs01.ath.oteglobe.gr [62.75.3.1]

6 91 ms 91 ms 91 ms 62.75.4.162

7 320 ms 96 ms 139 ms 40ge1-3.core1.lon2.he.net [195.66.224.21]

8 * 177 ms 158 ms 100ge1-1.core1.nyc4.he.net [72.52.92.166]

9 182 ms 176 ms 186 ms 100ge7-2.core1.chi1.he.net [184.105.223.161]

10 250 ms 208 ms 199 ms 10ge15-2.core1.den1.he.net [184.105.81.82]

11 199 ms 198 ms * handy-networks-llc.gigabitethernet2-11.core1.den1.he.net [216.66.78.126]

12 * * * Request timed out.

13 * * * Request timed out.

ericr · February 12, 2016

And they are back offline. I am notifying the datacenter.

kix766 · February 12, 2016

Wow this is a long one. i sure hope we dont have to change servers again....dont think I could handle another move.

Thanks for the updates as I have clients waiting for answers. I know you are doing what you can

my thoughts exactly

ericr · February 12, 2016

The fault was with the second DC location. They datacenter has links from the main datacenter to the second datacenter as well as direct internet links. The current reported fault is that the issue was with level 3 and the connection to that data center in denver.

Michael D. · February 12, 2016

Wow this is a long one. i sure hope we dont have to change servers again....dont think I could handle another move.

Thanks for the updates as I have clients waiting for answers. I know you are doing what you can

The migrations were to move us to the new hardware/infrastructure and then some secondary migrations to move from CentOS7 to CentOS6. Everybody is on CentOS6 now and there is nothing wrong with our new hardware, servers, network. The issue is outside of our border and we have zero control over it.

At this point we're at the mercy of our facility and they are at the mercy of their transit providers. This isn't affecting just us - it's affecting everybody in the facility which is tens of thousands if not hundreds of thousands of users - us included.

I do apologize for this outage and will most certainly pass on the Reason For Outage, or RFO, once it is available from our upstream provider.

Here are the status updates from them [not very descriptive but will give you an idea of the level of information we've had available to us]:

Update - 02:06AM MDT:
Connectivity to our DTC location has been restored, but we are still working with Level3 to ensure that the problem has been completely resolved.
==========
Update - 01:50AM MDT:
Level3 has identified an issue in the Denver Metro area, and is working to resolve it. We will continue to provide updates as we receive them.
==========
Update - 12:41AM MDT:
We are in contact with our transit provider to identify any network connectivty problems. We are also on-site reviewing our network gear for issues.
==========
We have been alerted of connectivity issues at our Denver Tech Center location. We are working on the issue as quickly as possible and will update here.

We are 2 hours ahead of MDT.

ericr · February 12, 2016

The current status of the work is that connectivity to the DTC location has been restored, but we they are still working with Level3 to ensure that the problem has been completely resolved.

Michael D. · February 12, 2016

That said the network is online and operational and has been prior to my last update of this thread.

None of our networking gear or servers had any issues. The best analogy I can make is that there was an accident on the highway between us and the internet - on a portion of the road that is not within our control. This stopped traffic from entering/leaving until the issue was resolved.

That said I am certainly going to get with the facility concerning this as *one* provider out of several dropping/having issues should not result in a total lack of connectivity. This does defeat the whole purpose of having multiple transit providers available to us.

KiwiTek · February 12, 2016

Is this the same data center you were using before?

It's good that it's back up, but this is the 4th outage I've experienced since the switch to new servers and site users are starting to question the reliability of the site.

Michael D. · February 12, 2016

I have just reached out the the facility management requesting an official Reason For Outage.

Being that it is 2:30 AM facility time and 4:30 AM here I don't expect an answer at the earliest for several hours but will make it available once I have it.

Michael D. · February 12, 2016

Is this the same data center you were using before?

It's good that it's back up, but this is the 4th outage I've experienced since the switch to new servers and site users are starting to question the reliability of the site.

The reliability of the *site* has nothing to do with the transfers and how good or bad they went for you. The facility itself was without connectivity as well - not just us. As a matter of fact to speak plainly - yes - the migration you experienced did not go as smoothly as it should have and for that I apologize.

This was a networking issue outside of our control. It had nothing to do with migrations.

I am sorry that you experienced issues with the migrations, however, if you wish to discuss those please open a ticket and ask for me and I'll be happy to discuss them with you.

Michael D. · February 12, 2016

The very short outage for the S1 server just now was not related to this. It was due to a LiteSpeed licensing issue which I resolved.

I am going to be opening a ticket with LiteSpeed as it is supposed to switch to Apache when there is a LiteSpeed issue with licensing but it did nothing [LSWS didn't start, Apache didn't get started].

I'm just updating this thread as a couple of users have asked if the issues were related.

Michael D. · February 12, 2016

I have more detail from the facility.

We are located in the H5 location in Denver and our bandwidth is transported by Level3 over two physically diverse 10 G links from H5 to 1801 where our bandwidth reaches the carriers.

The idea of physically diverse fiber routes is that one could go down and we should retain connectivity [hardware failure, a fiber cut, etc]. The facility is working with Level3 currently to ascertain why this was able to take down both links as well as what is going to be done to ensure it doesn't happen again.

It is a bad day all around for you, for us, and for our provider. All of us experienced an outage that was not within our control and was unplanned.

Once the official RFO is released detailing exactly what happened and why I will make it available but as the issue wasn't on our end/within our control I do not have an ETA.

Sign In

R1, S1, P1, SpamExperts Scanners outage

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation