[Resolved] HandyNetworks Core Routing Failure

Michael D. · January 7, 2011

At approximately 2:15 AM EST GMT-5 our networking team was alerted to an outage. Our data center facility was updating the software on the core routing infrastructure which unexpectedly caused the gear to go offline. The core networking equipment was brought back online within minutes. We're seeing very latent connections to the facility at this time due to the network playing "catch up" and we expect everything to return to normal within 15 to 20 minutes.

frankacter · January 7, 2011

Do we know if this was planned maintenance for the core router?

If so, is there a published schedule that can be made available so we can have a heads up on timing of future maintenance so we can avoid planning events around it :-)

Michael D. · January 7, 2011

It was planned maintenance, however we were not made aware of the maintenance window as the core router upgrades were supposed to be seamless. I misunderstood the maintenance when I originally spoke with networking. I've clarified this in post 4.

The lead networking engineer just got this to me, but I'm going to get more details:

There are redundant core routers. However, we attempted to do an eFSU/ISSU upgrade on these units to resolve some high CPU utilization that we were seeing, which is *supposed* to be a hitless failover. However, this did not happen as planned and resulted in both core routers rebooting at the same time.

Michael D. · January 7, 2011

More information from the networking team:

The upgrade between the two routers is supposed to be a dance coordinated between the two routers. However, we entered the command that would reload the current standby router to the new version of IOS, and at that time, both of the routers rebooted. Only one was supposed to reboot. The exact reason both rebooted at this time is still unclear.

The networking engineer made it clear to me that this wasn't scheduled maintenance and that while rolling out some new services to parts of the facility the core routers were showing an extremely high amount of CPU load. Cisco's Technical Assistance Center advised the facility to upgrade the operating systems of the routers to resolve the issues and when a single router should have been rebooted, they both rebooted.

I'm going to leave it at this as this appears to be a one-off situation. If I get more details as to why both routers decided to reboot I'll post them up however the facility is working closely with Cisco to investigate why this happened and what can be done in the future to ensure it doesn't happen again.

We sincerely apologize for any trouble this may have caused any of our customers.

frankacter · January 7, 2011

It only lasted a few short minutes and you guys were all over it, thanks for that.

Michael D. · January 7, 2011

It only lasted a few short minutes and you guys were all over it, thanks for that.

No problem.

Michael D. · January 7, 2011

The official Reason For Outage:

Date: January 7, 2011
Time: 2:22AM - 2:32AM
Impact: 10 minute loss of routing / network connectivity
Description:
Throughout the week, our engineers have been working to integrate an InterNAP Flow Control Platform (FCP) device into our network. This device, when implemented, significantly optimizes overall network performance. On the afternoon of January 6, 2011 the device was turned live and started inserting optimized BGP route announcements into our core network. After some time, we noticed that there was abnormally high CPU utilization occurring on our core routers. We decided to proceed with an upgrade of IOS, which should have been a "hitless" upgrade. This process did not go as planned; in fact, it resulted in both of our core routers rebooting at nearly the same exact time. During an approximately 10 minute window, all external connectivity was lost, as both routers rebooted. Presently, both core routers are running on the upgraded IOS version, all external connectivity has been restored and the InterNAP FCP device is optimizing routes within our network. Additionally, CPU utilization has dropped substantially on the routers.

We of course always work diligently to prevent and avoid customer impacting outages. It is also our policy to notify customers and schedule maintenance windows when an impact or outage is expected. In this case, we decided to move forward with this maintenance work because of the severity of the CPU load on our routers, along with the fact that the upgrade itself should have resulted in zero customer impact. You have our sincerest apologies for the brief outage that occurred this evening.

JFSG · January 9, 2011

The official Reason For Outage:

Glad that Handy Networks have InterNAP FCP in their network now.

Michael D. · January 9, 2011

Glad that Handy Networks have InterNAP FCP in their network now.

As are we! I've seen improvements of 10~15 ms in latency just between HN and our offices, I can only imagine the level of improvements for other locations such as Europe.

Sign In

[Resolved] HandyNetworks Core Routing Failure

Recommended Posts

Michael D.

Link to comment

Share on other sites

frankacter

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

frankacter

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

JFSG

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Join the conversation

Browse

Activity