Jump to content


Photo

[Resolved] HandyNetworks Core Routing Failure


  • Please log in to reply
8 replies to this topic

#1 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 07 January 2011 - 02:36 AM

At approximately 2:15 AM EST GMT-5 our networking team was alerted to an outage. Our data center facility was updating the software on the core routing infrastructure which unexpectedly caused the gear to go offline. The core networking equipment was brought back online within minutes. We're seeing very latent connections to the facility at this time due to the network playing "catch up" and we expect everything to return to normal within 15 to 20 minutes.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#2 frankacter

frankacter

    Member

  • Clients
  • PipPip
  • 46 posts
  • Gender:Male

Posted 07 January 2011 - 02:45 AM

Do we know if this was planned maintenance for the core router?

If so, is there a published schedule that can be made available so we can have a heads up on timing of future maintenance so we can avoid planning events around it :-)
  • 0

#3 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 07 January 2011 - 02:56 AM

It was planned maintenance, however we were not made aware of the maintenance window as the core router upgrades were supposed to be seamless. I misunderstood the maintenance when I originally spoke with networking. I've clarified this in post 4.

The lead networking engineer just got this to me, but I'm going to get more details:

There are redundant core routers. However, we attempted to do an eFSU/ISSU upgrade on these units to resolve some high CPU utilization that we were seeing, which is *supposed* to be a hitless failover. However, this did not happen as planned and resulted in both core routers rebooting at the same time.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#4 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 07 January 2011 - 03:00 AM

More information from the networking team:

The upgrade between the two routers is supposed to be a dance coordinated between the two routers. However, we entered the command that would reload the current standby router to the new version of IOS, and at that time, both of the routers rebooted. Only one was supposed to reboot. The exact reason both rebooted at this time is still unclear.

The networking engineer made it clear to me that this wasn't scheduled maintenance and that while rolling out some new services to parts of the facility the core routers were showing an extremely high amount of CPU load. Cisco's Technical Assistance Center advised the facility to upgrade the operating systems of the routers to resolve the issues and when a single router should have been rebooted, they both rebooted.

I'm going to leave it at this as this appears to be a one-off situation. If I get more details as to why both routers decided to reboot I'll post them up however the facility is working closely with Cisco to investigate why this happened and what can be done in the future to ensure it doesn't happen again.

We sincerely apologize for any trouble this may have caused any of our customers.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#5 frankacter

frankacter

    Member

  • Clients
  • PipPip
  • 46 posts
  • Gender:Male

Posted 07 January 2011 - 03:13 AM

It only lasted a few short minutes and you guys were all over it, thanks for that.
  • 0

#6 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 07 January 2011 - 03:14 AM

It only lasted a few short minutes and you guys were all over it, thanks for that.

No problem.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#7 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 07 January 2011 - 11:30 AM

The official Reason For Outage:

Date: January 7, 2011
Time: 2:22AM - 2:32AM
Impact: 10 minute loss of routing / network connectivity
Description:
Throughout the week, our engineers have been working to integrate an InterNAP Flow Control Platform (FCP) device into our network. This device, when implemented, significantly optimizes overall network performance. On the afternoon of January 6, 2011 the device was turned live and started inserting optimized BGP route announcements into our core network. After some time, we noticed that there was abnormally high CPU utilization occurring on our core routers. We decided to proceed with an upgrade of IOS, which should have been a "hitless" upgrade. This process did not go as planned; in fact, it resulted in both of our core routers rebooting at nearly the same exact time. During an approximately 10 minute window, all external connectivity was lost, as both routers rebooted. Presently, both core routers are running on the upgraded IOS version, all external connectivity has been restored and the InterNAP FCP device is optimizing routes within our network. Additionally, CPU utilization has dropped substantially on the routers.

We of course always work diligently to prevent and avoid customer impacting outages. It is also our policy to notify customers and schedule maintenance windows when an impact or outage is expected. In this case, we decided to move forward with this maintenance work because of the severity of the CPU load on our routers, along with the fact that the upgrade itself should have resulted in zero customer impact. You have our sincerest apologies for the brief outage that occurred this evening.


  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/

#8 JFSG

JFSG

    Newbie

  • Members
  • Pip
  • 12 posts
  • Gender:Male

Posted 09 January 2011 - 03:03 AM

The official Reason For Outage:

Glad that Handy Networks have InterNAP FCP in their network now. :)
  • 0
:)

#9 MikeDVB

MikeDVB

    Forum Administrator

  • Staff Administrator
  • PipPipPipPipPip
  • 2,900 posts
  • Gender:Male
  • Location:Central Indiana, USA

Posted 09 January 2011 - 06:14 PM

Glad that Handy Networks have InterNAP FCP in their network now. :)

As are we! I've seen improvements of 10~15 ms in latency just between HN and our offices, I can only imagine the level of improvements for other locations such as Europe.
  • 0
Michael Denney - MDDHosting LLC - Providing Hosting since 2007
Scalable shared hosting plans in the cloud! Check them out!
Highly Available Cloud Shared, Reseller, and VPS
http://www.mddhosting.com/




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users