Network connectivity issues. 08/12/2019

Tim · August 12, 2019

Hello,

Our data center facility is experiencing networking issues. It is out of our direct control as all of our equipment is up. They are aware of the issue and are actively working to resolve it. Once they provide details about the outage we will let everyone know. This included our support system and forums so I apologize for any delay in this reply.

Once I have more details and the RFO (Reason For Outage) I will post them here.

Tim · August 12, 2019

As of the time of this post I am able to access our services and your websites again.

SarisIsop · August 12, 2019

My sites are back on-line.

Thank you.

Michael D. · August 12, 2019

It is my current understanding that this issue was due to an unusual hardware failure in a core piece of networking equipment at the facility. This piece of hardware failed in such a way that it wasn't servicing requests but wasn't 'offline' - sort of like an operating system crash/panic.

As this piece of equipment is redundant - there is another identical piece of hardware doing the same job that should pick up the slack - I do not at this point know why the failure caused an outage that redundancy didn't prevent. It could be due to the nature of the failure in that the gear stayed online but wasn't actually working but that's speculation on my part.

As it stands everything is back online but we have lost the redundancy of this core piece of hardware until the issue is fully resolved. It is suspected that this is a bug in the operating system running on the core networking equipment and the facility is working with Juniper Emergency Support to both investigate the cause of the issue as well as working to ensure it doesn't happen again.

Here is a snippet of the kernel/operating system log from the failed piece of networking hardware:

Aug 12 17:29:05  dist3.denver2 /kernel: BAD_PAGE_FAULT: pid 1972 (fxpc), uid 0: pc 0x0 got a read fault at 0x0, x86 fault flags = 0x4
Aug 12 17:29:05  dist3.denver2 /kernel: Trapframe Register Dump:
Aug 12 17:29:05  dist3.denver2 /kernel: eax: 20dba498ecx: 000000ffedx: 20dba494ebx: 20dba468
Aug 12 17:29:05  dist3.denver2 /kernel: esp: af97de6cebp: af97de98esi: 21054b98edi: 00000000
Aug 12 17:29:05  dist3.denver2 /kernel: eip: 00000000eflags: 00010202
Aug 12 17:29:05  dist3.denver2 /kernel: cs: 0033ss: 003bds: 003bes: 003b
Aug 12 17:29:05  dist3.denver2 /kernel: fs: b0b5003btrapno: 0000000cerr: 00000004
Aug 12 17:29:05  dist3.denver2 /kernel: PC address 0x0 is inaccessible, PDE = 0x0, ****** = 0x0
Aug 12 17:29:05  dist3.denver2 /kernel: BAD_PAGE_FAULT: pid 1972 (fxpc), uid 0: pc 0x0 got a read fault at 0x0, x86 fault flags = 0x4
Aug 12 17:29:05  dist3.denver2 /kernel: Trapframe Register Dump:
Aug 12 17:29:05  dist3.denver2 /kernel: eax: 20dba498ecx: 000000ffedx: 20dba494ebx: 20dba468
Aug 12 17:29:05  dist3.denver2 /kernel: esp: af97de6cebp: af97de98esi: 21054b98edi: 00000000
Aug 12 17:29:05  dist3.denver2 /kernel: eip: 00000000eflags: 00010202
Aug 12 17:29:05  dist3.denver2 /kernel: cs: 0033ss: 003bds: 003bes: 003b
Aug 12 17:29:05  dist3.denver2 /kernel: fs: b0b5003btrapno: 0000000cerr: 00000004
Aug 12 17:29:05  dist3.denver2 /kernel: PC address 0x0 is inaccessible, PDE = 0x0, ****** = 0x0

Once the Reason For Outage [RFO] is available from our facility we will make it available.

Michael D. · August 12, 2019

In speaking with the senior network engineer at Handy Networks, our upstream facility, this issue affected both redundant pieces of hardware responsible for routing traffic. The primary crashed and then the secondary took over and subsequently crashed. While they were working to determine the cause this was ongoing and explains why things would show as online for a minute or two and then back down.

The mode of failure is definitely unusual and I still personally believe it to be a bug in the Juniper OS.

Juniper as well as Handy Networks are still working to trace the cause and I expect to have an RFO within 72 hours or less.

Michael D. · August 13, 2019

Our upstream facility has scheduled a maintenance window tonight from 11 PM to 4 AM Eastern Time.

They expect we may see a couple instances of downtime of up to 15 minutes but are going to strive to keep any downtime to a minimum.

For full details you can read their status at https://helpdesk.handynetworks.com/supportsuite/index.php?/News/NewsItem/View/276

Michael D. · August 15, 2019

I was waiting on the RFO before updating this - but I haven't seen one yet so I at least wanted to post that the maintenance on the 13th went well and that we are fully redundant once again.

Juniper is still investigating the cause but from my conversations with the networking department at our upstream provider a filter has been put in place that should prevent the issue from recurring.

Once I have the RFO I will make it available.

Tim · August 20, 2019

Hello all,

Here are further details about the issue from our data provider.

https://helpdesk.handynetworks.com/supportsuite/index.php?/News/NewsItem/View/275/network-status

Sign In

Network connectivity issues. 08/12/2019

Recommended Posts

Tim

Link to comment

Share on other sites

Tim

Link to comment

Share on other sites

SarisIsop

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Michael D.

Link to comment

Share on other sites

Tim

Link to comment

Share on other sites

Join the conversation

Browse

Activity