Jump to content
MDDHosting Forums

All Down!


SarisIsop
 Share

Recommended Posts

I suspect Mike has been in "panic mode" the last hour or so. My monitors indicated somewhere around a 13 minute outage (that could be as little as 5 minutes based on the monitoring frequency). Or it could have been between my ISP and the data center (but then you noted the same issue).

 

It takes a while for the servers to come back up to full speed after an outage.

Link to comment
Share on other sites

After getting with the networking team, it seems that a distribution switch in our cabinet decided to go kaput. Thankfully we have a secondary switch that picks up and takes over in this event, but the change over can take 2 to 4 minutes and then there is the possibility of intermittence for 5 to 15 minutes afterwards. The total downtime registered by our monitoring is right at about 2 minutes but it's possible individual routes to individual IPs took longer to update.

 

The switches are identical, so we will be swapping the failed switch with a replacement and then setting it up as the backup (so that there is no downtime switching back to a new switch).

Link to comment
Share on other sites

Here is a layman's example:

1. We have a neighborhood (our network of servers).

2. This neighborhood has two major roads that allow outsiders to get into the neighborhood, and people in the neighborhood to get out. (Traffic)

3. Only one road is used at a time, the other road is just there in case there is an issue with the primary road.

 

The primary road that runs to the neighborhood became inaccessible, and it took time for the traffic on that road to re-route to the other major road into the neighborhood. (around 2 minutes). Even after this change some cars (web requests) were still trying the old main road, before they realized it wasn't working and had to witch to the backup road.

 

The technical view:

Our network has dual redundant uplinks and dual redundant network distribution switches. The reason for having two of everything (redundancy) is so that if one component fails we aren't offline until somebody manually steps in and switches a piece of hardware, or worse that we are down until we get replacement hardware. One of the distribution switches (the primary) went offline and it took a couple of minutes for traffic to re-route over the backup distribution switch.

 

One thing to keep in mind is that both switches are identical, the uplinks are identical, etc... There is no reason for one switch or path to be faster than another. If things seem slower to you, it's likely just your perception since you're paying attention. A good layman's example of this is that if your car makes a certain noise that you don't notice - you may suddenly notice this noise after you do something to the car (like replacing a part) because you're paying specific attention to the sound and listening for differences where you were not previously.

  • Upvote 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
 Share

×
×
  • Create New...