Thursday, April 1st, 2010 by Simon Hackett
Internode engineers a lot of redundancy and resiliency into its network services. It also uses the best equipment and services it can obtain – because this is all a part of giving our customers the best possible service experience.
However, we don’t live in a perfect world, and sometimes things can go wrong that really exceed the capacity of any rational amount of forward planning to predict – and instead the fault has to be managed as it occurs.
Our high end ‘Internode direct’ ADSL2+ services (our ‘Extreme’, NakedExtreme, and some of our Internode Easy Broadband services) are deployed on our own equipment, installed in Telstra exchanges. Its been extremely fast and reliable. Like modern airlines, all the routine causes of breakdowns have largely been engineered away.
What is left, therefore (just like the airline situation), are the truly unusual failure modes.
Lately, in South Australia, we’ve suffered from such a truly unusual failure mode. This is a description of what has happened, and what (to this point) we’ve done about it.
The intention of writing this down is transparency – something we love. If you don’t feel like reading about how geeks make high performance, high availability networks run, then you should probably stop here!
Earlier this week, without warning, some of Extreme ADSL2+ customers started losing access to the Internet. In a national sense, this peaked at less than 1% of our total customer base. But because it was a truly unusual failure mode, and also just because we care a hell of a lot about such things, we dropped everything to resolve it.
It has been, and it remains, a very unusual failure mode, starting with this interesting fact – when things went wrong, rebooting the applicable parts of our equipment did not fix those network elements. Once they caught a cold, they kept it.
But I’m starting in the middle. Lets start at the start, with some explanations of what some of our ADSL2+ network equipment is, and how it connects your home or business ADSL2+ connection back to the Internode core network in your state. Here goes:
A key element of our ADSL2+ DSLAM deployments are devices that our DSLAM vendor, Ericsson, calls an ECN.
An ECN looks like a 24 port ethernet network switch. But it is much more than that. They are boot servers, control nodes, statistical aggregation points, and a number of other things (including, also, being the ethernet switches that they appear, at face value, to be).
They have multiple CPU’s in them, doing multiple things. There is switch hardware, power control hardware, and a linux machine, all inside a small rack-mounted box.
Each ECN has 24 ‘downstream’ ports. Each port attaches to a 12 port EDA DSLAM module using 100M powered ethernet, so there are 288 DSLAM ports (12 x 24), that get their power, management and data flow from their ECN. Each DSLAM module boots ‘over the cable’ from the ECN, acquiring its operating code and its configuration in that way.
Upstream of the ECN are two gigabit ethernet ports, which are used to construct a chain of ECN’s in the exchange.
At the each end of the chain of ECN’s in the exchange, then, there are two gigabit ethernet ports that exit the exchange building on optical fibre cables.
At that exit, those fibre links are attached to inter-exchange fibre paths, and overall what we construct (in most cases) is a wide-area ring.
That ring typically connects several exchanges, to each other and also to (in most cases) two geographically distributed major Internode PoP’s. The ring structure means there are two geographically distinct paths between any two points in that overall group of exchanges and the Internode PoP’s that service the exchanges concerned.
Accordingly, through this ring architecture, if there is a loss of any single:
… then the remaining devices in the entire chain concerned discover this and ‘self heal’ in short order (lets call it a minute or so), sending all their data the other way around the ring instead.
This sort of resiliency is at the heart of why these systems are generally just rock solid in practice.
Ok, so what happened? We obviously had a fault. Here’s what we know so far:
The fault we’ve suffered from has been traced to a software bug in one of the multiple software systems running in the ECN’s (as I said, they’re complex devices).
This software release has been running with complete stability, nationally, in our ECN’s for months. This week, out of the blue – literally – it started failing in some ECN’s in South Australia, with no initially obvious pattern.
The failure mode resulted in some downstream 12 port DSLAM modules spontaneously rebooting, and after boot, those modules started complaining of being handed invalid configuration data and commenced another reboot (and… so on). During reboot, the DSLAM modules maintain physical line sync to the customer, but obviously data flow stops.
And here’s the hard part:
Rebooting and/or power cycling the ECN controlling the DSLAM modules concerned did not resolve the fault!
Once an ECN catches this ‘cold’ – it turning it on and off again does not fix it!
The only thing that has been demonstrated to stop the fault is to upgrade the ECN software to a newer release.
It is important to appreciate that the current software release (the faulty one) was rolled out very carefully (as all upgrades normally are), and it has been flawlessly for around five months until this fault developed.
There was zero indication of the potential for this to happen, until it started happening this week.
The current theory, based on vendor feedback, is that the software fault is a latent, load triggered, memory leak that makes the ECN lose the plot in a manner that corrupts some critical item inside one of the ECN systems – an item that remains corrupted after it is rebooted or power cycled.
Our vendor has subsequently identified a bug that is consistent with our observed outcome, and on their advice, we upgraded to the next revision of the code concerned (in which that specific bug had been fixed).
So far, updating an ECN to that newer software release has resolved the issue on all customers attached to that affected ECN.
Last night we saw a few more switches (not yet upgraded) start to demonstrate the fault. In response we proactively upgraded another significant proportion of the switches in the entire SA network.
Based on vendor advice, we are now looking at how best to upgrade the rest of the network, to guard against a further recurrence elsewhere.
This is not a trivial decision, because we are upgrading to a new software release – which brings in the potential for there to be some other, as yet unknown, issue for our service. Obviously we don’t want to trade one problem for a different one, if we can help it. On the other hand, once the fault manifests, we can’t leave the ECN simply broken.
As I write this, the network is back to operating in a normal and stable manner.
We do, however, have still nodes in the network running the ‘vulnerable’ software in their ECN’s. Nodes that have worked flawlessly for five months, so they might work flawlessly for another five months. Or not.
Should further faults develop, we will of course remedy them immediately, as we have done for existing ECN’s.
Once our vendor and we have reached consensus that the new software release has no other significant new faults lurking within its bits and bytes, we’ll plan to proactively upgrade the rest of the ECN’s in the network to completely resolve this overall issue.
In summary: Some weeks are more complicated than others. This has been a complicated week!