Internode Blog

The anatomy of a complex fault

Thursday, April 1st, 2010 by

Ericsson ECN switch and EDA DSLAM modules

Internode engineers a lot of redundancy and resiliency into its network services. It also uses the best equipment and services it can obtain – because this is all a part of giving our customers the best possible service experience.

However, we don’t live in a perfect world, and sometimes things can go wrong that really exceed the capacity of any rational amount of forward planning to predict – and instead the fault has to be managed as it occurs.

Our high end ‘Internode direct’ ADSL2+ services (our ‘Extreme’, NakedExtreme, and some of our Internode Easy Broadband services) are deployed on our own equipment, installed in Telstra exchanges. Its been extremely fast and reliable. Like modern airlines, all the routine causes of breakdowns have largely been engineered away.

What is left, therefore (just like the airline situation), are the truly unusual failure modes.

Lately, in South Australia, we’ve suffered from such a truly unusual failure mode. This is a description of what has happened, and what (to this point) we’ve done about it.

The intention of writing this down is transparency – something we love. If you don’t feel like reading about how geeks make high performance, high availability networks run, then you should probably stop here!

(more…)