AT&T Crash Statement: The Official Report

Path: gmdzi!unido!mcsun!uunet!cs.utexas.edu!mailrus!accuvax.nwu.edu!nucsrl!
telecom-request
From: d...@teletech.uucp (Don H Kemp)
Newsgroups: comp.dcom.telecom
Subject: AT&T Crash Statement: The Official Report
Message-ID: <3278@accuvax.nwu.edu>
Date: 28 Jan 90 17:24:48 GMT
Sender: n...@accuvax.nwu.edu
Organization: TELECOM Digest
Lines: 117
Approved: Tele...@eecs.nwu.edu
Posted: Sun Jan 28 18:24:48 1990
X-Submissions-To: tele...@eecs.nwu.edu
X-Administrivia-To: telecom-requ...@eecs.nwu.edu
X-Telecom-Digest: Volume 10, Issue 59, message 1 of 4


Here's AT&T's _official_ report on the Martin Luther King day network
problems, courtesy of the AT&T Consultant Liason Program.


Don

=========================================================

Technical background on AT&T's network slowdown,
January 15, 1990

                             *  *  *

At approximately 2:30 p.m. EST on Monday, January 15, one of AT&T's
4ESS toll switching systems in New York City experienced a minor
hardware problem which activated normal fault recovery routines within
the switch.  This required the switch to briefly suspend new call
processing until it completed its fault recovery action -- a
four-to-six second procedure.  Such a suspension is a typical
maintenance procedure, and is normally invisible to the calling
public.

As part of our network management procedures, messages were
automatically sent to connecting 4ESS switches requesting that no new
calls be sent to this New York switch during this routine recovery
interval.  The switches receiving this message made a notation in
their programs to show that the New York switch was temporarily out of
service.

When the New York switch in question was ready to resume call
processing a few seconds later, it sent out call attempts (known as
IAMs - Initial Address Messages) to its connecting switches.  When
these switches started seeing call attempts from New York, they
started making adjustments to their programs to recognize that New
York was once again up-and-running, and therefore able to receive new
calls.

A processor in the 4ESS switch which links that switch to the CCS7
network holds the status information mentioned above.  When this
processor (called a Direct Link Node, or DLN) in a connecting switch
received the first call attempt (IAM) from the previously
out-of-service New York switch, it initiated a process to update its
status map.  As the result of a software flaw, this DLN processor was
left vulnerable to disruption for several seconds.  During this
vulnerable time, the receipt of two call attempts from the New York
switch -- within an interval of 1/100th of a second -- caused some
data to become damaged.  The DLN processor was then taken out of
service to be reinitialized.

Since the DLN processor is duplicated, its mate took over the traffic
load.  However, a second couplet of closely spaced new call messages
from the New York 4ESS switch hit the mate processor during the
vulnerable period, causing it to be removed from service and
temporarily isolating the switch from the CCS7 signaling network.  The
effect cascaded through the network as DLN processors in other
switches similarly went out of service.  The unstable condition
continued because of the random nature of the failures and the
constant pressure of the traffic load in the network providing the
call-message triggers.

The software flaw was inadvertently introduced into all the 4ESS
switches in the AT&T network as part of a mid-December software
update.  This update was intended to significantly improve the
network's performance by making it possible for switching systems to
access a backup signaling network more quickly in case of problems
with the main CCS7 signaling network.  While the software had been
rigorously tested in laboratory environments before it was introduced,
the unique combination of events that led to this problem couldn't be
predicted.

To troubleshoot the problem, AT&T engineers first tried an array of
standard procedures to reestablish the integrity of the signaling
network.  In the past, these have been more than adequate to regain
call processing.  In this case, they proved inadequate.  So we knew
very early on we had a problem we'd never seen before.

At the same time, we were looking at the pattern of error messages and
trying to understand what they were telling us about this condition.
We have a technical support facility that deals with network problems,
and they became involved immediately.  Bell Labs people in Illinois,
Ohio and New Jersey joined in moments later.  Since we didn't
understand the mechanism we were dealing with, we had to infer what
was happening by looking at the signaling messages that were being
passed, as well as looking at individual switches.  We were able to
stabilize the network by temporarily suspending signaling traffic on
our backup links, which helped cut the load of messages to the
affected DLN processors.  At 11:30 p.m. EST on Monday, we had the last
link in the network cleared.

On Tuesday, we took the faulty program update out of the switches and
temporarily switched back to the previous program.  We then started
examining the faulty program with a fine-toothed comb, found the
suspicious software, took it into the laboratory, and were able to
reproduce the problem.  We have since corrected the flaw, tested the
change and restored the backup signaling links.

We believe the software design, development and testing processes we
use are based on solid, quality foundations.  All future releases of
software will continue to be rigorously tested.  We will use the
experience we've gained through this problem to further improve our
procedures.

It is important to note that Monday's calling volume was not unusual;
in fact, it was less than a normal Monday, and the network handled
normal loads on previous weekdays.  Although nothing can be guaranteed
100% of the time, what happened Monday was a series of events that had
never occurred before.  With ongoing improvements to our design and
delivery processes, we will continue to drive the probability of this
type of incident occuring towards zero.

                              # # #

Don H Kemp			"Always listen to experts.  They'll
B B & K Associates, Inc.         tell you what can't be done, and
Rutland, VT			 why.  Then do it."
uunet!uvm-gen!teletech!dhk	  	               Lazarus Long