Cause of AT&T network failure

Date: Fri, 26 Jan 90 14:24:30 PST 
From: "Peter G. Neumann" <neum...@csl.sri.com>
Subject: Cause of AT&T network failure

From Telephony, Jan 22, 1990 p11:

    "The fault was in the code" of the new software that AT&T loaded
    into front-end processors of all 114 of its 4ESS switching systems
    in mid-December, said Larry Seese, AT&T's director of technology
    development. In detail:

    The problem began the afternoon of Jan 15 when a piece of trunk
    interface equipment developed internal problems for reasons that
    have yet to be determined. The equipment told the 4ESS switch
    in New York that it was having problems and couldn't correct the
    fault. "The recovery code is written so that the processor will run
    corrective initialization on the equipment. That takes four to six
    seconds. At the same time, new calls are stopped from coming into the
    switch." Seese said.

    The New York switch sent a message to all the other 4ESS switches
    it is linked with that it was not accepting additional traffic. 
    Seese referred to that message as a "congestion signal."  After 
    the switch successfully completed the reintialization, the New York 
    switch went back in service and began processing calls.
    That is when the fault in the new software reared its ugly head. 
    Under the previous system, switch A would send out a message that 
    it was working again, and swithc B would double-check that switch 
    A was back in service. With the new software, switch A begins  
    processing calls and sends out call routing signals. The reappearance
    of traffic from switch A is supposed to tell switch B that A is 
    working again.

    "We made an improvement in the way we react to those messages so 
    we can react more quickly. The first common channel signaling system 
    7 initial address message (caused by a call attempt) that switch B 
    receives from swithc A alerts B that A is back in service. Switch B 
    then resets its internal logic to indicate that A is back in service," 
    said Seese.

    The problem occured when switch B got a second call-attempt message 
    from A while it was in the process of resetting its internal logic. 
    "[The message] confused the software. it tried to execute an instruction
    that didn't make any sense. The software told switch B `My CCS7 processor
    is insane'", so switch B shut itself down to avoid spreading the problem,
    Seese explained.

    Unfortunately, switch B then sent a message to other switches that it 
    was out of service and wasn't accepting additional traffic. Once switch
    B reset itself and began operating again, it sent out call processing
    messages via the CCS7 link. That caused identical failures around the
    nation as other 4ESS switches got second messages from switch B while 
    they were in the process of resetting their internal logic to indicate
    switch B was working again.

    "It was a chain reaction. Any switch that was connected to B was put 
    into the same condition."

    "The event just repeated itself in every [4ESS] switch over and over 
    again. If the switches hadn't gotten a second message while resetting,
    there would have been no problem. If the messages had been received 
    farther apart, it would not have triggered the problem."

    AT&T solved the problem by reducing the messaging load of the CCS7 
    network. That allowed the switches to rest themselves and the network 
    to stabilize.

From: s...@cs.purdue.edu (Gene Spafford)
Subject: Re: AT&T (RISKS-9.62)
Date: 27 Jan 90 17:52:42 GMT

In article <CMM.0.88.633398480.ri...@hercules.csl.sri.com> ri...@csl.sri.com writes:
>From Telephony, Jan 22, 1990 p11:
>
>    The problem began the afternoon of Jan 15 when a piece of trunk
>    interface equipment developed internal problems for reasons that
>    have yet to be determined.

An interesting twist to this: several members of the media have gotten phone
calls from a rogue hacker claiming that he and a few friends had broken into
the NYC switch and were "looking around" at the time of the incident.

This raises two interesting (at least to me) possibilities: 

  1) They had, indeed, broken in, and were responsible for the crash.
     (Don't blindly accept published statements from AT&T that it
     was all a simple glitch.  Stories told off-the-record by law
     enforcement personnel and telco security indicate this kind of
     break-in is common.)

     If this is true, what to do from here?  Obviously, this raises
     some major security questions about how best to protect our phone
     systems.  It also raises some interesting social/legal questions.
     The nationwide losses here are probably greater than the Internet
     Worm, but the Federal Computer Crime and abuse act don't cover it
     (only one system tampered with).  Other laws maybe cover it, but
     is there any hope of proving it and prosecuting?

  2) These guys were not on the machine but are trying to get the
     press to publish their names as the ones responsible.  This would
     greatly enhance their image in the cracker/phreaker community.
     It's akin to having the Islamic Jihad call up and claim that a
     suicide caller had crashed the system (to protest dial-a-prayer
     and dial-a-porn, perhaps; remember that the Great Satan is a
     local call from NYC :-).   It raises interesting questions about how
     the press should handle such claims, and how we should react to them.

A third possibility exists, of course, that those guys had hacked into the
switch, but they had nothing to do with the failure.  That raises both sets of
questions.

I worry that it won't be long before this kind of thing happens and the phone
calls ARE from some terrorist group claiming responsibility: "We are holding
your dial tone hostage until you get your troops out of Panama, make abortion
illegal, stop killing animals for fur, and prevent Peter Neumann from making
more puns."

Or, perhaps AT&T security gets a call like: "We've planted a logic bomb in the
switching code.  Put $1 million in small unmarked bills in the following locker
at the bus station, or in 4 hours every call made in Boston will get routed to
dial-a-porn numbers in NYC.  We'll tell you how to fix it as soon as we get the
money."

Any bets that something like this will happen this year?  Last year's WANK worm
and politically-motivated viruses seem to suggest the time is ripe.

Gene Spafford, NSF/Purdue/U of Florida  Software Engineering Research Center,
Dept. of Computer Sciences, Purdue University, W. Lafayette IN 47907-2004
                         uucp	...!{decwrl,gatech,ucbvax}!purdue!spaf

    [By the way, AT&T is certain it was an open&shut (a no-pun&shut?) case of 
    a hardware-triggered software flaw, reproducible in the testbed ...  PGN]