Date: Fri, 26 Jan 90 14:24:30 PST From: "Peter G. Neumann" <neum...@csl.sri.com> Subject: Cause of AT&T network failure From Telephony, Jan 22, 1990 p11: "The fault was in the code" of the new software that AT&T loaded into front-end processors of all 114 of its 4ESS switching systems in mid-December, said Larry Seese, AT&T's director of technology development. In detail: The problem began the afternoon of Jan 15 when a piece of trunk interface equipment developed internal problems for reasons that have yet to be determined. The equipment told the 4ESS switch in New York that it was having problems and couldn't correct the fault. "The recovery code is written so that the processor will run corrective initialization on the equipment. That takes four to six seconds. At the same time, new calls are stopped from coming into the switch." Seese said. The New York switch sent a message to all the other 4ESS switches it is linked with that it was not accepting additional traffic. Seese referred to that message as a "congestion signal." After the switch successfully completed the reintialization, the New York switch went back in service and began processing calls. That is when the fault in the new software reared its ugly head. Under the previous system, switch A would send out a message that it was working again, and swithc B would double-check that switch A was back in service. With the new software, switch A begins processing calls and sends out call routing signals. The reappearance of traffic from switch A is supposed to tell switch B that A is working again. "We made an improvement in the way we react to those messages so we can react more quickly. The first common channel signaling system 7 initial address message (caused by a call attempt) that switch B receives from swithc A alerts B that A is back in service. Switch B then resets its internal logic to indicate that A is back in service," said Seese. The problem occured when switch B got a second call-attempt message from A while it was in the process of resetting its internal logic. "[The message] confused the software. it tried to execute an instruction that didn't make any sense. The software told switch B `My CCS7 processor is insane'", so switch B shut itself down to avoid spreading the problem, Seese explained. Unfortunately, switch B then sent a message to other switches that it was out of service and wasn't accepting additional traffic. Once switch B reset itself and began operating again, it sent out call processing messages via the CCS7 link. That caused identical failures around the nation as other 4ESS switches got second messages from switch B while they were in the process of resetting their internal logic to indicate switch B was working again. "It was a chain reaction. Any switch that was connected to B was put into the same condition." "The event just repeated itself in every [4ESS] switch over and over again. If the switches hadn't gotten a second message while resetting, there would have been no problem. If the messages had been received farther apart, it would not have triggered the problem." AT&T solved the problem by reducing the messaging load of the CCS7 network. That allowed the switches to rest themselves and the network to stabilize.
From: s...@cs.purdue.edu (Gene Spafford) Subject: Re: AT&T (RISKS-9.62) Date: 27 Jan 90 17:52:42 GMT In article <CMM.0.88.633398480.ri...@hercules.csl.sri.com> ri...@csl.sri.com writes: >From Telephony, Jan 22, 1990 p11: > > The problem began the afternoon of Jan 15 when a piece of trunk > interface equipment developed internal problems for reasons that > have yet to be determined. An interesting twist to this: several members of the media have gotten phone calls from a rogue hacker claiming that he and a few friends had broken into the NYC switch and were "looking around" at the time of the incident. This raises two interesting (at least to me) possibilities: 1) They had, indeed, broken in, and were responsible for the crash. (Don't blindly accept published statements from AT&T that it was all a simple glitch. Stories told off-the-record by law enforcement personnel and telco security indicate this kind of break-in is common.) If this is true, what to do from here? Obviously, this raises some major security questions about how best to protect our phone systems. It also raises some interesting social/legal questions. The nationwide losses here are probably greater than the Internet Worm, but the Federal Computer Crime and abuse act don't cover it (only one system tampered with). Other laws maybe cover it, but is there any hope of proving it and prosecuting? 2) These guys were not on the machine but are trying to get the press to publish their names as the ones responsible. This would greatly enhance their image in the cracker/phreaker community. It's akin to having the Islamic Jihad call up and claim that a suicide caller had crashed the system (to protest dial-a-prayer and dial-a-porn, perhaps; remember that the Great Satan is a local call from NYC :-). It raises interesting questions about how the press should handle such claims, and how we should react to them. A third possibility exists, of course, that those guys had hacked into the switch, but they had nothing to do with the failure. That raises both sets of questions. I worry that it won't be long before this kind of thing happens and the phone calls ARE from some terrorist group claiming responsibility: "We are holding your dial tone hostage until you get your troops out of Panama, make abortion illegal, stop killing animals for fur, and prevent Peter Neumann from making more puns." Or, perhaps AT&T security gets a call like: "We've planted a logic bomb in the switching code. Put $1 million in small unmarked bills in the following locker at the bus station, or in 4 hours every call made in Boston will get routed to dial-a-porn numbers in NYC. We'll tell you how to fix it as soon as we get the money." Any bets that something like this will happen this year? Last year's WANK worm and politically-motivated viruses seem to suggest the time is ripe. Gene Spafford, NSF/Purdue/U of Florida Software Engineering Research Center, Dept. of Computer Sciences, Purdue University, W. Lafayette IN 47907-2004 uucp ...!{decwrl,gatech,ucbvax}!purdue!spaf [By the way, AT&T is certain it was an open&shut (a no-pun&shut?) case of a hardware-triggered software flaw, reproducible in the testbed ... PGN]