The following is a magazine excerpt about the “crash” of the AT&T Long Distance Network on January 15, 1990 when faulty software was installed on the Number 4 ESS (Electronic Switching System) toll tandems throughout the network. The software glitch managed to disable many switches throughout the network until the problem was attributed to the faulty software.
The AT&T Crash from “The Risks Digest”
The following is an excerpt from The Risks Digest – Volume 9, Issue 62 – February 26, 1990.
Cause of AT&T network failure
“Peter G. Neumann”
Fri, 26 Jan 90 14:24:30 PST
From Telephony, Jan 22, 1990 p11:
“The fault was in the code” of the new software that AT&T loaded into front-end processors of all 114 of its 4ESS switching systems in mid-December, said Larry Seese, AT&T’s director of technology development. In detail:
The problem began the afternoon of Jan 15 when a piece of trunk interface equipment developed internal problems for reasons that have yet to be determined. The equipment told the 4ESS switch in New York that it was having problems and couldn’t correct the fault. “The recovery code is written so that the processor will run corrective initialization on the equipment. That takes four to six seconds. At the same time, new calls are stopped from coming into the switch.” Seese said.
The New York switch sent a message to all the other 4ESS switches it is linked with that it was not accepting additional traffic. Seese referred to that message as a “congestion signal.” After the switch successfully completed the reintialization, the New York switch went back in service and began processing calls. That is when the fault in the new software reared its ugly head. Under the previous system, switch A would send out a message that it was working again, and switch B would double-check that switch A was back in service. With the new software, switch A begins processing calls and sends out call routing signals. The reappearance of traffic from switch A is supposed to tell switch B that A is working again.
“We made an improvement in the way we react to those messages so we can react more quickly. The first common channel signaling system 7 initial address message (caused by a call attempt) that switch B receives from switch A alerts B that A is back in service. Switch B then resets its internal logic to indicate that A is back in service,” said Seese.
The problem occurred when switch B got a second call-attempt message from A while it was in the process of resetting its internal logic. “[The message] confused the software. it tried to execute an instruction that didn’t make any sense. The software told switch B `My CCS7 processor is insane'”, so switch B shut itself down to avoid spreading the problem, Seese explained.
Unfortunately, switch B then sent a message to other switches that it was out of service and wasn’t accepting additional traffic. Once switch B reset itself and began operating again, it sent out call processing messages via the CCS7 link. That caused identical failures around the nation as other 4ESS switches got second messages from switch B while they were in the process of resetting their internal logic to indicate switch B was working again.
“It was a chain reaction. Any switch that was connected to B was put into the same condition.”
“The event just repeated itself in every [4ESS] switch over and over again. If the switches hadn’t gotten a second message while resetting, there would have been no problem. If the messages had been received farther apart, it would not have triggered the problem.”
AT&T solved the problem by reducing the messaging load of the CCS7 network. That allowed the switches to rest themselves and the network to stabilize.