Routine that became a meltdown
By Sue Ashton Davies
from The Australian, 30 June 1998
(tranferred to the CII website because the
original is hard to read.)
A routine Friday afternoon batch job turned into disaster when a computer
meltdown brought a manufacturing system to its knees.
The computer room was humming, and all systems were go for one of Australia's
largest manufacturers.
Then Jeff Steel, project manager of Infact Consultants, reset the system
clock to January 7, 2000, and waited to see what would happen.
The routine batch job, which involved 800 custom-built Cobol and PL-1 programs
in a manufacturing mainframe environment, was expected to take six hours
to run.
Close by, a terminal in the control room was set up to track the programs
as they went through the batch run.
Although he anticipated some problems, Steel was not prepared for anything
coming out of left field.
His team of 12 programmers had worked methodically for nine months, manually
sifting through millions of lines of code, rectifying the double digit issue
to take account of the year 2000.
Great care had been taken to keep the crew motivated and focused on the
their tasks to ensure time was spent productively and any reworking was
kept to a minimum.
At worst, he expected to make some specific changes that could be easily
spotted.
Operations had hardly begun before the first programs started to run slowly.
By the time the sixth program started, the system began to falter. Then,
one after another, programs fell over.
By the time the 10th program failed, Steel decided to let the job run to
the end, because in all likelihood, it would be all over in half an hour
anyway.
Within minutes, 750 programs had fallen over. One of the few programs to
continue running was invoicing, but it was producing invoices for the 43rd
day of the 14th month.
As the job finally ground to a halt, a silence hung over the room as everyone
stared vacantly into the terminal.
Steel stood frozen to the floor in shock, as did his team, which had been
contracted to fulfil a $3 million contract.
Twelve people stared at the terminal where a complete suite of programs
had died instantly.
Fortunately the meltdown had taken place in a test environment.
The search was now on to diagnose the problem. One of the team tracked down
the problem to an obscure mainframe program.
The culprit was a non-Y2K compliant link editor on a PL1 program that last
ran in 1987.
A link editor takes different modules of a program and puts them together
in the right place at the right time.
With the problem identified and a Y2K compliant link editor installed, the
30 programs were rerun and the problem was solved.
Steel says the use of the test environment saved the company from bankruptcy.
"The consequences in a live environment would have been devastating,"
he says. As well as bringing the business to a standstill, it would have
rendered it unable to operate for six months ­p; and possibly taken suppliers
and customers down with it.
Situations like this are typical of what's happening and testify to the
truth of rumours about large companies not yet meeting Y2K compliancy requirements,
Steel says.
The post-mortem meeting found that the collective time required to diagnose
such an obscure problem in a live environment would been about a month,
and a fix would have taken six months.
"The problem was so unusual, you wouldn't have known if it was hardware,
software or system
utilities," Steel says. "The horrible thing about it was that
it was such an obscure component that
nobody even thought that it could fail."
Even with hindsight, the problem could never have been spotted before testing
because it was too obscure.
"In nine months of remediation, no-one had ever got near this problem,"
he says.
Steel says the meltdown was so catastrophic that even a contingency plan
wouldn't have saved the day.
The only way to find the Y2K bugs in a system is to manually trawl the program
code line by line to find the date fields, some of which are very obscure,
he says.
One area for dates was embedded deep in a job control language, where a
sort of 30 characters revealed six characters making up a date.
Even though the testing is complete, Steel cannot say definitely that the
system is now 100 per cent Y2K compliant.
As part of the strategy to protect himself and the company from any legal
recourse, he operated with an auditor looking over his shoulder at every
stage concurring that the way he was progressing was the best available
method.
"All I can say to the client is that I can't guarantee that there will
not be any problems after the year
2000," he says.
Steel says most organisations don't understand Y2K.
"Until something like this happens, they don't understand what Y2K
can do to them," he says.
© News Limited 1998
from THE AUSTRALIAN