A handful of articles that have stood out over the years as interesting case studies into the importance of good programming, and what that actually means.
UK Post Office Scandal
- Developers blamed for the post office scandal?
- Victim testimony
- Remote access and mistakes blamed on post masters, a smoking gun
- Example of the terrible quality of the code - were they paid by line of code submitted? Does this imply a fauly language conversion? Risks of overflow errors?
- One of the independent investigators from 2012 breaks silence
- Evidence that back in 2012 the indepent investigators were told about remote access, implying pejury in all later cases
- They Could, And They Did Change Branch Transaction Data
- Miscarriage of justice - the Rose report
- List of current issues / bugs as of 2017? with the software. One can only imagine how much worse it used to be
- Project manager on the original project discusses his impressions - how not to commission a complex project Inquiry Phase 2: Star Witness – Dave gives it both barrels
- Known errors in the software, perjury, and lack of disclosure
- Interview with IT - throwing coconuts
- Interview with the other IT
- Attempt by the post to recuse the judge - right after a verdict was handed down. Sour grapes, anyone?
- The cover up
- Why did the lawyer for the post office act this way?
- Thinking of alleging or pleading fraud: better read this first
Chinook Crash
Although only 17% of the code has been analysed [EDS, the code’s assessors, stopped work on the code because of the density of anomalies found] 21 category one and 153 category two anomalies have been revealed. One of these, the reliance on an undocumented and unproved feature of the processor [based on an Intel ASM96 chip] is considered to be positively dangerous.
EDS was not satisfied with Hawker Siddeley’s assurances. In its report on the Fadec software, EDS said: “Correct operation (of the Fadec) depends on an undocumented feature of the Intel ASM 96 microcomputer. EDS-Scicon’s concern is that a change in the mask or process of the ASM chip at some point in the future may cause incorrect operation of the FADEC.]
Bugs with the Therac-25 computer-controlled radiation therapy machine
If this story has a hero, it’s [Fritz Hager], the staff physicist at the East Texas Cancer Center in Tyler, Texas. After the second incident at his facility, he was determined to get to the bottom of the problem. In both cases, the Therac-25 displayed a “Malfunction 54” message. The message was not mentioned in the manuals. AECL explained that Malfunction 54 meant that the Therac-25’s computer could not determine if there a underdose OR overdose of radiation.
The same radiotherapy technician had been involved in both incidents, so [Fritz] brought her back into the control room to attempt to recreate the problem. The two “locked the doors” NASA style, working into the night and through the weekend trying to reproduce the problem. With the technician running the machine, the two were able to pinpoint the issue. The VT-100 console used to enter Therac-25 prescriptions allowed cursor movement via cursor up and down keys. If the user selected X-ray mode, the machine would begin setting up the machine for high-powered X-rays. This process took about 8 seconds. If the user switched to Electron mode within those 8 seconds, the turntable would not switch over to the correct position, leaving the turntable in an unknown state.
As the investigations and lawsuits progressed, the software for the Therac-25 was placed under scrutiny. The Therac-25’s PDP-11 was programmed completely in assembly language. Not only the application, but the underlying executive, which took the place of an operating system. The computer was tasked with handling real-time control of the machine, both its normal operation and safety systems. Today this sort of job could be handled by a microcontroller or two, with a PC running a GUI front end.
AECL never publicly released the source code, but several experts including [Nancy Leveson] did obtain access for the investigation. What they found was shocking. The software appeared to have been written by a programmer with little experience coding for real-time systems. There were few comments, and no proof that any timing analysis had been performed. According to AECL, a single programmer had written the software based upon the Therac-6 and 20 code. However, this programmer no longer worked for the company, and could not be found.
The lawsuits were settled out of court. It seemed like the problems were solved until January 17th, 1987, when another patient was overdosed at Yakima, Washington. This problem was a new one: A counter overflow. If the operator sent a command at the exact moment the counter overflowed, the machine would skip setting up some of the beam accessories – including moving the stainless steel aiming mirror. The result was once again an unscanned beam, and an overdose. The patient died 3 months later.
It’s important to note that while the software was the lynch pin in the Therac-25, it wasn’t the root cause. The entire system design was the real problem. Safety-critical loads were placed upon a computer system that was not designed to control them. Timing analysis wasn’t performed. Unit testing never happened. Fault trees for both hardware and software were not created. These tasks are not only the responsibility of the software engineers, but the systems engineers on the project. Therac-25 is long gone, but its legacy will live on. This was the watershed event that showed how badly things can go wrong when software for life-critical systems is not properly designed and adequately tested.
AI Hype Train
Others
WHEN SMALL SOFTWARE BUGS CAUSE BIG PROBLEMS