What was the historical impact of Ariane 5’s Flight 501

bughistorytesting

The disintegration of the Ariane 5 rocket 37 seconds after launch on her maiden voyage (Flight 501) is commonly referred to as one of the most expensive software bugs in history1:

It took the European Space Agency 10 years and $7 billion to produce Ariane 5, a giant rocket capable of hurling a pair of three-ton satellites into orbit with each launch and intended to give Europe overwhelming supremacy in the commercial space business.

All it took to explode that rocket less than a minute into its maiden voyage last June, scattering fiery rubble across the mangrove swamps of French Guiana, was a small computer program trying to stuff a 64-bit number into a 16-bit space.

One bug, one crash. Of all the careless lines of code recorded in the annals of computer science, this one may stand as the most devastatingly efficient. From interviews with rocketry experts and an analysis prepared for the space agency, a clear path from an arithmetic error to total destruction emerges.

What major changes did Flight's 501 failure and the subsequent investigations inspire to the research of safety critical systems and software testing?

I'm not looking for an explanation of the bug itself, but for an explanation of the historical impact of the bug, in terms of research that were inspired from or were directly related to the investigation(s) of the failure. For example this paper concludes:

We have used static analysis to:

  • check the initialization of variables,
  • provide the exhaustive list of potential data access conflicts for shared variables,
  • exhaustively list the potential run time errors from the Ada semantics.

To our knowledge this is the first time boolean-based and non boolean-based static analysis techniques are used to validate industrial programs.

Similarly, this paper(pdf) notes:

Abstract interpretation based static program analyses have been used for the
static analysis of the embedded ADA software of the Ariane 5 launcher and
the ARD. The static program analyser aims at the automatic detection of
the definiteness , potentiality, impossibility or inaccessibility of run-time errors
such as scalar and floating-point overflows, array index errors, divisions by zero
and related arithmetic exceptions, uninitialized variables, data races on shared
data structures, etc. The analyzer was able to automatically discover the Ariane
501 flight error. The static analysis of embedded safety critical software (such
as avionic software) is very promising
.

I would love a thorough explanation of the impact this single event had on software testing approaches and tools.

1 The $7 billion figure possibly refers to the total cost of the Ariane 5 project, Wikipedia reports that the failure resulted in a loss of more than $370 million. Still a quite expensive failure but nowhere near the $7 billion figure.

Best Answer

Technically speaking, it was more a case of "software rot". The flight control software was recycled from the earlier Ariane 4 rocket, a sensible move given how expensive it is to develop software, especially when it's mission critical software which must be tested and verified to far more rigorous standards than most commercial software needs to be.

Unfortunately, nobody bothered testing what effect the change in operating environment would have, or if they did they didn't do said testing to a sufficiently thorough standard.

The software was built to expect certain parameters to never exceed certain values (thrust, acceleration, fuel consumption rates, vibration levels, etc). In normal flight on an Ariane 4 this wasn't a problem because those parameters would never reach invalid values without something already being spectacularly wrong. The Ariane 5, however, is much more powerful and ranges that would seem to be silly on the 4 could quite easily happen on the 5.

I'm not sure what parameter it was that went out of range (it might have been acceleration, I'll have to check), but when it did, the software was unable to cope and suffered an arithmetic overflow for which there had been insufficient error checking and recovery code implemented. The guidance computer started sending garbage to the engine nozzle gimbals, which in turn started pointing the engine nozzle pretty much randomly. The rocket started to tumble and break up, and the automatic self-destruct system detected the rocket was now in an unsafe irrecoverable attitude and finished the job.

To be honest, this incident probably didn't teach any new lessons, as the kind of problems have been unearthed before in all manner of systems, and there are already strategies in place to deal with finding and fixing errors. What the incident did do was ram home the point that being lax in following those strategies can have enormous consequences, in this case millions of dollars of destroyed hardware, some extremely pissed off customers and an ugly dent in the reputation of Arianespace.

This particular case was especially glaring because a shortcut taken to save money ended up costing a huge amount, both in terms of money and lost reputation. If the software had been tested just as robustly in an Ariane 5 simulated environment as it had been when it was originally developed for Ariane 4, the error surely would have come to light long before the software was installed in launch hardware and put in command of an actual flight. Additionally, if a software developer had deliberately thrown some nonsense input into the software then the error might have even been caught in the Ariane 4 era, as it would have highlighted the fact that the error recovery that was in place was inadequate.

So in short, it didn't really teach new lessons, but it rammed home the dangers of not remembering old ones. It also demonstrated that the environment within which a software system operates is every bit as important as the software itself. Just because the software is verifiably correct for environment X doesn't mean it's fit for purpose in the similar but distinct environment Y. Finally it highlighted how important it is for mission critical software to be robust enough to deal with circumstances that shouldn't have happened.

Contrast flight 501 with Apollo 11 and its computer problems. Whilst the LGC software suffered from a serious glitch during the landing, it was designed to be extremely robust and was able to remain in an operational state in spite of the software alarms that were triggered, without putting any astronauts in danger and still being able to complete its mission.

Related Topic