Enter the maze

Double or nothing?

Ariane Rocket looking towards moon: Copyright www.iStockphoto.com 000020138771

If you spent billions of dollars on a gadget you'd probably like it to last more than a minute before it blows up. That's what happened to a European Space Agency rocket. How do you make sure the worst doesn't happen to you? How do you make machines reliable?

A powerful way to improve reliability is to use redundancy: double things up. A plane with four engines can keep flying if one fails. Worried about a flat tyre? You carry a spare in the boot. These situations are about making physical parts reliable. Most machines are a combination of hardware and software though. What about software redundancy?

You can have spare copies of software too. Rather than a single version of a program you can have several copies running on different machines. If one program goes wrong another can take over. It would be nice if it was that simple, but software is different to hardware. Two identical programs will fail in the same way at the same time: they are both following the same instructions so if one goes wrong the other will too. That was vividly shown by the maiden flight of the Ariane 5 rocket. Less than 40 seconds from launch things went wrong. The problem was to do with a big number that needed 64 bits of storage space to hold it. The program's instructions moved it to a storage place with only 16 bits. With not enough space, the number was mangled to fit. That led to calculations by its guidance system going wrong. The rocket veered off course and exploded. The program was duplicated, but both versions were the same so both agreed on the same wrong answers. Seven billion dollars went up in smoke.

Can you get round this? One solution is to get different teams to write programs to do the same thing. The separate teams may make mistakes but surely they won't all get the same thing wrong! Run them on different machines and let them vote on what to do. Then as long as more than half agree on the right answer the system as a whole will do the right thing. That's the theory anyway. Unfortunately in practice it doesn't always work. Nancy Leveson, an expert in software safety from MIT, ran an experiment where different programmers were given programs to write. She found they wrote code that gave the same wrong answers. Even if it had used independently written redundant code it's still possible Ariane 5 would have exploded.

Redundancy is a big help but it can't guarantee software works correctly. When designing systems to be highly reliable you have to assume things will still go wrong. You must still have ways to check for problems and to deal with them so that a mistake (whether by human or machine) won't turn into a disaster.