See things clearly
There is an interesting juxtaposition between the brilliant success of NASA landing the Curiosity Rover on Mars last week, and the recent ‘software glitch’ at Knight Capital, which nearly wrecked the company.
NASA projects like Curiosity are highly complex and highly dangerous, yet, when we consider what they entail, failures are rare – why?
A key explanation is that good engineering practice is at the heart of space exploration projects.
For example, documentation and change control is very important to NASA. Procedures are documented for everything from “Protection of Human Research Subjects”, to “Risk Classification for NASA payloads”, to “NASA Software Engineering Requirements”.
Standards too are critical. NASA uses over 2300 engineering standards, including over 70 issued by NASA itself. These range from a “Process for Limiting Orbital Debris”, to “Applying Data Matrix Identification Symbols on Aerospace Parts”, to a “Software Assurance Standard”.
Software will play a key role in the success of the Curiosity mission.
Having touched down safely on Mars, the rover has already had a ‘brain transplant’. The software used to optimise Curiosity’s landing has been updated, so that it is now, ‘optimised to drive the rover, operate the robotic arm and scoop up and analyse soil samples.’
At a press conference last week, a NASA senior software engineer said of the process, that ‘engineers will take their time, easing into the upgrade to make sure there are no problems’,
"On Sol 5 [Martian day 5], we'll do a toe dip into the new software...We'll install it softly just to check it out. Then if everything looks good, on Sol 6 we'll do the full install on the main computer. Then on Sol 7 we'll start with the backup computer."
Given its mission-critical importance, it is not surprising that such software is subject to rigorous testing and ongoing scrutiny by NASA engineers.
The “NASA Software Safety Standard” provides,
“...the requirements for software safety across all NASA Centers, programs and facilities. It describes the activities necessary to ensure that safety is designed into the software that is acquired or developed by NASA.
While the “Software Formal Inspections Standard” is designed,
“...to support the inspection process of software developed for NASA. Its goal is to provide a framework and model for an inspection process that will detect and eliminate defects as early as possible in the software life cycle.”
Shareholders in Knight Capital must be wishing that software development, and change controls, in the company were subject to NASA-like degrees of scrutiny.
Knight is a US market-maker which executes about 10 percent of U.S. share volume. On August 1st, a problem resulting from the installation of new software caused Knight’s computers to go on an, ‘out-of-control spree of rapid-fire buying and selling’, and produced wild swings in the share prices of over 150 companies.
As a result, Knight lost $440 million in less than an hour. The company’s share price dropped like a stone, and a rescue deal had to be put together, whereby what amounts to 70% of the company was put in the hands of new investors.
Bloomberg reports, citing anonymous sources, that the,
‘...trading loss stemmed from old computer software that was inadvertently reactivated when a new program was installed...the dormant system started multiplying stock trades by one thousand...staff looked through eight sets of software before determining what happened...”
But, as has been said over at the Institute of Electrical and Electronics Engineers, we will have to wait until the authorities complete an investigation before we discover, perhaps, how, ‘
“...the dormant software awakened and interposed itself when it came to executing trades that were supposed to be initiated by the new software Knight had installed... [and why] Knight would keep 'eight sets of software” apparently resident in its execution environment...”
Whatever the explanation, be it poor software coding, a breakdown in change control procedure, a mixture of both, or something else, it was a very expensive mistake. Knight, a company worth $1.5 billion, and employing nearly 1500 people, was almost destroyed in a few minutes.
Given the number of ‘glitches’ in banking and trading during the last couple of years, it is clear that finance has to consider new ways of working. Sooner or later, one of these events will have a major destructive impact on a national economy, perhaps even on the global economy.
Unless the industry adopts a well-proven ‘engineering approach’ to its business-critical systems, similar to the one used in NASA, further problems are inevitable.
NASA itself suffers occasional project failures. Good engineering practice is no guarantee of success, especially when you are at the cutting edge of technology. But an engineering approach gives you the best chance of avoiding disaster.
As Curiosity’s lead engineer said after the rover landed successfully, "We trained ourselves for eight years to think the worst all the time. You can never turn that off.”
Add a Comment