Ariane software fault
Based on the extensive documentation and data on the Ariane failure made available to the Board, the following chain of events, their inter-relations and causes have been established, starting with the destruction of the launcher and tracing back in time towards the primary cause.
The SRI internal events that led to the failure have been reproduced by simulation calculations. Furthermore, both SRIs were recovered during the Board's investigation and the failure context was precisely determined from memory readouts. In addition, the Board has examined the software code which was shown to be consistent with the failure scenario. The results of these examinations are documented in the Technical Report. Therefore, it is established beyond reasonable doubt that the chain of events set out above reflects the technical causes of the failure of Ariane In the failure scenario, the primary technical causes are the Operand Error when converting the horizontal bias variable BH, and the lack of protection of this conversion which caused the SRI computer to stop.
To determine the vulnerability of unprotected code, an analysis was performed on every operation which could give rise to an exception, including an Operand Error. In particular, the conversion of floating point values to integers was analysed and operations involving seven variables were at risk of leading to an Operand Error.
This led to protection being added to four of the variables, evidence of which appears in the Ada code. However, three of the variables were left unprotected. No reference to justification of this decision was found directly in the source code. Given the large amount of documentation associated with any industrial application, the assumption, although agreed, was essentially obscured, though not deliberately, from any external review.
The reason for the three remaining variables, including the one denoting horizontal bias, being unprotected was that further reasoning indicated that they were either physically limited or that there was a large margin of safety, a reasoning which in the case of the variable BH turned out to be faulty. It is important to note that the decision to protect certain variables but not others was taken jointly by project partners at several contractual levels.
There is no evidence that any trajectory data were used to analyse the behaviour of the unprotected variables, and it is even more important to note that it was jointly agreed not to include the Ariane 5 trajectory data in the SRI requirements and specification. Although the source of the Operand Error has been identified, this in itself did not cause the mission to fail. The specification of the exception-handling mechanism also contributed to the failure.
In the event of any kind of exception, the system specification stated that: the failure should be indicated on the databus, the failure context should be stored in an EEPROM memory which was recovered and read out for Ariane , and finally, the SRI processor should be shut down. It was the decision to cease the processor operation which finally proved fatal.
Restart is not feasible since attitude is too difficult to re-calculate after a processor shutdown; therefore the Inertial Reference System becomes useless. The reason behind this drastic action lies in the culture within the Ariane programme of only addressing random hardware failures. From this point of view exception - or error - handling mechanisms are designed for a random hardware failure which can quite rationally be handled by a backup system.
Although the failure was due to a systematic software design error, mechanisms can be introduced to mitigate this type of problem. For example the computers within the SRIs could have continued to provide their best estimates of the required attitude information.
There is reason for concern that a software exception should be allowed, or even required, to cause a processor to halt while handling mission-critical equipment. Indeed, the loss of a proper software function is hazardous because the same software runs in both SRI units.
In the case of Ariane , this resulted in the switch-off of two still healthy critical units of equipment. The original requirement acccounting for the continued operation of the alignment software after lift-off was brought forward more than 10 years ago for the earlier models of Ariane, in order to cope with the rather unlikely event of a hold in the count-down e.
The period selected for this continued alignment operation, 50 seconds after the start of flight mode, was based on the time needed for the ground equipment to resume full control of the launcher in the event of a hold. This special feature made it possible with the earlier versions of Ariane, to restart the count- down without waiting for normal alignment, which takes 45 minutes or more, so that a short launch window could still be used.
In fact, this feature was used once, in on Flight The same requirement does not apply to Ariane 5, which has a different preparation sequence and it was maintained for commonality reasons, presumably based on the view that, unless proven necessary, it was not wise to make changes in software which worked well on Ariane 4.
Even in those cases where the requirement is found to be still valid, it is questionable for the alignment function to be operating after the launcher has lifted off. Alignment of mechanical and laser strap-down platforms involves complex mathematical filter functions to properly align the x-axis to the gravity axis and to find north direction from Earth rotation sensing. The assumption of preflight alignment is that the launcher is positioned at a known and fixed position.
Therefore, the alignment function is totally disrupted when performed during flight, because the measured movements of the launcher are interpreted as sensor offsets and other coefficients characterising sensor behaviour. Returning to the software error, the Board wishes to point out that software is an expression of a highly detailed design and does not fail in the same sense as a mechanical system. Furthermore software is flexible and expressive and thus encourages highly demanding requirements, which in turn lead to complex implementations which are difficult to assess.
An underlying theme in the development of Ariane 5 is the bias towards the mitigation of random failure. The supplier of the SRI was only following the specification given to it, which stipulated that in the event of any detected exception the processor was to be stopped.
The exception which occurred was not due to random failure but a design error. The exception was detected, but inappropriately handled because the view had been taken that software should be considered correct until it is shown to be at fault. The Board has reason to believe that this view is also accepted in other areas of Ariane 5 software design.
The Board is in favour of the opposite view, that software should be assumed to be faulty until applying the currently accepted best practice methods can demonstrate that it is correct.
This means that critical software - in the sense that failure of the software puts the mission at risk - must be identified at a very detailed level, that exceptional behaviour must be confined, and that a reasonable back-up policy must take software failures into account. The Flight Control System qualification for Ariane 5 follows a standard procedure and is performed at the following levels :. The logic applied is to check at each level what could not be achieved at the previous level, thus eventually providing complete test coverage of each sub-system and of the integrated system.
Testing at equipment level was in the case of the SRI conducted rigorously with regard to all environmental factors and in fact beyond what was expected for Ariane 5. However, no test was performed to verify that the SRI would behave correctly when being subjected to the count-down and flight time sequence and the trajectory of Ariane 5. It should be noted that for reasons of physical law, it is not feasible to test the SRI as a "black box" in the flight environment, unless one makes a completely realistic flight test, but it is possible to do ground testing by injecting simulated accelerometric signals in accordance with predicted flight parameters, while also using a turntable to simulate launcher angular movements.
Had such a test been performed by the supplier or as part of the acceptance test, the failure mechanism would have been exposed. The main explanation for the absence of this test has already been mentioned above, i.
The Board has also noted that the systems specification of the SRI does not indicate operational restrictions that emerge from the chosen implementation. Such a declaration of limitation, which should be mandatory for every mission-critical device, would have served to identify any non-compliance with the trajectory of Ariane 5.
Despite this, the board did not focus solely on such issues—there was a broad range of potential causes and questions, and as with any investigation, its members wanted to exhaust all possibilities. It was some later detection logic in the software that examined the operand flag and interpreted it as a manifestation of a hardware error and sent out a diagnostic message that included the location of whereabouts in the software the operand flag had been set to true.
Any software error that manifested itself to the fault detection logic would have led to the primary system being ignored and the secondary system used, but the systematic error would be interpreted as a failure in the secondary backup system. This would have led the onboard computer to believe the rocket had failed. Much of the problem appeared to revolve around the culture of those involved in the project.
With mechanical systems, if such an incident had occurred and caused the inertial reference system to fail, you would go to a backup. Although the Ariane 5 project went down in history as a monumental failure, the code was well written and a very good software engineering process had been followed throughout.
The organization that had written the software had initially put a guard against this kind of situation into the code, so that if there was an output that was larger than 16 bits, those working on the spacecraft would have been alerted earlier during the testing phases.
However, it was removed because they were motivated to reduce the loading time of the processor and so took away elements they thought were unnecessary. A committee looked at every aspect of the rocket—its reliability, availability, maintainability, and safety.
Despite this, they still failed to appreciate the devastating impact removing the guard would have. Perhaps the fundamental issue was that the whole failure arose from the requirements that had been set.
But the problem was they faithfully reproduced the software to meet requirements that were there for Ariane 4, but there were further requirements for Ariane 5, particularly where the error came from.
Rotibi feels this slack behavior of building on top of what was there before is inherent in many issues with software, even today. A video of the take-off and explosion after 37 seconds. The Ariane 5 failure. A Powerpoint overview of the system failure. Ariane 5 — Who Dunnit? A short article by a distinguished professor of software engineering discussing the complex causes of the failure.
Ariane 5: Report of the post-accident enquiry External link. Ariane 5: A programming problem? External link. An extended discussion of the Ariane 5 failure.
0コメント