Explain fail-safe vs fail-operational with examples

Question

Accepted Answer

Fail-safe means that when a fault is detected, the system transitions to a safe state that eliminates the hazard, even if it means the system stops providing its intended function. The key assumption is that a safe state exists and is acceptable. Examples: an industrial robot detects an encoder fault and stops all motion — a stationary robot cannot injure workers. An electric heater detects a temperature sensor failure and shuts off the heating element — a cold room is uncomfortable but not dangerous. An automotive ABS system detects an internal fault and disables ABS, reverting to conventional braking — the driver loses anti-lock capability but retains basic braking. In all these cases, the loss of function is acceptable because the alternative (continued operation with a potential fault) could cause harm. Fail-operational means that when a fault is detected, the system continues to provide its intended function (or a degraded but still functional version of it), because the loss of function itself is dangerous. Fail-operational requires redundancy — duplicate or diverse hardware and software that can take over when the primary system fails. Examples: a fly-by-wire aircraft flight control system cannot simply "stop" when a computer fails — it must continue providing flight control. Typically this is achieved with triple-redundant computers using majority voting. A self-driving vehicle (SAE Level 4+) must remain controllable after a fault because there may be no human driver to take over — it uses redundant steering actuators, redundant power supplies, and redundant compute platforms to continue operating in a degraded mode long enough to execute a minimum risk maneuver (pulling over safely). A medical ventilator cannot simply stop — it uses redundant pressure sensors, backup motor controllers, and a separate safety-critical monitoring processor to ensure airflow continues. The tradeoff between fail-safe and fail-operational is cost and complexity versus availability. Fail-safe systems are dramatically simpler and cheaper — detecting the fault and shutting down is straightforward. Fail-operational systems require redundant hardware (2x or 3x cost), diverse software (to avoid common-mode failures), fault isolation mechanisms (ensuring one failing component does not corrupt the redundant one), and switchover logic that itself must be verified for safety. ISO 26262 captures this distinction in the ASIL decomposition concept: a fail-operational architecture decomposes a high-ASIL requirement across redundant, independent channels. The interview question tests whether you understand that "safe" does not always mean "off" — in systems where loss of function is itself hazardous, the safety strategy must maintain operation, which fundamentally changes the architecture, cost, and development effort.