What safety mechanisms would you implement for a safety-critical embedded system?

Question

Accepted Answer

Safety mechanisms in embedded systems are designed to detect, contain, and respond to hardware faults and software errors before they can cause harm. The selection of mechanisms depends on the required safety integrity level (ASIL or SIL), but a comprehensive safety-critical system typically implements multiple layers of defense. Hardware-level mechanisms include: ECC (Error Correcting Code) on RAM and flash — single-bit errors are corrected transparently, multi-bit errors are detected and flagged; lockstep CPU cores — two identical cores execute the same code in lockstep, and a comparator checks that their outputs match on every cycle (any divergence indicates a fault and triggers a safe state); voltage and clock monitors — dedicated supervisory circuits detect brownouts, overvoltage, and clock frequency deviations that could cause erratic behavior; and hardware watchdog timers — independent oscillator, cannot be disabled by software, resets the system if not serviced within the timeout window.

Software-level mechanisms include: stack overflow detection using MPU (Memory Protection Unit) guard regions or canary values — the MPU is configured to make the memory region immediately below the stack non-accessible, so a stack overflow triggers a MemManage fault instead of silently corrupting adjacent data; control flow monitoring — inserting sequence counters at critical decision points to verify that the code executed the expected path (not just that it reached the right destination); redundant computation — performing safety-critical calculations twice (ideally with a diverse algorithm or on different data representations) and comparing results; plausibility checks on sensor data — range validation (is the temperature reading physically possible?), rate-of-change limits (did the reading jump impossibly fast?), and cross-sensor comparison (do the redundant sensors agree within tolerance?); and periodic RAM and ROM tests using CRC or March test algorithms to detect memory corruption.

System-level mechanisms include: safe state definition — every safety analysis must identify what "safe" means when a fault is detected (for a motor controller, safe might be cutting power; for a ventilator, safe might be maintaining basic airflow); fault reaction time — the maximum time from fault detection to reaching the safe state, which must be shorter than the time for the fault to cause harm; diagnostic coverage — the percentage of dangerous faults detected by the safety mechanisms, a key metric in ISO 26262 hardware metric calculations (SPFM and LFM require 90-99% diagnostic coverage depending on ASIL level); and independent monitoring — a separate, simpler watchdog processor that monitors the main processor's outputs and can independently activate the safe state if the main processor fails. The interview-ready insight is that no single mechanism is sufficient — safety-critical systems require defense in depth, where each mechanism covers failure modes that others might miss.