Safety mechanisms for embedded systems

Quick Cap

Safety mechanisms are the runtime techniques that detect faults, contain their effects, and either recover or transition to a safe state. They span from simple defensive coding practices (assertions, range checks) through hardware-assisted protections (watchdogs, ECC, MPU) to full architectural redundancy (triple modular redundancy, lockstep dual-core). Choosing the right combination depends on the required safety integrity level and the cost/complexity budget.

Interviewers test whether you can match mechanisms to fault types, explain the overhead of each approach, and articulate the difference between fail-safe and fail-operational design.

Key Facts:

Defensive programming: Assertions, range checks, mandatory default in switch, static analysis enforcement
Watchdog strategies: Windowed watchdog (must feed within a time window, not just "before timeout"), per-task feeding in RTOS
Memory integrity: CRC on flash contents, ECC on SRAM, stack canaries for overflow detection, MPU guard regions
Redundancy: N-version programming, Triple Modular Redundancy (TMR) with majority voting, lockstep dual-core
Fail-safe vs fail-operational: Fail-safe transitions to a safe state (e.g., motor off); fail-operational continues with degraded capability (e.g., limp-home mode)

Deep Dive

At a Glance

Mechanism	What It Detects	Overhead	Typical Use
Assertions / range checks	Logic errors, out-of-range inputs	Near zero (compiled out in release or kept for safety)	All safety-critical code
Windowed watchdog	Task hangs, runaway execution, stuck loops	One timer peripheral	RTOS-based systems, motor controllers
CRC on flash	Bit rot, flash corruption	Periodic CPU cycles	Boot-time and periodic runtime checks
ECC on SRAM	Single-bit flips (SEU from radiation or noise)	Hardware (built into SRAM controller)	Automotive, aerospace, industrial
Stack canaries	Stack buffer overflows	4 bytes per stack frame	All C/C++ firmware
MPU guard regions	Stack overflow, wild pointer writes	MPU configuration	RTOS with per-task stacks
TMR (voting)	Any single-channel fault	3x hardware + voter logic	Aerospace, nuclear, railway
Lockstep dual-core	CPU logic faults, register corruption	2x CPU silicon (no throughput gain)	Automotive ASIL D, industrial SIL 3

Defensive Programming

Defensive programming treats every input as potentially invalid and every code path as potentially reachable. The goal is to detect faults at the earliest possible point and prevent them from propagating.

Key practices:

Assertions and range checks: Validate function parameters at entry. In safety-critical code, assertions are not compiled out -- they trigger a safe-state transition.
Mandatory default in every switch: Even for enum types, the default case catches memory corruption that could produce an invalid enum value.
Static analysis: MISRA C checkers (Polyspace, PC-lint, cppcheck with MISRA addon) catch patterns that runtime checks cannot, such as unreachable code, uninitialized variables, and implicit type conversions.
Function return value checking: Every function that can fail must have its return value checked. Ignoring a return value from flash_write() or i2c_read() is a common source of silent data corruption.

Watchdog Strategies

A basic watchdog resets the system if software fails to "kick" it before a timeout. But a simple watchdog only detects total system hangs -- it misses scenarios where the main loop is running but a critical task is stuck.

Windowed watchdog: Must be fed within a specific time window -- not too early, not too late. If software feeds the watchdog too quickly (indicating a runaway tight loop) or too slowly (indicating a hang), the watchdog triggers a reset.

text

Windowed Watchdog Timing
  ========================

  |<--- closed window --->|<--- open window --->|<--- timeout --->|
  |   feed here = RESET   |  feed here = OK     | no feed = RESET |
  |                        |                      |                 |
  t=0                   t=Tmin                 t=Tmax           t=Treset

  * Feed before Tmin: window violation, system resets
  * Feed between Tmin and Tmax: correct, counter reloads
  * No feed by Treset: timeout, system resets

Per-task watchdog feeding in RTOS: Instead of feeding the hardware watchdog from a single location, each critical task sets a "heartbeat" flag. A dedicated watchdog supervisor task checks that all heartbeat flags are set within each period, then feeds the hardware watchdog. If any task fails to report, the supervisor withholds the feed and the system resets.

/* Per-task watchdog feeding pattern */
#define NUM_TASKS  4
static volatile uint32_t task_heartbeats;  /* bitmask */

/* Called by each task periodically */
void task_report_alive(uint8_t task_id) {
    task_heartbeats |= (1u << task_id);
}

/* Supervisor task -- runs every 100 ms */
void watchdog_supervisor(void *arg) {
    const uint32_t all_alive = (1u << NUM_TASKS) - 1;
    for (;;) {
        if ((task_heartbeats & all_alive) == all_alive) {
            hw_watchdog_feed();       /* All tasks alive */
            task_heartbeats = 0;      /* Reset for next period */
        }
        /* If any task missed, we do NOT feed the watchdog */
        vTaskDelay(pdMS_TO_TICKS(100));
    }
}

Escalation levels: Not every watchdog timeout should trigger a hard reset. A common pattern is: first timeout triggers a soft recovery attempt (restart the stuck task), second timeout within a window triggers a warm reset (reinitialize peripherals), and third timeout triggers a cold reset (full power cycle via external watchdog IC).

Memory Integrity

CRC on flash contents: At boot and periodically at runtime, compute CRC-32 over the firmware image in flash and compare against a stored reference. Detects bit rot, incomplete flash writes, and radiation-induced bit flips in NOR flash.

ECC on SRAM: Many automotive and industrial MCUs (e.g., TI Hercules, Infineon AURIX, Renesas RH850) include hardware ECC on SRAM that transparently corrects single-bit errors and detects double-bit errors. Single-bit corrections are logged; double-bit errors trigger an NMI. Periodic RAM tests (march tests) catch stuck-at faults that ECC alone cannot.

Stack canaries: The compiler places a known sentinel value ("canary") between the stack frame and the return address. Before returning, the function checks the canary. If corrupted, a stack buffer overflow has occurred. GCC enables this with -fstack-protector-strong.

MPU guard regions: In an RTOS, configure the MPU to place a "guard region" (no-access) at the bottom of each task's stack. A stack overflow triggers a MemManage fault immediately instead of silently corrupting adjacent memory. This is far more reliable than software canary checks because the hardware catches the overflow on the exact instruction that causes it.

Redundancy Patterns

N-version programming: Multiple independently developed software implementations of the same specification, running in parallel. Their outputs are compared by a voter. Catches specification-interpretation bugs that a single implementation would not reveal. Extremely expensive -- used mainly in aerospace and nuclear.

Triple Modular Redundancy (TMR): Three identical hardware channels compute the same result. A majority voter selects the output that at least two channels agree on. A single channel failure is masked. Used in flight computers and railway signaling.

Lockstep dual-core: Two CPU cores execute the same instruction stream in lockstep. A comparator checks outputs on every clock cycle. Any divergence indicates a fault (e.g., radiation-induced bit flip in a register) and triggers an immediate safe-state transition. Lockstep provides fault detection, not fault tolerance -- the system detects the error but cannot continue. Automotive ASIL D processors like the ARM Cortex-R5 in TI TMS570 and Infineon AURIX use this architecture.

Fail-Safe vs Fail-Operational

Property	Fail-Safe	Fail-Operational
On fault detection	Transition to safe state (shut down, power off actuator)	Continue with degraded capability
Example	Industrial robot stops all motion	Electric power steering falls back to reduced assist
Complexity	Lower -- just need reliable detection and shutdown	Higher -- need redundancy to continue operating
Automotive example	Airbag controller (safe state = do not deploy unless crash)	Brake-by-wire (safe state cannot be "no brakes")
When to use	System can safely stop	Stopping is itself dangerous

Fail-operational design is required when the "off" state is hazardous. A brake-by-wire system cannot simply shut down -- it must provide at least degraded braking. This demands redundant actuators, redundant power supplies, and redundant processing channels.

Debugging Story

A field failure in a battery management system for an electric scooter fleet was traced to cosmic-ray-induced single-event upsets (SEUs) in SRAM. The MCU (Cortex-M4, no hardware ECC) would occasionally read a flipped bit in the battery state-of-charge variable, causing the controller to either shut down a fully charged battery or allow a depleted one to overdischarge. The fix was three-fold: (1) move to an MCU with hardware ECC on SRAM, (2) add periodic march RAM tests during idle periods, and (3) store critical variables in triplicate with majority voting. The SEU rate was roughly 1 per 10,000 device-hours -- invisible in lab testing but statistically certain across a fleet of 5,000 scooters. The lesson: memory integrity mechanisms are not just for aerospace. Any fleet-scale deployment in outdoor environments will encounter bit flips.

What interviewers want to hear: that you can map each safety mechanism to the fault type it addresses, understand the overhead and limitations of each approach, know the difference between fault detection (lockstep) and fault tolerance (TMR), and can articulate when fail-safe is sufficient versus when fail-operational is required.

Interview Focus

Classic Interview Questions

Q1: "What is a windowed watchdog and why is it better than a standard watchdog?"

Model Answer Starter: "A standard watchdog only detects total system hangs -- as long as software feeds it before the timeout, it is satisfied, even if the system is in a runaway tight loop. A windowed watchdog adds a closed window at the beginning: feeding too early triggers a reset just like feeding too late. This detects both hangs (late feed) and runaway execution (early feed). I configure the window based on the expected task execution time -- for example, if my control loop should take 8-12 ms, I set the window open from 7 ms to 15 ms."

Q2: "How do you implement per-task watchdog monitoring in an RTOS?"

Model Answer Starter: "Each critical task sets a heartbeat flag or increments a counter in its main loop. A dedicated supervisor task runs at a fixed interval, checks that all heartbeat flags are set, and only feeds the hardware watchdog if every task has reported in. If any task misses its heartbeat, the supervisor withholds the feed, and the hardware watchdog resets the system. This catches scenarios where the idle loop is running fine but a specific task is blocked on a deadlocked mutex or stuck in an infinite loop."

Q3: "What is the difference between fail-safe and fail-operational, and when do you need each?"

Model Answer Starter: "Fail-safe means the system transitions to a safe state when a fault is detected -- for example, an industrial robot stops all motion. Fail-operational means the system continues to function with degraded capability. You need fail-operational when the 'off' state is itself dangerous -- a brake-by-wire system cannot just shut down. Fail-operational requires redundancy: duplicate actuators, duplicate processing channels, and independent power supplies. Fail-safe is simpler and cheaper, so I default to it unless the system's safe state would create a hazard."

Q4: "How does lockstep dual-core differ from TMR?"

Model Answer Starter: "Lockstep dual-core runs two cores executing the same instructions in parallel with a cycle-by-cycle comparator. It provides fault detection -- if the cores diverge, an error is flagged -- but the system cannot continue because it does not know which core is correct. TMR uses three independent channels with a majority voter, providing fault tolerance -- a single channel failure is masked and operation continues. Lockstep is cheaper (two cores vs three channels) and is used when detection plus safe-state transition is sufficient, as in automotive ASIL D. TMR is used when continuous operation is mandatory, as in flight computers."

Q5: "How do you protect against SRAM bit flips in an embedded system?"

Model Answer Starter: "The primary defense is hardware ECC on SRAM, which transparently corrects single-bit errors and detects double-bit errors. Many automotive MCUs include this. For MCUs without hardware ECC, I store safety-critical variables in triplicate and use majority voting on every read. I also run periodic march RAM tests during idle time to detect stuck-at faults. Stack canaries detect corruption in stack frames, and MPU guard regions catch stack overflows before they corrupt adjacent memory. For flash, I verify CRC-32 at boot and periodically at runtime."

Trap Alerts

Don't say: "A watchdog is just a timer that resets the system" -- interviewers expect you to know windowed watchdogs and per-task monitoring
Don't forget: Lockstep is fault detection, not fault tolerance -- the system stops, it does not recover
Don't ignore: Memory integrity is relevant beyond aerospace -- fleet-scale IoT deployments encounter bit flips too

Follow-up Questions

"How would you implement a software watchdog for a bare-metal system without an RTOS?"
"What is a march test and how does it complement ECC?"
"How do you handle a double-bit ECC error at runtime?"

Practice

❓ What does a windowed watchdog detect that a standard watchdog misses?

❓ What is the key difference between lockstep dual-core and TMR?

❓ Why should safety-critical firmware keep assertions enabled in release builds?

❓ What is the purpose of an MPU guard region at the bottom of a task stack?

❓ When is fail-operational design required instead of fail-safe?

Real-World Tie-In

Electric Power Steering (ASIL D, Fail-Operational) -- Designed the safety architecture for an EPS system using a lockstep Cortex-R5 with hardware ECC SRAM. The windowed watchdog was configured with a 4-8 ms open window matching the 5 ms control loop period. Per-task heartbeat monitoring covered the torque calculation, CAN communication, and diagnostic tasks. On lockstep fault detection, the system entered limp-home mode (reduced assist) via a backup analog torque path rather than shutting down steering. Over 2 million field-hours, the system detected and safely handled 3 lockstep comparison faults -- all traced to transient radiation effects.

Industrial Motor Controller (SIL 2) -- Implemented a safety supervisor for a 3-phase motor drive controlling a conveyor belt. Defensive programming caught an out-of-range current sensor reading caused by a broken ADC reference, triggering a safe stop before overcurrent could damage the motor winding. Flash CRC checks at boot detected a single bit flip after a power surge event. The combination of software range checks, hardware watchdog, and flash integrity verification met IEC 61508 SIL 2 requirements without the cost of full hardware redundancy.