What are watchdog recovery levels, and why might a simple reset not be enough?
Not all failures warrant a full system reset, and some systems cannot tolerate the downtime. In safety-critical and high-availability applications, a graduated recovery strategy provides the right level of response for each severity of failure:
Level 0 — Software recovery (pre-reset): Before the watchdog fires, attempt to recover within the application. The WWDG's Early Wakeup Interrupt (EWI) fires just before the counter expires, giving firmware a last chance to save diagnostic state (fault registers, stack trace, task states) to non-volatile memory, reinitialize the problematic subsystem, and refresh the watchdog to avoid the reset entirely. This is the fastest recovery path but only works if the software is still coherent enough to execute the EWI handler correctly.
Level 1 — Warm reset: The watchdog triggers a system reset, but the bootloader distinguishes it from a power-on reset by checking the reset reason register (RCC_CSR on STM32). On a watchdog reset, the bootloader can take a fast recovery path: skip lengthy self-tests and peripheral enumeration, restore state from battery-backed RAM or a persistent configuration block, and resume operation in milliseconds instead of seconds. Non-volatile diagnostic data saved during Level 0 is available for later retrieval.
Level 2 — Cold reset: A full power-on reset sequence: all hardware reinitialized from scratch, full self-tests executed, configuration loaded from defaults. This is the safest recovery but also the slowest, and it loses all volatile state. Use this when the warm reset path itself is suspect — for example, if battery-backed RAM may be corrupted.
Level 3 — External power cycle: An external watchdog IC (independent of the MCU) cuts and restores power to the entire board. This handles failure modes that survive a software-initiated reset: a peripheral locked in an error state that holds a bus signal low, an ESD-induced latch-up condition where the MCU draws excessive current, or a voltage regulator glitch that leaves the MCU in an undefined state with the reset line not asserted.
Production systems often implement escalation: first watchdog reset attempts Level 1 recovery. If a second watchdog reset occurs within a short window (e.g., 60 seconds), escalate to Level 2. If resets continue, escalate to Level 3. If resets persist even after power cycling, enter a safe mode with minimal functionality and signal for human intervention.
Source: Watchdog Q&A
