Watchdog Basics
QWhy do watchdog timers exist? What problem do they solve?
Watchdog timers exist because software can fail in ways that the software itself cannot detect or recover from. An infinite loop caused by unexpected input, a deadlock between concurrent tasks, a stack overflow that corrupts the return address and sends the program counter into garbage memory, a hardware glitch that flips a bit in a control register, or a cosmic ray-induced single-event upset that alters RAM — all of these can leave an embedded system in a hung or erratic state with no internal mechanism for recovery. Unlike desktop software, embedded systems often have no user to notice the hang and press Ctrl-C, and no operating system to kill the runaway process.
A watchdog timer is an independent hardware countdown timer that resets the system if firmware does not periodically "feed" (refresh, kick) it before it reaches zero. The logic is deliberately simple: if the software is running correctly and progressing through its intended execution path, it will feed the watchdog on schedule. If the software is stuck, corrupted, or otherwise non-functional, it will miss the feed deadline, and the watchdog timeout triggers a hardware reset. The watchdog operates independently of the CPU and the software — it is clocked by its own oscillator (on the IWDG) and cannot be disabled by runaway code once started (on most MCUs, the IWDG cannot be stopped after activation).
In safety-critical systems — medical devices, automotive ECUs, industrial controllers, avionics — a watchdog is not optional. Standards like IEC 61508 (industrial functional safety), ISO 26262 (automotive), and IEC 62304 (medical device software) mandate independent monitoring of program execution. The watchdog is the simplest and most widely used mechanism to satisfy this requirement. Even in non-safety-critical products, a watchdog is considered baseline engineering practice because field failures from software hangs are inevitable over millions of device-hours.
QHow do you choose the watchdog timeout value?
The timeout must be long enough for the software to complete its longest legitimate execution cycle, but short enough to limit the duration of undetected failure. Choosing this value requires understanding your system's timing characteristics under worst-case conditions, not just typical operation.
Too short: Legitimate code paths that take longer than normal — a flash sector erase (which can block for 20-400 ms depending on the flash technology), a computation-heavy DSP filter pass, or a task waiting for a slow external peripheral response — exceed the timeout, causing spurious resets. These are maddening to debug because the system works most of the time but randomly resets under heavy load, low temperature (flash writes are slower), or specific operational sequences. The symptom looks like a hardware defect, not a timeout misconfiguration.
Too long: The system remains in a failed state for an unacceptably long time before the watchdog fires. A 30-second watchdog timeout on a motor controller means the motor could run uncontrolled for 30 seconds after the software hangs — potentially causing physical damage, injury, or product destruction. A 10-second timeout on a communication gateway means 10 seconds of lost data before the system recovers.
Practical approach: Measure the worst-case main loop or task cycle time under maximum load, with all peripheral interactions active and all error-handling paths exercised. Set the watchdog timeout to 2-3x that value to provide margin for timing variability. For a super-loop that normally completes in 5 ms and worst-case in 20 ms, a 50-100 ms timeout is appropriate. For an RTOS watchdog manager checking in every 100 ms, a 300-500 ms timeout provides comfortable margin. Also consider the reset recovery time: if a watchdog reset takes 500 ms to boot and reinitialize, the total downtime per watchdog event is timeout + recovery, and this total must be acceptable for your application's availability requirements.
IWDG vs WWDG
QWhat is the difference between the Independent Watchdog (IWDG) and Window Watchdog (WWDG) on STM32?
The IWDG (Independent Watchdog) is clocked by its own dedicated low-speed internal oscillator (LSI, approximately 32 kHz on STM32), making it truly independent of the main system clock and the entire digital clock tree. If the main HSE oscillator fails, if the PLL loses lock, or if the system clock configuration is corrupted, the IWDG continues counting and will reset the system. It is a simple countdown timer with a single constraint: refresh it before it reaches zero. You can refresh at any time — there is no "too early" penalty. The IWDG is best for catching catastrophic failures: complete CPU hangs, infinite loops, clock system failures, and corrupted program flow. Once started, the IWDG cannot be stopped (on most STM32 families), which is a deliberate safety feature — it prevents runaway software from disabling its own watchdog.
The WWDG (Window Watchdog) is clocked from the APB1 bus clock, which is derived from the main system clock. It adds a window constraint: the refresh must occur within a specific time window — after a configurable lower bound and before the counter expires. Refreshing too early (below the window value) triggers a reset, just like refreshing too late. This catches a broader class of bugs than the IWDG: if the software loop is running faster than expected (skipping critical sections, bypassing sensor reads, or stuck in a tight sub-loop that still reaches the feed point), the early-refresh detection catches the abnormal timing. The WWDG also provides an Early Wakeup Interrupt (EWI) that fires approximately one counter tick before the timeout, giving firmware a last chance to save diagnostic state or perform graceful shutdown before the reset.
For maximum protection, deploy both: the IWDG catches catastrophic failures and clock loss (since it runs on its own oscillator), while the WWDG verifies timing integrity (since it detects both too-fast and too-slow execution). This combination is recommended by several safety standards for SIL 2+ and ASIL B+ applications.
QHow do you handle the watchdog during debugging?
Debugging and watchdogs are inherently in conflict. When you hit a breakpoint, the CPU halts — but hardware timers keep counting (they are not on the CPU pipeline). If you pause at a breakpoint for longer than the watchdog timeout, the system resets, the debug session disconnects, and you lose your inspection context. This is especially frustrating when debugging the exact code path that feeds the watchdog, because you cannot step through it without triggering the timeout.
STM32 solution: The DBGMCU (Debug MCU Configuration) peripheral has freeze bits that halt specific timers when the CPU is halted by the debugger. Setting DBGMCU_APB1_FZ.DBG_IWDG_STOP freezes the IWDG counter during debug halt, and DBGMCU_APB1_FZ.DBG_WWDG_STOP freezes the WWDG. The HAL provides convenience macros: __HAL_DBGMCU_FREEZE_IWDG() and __HAL_DBGMCU_FREEZE_WWDG(). Enable these early in your initialization code (before starting the watchdogs) so they take effect for the entire debug session.
Critical rule: These freeze bits must not be present in production firmware. If a debugger probe is accidentally left connected in the field (not uncommon during manufacturing test), or if the DBGMCU registers are not cleared after a debug session, the watchdog might not function correctly when the CPU enters a low-power halt state. Use conditional compilation to restrict watchdog freeze to debug builds only:
#ifdef DEBUG__HAL_DBGMCU_FREEZE_IWDG();__HAL_DBGMCU_FREEZE_WWDG();#endif
Some teams also increase the watchdog timeout in debug builds (e.g., 10 seconds instead of 500 ms) as additional margin for stepping through code. This is less safe than the freeze approach but useful when the DBGMCU freeze feature is not available on a particular MCU.
Feeding Strategies
QWhy should you NEVER feed the watchdog inside a timer ISR?
Feeding the watchdog in a timer ISR is one of the most dangerous design mistakes in embedded firmware because it completely defeats the watchdog's purpose while giving the false appearance of safety. The timer interrupt is driven by hardware — it fires at a fixed rate based on the timer peripheral's clock and configuration, completely independent of what the application code is doing. If the main loop is stuck in an infinite loop, if a task is deadlocked, if the stack has overflowed and corrupted the program counter so the CPU is executing random instructions, or if a critical peripheral driver has hung waiting for a status bit that will never be set — the timer ISR still fires on schedule (because it runs at a higher priority than the stuck code) and faithfully refreshes the watchdog. The system appears healthy to the watchdog but is completely non-functional.
The fundamental principle is: the watchdog feed must prove that the application logic is progressing correctly, not merely that the CPU is executing instructions. A timer ISR proves only that the interrupt controller and timer peripheral are working — it says nothing about the application's functional state. The watchdog feed must be placed at a point in the code that is only reached if all critical functions have executed successfully during the current cycle.
In a super-loop architecture, the correct placement is at the end of the main loop, after all critical subsystems have been serviced — sensor reads completed, communication processed, control outputs updated. If any subsystem hangs, the main loop stalls before reaching the feed point, and the watchdog fires. For additional safety, implement execution path verification: maintain a bitmask where each critical function sets its bit when it completes, and only feed the watchdog if all bits are set (then clear the mask for the next cycle). This detects not just hangs but also skipped subsystems:
#define TASK_SENSOR (1 << 0)#define TASK_COMMS (1 << 1)#define TASK_CONTROL (1 << 2)#define ALL_TASKS (TASK_SENSOR | TASK_COMMS | TASK_CONTROL)volatile uint32_t task_complete_mask = 0;while (1) {read_sensors(); task_complete_mask |= TASK_SENSOR;process_comms(); task_complete_mask |= TASK_COMMS;update_control(); task_complete_mask |= TASK_CONTROL;if (task_complete_mask == ALL_TASKS) {IWDG_feed();task_complete_mask = 0;}}
QHow do you implement a watchdog feeding strategy in an RTOS?
In an RTOS, a single watchdog feed in one task does not protect the other tasks — if the network task hangs but the sensor task keeps feeding the watchdog, the system appears healthy while half of its functionality is dead. The solution is a watchdog manager (sometimes called a watchdog supervisor) pattern that aggregates health information from all critical tasks before feeding the hardware watchdog.
The architecture works as follows: each critical task has a "check-in" flag or counter in a shared array. Within its normal execution cycle, each task sets its flag — task_alive[TASK_SENSOR] = true — at a point that is only reached after the task has completed its essential work for that cycle. A dedicated watchdog manager task runs periodically (e.g., every 100-200 ms) at a low-to-medium priority. On each run, it inspects every monitored task's flag. Only if all flags are set does the manager feed the hardware watchdog. After feeding (or after detecting a missing flag), the manager clears all flags for the next monitoring cycle.
// Watchdog manager task (FreeRTOS example)#define NUM_TASKS 4static volatile bool task_checkin[NUM_TASKS];void watchdog_manager_task(void *param) {while (1) {vTaskDelay(pdMS_TO_TICKS(100));bool all_alive = true;for (int i = 0; i < NUM_TASKS; i++) {if (!task_checkin[i]) {all_alive = false;log_error("Task %d missed check-in", i);}task_checkin[i] = false; // Reset for next cycle}if (all_alive) {IWDG_feed();}}}// Called by each monitored task after completing its critical workvoid watchdog_checkin(uint8_t task_id) {task_checkin[task_id] = true;}
This pattern catches several failure modes: (1) any single task hanging — its flag is never set; (2) priority inversion starving a low-priority task — it cannot run and therefore cannot check in; (3) deadlocks between two or more tasks — at least one task's flag will be missing. The watchdog timeout must be set longer than the monitoring period plus the worst-case latency for all tasks to complete one cycle and check in. If the manager runs every 100 ms and the slowest task takes up to 80 ms per cycle, a 300-500 ms watchdog timeout provides adequate margin.
A subtle design decision: the watchdog manager should run at a low-to-medium priority, not the highest. If it runs at the highest priority, it always gets CPU time and always feeds on schedule — but a priority inversion that starves a medium-priority task would go undetected because the high-priority manager does not depend on the medium-priority task for its own scheduling.
Safety and Recovery
QWhat are watchdog recovery levels, and why might a simple reset not be enough?
Not all failures warrant a full system reset, and some systems cannot tolerate the downtime. In safety-critical and high-availability applications, a graduated recovery strategy provides the right level of response for each severity of failure:
Level 0 — Software recovery (pre-reset): Before the watchdog fires, attempt to recover within the application. The WWDG's Early Wakeup Interrupt (EWI) fires just before the counter expires, giving firmware a last chance to save diagnostic state (fault registers, stack trace, task states) to non-volatile memory, reinitialize the problematic subsystem, and refresh the watchdog to avoid the reset entirely. This is the fastest recovery path but only works if the software is still coherent enough to execute the EWI handler correctly.
Level 1 — Warm reset: The watchdog triggers a system reset, but the bootloader distinguishes it from a power-on reset by checking the reset reason register (RCC_CSR on STM32). On a watchdog reset, the bootloader can take a fast recovery path: skip lengthy self-tests and peripheral enumeration, restore state from battery-backed RAM or a persistent configuration block, and resume operation in milliseconds instead of seconds. Non-volatile diagnostic data saved during Level 0 is available for later retrieval.
Level 2 — Cold reset: A full power-on reset sequence: all hardware reinitialized from scratch, full self-tests executed, configuration loaded from defaults. This is the safest recovery but also the slowest, and it loses all volatile state. Use this when the warm reset path itself is suspect — for example, if battery-backed RAM may be corrupted.
Level 3 — External power cycle: An external watchdog IC (independent of the MCU) cuts and restores power to the entire board. This handles failure modes that survive a software-initiated reset: a peripheral locked in an error state that holds a bus signal low, an ESD-induced latch-up condition where the MCU draws excessive current, or a voltage regulator glitch that leaves the MCU in an undefined state with the reset line not asserted.
Production systems often implement escalation: first watchdog reset attempts Level 1 recovery. If a second watchdog reset occurs within a short window (e.g., 60 seconds), escalate to Level 2. If resets continue, escalate to Level 3. If resets persist even after power cycling, enter a safe mode with minimal functionality and signal for human intervention.
QWhat should your system do after a watchdog reset to prevent reset loops?
A watchdog reset indicates something went wrong. If the root cause persists — a corrupted configuration in flash, a peripheral stuck in an error state, a hardware fault, or an environmental condition like sustained over-temperature — the system will hang again immediately after restarting, triggering another watchdog reset, creating an infinite reset loop. This is worse than a single failure: the system never stabilizes, the rapid cycling may wear out flash memory (repeated boot writes to the same sectors), mechanical actuators may cycle dangerously (motors jerking on each boot), and the power supply stress from repeated inrush current can damage components.
Strategy 1 — Count consecutive watchdog resets: Maintain a reset counter in battery-backed RAM (or a persistent register that survives warm resets but not power-on resets). Increment it on each watchdog reset and clear it on normal operation after a stability period (e.g., 30 seconds of successful running). If the counter exceeds a threshold (e.g., 3 resets within 5 minutes), enter a safe mode with minimal functionality — disable the failing subsystem, load factory-default configuration, halt actuation outputs, and wait for external intervention (debugger connection, configuration command, or manual power cycle).
Strategy 2 — Log diagnostics for post-mortem analysis: At startup, read the reset cause flags in RCC_CSR to distinguish watchdog resets from power-on resets, pin resets, and brownout resets. For watchdog resets, retrieve any saved fault information (program counter at fault, stack pointer, fault status registers from a preceding HardFault handler) and store it to a dedicated diagnostic partition in flash or external EEPROM. This data is invaluable for root-cause analysis — it tells the field engineer or developer exactly where the code was when it failed.
void check_reset_reason(void) {if (__HAL_RCC_GET_FLAG(RCC_FLAG_IWDGRST)) {wdt_reset_count++; // In battery-backed RAMsave_reset_log(wdt_reset_count, saved_pc, saved_lr);if (wdt_reset_count >= MAX_WDT_RESETS) {enter_safe_mode(); // Minimal operation}} else if (__HAL_RCC_GET_FLAG(RCC_FLAG_PORRST)) {wdt_reset_count = 0; // Clean power-on, reset counter}__HAL_RCC_CLEAR_RESET_FLAGS();}
Strategy 3 — Progressive degradation: On first watchdog reset, attempt full normal restart. On second, disable non-essential subsystems (logging, telemetry, secondary sensors). On third, enter a minimal safe state that maintains only critical safety functions — a motor controller parks the motor and holds brakes; a medical infusion pump closes the valve and sounds an alarm; a communication gateway forwards only priority-1 messages. Communicate the fault condition through whatever channel remains operational: an LED blink pattern, a CAN error frame, a serial log message, or a status register readable via debugger.
QWhat safety standards require watchdog timers, and what do they mandate?
Several international safety standards explicitly require independent monitoring of program execution, and the watchdog timer is the most common mechanism to satisfy this requirement:
IEC 61508 (Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems) is the umbrella standard for industrial systems. It requires monitoring of the program sequence (execution flow) and detection of timing faults. A watchdog satisfies this by verifying that software progresses through its expected sequence within expected time bounds. For SIL 2 and above, the monitoring mechanism must be independent of the monitored software — meaning the watchdog must use an independent clock source. The IWDG's LSI oscillator satisfies this requirement because it is separate from the main system clock; the WWDG (clocked from APB1) does not fully satisfy it because a main clock failure takes down both the application and the watchdog.
ISO 26262 (Road Vehicles — Functional Safety) requires software execution monitoring for ASIL B and above. The watchdog mechanism must be independent of the monitored software, and the monitoring must cover both temporal (too slow or too fast) and logical (correct sequence) aspects of execution. For ASIL C and ASIL D applications, the standard often requires an external watchdog IC in addition to the internal MCU watchdog, providing a second independent fault detection path that can reset or power-cycle the MCU even if the MCU itself is completely non-responsive.
IEC 62304 (Medical Device Software — Software Life Cycle Processes) requires risk mitigation mechanisms for software failures in Class B and Class C medical devices. A watchdog reset is a standard risk control measure. The standard also requires that watchdog reset events are logged (to non-volatile memory if possible), that the system verifies correct startup after a watchdog reset (not just blind restart), and that repeated resets trigger an escalated fault response rather than entering an infinite reset loop that could mask a persistent hazard.
All three standards share common requirements: the watchdog must be started early in the boot process (before the application reaches its main loop), it must not be disableable by application software (once started), and the feeding strategy must verify meaningful application progress — not just CPU activity.
QWhat is an external watchdog IC, and when should you use one instead of the internal watchdog?
An external watchdog IC is a separate chip on the PCB — typically a small SOT-23 or SOIC-8 package — that monitors the MCU by expecting a periodic signal transition (toggle or pulse) on an input pin. If the MCU fails to toggle the pin within the configured timeout, the external watchdog asserts the MCU's hardware reset line (or cuts power via a load switch). Examples include the MAX6369 family, TPS3823, STM6601, and ADM6316.
You need an external watchdog when the internal watchdog's assumptions break down: (1) the internal watchdog shares a failure mode with the MCU itself — a power supply glitch that puts the MCU into a partially functional state may corrupt the IWDG configuration registers while leaving the CPU running erratically, or a brownout may halt the LSI oscillator that clocks the IWDG; (2) the MCU enters a latch-up condition (triggered by ESD, radiation, or overvoltage on an I/O pin) where it draws excessive current and is electrically stuck — no internal watchdog can recover from this because the entire digital core is non-functional; (3) safety standards require a monitoring path that is physically independent of the device being monitored — this is common at ASIL C/D and SIL 3/4, where a single IC failure must not disable both the application and its monitor.
External watchdogs also provide capabilities that internal watchdogs typically lack: power-on reset generation with proper voltage supervisory (the IC monitors VDD and holds the MCU in reset until power is stable), brownout detection with configurable thresholds, reset pulse width guarantees (ensuring all peripherals see a long enough reset pulse to properly initialize), and cascaded timeout architectures for multi-processor systems. Some external watchdogs offer window mode (configurable minimum and maximum feed times), providing the same timing integrity check as the STM32 WWDG but as a fully independent device with its own oscillator and power domain.