Watchdog Timer

Quick Cap

A watchdog timer is a hardware safety mechanism that resets the system if software fails to "feed" (refresh) it within a configured timeout period. It serves as the last line of defense against firmware hangs, infinite loops, deadlocks, and task starvation in systems that run unattended with no human to press reset. Interviewers test your understanding of hardware vs software watchdog, the difference between independent and windowed watchdog timers, intelligent feeding strategies (not just feeding blindly in a timer ISR), and the safety implications of watchdog design in critical systems.

Key Facts:

Hardware counter: A free-running down-counter that triggers a system reset when it reaches zero
Feed/kick/refresh: Software must periodically reload the counter to prevent timeout and reset
Independent WDT (IWDG): Clocked from an independent oscillator; survives main clock failure
Window WDT (WWDG): Resets if fed too early OR too late, catching both hangs and runaway code
Cannot be disabled: Once started, most hardware watchdogs cannot be stopped -- this is by design
Safety requirement: Standards like IEC 61508 and ISO 26262 mandate independent watchdog timers

Deep Dive

At a Glance

Characteristic	Independent WDT (IWDG)	Window WDT (WWDG)
Clock source	Independent (LSI ~32-40 kHz)	System clock (APB)
Timeout range	~100 ms to ~26 s	~50 ms to ~65 ms
Window mode	No (feed anytime before timeout)	Yes (too early = reset)
Survives main clock failure	Yes	No
Reset behavior	Full system reset on timeout	Full system reset on timeout or early feed
Configurable after start	No (once started, runs forever)	No
Typical use	General hang detection	Detecting timing anomalies

Why Watchdogs Matter

Embedded systems operate unattended -- there is no user watching the screen ready to reboot when something goes wrong. A medical infusion pump, a vehicle braking controller, an industrial valve actuator, a remote weather station: all of these must recover from software faults autonomously. The watchdog timer is the mechanism that makes this possible.

The fundamental contract is simple: the system must prove it is alive by periodically feeding the watchdog. If the software is running correctly, it feeds the watchdog well before the timeout expires, and nothing happens. If the software hangs, enters an infinite loop, deadlocks on a mutex, or starves a critical task, the feed never arrives, the counter reaches zero, and the hardware forces a full system reset.

This is not a graceful recovery -- it is the nuclear option. But it must always be there as the ultimate safety net. Safety standards like IEC 61508 (industrial) and ISO 26262 (automotive) explicitly require independent hardware watchdog timers for any system where failure could cause harm. The watchdog is not optional in safety-critical design; it is mandatory.

Independent vs Window Watchdog

The two most common watchdog types serve different detection goals. Understanding when to use each -- and why you might use both simultaneously -- is critical for interview discussions.

Feature	Independent WDT (IWDG)	Window WDT (WWDG)
Clock source	Independent oscillator (LSI ~32-40 kHz)	System clock (APB bus)
Survives main clock failure	Yes -- this is its key advantage	No -- dies with the system clock
Window mode	No (feed anytime before timeout)	Yes (too early = reset)
Can detect runaway code	Partial (only catches hangs)	Yes (catches too-fast execution too)
Typical timeout range	100 ms - 26 s	~50 ms - ~65 ms
Can be disabled after start	No (once started, runs forever)	No
Best for	General hang detection, clock failure survival	Detecting timing anomalies, execution flow validation

Why windowed mode matters: A simple watchdog only catches code that is stuck -- it times out because the feed never arrives. But what about code that runs too fast? Consider a control loop that should execute every 10 ms. If a corrupted loop counter or a stack overflow causes the loop to complete in microseconds instead of milliseconds, the actuator receives commands at thousands of times the intended rate. A simple watchdog would see the feed arriving and think everything is fine. A windowed watchdog catches this: if the feed arrives before the window opens (too early), it triggers a reset just as surely as if the feed arrived too late.

DiagramWindow Watchdog Timing

Feed only inside the valid window — too early or too late triggers a reset.

Too Early (above W) -- Feeding the watchdog before the counter drops below the window value triggers an immediate reset. This catches code that executes faster than expected.
Valid Window (between W and 0) -- The only safe time to feed. Software must be designed so that its critical path completes within this window.
Too Late (counter reaches 0) -- The watchdog fires a reset because the software failed to feed in time.

The combination of IWDG and WWDG gives you a dual-watchdog pattern: the IWDG catches hangs and survives clock failures, while the WWDG catches timing anomalies and execution flow corruption. Safety-critical systems commonly deploy both.

Feeding Strategies

How and where you feed the watchdog is just as important as having one. A poorly placed feed can completely defeat the watchdog's purpose.

Bad strategy -- feeding in a timer interrupt:

This is the single most common watchdog mistake and a frequent interview trap. If you feed the watchdog from a periodic timer ISR, the watchdog will be fed as long as interrupts are running -- even if the main application is completely deadlocked. The timer interrupt fires independently of the application logic, so the watchdog happily stays alive while the system is effectively dead. The watchdog has been reduced to an expensive paperweight.

⚠️Common Trap: Feeding in a Timer ISR

Never feed the watchdog from a periodic timer interrupt. Interrupts can continue firing even when the main application is deadlocked. The watchdog should only be fed when the application has demonstrated that its critical paths are executing correctly.

Good strategy -- health-verified feeding:

Each critical task or subsystem sets a "heartbeat" flag when it completes a successful iteration. A dedicated watchdog monitoring task (or the main loop itself) checks ALL heartbeat flags before feeding the watchdog. If any task has missed its heartbeat deadline, the watchdog is intentionally starved, and the system resets. This approach ensures that the watchdog feed is a proof of system health, not just proof that the CPU is running.

The pattern works as follows:

Each critical task sets its heartbeat flag after completing its core work
The watchdog task periodically checks all heartbeat flags
If ALL flags are set, clear them all and feed the watchdog
If ANY flag is missing, do NOT feed -- let the watchdog expire

This way, if the sensor task hangs on a bus timeout, or the communication task deadlocks on a semaphore, or the control task gets stuck in an infinite retry loop, the corresponding heartbeat flag will never be set, the watchdog will starve, and the system will reset.

RTOS pattern: In an RTOS environment, the watchdog feed typically lives in either the lowest-priority idle task or a dedicated medium-priority watchdog task. The idle-task approach has a subtle advantage: if any higher-priority task monopolizes the CPU (priority inversion, runaway loop), the idle task never runs, and the watchdog starves. A dedicated watchdog task with explicit health checks from all other tasks is more robust and more commonly used in production systems.

Recovery Strategies

Not all faults require the scorched-earth approach of a full hardware reset. Well-designed embedded systems implement multi-level recovery that escalates through increasingly aggressive actions, using the watchdog reset as the absolute last resort.

Level	Action	When to Use	Example
1 -- Task restart	Kill and restart the failed task	Single task hang or crash	Sensor driver task restarted after I2C timeout
2 -- Subsystem reset	Reset peripheral + driver state	Communication or peripheral failure	Re-initialize UART peripheral and flush buffers
3 -- Warm reset	Software reset preserving some state	Application-level fault, corrupted state	Trigger NVIC system reset, retain RAM for diagnostics
4 -- Cold reset (watchdog)	Full hardware reset, no state preserved	Unrecoverable fault, total system hang	Watchdog timeout fires, entire system restarts from scratch

The key insight is that levels 1-3 are software-initiated recovery that the application manages proactively. Level 4 -- the watchdog reset -- fires only when the software is so broken that it cannot even manage its own recovery. The watchdog is the safety net beneath all other safety nets.

In practice, a robust system logs the reset cause in persistent memory (battery-backed RAM, flash, or a dedicated register like the RCC_CSR on STM32). On startup, the firmware checks this register to determine whether the last reset was a power-on, a software reset, or a watchdog reset. Watchdog resets are logged and reported because they indicate a failure that the software could not handle -- these are bugs that need investigation.

Watchdog in Safety-Critical Systems

Safety standards impose specific requirements on watchdog implementation that go beyond simply "have a watchdog timer":

Independent clock source: The watchdog must use a clock source independent of the main system clock. If the main oscillator fails, the watchdog must still be able to detect the hang and trigger a reset. This is why the IWDG uses the LSI oscillator.
Watchdog self-test at startup: The system must verify that the watchdog is actually functional before entering normal operation. A common approach is to deliberately let the watchdog expire during a controlled test sequence and verify that the reset occurs. This catches silicon defects or configuration errors that could silently disable the watchdog.
External hardware watchdog: Some safety integrity levels require a watchdog that is completely external to the MCU -- a separate IC on the board with its own oscillator and reset output connected to the MCU's reset pin. This protects against MCU-level faults (latch-up, runaway execution outside flash) that could disable the internal watchdog.
Dual-watchdog pattern: Use both IWDG and WWDG simultaneously. The IWDG catches hangs and survives clock failures. The WWDG validates execution timing. Together, they cover a broader class of failure modes than either one alone.
Question-and-answer watchdog: Some advanced external watchdog ICs require the MCU to respond to a challenge (e.g., compute a CRC or return a specific sequence) rather than simply toggling a pin. This prevents a simple stuck GPIO from keeping the watchdog alive.

Watchdog During Development and Debugging

The watchdog creates a tension during development: it is essential for production safety but can be frustrating when you are stepping through code with a debugger.

Debug freeze: Most MCUs provide a debug configuration register (e.g., DBGMCU on STM32) that can freeze the watchdog counter when the CPU is halted by a debugger. This prevents the watchdog from firing a reset while you are inspecting variables at a breakpoint.
Do not disable the watchdog during development: Some developers turn off the watchdog entirely during development to avoid the annoyance of unexpected resets. This is dangerous because bugs that would trigger the watchdog -- infinite loops, deadlocks, task starvation -- go completely unnoticed and only surface in production when the watchdog is enabled. A bug that takes 45 minutes to manifest in the field could have been caught on day one if the watchdog had been running.
Better approach: Keep the watchdog active during development. If the timeout is too aggressive for debugging, extend it to a longer value (e.g., 10 seconds instead of 500 ms). Log the reset cause in persistent memory so you can distinguish between intentional resets (power cycle, flash programming) and watchdog resets (bugs). When you see unexpected watchdog resets during development, investigate them immediately -- they are telling you something important about your code.

Debugging Story

A field-deployed environmental sensor node was rebooting every 45 minutes. The development team could not reproduce the issue on the bench. After adding persistent reset-cause logging, they confirmed the resets were watchdog-triggered. Root cause: a slow memory leak in the JSON parser caused heap fragmentation. After approximately 45 minutes of operation, malloc failed inside the MQTT publish task, which then entered an infinite retry loop attempting to allocate the buffer. The watchdog feeding happened to be done inside the same MQTT task -- so when the task got stuck in the retry loop, the feed stopped arriving and the watchdog correctly fired. The fix was twofold: patch the JSON parser to eliminate the memory leak, and move watchdog feeding to a separate health-monitoring task that verified all critical tasks (sensor sampling, data logging, MQTT publishing, and BLE advertising) were meeting their deadlines. With the new architecture, a single stuck task could no longer prevent the system from detecting the fault early.

What interviewers want to hear is that you understand the watchdog is not just a simple peripheral you "turn on and forget" -- it requires deliberate architectural decisions about where to feed it, how to verify system health before feeding, and how to layer recovery mechanisms so that the watchdog reset is truly the last resort. They want you to articulate why feeding in a timer ISR defeats the purpose, why windowed mode catches a broader class of faults than a simple timeout, and why safety-critical systems require independent clock sources and potentially external watchdog hardware. Demonstrating awareness of the development vs production tension (debug freeze, reset-cause logging, never disabling the watchdog) shows real-world maturity.

Interview Focus

Classic Watchdog Interview Questions

Q1: "What is a watchdog timer and why is it needed?"

Model Answer Starter: "A watchdog timer is a hardware peripheral with a free-running down-counter that triggers a system reset when it reaches zero. Software must periodically 'feed' or 'kick' the watchdog to reload the counter, proving the system is alive. It is needed because embedded systems run unattended -- there is no human to press reset when firmware hangs, enters an infinite loop, or deadlocks. The watchdog acts as the last line of defense, ensuring the system recovers autonomously from any software fault that prevents normal execution. Safety standards like IEC 61508 and ISO 26262 mandate watchdog timers for systems where failure could cause harm."

Q2: "What's the difference between an independent and a windowed watchdog?"

Model Answer Starter: "An independent watchdog (IWDG) runs from its own oscillator, separate from the main system clock. This means it survives main clock failures -- if the HSE crystal stops, the IWDG still counts down and resets the system. However, it only detects code that is stuck: as long as the feed arrives before timeout, it is satisfied. A windowed watchdog (WWDG) runs from the system clock and defines a valid time window for feeding. If the code feeds too early (before the window opens) OR too late (after timeout), it triggers a reset. This catches not just hangs but also code that runs too fast -- for example, a corrupted loop counter causing a control task to execute in microseconds instead of milliseconds. The trade-off is that the WWDG cannot survive main clock failure since it depends on the system clock."

Q3: "Where should you feed the watchdog in your code -- and where should you NOT?"

Model Answer Starter: "You should feed the watchdog only after verifying that all critical tasks are healthy. The best pattern is a dedicated health-monitoring task that checks heartbeat flags from every critical subsystem. Each task sets its flag after completing a successful iteration. Only when ALL flags are set does the monitoring task clear them and feed the watchdog. You should NOT feed the watchdog from a timer interrupt, because interrupts continue firing even when the main application is deadlocked -- the watchdog would be kept alive while the system is effectively dead. You also should not feed unconditionally in the main loop without checking task health, as this could mask a single stuck task while the rest of the system appears fine."

Q4: "How does a watchdog interact with an RTOS?"

Model Answer Starter: "In an RTOS environment, the watchdog feed is typically managed by either a dedicated watchdog task or the idle task. The dedicated-task approach is more common: each critical task reports its health by setting a flag or incrementing a counter within its expected period. The watchdog task periodically checks all reported health indicators. If every task has checked in within its deadline, the watchdog is fed. If any task has missed its deadline, the feed is withheld and the system resets. The idle-task approach works differently -- if any higher-priority task monopolizes the CPU through priority inversion or a runaway loop, the idle task never executes and the watchdog starves. Both approaches are valid, but the dedicated task with explicit health checks is more robust and provides better diagnostics about which task failed."

Q5: "How would you determine the correct watchdog timeout for your application?"

Model Answer Starter: "The timeout must be long enough that the system can complete its longest normal execution path and feed the watchdog under worst-case conditions, but short enough that unacceptable downtime does not accumulate before the reset occurs. I start by measuring the worst-case execution time of the critical path, including any blocking operations like flash writes, I2C transactions with clock stretching, or RTOS scheduling delays. I then add a safety margin -- typically 2x to 3x the measured worst case -- to account for variability. For a system with a 100 ms control loop, a 500 ms watchdog timeout gives ample margin. For safety-critical systems, I also consider the maximum tolerable time-to-recovery: if the system must recover within 1 second of a fault, the watchdog timeout plus boot time must fit within that budget."

Trap Alerts

Don't say: "I just feed the watchdog in the main while loop" -- this shows no understanding of health verification or multi-task monitoring
Don't forget: That a simple watchdog only catches hangs, not code that runs too fast -- mention windowed mode to show depth
Don't ignore: The clock source question -- an interviewer asking about watchdog reliability expects you to discuss independent oscillators and main clock failure scenarios

Follow-up Questions

"How would you detect which task caused a watchdog reset in a multi-task system?"
"What happens if the watchdog timeout is shorter than the worst-case execution time of your critical path?"
"How would you implement a watchdog for a system that has legitimate long-running operations like flash erasing?"
"What is the difference between an internal MCU watchdog and an external watchdog IC, and when would you use each?"

Practice

❓ Why is feeding the watchdog from a periodic timer interrupt a bad practice?

❓ What additional fault does a windowed watchdog (WWDG) detect compared to a standard watchdog (IWDG)?

❓ Why does the Independent Watchdog (IWDG) use its own oscillator instead of the system clock?

❓ In an RTOS-based system, what is the recommended pattern for watchdog feeding?

❓ After a watchdog reset, what should the firmware do first?

Real-World Tie-In

Remote Agricultural IoT Gateway -- A solar-powered LoRaWAN gateway deployed in a remote vineyard used an IWDG with a 4-second timeout and a health-monitoring task that verified sensor polling, LoRa radio communication, and solar charge controller status before each feed. Over 18 months, the gateway experienced 3 watchdog resets -- all traced to a rare SPI bus lockup with the LoRa radio module during lightning-induced transients. The watchdog ensured the system recovered within 6 seconds each time, with zero data loss thanks to a battery-backed write buffer.

Automotive Body Control Module -- A BCM controlling door locks, windows, and interior lighting used a dual-watchdog architecture: IWDG for general hang detection and WWDG to validate that the CAN message processing loop executed within its 5 ms timing window. During EMC testing, a burst of conducted interference caused the main oscillator to glitch, freezing the WWDG. The IWDG, running on the independent LSI oscillator, detected the hang and reset the system within 500 ms -- exactly the scenario that dual-watchdog architecture is designed to handle.

Medical Device Infusion Pump -- An infusion pump controller used an external watchdog IC with a challenge-response protocol: every 200 ms, the MCU had to compute a CRC over its current state and return it to the external watchdog. A simple GPIO toggle could not satisfy the watchdog, ensuring that even a latch-up condition or runaway program counter (jumping to random memory) could not accidentally keep the watchdog alive. This design met IEC 62304 Class C software safety requirements.