Your embedded application crashes intermittently — describe your systematic debugging approach.

Question

Accepted Answer

Intermittent crashes are the hardest class of embedded bugs because they resist simple breakpoint debugging. The systematic approach has three phases: contain, reproduce, isolate.

Contain and log: First, ensure your hard fault handler and watchdog reset handler both log diagnostic data (stacked PC, fault status registers, stack pointer values, and a monotonic timestamp) to a persistent region — either a dedicated section of flash, a battery-backed SRAM, or an unused region of RAM marked no-init in the linker script. Collect several crash instances. Look for patterns: does the stacked PC always point to the same function? Does it correlate with a particular peripheral operation, interrupt, or time of day? Also check whether the stack pointer at crash time is near the end of the stack region — stack overflow is the single most common cause of intermittent hard faults in embedded systems, and it manifests differently every time because the corrupted data varies.

Reproduce: If the crash log shows a pattern, try to stress the triggering condition. If crashes correlate with heavy interrupt activity, artificially increase interrupt rates. If they correlate with memory allocation, run stress tests that fragment the heap. Use data watchpoints (available on Cortex-M3 and above via the DWT unit) to break when specific memory locations are written — this catches stack overflows (set a watchpoint at the bottom of the stack) and buffer overruns. Enable stack canaries or MPU-based stack guards to catch corruption immediately rather than after it propagates.

Isolate: If no pattern emerges from logs, begin binary elimination. Disable subsystems one at a time: turn off DMA, reduce interrupt sources, simplify the RTOS task structure. If the crash disappears when a particular interrupt is disabled, the bug is in that ISR or in a shared-data race condition between the ISR and a task. Use static analysis tools (PC-lint, Polyspace) to scan for undefined behavior, uninitialized variables, and race conditions. Check every shared variable between ISR context and thread context for proper volatile qualification and atomic access. In RTOS environments, verify that every shared resource is protected by a mutex and that no task holds two mutexes in different order (deadlock potential that can manifest as a watchdog timeout).