How do you detect and handle memory corruption at runtime?

Question

Accepted Answer

Memory corruption in embedded systems — caused by hardware faults (single-event upsets from radiation, aging SRAM cells, marginal voltage levels), software bugs (buffer overflows, wild pointers, stack overflow, DMA misconfiguration), or environmental stress (EMI, temperature extremes) — is insidious because it can silently alter data or code without triggering an immediate crash. Detection requires proactive runtime checks at multiple levels.

RAM integrity testing uses algorithms like the March test (March C-, March B) that write and read test patterns to every cell in a systematic order designed to detect stuck-at faults, coupling faults, and address decoder faults. Running a full March test on the entire RAM takes time, so safety-critical systems typically test a rotating subset of RAM in each main loop iteration (a "walking" test), completing full coverage over many cycles. The key challenge is testing RAM that is in use — you must save the contents, run the test, and restore the contents, all within a critical section. For flash/ROM integrity, a CRC-32 or SHA-256 is computed over the entire code region and compared against a stored reference value. This check runs periodically (every second or every main loop iteration for high-integrity systems) and detects bit-flips in the stored program. Most MCUs with ECC on flash detect and correct single-bit errors in hardware, but the software CRC catches multi-bit errors and corruption from failed flash writes.

Stack overflow detection is critical because stack corruption produces some of the hardest-to-diagnose failures — variables on the stack are silently overwritten, causing incorrect behavior that may not manifest until much later. The MPU (Memory Protection Unit) is the most reliable mechanism: configure a guard region (typically 32-256 bytes) immediately below the stack as "no access." Any stack growth into this region triggers an immediate MemManage fault, catching the overflow at the exact instruction that caused it. Without an MPU, the stack canary pattern places a known value (like 0xDEADBEEF) at the bottom of the stack and periodically checks it — if the value has changed, the stack has overflowed. This is less reliable because the corruption might skip over the canary, but it catches most cases. For global data integrity, redundant storage with voting stores critical variables in three copies and uses majority voting on read — a single corrupted copy is outvoted. Combined with ECC RAM, MPU stack guards, periodic ROM CRC checks, and plausibility validation on all sensor inputs, these mechanisms provide defense-in-depth against memory corruption in safety-critical systems.