What is the cache coherency problem on Cortex-M7, and how do you solve it with DMA?

Question

Accepted Answer

Cortex-M7 processors include a data cache (D-Cache) — a small, fast SRAM that holds copies of recently accessed main memory locations. The cache operates transparently to normal CPU code, but DMA transfers bypass the cache entirely because the DMA controller accesses main RAM directly through the bus matrix without going through the CPU's cache. This creates two coherency problems:

DMA writes, CPU reads (most common): DMA deposits fresh data into RAM, but the CPU's D-Cache still holds stale copies of those addresses from a previous access. When the CPU reads the buffer, it gets the old cached values instead of the new DMA data. This manifests as the ADC buffer showing the same readings repeatedly, or UART RX data appearing frozen — a maddening bug because everything works on Cortex-M4 (no cache) and breaks on M7.

CPU writes, DMA reads: The CPU updates a transmit buffer, but the writes are sitting in the cache and have not been flushed to main RAM. When DMA reads the buffer to feed the peripheral (UART TX, SPI TX), it reads stale RAM data — potentially garbage from the previous transfer or uninitialized memory.

Solutions, from simplest to most performant: (1) MPU-based non-cacheable region: Use the Memory Protection Unit to mark a RAM region as non-cacheable and place all DMA buffers there. Simple, foolproof, and zero maintenance — but every CPU access to that region bypasses the cache, which can hurt performance for frequently accessed buffers. (2) Explicit cache maintenance: Before DMA reads a CPU-written buffer, call SCB_CleanDCache_by_Addr() to flush dirty cache lines to RAM. After DMA writes to a buffer, call SCB_InvalidateDCache_by_Addr() to discard stale cache lines so the CPU fetches fresh data. Critical requirement: DMA buffers must be aligned to 32-byte cache line boundaries (__attribute__((aligned(32)))) and sized as multiples of 32 bytes. If a buffer shares a cache line with an unrelated variable, invalidating that line discards the unrelated variable's cached value — a subtle corruption bug. (3) DTCM (Data Tightly Coupled Memory): This memory region is not cached and provides single-cycle access latency. It is ideal for small, frequently accessed DMA buffers. The limitation is size — DTCM is typically 64-128 KB, shared with the stack and other performance-critical data.