Processor Architecture & CPU Internalsfoundational
Explain how DMA works. What are some of the issues that you need to worry about when using DMA?
0 upvotes
Practice with AISoon
How DMA works
Direct Memory Access lets a dedicated DMA controller move data between memory and peripherals (or memory-to-memory) without the CPU copying each byte, freeing the CPU for other work and greatly increasing throughput.
Typical flow:
- Configuration: The CPU programs a DMA channel with source address, destination address, transfer length, transfer width (byte/half/word), address-increment rules (which side increments), trigger/request source (e.g., UART RX request), and mode (single, block, circular/ring, scatter-gather, etc.).
- Trigger: The peripheral asserts a DMA request (DRQ) when it needs service (e.g., RX FIFO has data, TX FIFO has space), or the CPU starts a memory-to-memory transfer.
- Bus arbitration: The DMA controller becomes bus master and arbitrates with the CPU for the memory bus. It may steal cycles ("cycle stealing"), use burst mode, or a transparent mode.
- Transfer: The controller moves data directly between endpoints, updating its address/count registers, until the count is exhausted (or continuously, in circular mode).
- Completion: The controller raises a transfer-complete interrupt (and/or half-transfer interrupt for double-buffering). The CPU's ISR then processes the buffer or queues the next transfer.
Variants: scatter-gather / linked-list descriptors (one logical transfer spanning many noncontiguous buffers), circular/ping-pong buffers (continuous streaming), and DMA engines integrated with peripherals.
Issues to worry about
- Cache coherency (the big one): On systems with data caches, the CPU and DMA see memory differently.
- Before a memory→peripheral (TX) DMA, you must clean/flush the cache so DMA reads the latest data.
- After a peripheral→memory (RX) DMA, you must invalidate the cache so the CPU reads fresh data instead of stale cached values.
- Many MCUs lack caches (so this is moot), but on Cortex-M7/Cortex-A this is a frequent source of bugs. Solutions: cache maintenance operations, non-cacheable/MPU-marked DMA regions, or hardware cache-coherent interconnects.
- Memory alignment & transfer width: Buffers often must be aligned to the transfer width/cache-line; misalignment causes faults or wrong behavior. Cache-line alignment also prevents a cache-line-straddling hazard where a clean/invalidate on a DMA buffer’s line corrupts an adjacent CPU variable sharing that line. (This is a data-corruption issue, distinct from “false sharing,” which is an SMP coherence performance effect.)
- Buffer ownership / lifetime: The buffer must remain valid and untouched by the CPU for the entire transfer. Don't free it, don't put it on a stack that unwinds, and don't write to it mid-transfer. Use clear ownership handoff between CPU and DMA.
- Address translation: DMA controllers typically work with physical (bus) addresses, not virtual addresses. On an MMU system you must translate and ensure pages are pinned/contiguous (or use an IOMMU/scatter-gather).
- Memory region reachability: Some DMA engines can't reach all memory (e.g., can access SRAM but not certain TCM/flash, or are limited to a 32-bit address window). Verify the source/dest are DMA-accessible.
- Synchronization / race conditions: Polling vs. interrupt completion, half-transfer handling, and the order in which you read DMA count vs. peripheral flags all matter. Use the completion interrupt or correct flag sequence; mark shared status
volatile. volatileand compiler/memory ordering: Status flags updated by DMA must bevolatile; insert barriers (DSB) so configuration writes complete before you enable the channel.- Throughput / bus contention: DMA competing with the CPU for the bus can stall the CPU; choose burst vs. cycle-steal modes and channel priorities carefully for real-time deadlines.
- Error handling: Handle DMA error/abort interrupts (bus errors, FIFO over/underrun), and watch for overrun if the peripheral outpaces the configured transfer.
- Endianness / element size mismatches between source and destination peripherals.
