DMA

Quick Cap

DMA (Direct Memory Access) offloads bulk data transfer from the CPU to a dedicated hardware controller, enabling peripheral-to-memory, memory-to-peripheral, and memory-to-memory transfers without CPU intervention. The CPU configures the transfer parameters -- source address, destination address, transfer count, and data width -- then the DMA controller executes autonomously, freeing the CPU to perform computation or enter low-power sleep. Interviewers test whether you understand when DMA is beneficial versus overkill, how circular buffers enable continuous streaming, and how cache coherency can silently corrupt DMA data on cached processors.

Key Facts:

Bus master: DMA controller acts as a bus master, reading from source and writing to destination independently of the CPU
Three directions: Peripheral-to-memory, memory-to-peripheral, and memory-to-memory transfers
Transfer modes: Normal (one-shot) and circular (auto-restart) with half-transfer and transfer-complete interrupts
Channel arbitration: Software-configurable priority levels determine which DMA channel wins when multiple requests are pending
Cache coherency: On cached processors (Cortex-M7, Cortex-A), DMA buffers must be in non-cacheable memory or caches must be manually managed
Setup overhead: DMA has register configuration cost; for very small transfers (< ~16 bytes), CPU memcpy is often faster

Deep Dive

At a Glance

Characteristic	Detail
Transfer Directions	Peripheral-to-Memory, Memory-to-Peripheral, Memory-to-Memory
Channels	Typically 4-16 per DMA controller (MCU-dependent)
Trigger Sources	Peripheral requests (ADC EOC, UART RXNE, SPI TXE, Timer update) or software trigger
Data Widths	Byte (8-bit), half-word (16-bit), word (32-bit)
Burst Support	Single transfer, 4-beat, 8-beat, or 16-beat bursts (advanced controllers)
Interrupt Types	Transfer Complete (TC), Half Transfer (HT), Transfer Error (TE)

Why DMA Matters

The fundamental problem DMA solves is straightforward: without DMA, the CPU must execute a load-store loop for every byte transferred between a peripheral and memory. During that time, the CPU cannot run application logic, service other interrupts promptly, or enter a low-power state. For a single UART byte, this is negligible. For streaming 1,000 ADC samples per second, receiving continuous SPI data from an external flash, or updating a display framebuffer, the CPU spends a significant fraction of its cycles simply copying bytes -- work that requires no computation, just moving data from point A to point B.

DMA eliminates this bottleneck. The CPU sets up the transfer once -- source address, destination address, number of items, data width, and direction -- and the DMA controller takes over. The controller generates its own bus transactions, reading from the source and writing to the destination one element at a time (or in bursts), without any CPU instruction execution. When the transfer is complete, the DMA controller raises an interrupt so the CPU can process the data or start the next transfer.

This architecture is critical for: ADC streaming at high sample rates, UART and SPI bulk data reception and transmission, display framebuffer updates, audio codec I/O via I2S or SAI, and any application where the CPU needs to remain responsive while large data movements happen in the background.

DiagramDMA Transfer Flow

Peripheral (e.g. ADC)         DMA Controller              Memory (RAM buffer)
       │                            │                            │
       │── DMA request ───────────→ │                            │
       │                            │── read from peripheral ──→ │
       │                            │                            │
       │                            │── write to memory ────────→│
       │                            │                            │
       │   (repeat N times)         │                            │
       │                            │                            │
       │                            │── Transfer Complete IRQ ──→ CPU

Peripheral asserts DMA request; controller moves data to RAM and signals the CPU on completion.

The DMA controller autonomously moves data from peripheral to memory (or vice versa). The CPU is only involved at setup and when the transfer-complete interrupt fires.

Transfer Directions

Direction	Source	Destination	Example Use Case
Peripheral to Memory	ADC data register	RAM buffer	ADC continuous sampling
Memory to Peripheral	RAM buffer	UART transmit data register	UART bulk transmit
Memory to Memory	Source RAM region	Destination RAM region	Buffer copy, framebuffer update

Peripheral-to-memory and memory-to-peripheral transfers are paced by the peripheral's request signal -- the DMA controller waits until the peripheral asserts its DMA request line before performing each element transfer. This prevents the DMA from reading an ADC data register before the next conversion is ready, or writing to a UART transmit register before the previous byte has shifted out.

Memory-to-memory transfers have no external pacing signal. The DMA controller runs as fast as the bus arbiter allows, transferring data at maximum bus throughput. This makes M2M DMA useful for large memory copies, but it also means M2M transfers consume significant bus bandwidth and can stall CPU memory accesses during the transfer.

DMA vs CPU Transfer: When to Use Each

Criteria	CPU Transfer	DMA Transfer
Setup overhead	None (just a loop)	Must configure source, dest, count, width registers
Throughput	Limited by instruction execution speed	Near bus-bandwidth speed
CPU load	100% during transfer	Near 0% during transfer (CPU is free)
Latency	Immediate start	Small setup latency (~microseconds)
Best for small transfers	Yes (< ~16 bytes)	No -- setup cost dominates
Best for large transfers	No -- wastes CPU cycles	Yes -- CPU does useful work instead
Continuous streaming	Requires constant CPU attention	Circular mode runs indefinitely
Complexity	Simple loop	Channel config, interrupts, cache management

The key insight is that DMA has a fixed setup overhead: writing to several configuration registers, enabling the channel, and handling the completion interrupt. For transfers smaller than approximately 16 bytes, CPU memcpy is often faster because the setup cost exceeds the time saved. DMA becomes increasingly advantageous as transfer size grows, and it is essential for continuous streaming scenarios where the CPU would otherwise be permanently occupied with data movement.

Transfer Modes

Normal mode: The DMA controller transfers exactly N items, then stops and disables the channel. To perform another transfer, the CPU must reconfigure the DMA registers (or at minimum re-enable the channel and reset the count). Normal mode is appropriate for one-shot operations: sending a UART message, reading a block from SPI flash, or copying a memory region.

Circular mode: After transferring N items, the DMA controller automatically resets its internal counter and source/destination pointers back to the starting values and continues transferring. The channel never stops unless the CPU explicitly disables it. Circular mode is essential for continuous streaming applications -- ADC sampling into a ring buffer, audio I/O, or any scenario requiring an uninterrupted data flow.

The real power of circular mode emerges when combined with the half-transfer interrupt. The DMA controller generates two interrupts per cycle: a half-transfer interrupt (HT) when the first half of the buffer has been filled, and a transfer-complete interrupt (TC) when the second half has been filled and the pointer wraps around. This creates a natural double-buffer pattern: while the DMA fills the second half, the CPU processes the first half, and vice versa. No data is lost, and no explicit buffer swapping is required.

DiagramDMA Double-Buffer Cycle

Buffer:  [  First Half  |  Second Half  ]
          ↑              ↑               ↑
       start          HT IRQ          TC IRQ
                                    (wraps to start)

Time 1:  [ DMA filling → |  CPU reads   ]   ← HT fires, CPU processes first half
Time 2:  [  CPU reads    | DMA filling → ]   ← TC fires, CPU processes second half
Time 3:  [ DMA filling → |  CPU reads   ]   ← wraps, cycle repeats forever

HT and TC interrupts let the CPU process one half while DMA fills the other.

First half: DMA fills elements 0 through N/2-1. When complete, the half-transfer interrupt fires. The CPU can now safely read elements 0 through N/2-1.
Second half: DMA fills elements N/2 through N-1. When complete, the transfer-complete interrupt fires. The CPU can now safely read elements N/2 through N-1.
Wrap: The DMA pointer resets to element 0 and the cycle repeats indefinitely.

Buffer sizing matters: too small and the CPU cannot process one half before the DMA overwrites it; too large and latency increases. The optimal size depends on the data rate and the processing time per buffer half.

Channel Priority and Arbitration

Most DMA controllers have multiple channels (typically 4 to 16), and multiple peripherals may request transfers simultaneously. The arbiter resolves conflicts using a two-level priority scheme:

Software priority: Each channel is assigned a priority level -- typically Very High, High, Medium, or Low. The arbiter always services the highest-priority pending request first.

Hardware tiebreaker: If two channels have equal software priority, the channel with the lower number wins. Channel 0 beats Channel 1, Channel 1 beats Channel 2, and so on.

This arbitration happens on a per-transfer basis (or per-burst on controllers that support burst mode). Between each individual element transfer, the arbiter re-evaluates all pending requests. This means a lower-priority channel is not starved indefinitely -- it will be serviced in the gaps between higher-priority transfers.

Impact on real-time performance: DMA transfers and CPU memory accesses share the same bus (or buses, on multi-layer AHB architectures). During a DMA burst transfer, the CPU may be stalled if it attempts to access the same bus. Keep burst lengths reasonable (4 or 8 beats rather than 16) if you need the CPU to maintain low interrupt latency. On dual-AHB controllers (like STM32F4/F7 DMA2), one DMA port accesses peripherals while the other accesses memory, reducing CPU contention.

Cache Coherency

⚠️Common Trap: DMA + Cache = Stale Data

On any MCU with data cache (Cortex-M7, Cortex-A), DMA buffers MUST be in non-cacheable memory or you must manually clean/invalidate the cache. This is the most common source of "DMA works on M4 but not M7" bugs.

Cache coherency is the single most important DMA pitfall on Cortex-M7 and application processors, and it is a frequent interview topic. The problem arises because the DMA controller and the CPU see memory through different paths:

CPU write, DMA read (Memory-to-Peripheral): The CPU writes data into a transmit buffer. If the D-cache is enabled, those writes may stay in cache and never reach physical RAM. When the DMA controller reads from RAM, it gets stale data -- whatever was in RAM before the CPU's writes. The transmitted data is corrupted.

DMA write, CPU read (Peripheral-to-Memory): The DMA controller writes received data into RAM. The CPU then reads from the buffer, but the D-cache may serve a cached (stale) copy of that memory region from before the DMA wrote to it. The CPU sees old data instead of the freshly received data.

Solutions, from simplest to most flexible:

Non-cacheable memory region (MPU): Configure the Memory Protection Unit to mark the DMA buffer region as non-cacheable. This is the simplest and most robust approach. The CPU reads and writes directly to RAM, bypassing the cache entirely for that region. The tradeoff is slower CPU access to those buffers.
Manual cache maintenance: Before starting a DMA read (memory-to-peripheral), clean (flush) the D-cache for the buffer region to push data from cache to RAM. After a DMA write (peripheral-to-memory) completes, invalidate the D-cache for the buffer region to discard stale cached copies. This approach preserves cache performance but requires careful placement of cache operations and correct address alignment (cache operations work on cache-line granularity, typically 32 bytes).
Write-through cache policy: Configure the cache region as write-through instead of write-back. Every CPU write goes to both cache and RAM simultaneously. This solves the CPU-write/DMA-read problem but not the DMA-write/CPU-read problem (you still need invalidation after DMA writes).

On Cortex-M4 and other processors without data cache, this issue does not arise -- the CPU and DMA both access RAM directly, so coherency is automatic. This is why code that works perfectly on an M4 can produce mysterious data corruption when ported to an M7 without cache management.

Memory Alignment

DMA transfers are most efficient when source and destination addresses are aligned to the transfer data width. For half-word (16-bit) transfers, addresses should be 2-byte aligned. For word (32-bit) transfers, addresses should be 4-byte aligned.

The consequences of misalignment vary by architecture:

Cortex-M0/M0+: Misaligned accesses generate a HardFault. DMA buffers must be properly aligned.
Cortex-M3/M4/M7: The core supports unaligned accesses, but the DMA controller itself may not. Some DMA controllers silently mask the lower address bits, causing writes to the wrong location. Others generate a transfer error.
General rule: Always align DMA buffers to at least the transfer width. Use compiler attributes (__attribute__((aligned(4)))) or linker script placement to guarantee alignment.

Cache maintenance adds another alignment constraint: cache clean and invalidate operations work on entire cache lines (typically 32 bytes on Cortex-M7). If a DMA buffer shares a cache line with other data, invalidating the cache line after a DMA write will discard any CPU modifications to the neighboring data. For this reason, DMA buffers on cached processors should be aligned to cache-line boundaries and sized as multiples of the cache-line size.

Common DMA Peripherals

Peripheral	DMA Use Case	Typical Mode	Notes
ADC	Continuous sampling to buffer	Circular, HT+TC interrupts	Most common DMA use case in interviews
UART RX	Receive without polling/per-byte IRQ	Circular, idle line detect	Combine with UART idle interrupt for variable-length messages
UART TX	Bulk transmit	Normal, TC interrupt	CPU loads next message on TC
SPI	Flash read/write, sensor bulk read	Normal	Often bidirectional (TX+RX DMA simultaneously)
I2S / SAI	Audio streaming	Circular, double buffer	Strict timing -- buffer underrun causes audible glitches
Memory	Framebuffer copy, buffer init	Normal, M2M	No peripheral pacing; runs at bus speed
Timer	Waveform generation (PWM updates)	Normal or circular	DMA updates compare registers on timer events

Debugging Story

A team was developing an audio processing board based on a Cortex-M7 MCU. The I2S peripheral streamed audio samples via DMA into a circular buffer, the DSP algorithm processed each half-buffer on the HT and TC interrupts, and the output was sent back via another DMA channel. On the previous prototype (Cortex-M4), everything worked flawlessly. On the new M7 board, the audio output had intermittent clicks, pops, and corrupted samples -- sometimes clean for minutes, then suddenly garbled.

After days of investigating DMA configuration, buffer sizes, interrupt priorities, and clock trees, the root cause turned out to be cache coherency. The DMA was writing received audio samples directly to SRAM, but the CPU's D-cache was serving stale copies of those memory locations to the DSP algorithm. Occasionally, when cache lines were naturally evicted, the correct data would appear -- explaining the intermittent nature of the problem.

The fix was straightforward: the team used the MPU to mark the DMA buffer region as non-cacheable (Device or Strongly-Ordered memory type). Audio quality became perfect immediately. The lesson: whenever you port DMA code from a non-cached to a cached processor, cache coherency must be your first checkpoint, not your last resort after days of debugging.

What interviewers want to hear: Demonstrate that you understand DMA as more than "fast memcpy." They want you to articulate why DMA exists (freeing the CPU from data-movement busywork), when it is beneficial versus overkill (transfer size tradeoff), how circular buffers enable continuous streaming without data loss, and what goes wrong on cached processors if you do not manage coherency. Mention bus arbitration and its impact on real-time latency. If you can describe a real debugging experience involving DMA -- especially a cache coherency issue -- it strongly signals production experience.

Interview Focus

Classic DMA Interview Questions

Q1: "When would you choose DMA over CPU-based data transfer?"

Model Answer Starter: "I choose DMA when the transfer is large enough that setup overhead is justified and the CPU's time is better spent elsewhere. The breakeven point is typically around 16-32 bytes -- below that, a simple CPU memcpy or load-store loop is faster because DMA requires configuring source, destination, count, and width registers plus handling the completion interrupt. Above that threshold, DMA becomes increasingly advantageous because the CPU is completely free during the transfer. The classic use cases are ADC streaming, UART/SPI bulk transfers, and audio I/O, where data arrives continuously and the CPU would otherwise spend most of its cycles just copying bytes. DMA is also essential when the CPU needs to enter a low-power sleep mode while data collection continues."

Q2: "What is a DMA circular buffer and when would you use it?"

Model Answer Starter: "A DMA circular buffer is a transfer mode where the DMA controller automatically resets its pointer back to the start of the buffer after reaching the end, creating a continuous ring buffer without any CPU intervention to restart the transfer. I use it for any continuous streaming scenario -- ADC sampling, audio input, UART reception -- where data flows indefinitely. The key technique is combining circular mode with half-transfer and transfer-complete interrupts: while the DMA fills the second half of the buffer, the CPU processes the first half, and vice versa. This creates a natural double-buffer pattern. Buffer sizing is critical -- it must be large enough that the CPU can finish processing one half before the DMA wraps around and overwrites it."

Q3: "How does DMA interact with CPU cache, and what problems can arise?"

Model Answer Starter: "DMA and the CPU access RAM through different paths. The CPU goes through the data cache, while the DMA controller accesses physical RAM directly. This creates a coherency problem in both directions. When the CPU writes data for DMA to transmit, the writes may sit in D-cache and never reach RAM -- the DMA reads stale data. When DMA receives data into RAM, the CPU may read a cached copy of the old contents instead of the fresh data. The solutions are: placing DMA buffers in a non-cacheable region via the MPU, manually cleaning the cache before DMA reads and invalidating after DMA writes, or using a write-through cache policy. This issue only affects processors with data cache -- Cortex-M7 and Cortex-A series -- which is why code that works on Cortex-M4 can fail mysteriously on M7."

Q4: "How does DMA channel priority and arbitration work?"

Model Answer Starter: "DMA channels share the bus, and when multiple channels have pending requests simultaneously, the arbiter decides which goes first. There are two levels of priority: software-configurable priority (Very High, High, Medium, Low) assigned per channel, and a hardware tiebreaker where the lower-numbered channel wins if software priorities are equal. Arbitration happens between each individual element transfer, so lower-priority channels are not completely starved -- they get serviced in the gaps. An important system-level consideration is that DMA burst transfers can stall CPU memory accesses since they share the bus. I keep burst lengths moderate and assign priorities carefully to ensure time-critical peripherals are serviced first without starving others."

Q5: "What happens if the CPU and DMA access the same memory simultaneously?"

Model Answer Starter: "On a single-bus architecture, the CPU and DMA cannot physically access the bus at the exact same clock cycle -- the bus arbiter interleaves their accesses. The CPU may be stalled for a few cycles while a DMA transfer completes. On multi-layer AHB bus architectures, the CPU and DMA can operate in parallel as long as they target different bus slaves. The real danger is not simultaneous access at the electrical level but logical race conditions: if the CPU reads a buffer that DMA is currently writing into, the CPU may see partially updated data. This is why circular DMA uses half-transfer and transfer-complete interrupts to partition the buffer into safe zones. On cached processors, there is the additional cache coherency issue where the CPU and DMA see different versions of the same memory location."

Trap Alerts

Don't say: "DMA is always faster than CPU transfer" -- for small transfers, the DMA setup overhead makes it slower than a simple loop
Don't forget: Cache coherency on Cortex-M7 and application processors -- this is the single most common DMA bug in production
Don't ignore: Bus arbitration and its impact on real-time latency -- uncontrolled DMA bursts can stall the CPU at critical moments

Follow-up Questions

"How would you size a DMA circular buffer for an ADC sampling at 100 kHz with 1 ms processing deadlines?"
"What happens if the CPU does not process one half of a circular DMA buffer before the DMA wraps around?"
"How would you implement scatter-gather DMA for a system with non-contiguous memory buffers?"
"What is the difference between DMA and FIFO-based peripheral data handling?"

💡Practice DMA Interview Questions

Ready to test yourself? Head over to the DMA Interview Questions page for a full set of Q&A with collapsible answers -- perfect for self-study and mock interview practice.

Practice

❓ What is the primary advantage of DMA over CPU-driven data transfer?

❓ When is CPU memcpy typically faster than DMA?

❓ What is the purpose of the DMA half-transfer interrupt in circular mode?

❓ On a Cortex-M7 with D-cache enabled, what happens if DMA writes data to a cacheable memory region?

❓ What determines which DMA channel is serviced first when multiple channels have pending requests?

Real-World Tie-In

High-Speed Data Logger -- Designed a vibration monitoring system that sampled 8 ADC channels at 50 kHz each using DMA circular mode with half-transfer interrupts. The CPU ran an FFT on each half-buffer during the interrupt, wrote results to SD card via SPI DMA, and spent the remaining time in sleep mode. Without DMA, the CPU would have been 100% occupied just reading ADC samples, leaving no time for signal processing.

Audio Codec on Cortex-M7 -- Implemented a real-time audio effects processor streaming 48 kHz stereo audio through I2S with DMA circular buffers. Discovered cache coherency corruption on first bring-up (clicks and garbled audio). Resolved by placing all DMA buffers in a non-cacheable SRAM region via MPU configuration. The system achieved sub-millisecond audio latency with less than 15% CPU utilization.

Industrial Gateway with Multi-Protocol DMA -- Built a protocol converter bridging UART, SPI, and CAN interfaces, each using its own DMA channel. UART RX used circular DMA with idle-line detection for variable-length messages. SPI used normal-mode DMA for flash logging. Careful channel priority assignment ensured the safety-critical CAN interface was never starved by bulk SPI flash writes.