Transfer Fundamentals
QWhen should you use DMA instead of CPU-driven data transfers?
DMA is beneficial when you need to move blocks of data between memory and peripherals (or memory-to-memory) at rates where CPU involvement becomes wasteful or infeasible. The decision comes down to three factors: data rate, transfer size, and what the CPU should be doing instead.
Use DMA when: (1) high data rates force the CPU to spend most of its time servicing transfers — an ADC sampling at 1 Msps generates a million interrupts per second, each costing 30-50 clock cycles for context switch, flag check, and data copy; at 72 MHz, that alone consumes 40-70% of the CPU. (2) You need deterministic transfer timing independent of interrupt latency — DMA responds to peripheral requests within 1-2 bus cycles, while an ISR has variable latency depending on current interrupt priority and pipeline state. (3) The CPU should be processing previously captured data while new data arrives — the classic producer-consumer pattern where DMA produces and the CPU consumes.
Use CPU transfers when: (1) data volumes are small and infrequent — a single 4-byte SPI command does not justify 10-20 register writes to configure the DMA channel. (2) The transfer involves per-byte logic that DMA's simple increment-and-copy model cannot express — byte stuffing, CRC computation inline, protocol framing, or conditional branching. (3) The system has only a few DMA channels and they are all allocated to higher-priority peripherals. The practical break-even point is roughly 8-16 bytes per transfer at rates above a few hundred transfers per second. Below that, the DMA setup overhead (50-200 CPU cycles) exceeds the time the CPU would spend doing the transfer directly.
QWhat is the overhead of setting up a DMA transfer, and when does it outweigh the benefit?
Setting up a DMA transfer requires configuring multiple registers: source address, destination address, transfer count (number of data items), data width (byte, half-word, or word), increment modes for source and destination, circular vs. normal mode, channel priority, and finally enabling the channel. On STM32, this translates to 10-20 register writes. Including function call overhead if using the HAL (HAL_DMA_Start() or HAL_DMA_Start_IT()), the total setup cost is roughly 50-200 CPU cycles depending on HAL version and compiler optimization level.
This setup cost is amortized over the entire transfer. For a 1000-byte SPI transfer, 100 cycles of setup versus 1000 cycles of CPU byte-banging is a clear win — especially since the CPU is free during the DMA transfer. But for a 2-byte I2C register read, the DMA setup alone takes longer than the CPU would need to simply poll the two bytes. The crossover is typically around 8-16 bytes, depending on the peripheral clock speed and CPU clock.
For applications that perform repeated small transfers to the same peripheral (e.g., periodic 8-byte SPI sensor reads), the solution is to configure DMA once and re-trigger it for each transfer by simply updating the transfer count register and re-enabling the channel — reducing per-transfer overhead to 2-3 register writes. Circular mode eliminates re-configuration entirely for continuous transfers, making DMA effectively zero-overhead after the initial setup. The key insight: DMA is an investment that pays off in throughput and CPU freedom, not in latency for individual small operations.
QWhy does memory alignment matter for DMA transfers?
DMA transfers data in units of the configured transfer width — byte (8-bit), half-word (16-bit), or word (32-bit). Both the source and destination addresses must be naturally aligned to the transfer width: half-word transfers require 2-byte aligned addresses, and word transfers require 4-byte aligned addresses. This alignment requirement exists because the bus infrastructure (AHB/APB) and the memory controller expect aligned accesses for efficient operation.
Misaligned DMA transfers cause different problems depending on the Cortex-M core. On Cortex-M0, M3, and M4, an unaligned access from the DMA controller generates a HardFault — the transfer aborts and the system crashes if the fault handler does not recover gracefully. On Cortex-M7, the CPU handles unaligned accesses transparently (with a performance penalty), but the DMA controller typically does not share this capability and will either fault or silently transfer incorrect data by accessing the wrong aligned address. Additionally, when peripheral and memory transfer widths differ (e.g., packing two 16-bit ADC readings into a 32-bit memory word), the DMA's FIFO and packing/unpacking logic must be configured correctly, and the buffer alignment must satisfy the wider of the two widths.
Best practice: always declare DMA buffers with explicit alignment attributes — __attribute__((aligned(4))) for word transfers, __attribute__((aligned(32))) for Cortex-M7 cache-line alignment. When transferring arrays of structs, verify that the struct size is a multiple of the transfer width and that padding does not introduce gaps. A common bug: defining a DMA buffer inside a function as a local variable on the stack — stack alignment is compiler-dependent and may not satisfy DMA requirements, plus the buffer is invalid after the function returns while DMA may still be writing to it.
Modes and Configuration
QHow do circular buffers with Half-Transfer (HT) and Transfer-Complete (TC) interrupts work?
In circular DMA mode, the DMA controller continuously transfers data between a peripheral and a memory buffer, automatically wrapping the memory pointer back to the beginning when it reaches the end. This creates an infinite ring buffer that operates without any CPU intervention to restart or reconfigure transfers. The hardware maintains the write position internally via the NDTR (Number of Data to Transfer Register), which decrements with each transfer and reloads to the full buffer size on wrap-around.
The buffer is logically divided into two halves, and the DMA controller generates two interrupts per complete cycle. The Half-Transfer (HT) interrupt fires when the DMA has filled the first half of the buffer and begins writing to the second half. The Transfer-Complete (TC) interrupt fires when the second half is full and the DMA wraps back to the buffer start. This creates a natural ping-pong double-buffering scheme: when HT fires, the CPU processes the first half while DMA fills the second; when TC fires, the CPU processes the second half while DMA fills the first. The CPU always has a full half-buffer's worth of time to complete processing before DMA overwrites the data.
// Circular DMA double-buffering pattern#define BUF_SIZE 256uint16_t adc_buf[BUF_SIZE];void DMA1_Channel1_IRQHandler(void) {if (DMA1->ISR & DMA_ISR_HTIF1) {DMA1->IFCR = DMA_IFCR_CHTIF1;process(adc_buf, BUF_SIZE / 2); // First half}if (DMA1->ISR & DMA_ISR_TCIF1) {DMA1->IFCR = DMA_IFCR_CTCIF1;process(adc_buf + BUF_SIZE / 2, BUF_SIZE / 2); // Second half}}
This pattern is the backbone of real-time audio processing, continuous ADC sampling, and high-throughput UART/SPI reception. The buffer size must be chosen so that the worst-case CPU processing time for one half is less than the time DMA takes to fill the other half. If processing overruns, data is silently corrupted — there is no hardware protection, no error flag, and no interrupt. Sizing the buffer correctly requires knowing both the data arrival rate and the worst-case processing time.
QHow does double buffering work in DMA, and how is it different from circular mode with HT/TC?
True double buffering (available on DMA2 streams on STM32F4/F7/H7) uses two completely separate memory buffers with automatic hardware pointer switching. The DMA controller maintains two memory address registers — M0AR and M1AR — each pointing to a different buffer. When the current buffer is full, the hardware atomically switches to the other buffer and sets the Current Target (CT) flag to indicate which buffer is now active. A Transfer-Complete interrupt notifies the CPU, which processes the just-completed buffer while DMA fills the other.
The key advantage over circular HT/TC is physical buffer independence: the two buffers can be at arbitrary, non-contiguous memory addresses. This is useful when you cannot allocate a single large contiguous block (fragmented memory), when buffers need to be in different memory regions (one in SRAM, one in DTCM), or when you want to redirect DMA output dynamically. While DMA fills buffer A, the CPU can even update the M1AR register to point to a completely different location for the next transfer — enabling scatter-gather-like patterns where successive DMA transfers land in different memory regions.
In contrast, circular HT/TC uses a single contiguous buffer split into logical halves — simpler to set up and available on all DMA channels (not just specific streams). The switching is implicit (based on counter position) rather than explicit (hardware pointer swap). In practice, circular HT/TC is more commonly used because it covers the majority of use cases, requires less configuration, and works on all STM32 DMA channels. True double buffering is reserved for advanced scenarios: high-bandwidth video/audio processing, multi-stage DMA pipelines, or when contiguous allocation is impractical.
QHow does DMA channel priority and arbitration work?
When multiple DMA channels have pending transfer requests simultaneously, the DMA controller's internal arbiter decides which channel gets bus access. On STM32, this uses a two-level priority scheme:
Software priority (configurable): Each DMA channel/stream is assigned a priority level — Low, Medium, High, or Very High — via the PL bits in the configuration register. The arbiter services the highest-priority pending request first. This lets you ensure time-critical peripherals (audio DAC output that must not underrun, or ADC sampling that must not overrun) preempt less urgent transfers (UART logging, memory-to-memory copies). Assigning priorities incorrectly can cause subtle bugs: a high-priority bulk memory copy can starve the UART DMA channel long enough to cause receive overrun errors.
Hardware priority (tie-breaking): When two channels have the same software priority level, the lower-numbered channel always wins. Channel 1 beats channel 2; stream 0 beats stream 3. This is fixed in silicon and cannot be changed. It means the physical channel assignment matters — map your most time-critical peripheral to the lowest-numbered compatible channel.
Between individual data transfers (not mid-transfer), the arbiter re-evaluates priorities. This means a high-priority channel can preempt an ongoing low-priority transfer at the next data-unit boundary. The maximum latency a high-priority channel experiences is one transfer unit of the low-priority channel — typically 1-4 bus cycles. For systems with multiple DMA controllers (DMA1 and DMA2 on many STM32 families), each controller has its own independent arbiter. Contention between the two controllers is resolved by the bus matrix, which has its own arbitration rules and can cause additional latency if both controllers access the same bus slave simultaneously.
QWhich peripherals are commonly paired with DMA, and what configurations are typical?
ADC: The most common DMA use case in embedded systems. DMA in circular mode transfers conversion results directly to a RAM buffer. For multi-channel scan mode, each channel's result is stored sequentially — the DMA auto-increments the memory pointer while the ADC sequences through channels. Timer-triggered ADC with DMA produces precisely timed, zero-jitter sampling with zero CPU involvement until the buffer is ready to process.
UART/USART: DMA handles both directions. For TX: memory-to-peripheral, normal mode — configure the buffer address and byte count, enable the channel, and the DMA feeds bytes to the UART data register as fast as the UART can accept them. The CPU is free immediately after starting the transfer. For RX: peripheral-to-memory, circular mode — incoming bytes fill a ring buffer continuously. Combine with the UART idle line detection interrupt to process variable-length messages: the idle interrupt tells firmware "the sender paused, process the accumulated bytes." This DMA + idle-line pattern is the gold standard for UART reception in production firmware.
SPI: DMA enables full-duplex, high-speed transfers. Configure two DMA channels — one for TX (memory to SPI_DR) and one for RX (SPI_DR to memory) — and start both simultaneously. This is essential for driving displays (pushing framebuffer data), SD card communication, and high-speed sensor interfaces where the CPU cannot keep up with the SPI clock rate. Without DMA, a 42 MHz SPI bus requires the CPU to service an interrupt every 190 ns — faster than the ISR can execute on most Cortex-M4 cores.
DAC: Timer-triggered DAC with circular DMA enables waveform generation. The DMA feeds a pre-computed waveform table (sine, triangle, arbitrary) to the DAC at each timer trigger, producing analog output signals with zero CPU involvement — the CPU just fills the table once and the hardware loops forever.
Memory-to-memory: DMA can copy between RAM regions or from flash to RAM, functioning as a hardware memcpy(). Useful for framebuffer blitting, initializing large data structures, or copying lookup tables from flash to faster RAM at startup.
Cache and Performance
QWhat is the cache coherency problem on Cortex-M7, and how do you solve it with DMA?
Cortex-M7 processors include a data cache (D-Cache) — a small, fast SRAM that holds copies of recently accessed main memory locations. The cache operates transparently to normal CPU code, but DMA transfers bypass the cache entirely because the DMA controller accesses main RAM directly through the bus matrix without going through the CPU's cache. This creates two coherency problems:
DMA writes, CPU reads (most common): DMA deposits fresh data into RAM, but the CPU's D-Cache still holds stale copies of those addresses from a previous access. When the CPU reads the buffer, it gets the old cached values instead of the new DMA data. This manifests as the ADC buffer showing the same readings repeatedly, or UART RX data appearing frozen — a maddening bug because everything works on Cortex-M4 (no cache) and breaks on M7.
CPU writes, DMA reads: The CPU updates a transmit buffer, but the writes are sitting in the cache and have not been flushed to main RAM. When DMA reads the buffer to feed the peripheral (UART TX, SPI TX), it reads stale RAM data — potentially garbage from the previous transfer or uninitialized memory.
Solutions, from simplest to most performant: (1) MPU-based non-cacheable region: Use the Memory Protection Unit to mark a RAM region as non-cacheable and place all DMA buffers there. Simple, foolproof, and zero maintenance — but every CPU access to that region bypasses the cache, which can hurt performance for frequently accessed buffers. (2) Explicit cache maintenance: Before DMA reads a CPU-written buffer, call SCB_CleanDCache_by_Addr() to flush dirty cache lines to RAM. After DMA writes to a buffer, call SCB_InvalidateDCache_by_Addr() to discard stale cache lines so the CPU fetches fresh data. Critical requirement: DMA buffers must be aligned to 32-byte cache line boundaries (__attribute__((aligned(32)))) and sized as multiples of 32 bytes. If a buffer shares a cache line with an unrelated variable, invalidating that line discards the unrelated variable's cached value — a subtle corruption bug. (3) DTCM (Data Tightly Coupled Memory): This memory region is not cached and provides single-cycle access latency. It is ideal for small, frequently accessed DMA buffers. The limitation is size — DTCM is typically 64-128 KB, shared with the stack and other performance-critical data.
QWhat happens if the CPU and DMA access the same memory or peripheral simultaneously?
When the CPU and DMA contend for the same bus (AHB or APB), the bus matrix arbiter serializes the accesses. One gets through first, the other is stalled. For the CPU, this stall manifests as the current instruction taking extra clock cycles to complete — software cannot detect it, but it reduces effective CPU throughput. This is called bus contention or cycle stealing. For individual transfers, the stall is typically 1-2 clock cycles and is invisible to software. But for sustained high-bandwidth DMA (memory-to-memory at maximum speed, or multiple DMA channels active simultaneously), the cumulative stalling can reduce CPU performance by 10-30%.
For simultaneous access to the same peripheral register, the bus arbiter serializes the accesses cleanly — no data corruption occurs at the hardware level. However, logical races are a real danger: if the CPU reads the UART data register (DR) while a DMA RX transfer is active on the same UART, both are consuming from the same source. Reading DR clears the RXNE flag, so whichever reads first (CPU or DMA) gets the byte, and the other sees nothing or gets the next byte. This is a configuration error — never use both CPU and DMA to access the same peripheral's data path simultaneously.
Similarly, if both the CPU and DMA write to the same RAM buffer without coordination, data corruption occurs. The HT/TC interrupt mechanism exists precisely to provide this synchronization — the CPU processes one half only after the DMA has moved on to the other half. Breaking this contract (processing too slowly and overrunning into the DMA's active half) produces silent data corruption with no error flag or interrupt. In performance-critical systems, add a runtime check: compare the DMA's current NDTR value against the processing pointer to detect overrun conditions and log them.
QHow do you debug DMA transfers that appear to silently fail?
DMA failures are notoriously difficult to debug because they often produce no visible error — data simply does not arrive, arrives corrupted, or the system hangs without a HardFault. A systematic approach is essential.
Step 1 — Verify the DMA channel is actually running. Read the DMA stream/channel enable bit (EN in CR) — some configurations clear EN automatically on error. Check the DMA error flags: Transfer Error (TEIF), FIFO Error (FEIF on DMA2), and Direct Mode Error (DMEIF). These flags are often overlooked because the HAL error callbacks are not always connected. Read the NDTR register — if it equals the initial count, no transfers have occurred. If it is decrementing, transfers are happening but data may be going to the wrong place.
Step 2 — Check addresses and alignment. Verify the peripheral address register points to the correct data register (e.g., &ADC1->DR, not &ADC1->SR). Verify the memory address points to a valid, writable RAM region — a common bug is pointing to flash (read-only) or to a stack variable that has gone out of scope. Check alignment: a word-width DMA transfer to an address not aligned to 4 bytes causes a Transfer Error on most STM32 families. On Cortex-M7, also check that the buffer is not in a cached region without proper cache maintenance.
Step 3 — Check the peripheral request connection. On STM32, each DMA channel/stream is hard-wired to specific peripheral request signals (e.g., DMA1 Channel 1 is ADC1 on STM32F1; DMA2 Stream 0 with Channel 0 is ADC1 on STM32F4). Using the wrong DMA channel for a peripheral means the DMA never receives a transfer request, so it sits idle with EN set and NDTR unchanged. Always verify the channel/stream/request mapping against the reference manual's DMA request mapping table.