CPU Fundamentals for Embedded — Interview Questions & Answers

Cache & DMA Coherency

QExplain cache coherency issues with DMA and how to solve them.

Cache coherency problems arise because the DMA controller and the CPU access main SRAM through different paths. The CPU reads and writes through the data cache (D-Cache), which keeps a local copy of recently accessed memory. The DMA controller bypasses the cache entirely — it reads from and writes to main SRAM directly via the bus matrix. This creates two distinct problems depending on the data flow direction.

DMA RX (DMA writes, CPU reads): The DMA deposits fresh data — ADC samples, UART bytes, SPI responses — into a RAM buffer. But the CPU's D-Cache may still hold stale copies of those same addresses from a previous access. When the CPU reads the buffer after the DMA transfer completes, it gets the old cached values, not the new DMA data. The symptom is maddening: your ADC buffer shows the same readings repeatedly, or UART RX data appears frozen. Everything works perfectly on Cortex-M4 (which has no cache) and breaks on M7. Solution: Call SCB_InvalidateDCache_by_Addr() on the buffer after the DMA transfer completes. This discards the stale cache lines, forcing the CPU to fetch fresh data from SRAM on the next read.

DMA TX (CPU writes, DMA reads): The CPU fills a transmit buffer, but the writes may be sitting in the cache's dirty lines and have not been flushed to main SRAM. When DMA reads the buffer to feed the peripheral, it reads stale SRAM contents — potentially garbage from a previous transfer or uninitialized memory. Solution: Call SCB_CleanDCache_by_Addr() before starting the DMA transfer, which pushes dirty cache lines out to SRAM. Alternative solutions include: marking the buffer's memory region as non-cacheable via the MPU (simple but sacrifices cache performance for all CPU accesses to that region), or placing DMA buffers in DTCM/TCM, which is not cached and provides single-cycle access. Critical requirement: DMA buffers must be cache-line aligned (32 bytes on M7) and sized as multiples of 32 bytes, because cache operations work on whole cache lines.

QWrite-through vs write-back cache — tradeoffs in embedded?

Write-through cache policy means every CPU write goes to both the cache and main SRAM simultaneously. The cache always mirrors SRAM exactly, so there are never dirty cache lines. This is inherently safe for DMA TX — when the DMA reads the buffer from SRAM, it always gets the latest data the CPU wrote. The downside is performance: every store instruction generates a bus transaction to SRAM, which takes multiple clock cycles and consumes bus bandwidth. On a Cortex-M7 running at 400 MHz with SRAM at 200 MHz, every write stalls the CPU until the bus transaction completes (unless a write buffer absorbs it temporarily). For write-heavy code — clearing a framebuffer, building a transmit packet, initializing a large struct — write-through can reduce throughput by 30-50% compared to write-back.

Write-back cache policy means CPU writes update only the cache, marking the cache line as "dirty." The data is written to SRAM later — either when the cache line is evicted to make room for new data, or when software explicitly cleans it. This is significantly faster because multiple writes to the same cache line are absorbed without any bus traffic. The danger in embedded systems is that SRAM contents lag behind the cache. If DMA reads a TX buffer before the CPU cleans those cache lines, the DMA transmits stale data. Similarly, if DMA writes to an RX buffer, the CPU must invalidate those cache lines before reading, or it sees old cached values instead of the new DMA data.

The practical choice depends on the application. Write-through is simpler and safer — use it when DMA buffers are scattered throughout memory and you cannot easily track which buffers need explicit cache maintenance. Write-back is the default on most Cortex-M7 BSPs because it delivers better performance — use it when you can carefully manage cache maintenance calls around every DMA transfer, and when your DMA buffers are cache-line aligned. Many production systems use a hybrid approach: configure most SRAM as write-back for performance, and use the MPU to mark a small region as non-cacheable or write-through specifically for DMA buffers.

QHow do you ensure a DMA buffer is cache-line aligned and why does it matter?

On Cortex-M7, the D-Cache operates on cache lines of 32 bytes. Cache maintenance operations — SCB_InvalidateDCache_by_Addr() and SCB_CleanDCache_by_Addr() — work on entire cache lines, not individual bytes. If a DMA buffer shares a cache line with an unrelated variable, invalidating the buffer's cache line also discards the cached value of that adjacent variable. This causes silent data corruption: the unrelated variable reverts to whatever was last written to SRAM, losing any recent updates that were sitting in the cache.

To prevent this, DMA buffers must be aligned to 32-byte boundaries and sized as multiples of 32 bytes:

// Correct: 32-byte aligned, size is multiple of 32
__attribute__((aligned(32)))
static uint8_t dma_rx_buf[64];  // 2 cache lines

// Also correct: using linker section
__attribute__((section(".dma_buffer"), aligned(32)))
static uint8_t dma_tx_buf[128];

Without the alignment attribute, the compiler may place dma_rx_buf at any address that satisfies its natural alignment (1-byte for uint8_t arrays). If the buffer starts at address 0x20000014, it spans cache lines starting at 0x20000000 and 0x20000020. Invalidating the first cache line (0x20000000-0x2000001F) also invalidates bytes 0x20000000-0x20000013, which belong to other variables. With aligned(32), the buffer starts at a cache line boundary, ensuring invalidation only affects the buffer's own data.

The size requirement is equally important. A 50-byte buffer aligned to 32 bytes occupies cache lines 0-31 and 32-63. But bytes 50-63 of the second cache line belong to other variables. Invalidating the second cache line corrupts those variables. The solution is to round the buffer size up to the next multiple of 32 (64 bytes in this case). Some teams define a macro — #define CACHE_ALIGN_SIZE(x) (((x) + 31) & ~31) — to automate this.

Pipeline & Barriers

QWhy does pipeline flush matter for interrupt latency?

When an interrupt arrives, the CPU must abandon the instructions currently in the pipeline and begin executing the ISR. All partially-completed instructions in the pipeline stages — fetched but not decoded, decoded but not executed — are discarded (flushed). The CPU then stacks the context (registers R0-R3, R12, LR, PC, xPSR on Cortex-M), fetches the ISR address from the vector table, and begins filling the pipeline with ISR instructions. The total time from interrupt assertion to the first ISR instruction executing is the interrupt latency.

A deeper pipeline means more stages to flush and refill. The Cortex-M0 has a 3-stage pipeline (fetch, decode, execute), so a flush discards at most 2 instructions and the pipeline refills in 3 cycles. The Cortex-M7 has a 6-stage pipeline (plus branch prediction and dual-issue), so a flush is more expensive — though the M7 compensates with faster clock speeds and more sophisticated prefetch logic. The net result is that Cortex-M0/M3 have a deterministic 12-cycle interrupt latency, while Cortex-M7 has a 12-cycle minimum but with more variability depending on the instruction being interrupted.

Multi-cycle instructions amplify the problem. If the CPU is executing a 64-bit multiply (UMULL), a load-multiple (LDM with 8 registers), or a divide (UDIV, which takes 2-12 cycles on M3/M4), the interrupt cannot be serviced until that instruction completes — the pipeline cannot be flushed mid-instruction. This adds unpredictable delay on top of the base latency. Cortex-M3 and later implement interruptible load/store-multiple instructions (LDM/STM can be abandoned and restarted), which reduces worst-case latency from these long instructions. This is why interrupt latency specifications always list both minimum and maximum values, and why real-time systems must account for the worst case.

QWhat are memory barriers (DSB, DMB, ISB) and when do you need them?

ARM Cortex-M provides three memory barrier instructions that control the ordering of memory operations and instruction execution. They exist because modern processors — especially Cortex-M7 with its write buffers, caches, and multi-stage pipeline — can reorder operations for performance in ways that break hardware interaction and configuration sequences.

DSB (Data Synchronization Barrier): Ensures all memory accesses before the DSB complete before any memory access after the DSB begins. The most common use case is after enabling a peripheral clock: RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN; __DSB(); — the clock enable write must propagate through the bus to the peripheral before you write to GPIOA registers. Without the DSB, the subsequent register write may execute before the clock is active, and the write is silently lost. Also required after writing to the MPU to ensure the new memory attributes are in effect before the next memory access.

DMB (Data Memory Barrier): Ensures all memory accesses before the DMB are observed by other bus masters before any memory accesses after the DMB. This is critical for DMA buffer handoff: when the CPU fills a TX buffer and then sets a DMA enable bit, the DMB guarantees that all buffer writes are visible in SRAM before the DMA starts reading. Without it, the DMA might see partially-updated buffer contents on systems with write buffers or out-of-order memory.

ISB (Instruction Synchronization Barrier): Flushes the pipeline so that all subsequent instructions are re-fetched. Required after changes that affect instruction execution: updating the VTOR (Vector Table Offset Register) — without ISB, an interrupt arriving immediately after might use the old vector table still in the pipeline's fetch stage; modifying MPU regions — the pipeline may have already fetched instructions under the old memory attributes; or changing the CONTROL register (switching between MSP/PSP or privilege levels). In practice, the most common pattern is __DSB(); __ISB(); after MPU or VTOR configuration, ensuring both data writes and instruction fetches observe the new settings.

QYou enable a peripheral clock and immediately write to its registers. What could go wrong?

The peripheral clock enable is a write to the RCC (Reset and Clock Control) register on the AHB bus. This write enters the CPU's write buffer and propagates through the bus matrix to the RCC peripheral. The RCC then enables the clock to the target peripheral, but this takes additional time — the clock signal must propagate through clock distribution logic and reach the peripheral's clock domain. Meanwhile, the CPU has already moved on to the next instruction: a write to the peripheral's configuration register.

If the peripheral's clock has not arrived by the time the configuration write reaches the peripheral, the write is silently lost. The peripheral was not clocked and therefore could not latch the data. There is no bus error, no fault, no error flag — the write simply vanishes. The peripheral remains in its default/reset state while the firmware believes it is configured. This is one of the most common and difficult-to-diagnose bugs in ARM Cortex-M firmware, because it is timing-dependent: it may work reliably on one board revision (slightly different PCB trace lengths) and fail on another, or work in debug mode (slower execution) and fail in release builds (full optimization).

The fix is straightforward: insert a __DSB() (Data Synchronization Barrier) between the clock enable and the first peripheral register access:

RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;
__DSB();  // Wait for clock enable to propagate
GPIOA->MODER = 0x01;  // Now safe to configure

The DSB ensures the clock enable write has completed — fully propagated through the write buffer and bus fabric — before any subsequent memory operation begins. Some engineers use a dummy read-back of the RCC register ((void)RCC->AHB1ENR;) as an alternative, since a read forces the bus to complete the pending write. Both approaches work, but __DSB() is the architecturally correct solution and is what ARM recommends in the Cortex-M programming manuals.

Bus Architecture & Memory Types

QWhat is the difference between AHB and APB buses?

ARM Cortex-M microcontrollers use a multi-layer bus architecture based on the AMBA (Advanced Microcontroller Bus Architecture) specification. The two primary buses — AHB (Advanced High-performance Bus) and APB (Advanced Peripheral Bus) — serve different classes of peripherals and operate at different speeds.

AHB runs at the full system clock frequency (SYSCLK) and connects high-bandwidth peripherals: SRAM, Flash memory interface, DMA controllers, GPIO ports, and the bus matrix itself. AHB supports pipelined, burst, and split transactions, enabling high throughput. A GPIO write on AHB completes in a single clock cycle — critical for bit-banging protocols or toggling pins at the maximum possible rate. AHB is also where the CPU's instruction and data buses connect, so all code and data fetches go through AHB.

APB is connected to AHB through a bridge (AHB-to-APB bridge) and runs at a prescaled fraction of SYSCLK — typically SYSCLK/2 or SYSCLK/4. It connects lower-bandwidth peripherals: UART, SPI, I2C, timers, ADC, DAC, and RTC. APB uses a simpler protocol with no burst support, reducing gate count and power consumption for peripherals that do not need high bandwidth. The trade-off is access latency: an APB register read takes 2-3 system clock cycles because of the bridge synchronization — the AHB-to-APB bridge must synchronize the request across clock domains and relay the response back.

This distinction matters for firmware optimization. If you are polling an APB peripheral's status register in a tight loop, each read costs 2-3 cycles instead of 1. For GPIO-heavy operations (bit-banging, LED matrix driving), STM32 places GPIO on AHB specifically to enable single-cycle access. When calculating peripheral clock rates for baud rate generators or timer prescalers, remember that APB peripherals are clocked at the APB bus frequency (PCLK1 or PCLK2), not SYSCLK — a common source of baud rate errors when configuring UART or SPI.

QWhat is TCM and when would you use it instead of regular SRAM?

TCM (Tightly Coupled Memory) is a dedicated SRAM connected directly to the CPU core through a private bus, bypassing the main bus matrix and cache entirely. On Cortex-M7, there are two types: ITCM (Instruction TCM) for code and DTCM (Data TCM) for data. The defining characteristic is zero wait-state, deterministic access — every read and write completes in exactly one clock cycle, regardless of what the DMA, other bus masters, or the cache are doing.

This determinism makes TCM ideal for interrupt service routines and real-time control loops. When an ISR is in ITCM, the instruction fetch latency is always one cycle — no cache misses, no flash wait states, no bus contention from concurrent DMA transfers. When the ISR's data (lookup tables, filter coefficients, control state) is in DTCM, every load and store is also one cycle. This eliminates the timing variability that makes worst-case execution time (WCET) analysis so difficult with cached memory.

Specific use cases include: (1) DMA descriptors and small DMA buffers — DTCM is not cached, so there are no cache coherency issues with DMA. Placing DMA buffers in DTCM eliminates the need for cache maintenance calls entirely. (2) Stack memory — placing the main stack (MSP) in DTCM ensures that context save/restore during ISR entry has deterministic timing. (3) High-frequency ISR code — a motor FOC (Field-Oriented Control) ISR running at 20 kHz with hard real-time deadlines benefits from ITCM placement because cache misses on the first invocation after a context switch are eliminated. (4) Audio processing buffers — real-time audio codecs that cannot tolerate jitter from cache misses or bus contention.

The limitation is size: TCM is typically 64-128 KB on STM32H7, shared among all these use cases. You cannot put everything in TCM, so you must profile your application to identify the code and data that benefit most from deterministic access and place only those items in TCM via linker script sections.

QWhat is the difference between Flash, SRAM, TCM, and CCM on Cortex-M7?

These are the four primary memory types in a Cortex-M7 system, each with different characteristics that dictate how firmware should use them:

Flash (typically 512 KB to 2 MB) stores the program binary and constant data. It is non-volatile — contents survive power cycling. The key limitation is wait states: Flash access is slower than the CPU clock. At 400 MHz, a Cortex-M7 may need 4-7 wait states per Flash read, meaning an instruction fetch from Flash takes 5-8 clock cycles. The ART (Adaptive Real-Time) accelerator and instruction cache mitigate this by caching recently accessed Flash lines, but cache misses still incur the full wait-state penalty. Flash is also read-only during normal execution — writes require an erase-then-program sequence that takes milliseconds and blocks further reads.

SRAM (typically 256-512 KB, split across multiple banks) holds runtime data: global variables, heap, and general-purpose buffers. SRAM access through the bus matrix takes 1-2 clock cycles at full speed, but this can increase when the DMA is also accessing SRAM — bus contention adds wait cycles. On Cortex-M7, SRAM is cacheable by default, which improves average access time for frequently used data but introduces the cache coherency issues discussed earlier.

DTCM/ITCM (64-128 KB each) provide zero wait-state, deterministic access directly coupled to the CPU core. DTCM is for data — stack, ISR variables, DMA buffers that need cache-free access. ITCM is for code — critical ISRs, real-time control functions. TCM bypasses both the cache and the bus matrix, so access latency is always exactly one cycle regardless of DMA activity or cache state. The trade-off is size (much smaller than general SRAM) and DMA limitations (some STM32 variants cannot DMA directly to/from TCM, or have limited DMA connectivity).

CCM (Core-Coupled Memory) is the predecessor to TCM, found on Cortex-M4 parts like the STM32F4. It provides zero wait-state access on a dedicated bus (D-bus), bypassing the main bus matrix. Unlike TCM, CCM on F4 is not accessible by DMA — this is a hard hardware limitation. CCM is ideal for stack memory and CPU-only data structures (RTOS task stacks, computation buffers) but unusable for any buffer involved in DMA transfers. On M7 parts, CCM is replaced by DTCM/ITCM, which have broader DMA connectivity.

QLittle-endian vs big-endian — when does it matter in embedded?

Endianness defines the byte order of multi-byte values in memory. Little-endian stores the least-significant byte at the lowest address — the 32-bit value 0x12345678 is stored as 78 56 34 12. Big-endian stores the most-significant byte first — the same value is stored as 12 34 56 78. ARM Cortex-M processors are little-endian by default (and most are little-endian only).

Endianness matters whenever firmware exchanges multi-byte data with an external system that uses a different byte order. The most common case is network protocols: TCP/IP, UDP, and virtually all Internet protocols define their headers in big-endian (called "network byte order"). When an embedded device constructs an IP packet, every 16-bit and 32-bit header field — source/destination port, sequence number, packet length — must be byte-swapped from the CPU's native little-endian to big-endian before transmission, and swapped back on reception. Similarly, many industrial protocols (Modbus TCP, CAN with J1939) define their payloads in big-endian.

ARM provides efficient hardware intrinsics for byte swapping: __REV(x) reverses the byte order of a 32-bit word (equivalent to htonl()), and __REV16(x) swaps bytes within each 16-bit halfword (equivalent to htons()). These compile to a single REV or REV16 instruction — one cycle on all Cortex-M cores. Using these intrinsics is far more efficient than the manual shift-and-OR approach that generic C implementations use. A common bug: forgetting to swap when parsing a network packet, leading to values that are off by orders of magnitude (e.g., a 16-bit port number 0x01BB (443) becomes 0xBB01 (47873)). Another trap: casting a uint8_t buffer pointer to a uint32_t* to read a multi-byte field — this both assumes alignment and ignores endianness, producing incorrect values and potential HardFaults on Cortex-M0 (which does not support unaligned access).