CPU fundamentals for embedded

Quick Cap

Embedded interviewers test computer architecture concepts that directly impact firmware behavior: cache coherency with DMA (the #1 gotcha on Cortex-M7), pipeline stalls that affect interrupt latency, bus architecture that determines peripheral access speed, and memory types that control determinism. This page focuses on the architecture concepts that matter for embedded — not desktop CPU design topics like out-of-order execution or virtual memory.

Key Facts:

Pipeline: 3-stage (M0) to 6-stage (M7). A pipeline flush on interrupt or branch costs several cycles of latency.
Cache: I-cache and D-cache on Cortex-M7 only. DMA writes to memory that the CPU has cached = stale data unless you invalidate.
Memory barriers: DSB (complete all memory accesses), DMB (order memory accesses), ISB (flush pipeline). Critical after writing peripheral registers.
AHB vs APB: AHB is the high-speed bus (SRAM, DMA, Flash). APB is the low-speed bus (UART, SPI, I2C). APB accesses are slower.
TCM (Tightly Coupled Memory): Zero wait-state SRAM with guaranteed single-cycle access — no cache, no contention.
Endianness: ARM Cortex-M is little-endian by default. Network protocols are big-endian. You must byte-swap.

Deep Dive

At a Glance

Concept	Cortex-M0	Cortex-M4	Cortex-M7
Pipeline stages	3	3	6 (dual-issue)
I-Cache	No	No	Yes (4-16 KB)
D-Cache	No	No	Yes (4-16 KB)
TCM	No	No	Yes (ITCM + DTCM)
Bus matrix	Simple AHB-Lite	AHB with bus matrix	Multi-layer AHB with arbitration
Cache coherency concern	N/A	N/A	DMA + cache = critical

CPU Pipeline

A pipeline allows the CPU to work on multiple instructions simultaneously. While one instruction executes, the next is being decoded, and the one after that is being fetched from memory:

px-2 py-1 rounded text-sm font-mono border

Cycle 1   Cycle 2   Cycle 3   Cycle 4   Cycle 5
Instr 1:  Fetch     Decode    Execute
Instr 2:            Fetch     Decode    Execute
Instr 3:                      Fetch     Decode    Execute

Why it matters for embedded:

Pipeline flush on interrupts: When an interrupt fires, the pipeline must be flushed (partially completed instructions are discarded) before the ISR can begin. A 3-stage pipeline loses up to 3 cycles; a 6-stage pipeline (M7) loses more. This is part of the 12-15 cycle interrupt entry latency.
Branch penalty: Conditional branches (if/else, loop endings) may cause a pipeline stall because the CPU does not know which instruction to fetch next until the branch is resolved. Cortex-M has limited branch prediction.
Multi-cycle instructions: Instructions like divide (UDIV/SDIV) or load-multiple (LDM) take multiple cycles and stall the pipeline. These can increase worst-case interrupt latency because the CPU must finish the current instruction before servicing the interrupt.

Cache Fundamentals

Cache is a small, fast memory between the CPU and main memory. The CPU reads/writes cache first; only on a miss does it access main SRAM or Flash.

Cache Term	Meaning
Cache line	Block of data loaded together (typically 32 bytes on M7)
Hit	Requested data is in cache — fast (1 cycle)
Miss	Data not in cache — must fetch from SRAM/Flash (3-10 cycles)
Write-back	Writes go to cache only; main memory updated later
Write-through	Writes go to both cache and main memory simultaneously
Invalidate	Mark cache lines as stale — next read fetches from main memory
Clean	Write dirty cache lines back to main memory

Cache Coherency with DMA — The #1 Embedded Cache Gotcha

DMA transfers data between peripherals and memory without going through the CPU cache. This creates two coherency problems:

Problem 1: DMA writes, CPU reads stale cache

px-2 py-1 rounded text-sm font-mono border

1. CPU reads buffer[0] → cached
2. DMA writes new data to buffer[0] in SRAM → cache still has old value
3. CPU reads buffer[0] → gets stale cached value!

Fix: Invalidate the D-cache for the buffer region before reading DMA results:

SCB_InvalidateDCache_by_Addr(rx_buffer, sizeof(rx_buffer));

Problem 2: CPU writes to cache, DMA reads stale SRAM

px-2 py-1 rounded text-sm font-mono border

1. CPU writes tx_data[0] = 0x42 → goes to cache (write-back mode)
2. DMA reads tx_data[0] from SRAM → gets old value (cache hasn't flushed!)

Fix: Clean (flush) the D-cache for the buffer region before starting DMA TX:

SCB_CleanDCache_by_Addr(tx_buffer, sizeof(tx_buffer));

Alternative solutions:

Configure the DMA buffer region as non-cacheable via the MPU — simplest but sacrifices cache benefits for that region
Use DTCM for DMA buffers — no cache, zero wait states, no coherency issue
Use write-through cache policy — every CPU write goes to both cache and SRAM (slightly slower but always coherent for DMA reads)

⚠️Cache Alignment Requirement

Cache invalidation operates on entire cache lines (32 bytes on M7). If your DMA buffer is not aligned to a cache line boundary, invalidating it may corrupt adjacent variables that share the same cache line. Always align DMA buffers: __attribute__((aligned(32))) uint8_t dma_buf[256];

Memory Barriers

Memory barriers are CPU instructions that enforce ordering of memory operations. On Cortex-M, there are three:

Barrier	Full Name	What It Does	When to Use
DSB	Data Synchronization Barrier	Ensures all memory accesses before DSB complete before any after it	After writing to a peripheral register that triggers an action (e.g., enabling a peripheral clock, then immediately accessing that peripheral)
DMB	Data Memory Barrier	Ensures ordering of memory accesses (but does not wait for completion)	Before/after DMA buffer access to ensure writes are visible
ISB	Instruction Synchronization Barrier	Flushes the pipeline — ensures subsequent instructions are fetched fresh	After modifying VTOR, enabling MPU, or changing privilege level

The most common interview scenario: You enable a peripheral clock and immediately write to that peripheral's registers. Without a DSB between the clock enable and the register write, the write may execute before the clock is actually enabled (due to pipeline and bus buffering), and the write is silently lost.

RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN;  /* Enable GPIOA clock */
__DSB();                                 /* Wait for clock to actually enable */
GPIOA->MODER = ...;                     /* Now safe to configure */

💡Interview Insight

Most embedded engineers never explicitly use memory barriers because vendor HAL functions insert them internally, and most Cortex-M0/M3/M4 code works without them due to simpler bus architectures. On Cortex-M7 and multi-core systems, they become critical. Mentioning memory barriers in an interview shows deep architectural understanding.

Bus Architecture: AHB vs APB

The ARM AMBA bus system connects the CPU to memory and peripherals through a hierarchy:

px-2 py-1 rounded text-sm font-mono border

┌──────────┐
                    │   CPU    │
                    └────┬─────┘
                         │
                    ┌────┴─────┐
                    │   AHB    │  High-speed bus (up to SYSCLK)
                    │  Matrix  │
                    └──┬──┬──┬─┘
                       │  │  │
              ┌────────┘  │  └────────┐
              ▼           ▼           ▼
         ┌────────┐  ┌────────┐  ┌────────┐
         │  SRAM  │  │  DMA   │  │  Flash │
         └────────┘  └────────┘  └────────┘

                    ┌──────────┐
                    │ AHB-APB  │  Bridge (prescaled clock)
                    │  Bridge  │
                    └──┬───┬───┘
                       │   │
              ┌────────┘   └────────┐
              ▼                     ▼
         ┌────────┐            ┌────────┐
         │  APB1  │            │  APB2  │
         │(42 MHz)│            │(84 MHz)│
         └──┬──┬──┘            └──┬──┬──┘
            │  │                  │  │
         UART  I2C             SPI1  ADC
         SPI2  TIM2            TIM1  USART1

Bus	Speed	Connected Peripherals	Access Time
AHB	Full SYSCLK (e.g., 168 MHz)	SRAM, DMA, GPIO, Flash controller	1 cycle
APB1	SYSCLK/4 (e.g., 42 MHz)	UART2/3, SPI2/3, I2C, TIM2-7	2-3 cycles
APB2	SYSCLK/2 (e.g., 84 MHz)	UART1, SPI1, ADC, TIM1/8	2 cycles

Why it matters: If you are bit-banging a protocol on GPIO (AHB, fast) but configuring it through APB registers, the APB access is slower. In tight loops, the bus speed difference can limit throughput. GPIO is on AHB specifically because fast toggling is important.

Memory Types

Cortex-M7 has several memory types optimized for different use cases:

Memory	Speed	Cached?	DMA Access?	Best For
Flash	Slow (wait states)	I-cache helps	Read-only by DMA	Code, const data
SRAM	Fast (1-2 cycles)	D-cache optional	Yes	General variables, heap
DTCM	Fastest (0 wait, deterministic)	No (bypasses cache)	No	ISR data, stack, time-critical variables
ITCM	Fastest (0 wait, deterministic)	No (bypasses cache)	No	ISR code, time-critical functions
CCM (STM32F4)	Fast	No	No	Stack, ISR-local data
Backup SRAM	Slow	No	No	Data preserved across resets (with battery)

Key interview insight: TCM provides deterministic access time — no cache misses, no bus contention with DMA. This is why critical ISR code and real-time data structures should be placed in TCM. Use linker script sections to place specific functions in ITCM:

__attribute__((section(".itcm_text")))
void Motor_Control_ISR(void) { /* runs from ITCM — zero wait state */ }

Endianness

ARM Cortex-M is little-endian by default: the least significant byte is stored at the lowest address.

px-2 py-1 rounded text-sm font-mono border

Value: 0x12345678

Little-endian (ARM):           Big-endian (network):
Addr+0: 0x78 (LSB)            Addr+0: 0x12 (MSB)
Addr+1: 0x56                  Addr+1: 0x34
Addr+2: 0x34                  Addr+2: 0x56
Addr+3: 0x12 (MSB)            Addr+3: 0x78 (LSB)

When it matters: Network protocols (TCP/IP, CAN, Modbus) typically use big-endian (network byte order). When parsing or building protocol packets, you must byte-swap multi-byte fields. ARM provides intrinsics: __REV() (reverse bytes in a word), __REV16() (reverse bytes in each halfword).

Debugging Story: Ethernet DMA Corruption on Cortex-M7

A team ported their Ethernet-based industrial controller from STM32F4 (no cache) to STM32H7 (Cortex-M7 with D-cache). The port compiled and ran, but Ethernet packets had random corruption — about 1 in 100 packets arrived with wrong data.

The root cause: the Ethernet DMA was writing received packet data to SRAM. The CPU then read the packet data — but from D-cache, which still held stale data from a previous packet. The fix was to place the Ethernet DMA buffers in a non-cacheable MPU region:

/* MPU region for Ethernet DMA buffers — non-cacheable, non-bufferable */
MPU->RNR = 0;  /* Region 0 */
MPU->RBAR = ETH_DMA_BUF_ADDR;
MPU->RASR = MPU_RASR_ENABLE | MPU_REGION_SIZE_32KB |
            MPU_ACCESS_NOT_CACHEABLE | MPU_ACCESS_FULL;

Alternatively, the team could have used SCB_InvalidateDCache_by_Addr() before reading each packet, but the MPU approach is simpler and avoids the alignment constraints.

The lesson: When porting code from a cache-less MCU (M0/M3/M4) to a cached one (M7), every DMA buffer is a potential coherency bug. Audit all DMA paths during the port.

What Interviewers Want to Hear

You understand cache coherency with DMA and can describe both the problem and multiple solutions
You know what memory barriers do and can give a concrete example of when DSB is needed
You can explain the AHB/APB bus hierarchy and why it affects peripheral access speed
You understand TCM and when to use it over regular SRAM
You can discuss pipeline stalls and their impact on interrupt latency
You know about endianness and its relevance for protocol implementations

Interview Focus

Classic Interview Questions

Q1: "Explain cache coherency issues with DMA and how to solve them."

Model Answer Starter: "DMA transfers data between peripherals and main memory without going through the CPU cache. This creates two problems: when DMA writes to memory, the CPU may read stale cached data; when the CPU writes to cache, DMA may read stale data from main memory. Three solutions: invalidate the D-cache before reading DMA results, clean the D-cache before starting DMA writes, or configure the buffer region as non-cacheable via the MPU. TCM memory avoids the issue entirely because it is not cached."

Q2: "What's the difference between AHB and APB buses?"

Model Answer Starter: "AHB (Advanced High-performance Bus) runs at full system clock speed and connects high-bandwidth peripherals like SRAM, DMA, GPIO, and the Flash controller. APB (Advanced Peripheral Bus) runs at a divided clock and connects lower-bandwidth peripherals like UART, SPI, I2C, and timers. APB accesses take 2-3 cycles due to the AHB-APB bridge and prescaled clock. This matters when doing high-frequency register polling or bit-bang protocols — GPIO is on AHB specifically for fast toggling."

Q3: "What are memory barriers and when do you need them in embedded?"

Model Answer Starter: "Memory barriers enforce ordering of memory operations. DSB ensures all pending memory accesses complete before any subsequent ones start — needed after enabling a peripheral clock before accessing its registers. DMB orders memory accesses without waiting for completion — useful for DMA buffer handoffs. ISB flushes the instruction pipeline — needed after changing the vector table (VTOR) or MPU configuration. Most code works without explicit barriers on M0/M4, but they become critical on M7 and multi-core systems."

Q4: "What is TCM and when would you use it instead of regular SRAM?"

Model Answer Starter: "TCM (Tightly Coupled Memory) provides zero wait-state, deterministic access — every read and write takes exactly one cycle, regardless of what else is happening on the bus. Regular SRAM access can be delayed by DMA contention, cache misses, or Flash wait states. I use ITCM for time-critical ISR code (motor control, audio processing) and DTCM for real-time data structures (PID state, DMA descriptors). The key advantage is determinism — no jitter in access time."

Q5: "Why does pipeline flush matter for interrupt latency?"

Model Answer Starter: "When an interrupt fires, the CPU must discard partially-executed instructions in the pipeline before it can begin the ISR. On a 3-stage pipeline (M0/M4), this costs 1-3 cycles. On a 6-stage dual-issue pipeline (M7), the flush is larger. Additionally, if the CPU is executing a multi-cycle instruction (divide, load-multiple), it must wait for that instruction to complete before flushing. This is why worst-case interrupt latency depends on the longest instruction in your code, not just the pipeline depth."

Trap Alerts

Don't say: "Cache is always good for performance" — in embedded, cache introduces coherency bugs with DMA that can be worse than the performance gain
Don't forget: That DMA buffers must be cache-line aligned (32 bytes on M7) when using cache invalidation
Don't ignore: Memory barriers after clock enable — a classic source of "works in debug, fails in release" bugs

Follow-up Questions

"Write-through vs write-back cache — what are the tradeoffs in embedded?"
"How do you ensure a DMA buffer is cache-line aligned?"
"What is the difference between Flash, SRAM, TCM, and CCM on Cortex-M7?"
"How would you measure the performance impact of cache on your application?"
"When would you NOT want to use cache on a Cortex-M7?"

💡Practice CPU Fundamentals Interview Questions

Ready to test yourself? Head over to the CPU Fundamentals Interview Questions page for a full set of Q&A with collapsible answers — perfect for self-study and mock interview practice.

Practice

❓ DMA writes data to a buffer in SRAM. The CPU then reads the buffer. On Cortex-M7 with D-cache enabled, what is the risk?

❓ You enable GPIOA clock via RCC->AHB1ENR, then immediately write to GPIOA->MODER. What could go wrong?

❓ Why is TCM preferred over SRAM for time-critical ISR code on Cortex-M7?

❓ ARM Cortex-M uses little-endian byte order. The 32-bit value 0xDEADBEEF is stored in memory starting at address 0x20000000. What byte is at address 0x20000000?

Real-World Tie-In

High-Speed Data Acquisition — An STM32H7-based oscilloscope captures ADC samples at 5 MSPS via DMA into DTCM. DTCM was chosen over regular SRAM to avoid cache coherency issues entirely — no invalidation needed, no alignment constraints, and deterministic access for the DSP processing that runs immediately after each DMA transfer completes.

Ethernet Gateway Migration — A factory automation gateway was ported from STM32F407 (Cortex-M4) to STM32H743 (Cortex-M7). The port worked immediately for UART and CAN, but Ethernet showed intermittent packet corruption. Root cause: DMA receive buffers in cached SRAM. The fix (applying an MPU non-cacheable region to the Ethernet buffer area) took 30 minutes once diagnosed, but finding the bug took 3 days. The team added "audit all DMA buffers for cache coherency" to their M4-to-M7 porting checklist.

CPU fundamentals for embedded

Quick Cap

Deep Dive

At a Glance

CPU Pipeline

Cache Fundamentals

Cache Coherency with DMA — The #1 Embedded Cache Gotcha

Memory Barriers

Bus Architecture: AHB vs APB

Memory Types

Endianness

Debugging Story: Ethernet DMA Corruption on Cortex-M7

What Interviewers Want to Hear

Interview Focus

Classic Interview Questions

Trap Alerts

Follow-up Questions

Practice

Real-World Tie-In

Prerequisites

MCU cores and clocking

Interrupts and priorities

Up Next

Driver design: HAL vs bare metal