MCU Cores & Clocking — Interview Questions & Answers

Core Architecture

QCompare Cortex-M4 vs Cortex-M7 — when would you choose each?

The Cortex-M4 and M7 are both high-performance Cortex-M cores with hardware floating-point units and DSP instructions (single-cycle MAC, SIMD), but they differ significantly in microarchitecture and target applications. The M4 is a single-issue, in-order pipeline (3-stage for most instructions) that achieves roughly 1.25 DMIPS/MHz. It has no instruction or data cache and no tightly coupled memory (TCM). Its simplicity makes it deterministic and predictable — interrupt latency is fixed at 12 cycles, and there are no cache-related timing variations. The M4 is the sweet spot for cost-sensitive applications that need moderate DSP capability: motor control, sensor fusion, audio processing, and industrial control. Typical M4 parts (STM32F4, STM32G4) run at 72-180 MHz with 128 KB to 1 MB Flash.

The M7 is a superscalar, dual-issue, 6-stage pipeline that achieves up to 2.14 DMIPS/MHz — nearly double the M4 per clock cycle. It includes instruction and data caches (typically 4-16 KB each), tightly coupled memories (TCM) with single-cycle deterministic access, branch prediction, and optional double-precision FPU. These features make the M7 suitable for workloads that the M4 cannot handle: running from external QSPI Flash or SDRAM (where caches mask the slow external memory latency), high-throughput signal processing, graphics rendering (with Chrom-ART DMA), and complex protocol stacks. Typical M7 parts (STM32F7, STM32H7) run at 216-550 MHz.

The tradeoff is cost, power, and determinism. M7 parts are 2-5x more expensive than comparable M4 parts, consume more dynamic power due to the deeper pipeline and caches, and have less deterministic interrupt latency because cache misses introduce variable delays. For real-time control loops where jitter matters (motor FOC at 20 kHz), the M4's fixed latency is actually an advantage — or on M7, you place the ISR code in TCM to bypass the cache entirely. Choose M4 when your workload fits in internal Flash/RAM and you need predictable timing at low cost. Choose M7 when you need external memory, high throughput, or the computational headroom that dual-issue provides.

QWhat is TrustZone on Cortex-M33, and when would you use it?

TrustZone for Cortex-M (ARMv8-M architecture) provides hardware-enforced isolation between a Secure world and a Non-Secure world within a single processor. Unlike software-only isolation (MPU regions, privilege levels), TrustZone partitions the entire memory map, peripherals, and interrupt system into secure and non-secure domains at the bus level. The Secure Attribution Unit (SAU) and Implementation-Defined Attribution Unit (IDAU) define which address ranges are secure, non-secure, or non-secure callable (NSC — the entry points for secure API calls). When non-secure code attempts to access a secure memory region, the hardware generates a SecureFault exception — no software check is needed, and no software bug can bypass it.

The security model works like this: secure firmware runs first after reset, configures the security boundaries, exposes a set of secure gateway functions (marked with the SG instruction in NSC regions), then boots the non-secure application. The non-secure application can call secure services only through these gateways — it cannot read secure memory, access secure peripherals, or modify the security configuration. Secure and non-secure code each have their own stack pointers, and the hardware automatically saves and clears secure registers during world transitions to prevent information leakage. Secure interrupts can preempt non-secure code, but non-secure interrupts cannot access secure state.

Use TrustZone when your product needs to protect cryptographic keys, firmware update verification, secure boot chains, or sensitive data processing from potentially untrusted application code. Concrete examples: payment terminals (PCI-DSS compliance requires key isolation), medical devices (FDA cybersecurity guidance recommends hardware-enforced partitioning), IoT devices with OTA updates (secure bootloader verifies firmware signatures in the secure world, preventing a compromised application from modifying the update process), and multi-tenant industrial controllers (different vendors' application code runs in the non-secure world while the platform owner controls the secure services). The M33 also includes an optional MPU in each world, so you can combine TrustZone (secure vs non-secure) with MPU (privilege levels within each world) for defense-in-depth.

QRISC-V vs ARM in embedded — what are the key differences?

The most fundamental difference is the business model. ARM is a proprietary ISA — chip vendors pay licensing fees and royalties to ARM Holdings for the right to implement Cortex-M cores. RISC-V is an open-standard ISA maintained by RISC-V International — anyone can implement it without fees. This has led to a wave of RISC-V microcontrollers (ESP32-C3/C6, GD32VF103, BL602/BL616, Kendryte K210) from vendors seeking to avoid ARM licensing costs or gain design flexibility. For engineers, this means RISC-V parts are often cheaper at the same performance tier, and startups or academic projects can design custom RISC-V cores without legal barriers.

From a technical architecture standpoint, several differences matter for embedded firmware. RISC-V uses the PLIC (Platform-Level Interrupt Controller) instead of ARM's NVIC — the PLIC supports priority-based preemption but has a different programming model (claim/complete semantics vs. automatic hardware stacking). RISC-V has no equivalent to ARM's automatic context save on exception entry — the firmware or compiler must save/restore registers in ISR prologues/epilogues, which adds a few cycles of interrupt latency compared to Cortex-M's hardware stacking. RISC-V's modular ISA extensions (M for multiply, A for atomics, F/D for float, C for compressed instructions, V for vector) mean that two "RISC-V" chips may have very different instruction sets, whereas "Cortex-M4F" guarantees a specific feature set everywhere.

The ecosystem maturity gap is the most practical difference today. ARM Cortex-M has 15+ years of tooling: mature compilers (GCC, Clang, IAR, Keil), debuggers (J-Link, ST-Link, OpenOCD), RTOS ports (FreeRTOS, Zephyr, ThreadX), and middleware (lwIP, mbedTLS, USB stacks) that have been battle-tested in billions of deployed devices. RISC-V tooling is catching up rapidly — GCC and LLVM support is solid, OpenOCD works with RISC-V debug (via the RISC-V Debug Spec), and FreeRTOS/Zephyr have official RISC-V ports — but vendor-specific HALs, peripheral libraries, and IDE integrations are less mature. For a production project today, ARM is the safer choice for time-to-market; for a new product with a 2-3 year development horizon, RISC-V is increasingly viable and avoids vendor lock-in to ARM's licensing terms.

Clock Tree

QWalk me through configuring the clock tree from an external crystal to 168 MHz on an STM32F4.

The clock tree configuration follows a specific sequence where order matters — getting it wrong causes hard faults or peripheral misconfiguration. Start with the system running on the default internal oscillator (HSI at 16 MHz). Step 1: Enable the HSE (High-Speed External oscillator) by setting the HSEON bit in RCC_CR. Wait for the HSERDY flag to set, confirming the external crystal (typically 8 MHz or 25 MHz) has stabilized. This takes a few milliseconds. If the crystal is not populated or damaged, HSERDY never sets — robust firmware includes a timeout here that falls back to HSI.

Step 2: Configure Flash wait states for the target frequency before switching to the faster clock. At 168 MHz and 3.3V supply, the STM32F4 requires 5 wait states (WS5) in the FLASH_ACR register. Also enable the prefetch buffer and instruction/data caches (ART Accelerator) to mitigate the wait-state penalty. Configuring wait states after switching the clock would cause the CPU to attempt fetching instructions faster than Flash can deliver them, resulting in corrupt instruction reads and an immediate hard fault.

Step 3: Configure the PLL. The PLL takes the HSE frequency and transforms it through three dividers: VCO_input = HSE / M, VCO_output = VCO_input * N, SYSCLK = VCO_output / P. For an 8 MHz crystal targeting 168 MHz: M = 8 (VCO input = 1 MHz), N = 336 (VCO output = 336 MHz, must be within the 100-432 MHz valid range), P = 2 (SYSCLK = 168 MHz). Also set Q = 7 for the USB OTG FS clock (VCO/Q = 48 MHz required by USB). Write all these to RCC_PLLCFGR, set the PLL source to HSE, then enable the PLL via PLLON in RCC_CR and wait for PLLRDY.

Step 4: Configure bus prescalers — AHB prescaler = 1 (HCLK = 168 MHz), APB1 prescaler = 4 (APB1 = 42 MHz, must not exceed 42 MHz), APB2 prescaler = 2 (APB2 = 84 MHz, must not exceed 84 MHz). Step 5: Switch SYSCLK to PLL by setting SW bits in RCC_CFGR to PLL, then verify SWS bits confirm the switch. The entire sequence takes microseconds of CPU time plus millisecond-scale oscillator stabilization. If any step is done out of order — especially configuring Flash wait states after the clock switch — the system will fault immediately and unpredictably.

QHSI vs HSE — what are the tradeoffs?

The HSI (High-Speed Internal) oscillator is an on-chip RC oscillator that requires no external components and starts within microseconds of being enabled. Its typical accuracy is plus or minus 1% at room temperature (25 degrees C), degrading to plus or minus 2-3% over the full temperature range (-40 to 85 degrees C). The HSI is always available, making it ideal as a safe fallback clock source — if the external crystal fails, the system can revert to HSI and continue operating in a degraded mode. For applications that do not require precise timing — GPIO toggling, ADC sampling with timer triggers, motor control where the control loop compensates for clock drift — HSI is perfectly adequate and saves BOM cost (no crystal, no load capacitors, no PCB routing constraints).

The HSE (High-Speed External) oscillator uses an external crystal or ceramic resonator, requiring two pins and two load capacitors (typically 10-20 pF). Its accuracy depends on the crystal: a standard 8 MHz crystal provides plus or minus 20 ppm (0.002%) accuracy — roughly 50x better than HSI. The HSE takes 2-10 ms to start and stabilize (the crystal needs time to build oscillation amplitude), which is a consideration for ultra-fast wake-up scenarios. The HSE is mandatory for certain peripherals: USB requires a 48 MHz clock derived from PLL with HSE accuracy (HSI drift exceeds the USB plus or minus 0.25% tolerance), CAN requires plus or minus 0.5% clock accuracy that HSI cannot guarantee across temperature, and Ethernet needs a 25 MHz reference that is typically derived from the HSE.

The decision comes down to the application's timing requirements: if you use USB, CAN, Ethernet, or any protocol with tight baud rate tolerance, you need HSE. If you use UART at moderate baud rates (up to 115200), HSI is usually acceptable — at 1% error, the worst-case bit timing error is within the UART's 3-5% tolerance at low baud rates but may fail at 1 Mbaud or higher. A practical middle ground on some STM32 families is the HSI with calibration trimming — the HSITRIM bits in RCC_CR allow fine-tuning the HSI frequency in approximately 40 kHz steps. Some applications measure HSI against an external timing reference (like a GPS 1PPS signal or an RTC with a 32.768 kHz crystal) and dynamically adjust HSITRIM for sub-0.1% accuracy without a high-speed crystal.

QWhat is the PLL and how do you calculate its output frequency?

The PLL (Phase-Locked Loop) is a clock multiplier circuit that takes a low-frequency input clock and generates a much higher-frequency output. On STM32F4, the main PLL has four programmable dividers/multipliers that form a frequency synthesis chain. The input source is either HSI (16 MHz) or HSE (typically 8 or 25 MHz), selected by the PLLSRC bit. The chain is:

text

VCO_input  = PLL_source / M     (must be 1-2 MHz, typically 1 MHz)
VCO_output = VCO_input  * N     (must be 100-432 MHz)
SYSCLK     = VCO_output / P     (the main system clock)
USB_CLK    = VCO_output / Q     (must be exactly 48 MHz for USB)

The VCO (Voltage-Controlled Oscillator) is the core of the PLL and must operate within its valid frequency range (100-432 MHz on STM32F4). If VCO_output is configured outside this range, the PLL will not lock — meaning it either oscillates at an unpredictable frequency or fails to stabilize at all, which produces a clock that jitters wildly and causes intermittent hard faults or peripheral misbehavior. The M divider scales the input down to the VCO's input range (1-2 MHz recommended for best jitter performance), and N multiplies it up to the target VCO frequency. P divides the VCO output to produce SYSCLK, and Q divides it to produce the 48 MHz USB clock.

For a concrete example with an 8 MHz HSE crystal targeting 168 MHz SYSCLK and 48 MHz USB: M = 8 gives VCO_input = 1 MHz; N = 336 gives VCO_output = 336 MHz (within the 100-432 MHz range); P = 2 gives SYSCLK = 168 MHz; Q = 7 gives USB_CLK = 48 MHz. A common mistake is choosing N and P values that produce the correct SYSCLK but push VCO_output outside the valid range — for example, M = 4, N = 84, P = 1 gives SYSCLK = 168 MHz but VCO_output = 168 MHz * 1 = only 168 MHz, which is within range but leaves no headroom. Or choosing values where VCO_output/Q does not produce exactly 48 MHz, silently breaking USB. Some STM32 families (H7) have additional PLL outputs (PLL1P, PLL1Q, PLL1R) and fractional-N dividers for even finer frequency control.

QWhat happens if you access a peripheral whose clock is disabled?

On STM32 (and most ARM Cortex-M MCUs), each peripheral is connected to the clock tree through individual clock gate bits in the RCC (Reset and Clock Control) registers. After a system reset, most peripheral clocks are disabled by default — the peripheral receives no clock signal, its registers do not function, and it consumes no dynamic power. If firmware attempts to read or write a peripheral's registers while its clock gate is disabled, the behavior is architecturally undefined on the AHB/APB bus and varies by implementation.

In practice on STM32, writes are silently ignored — the bus transaction completes without error, but the data is discarded because the peripheral's register logic is not clocked and cannot latch the value. Reads typically return zero (the bus returns a default value) or, on some families, may return stale data from the last time the peripheral was clocked. On some MCUs (particularly with stricter bus implementations), accessing an unclocked peripheral generates a bus fault (HardFault on Cortex-M0, or BusFault if enabled on M3+). The inconsistency across families makes this a particularly insidious bug — code that "works" on one MCU may fault on another.

This is one of the most common bugs in embedded development, especially for beginners. The typical symptom is: "I configured the peripheral registers but it does not work, and reading the registers back shows all zeros." The fix is always the same — enable the peripheral clock via the appropriate RCC enable register (RCC->AHB1ENR, RCC->APB1ENR, etc.) before any register access. A second subtle issue: the clock enable write and the first peripheral register access may execute back-to-back. On high-performance Cortex-M7 with write buffering, the clock enable may still be in the write buffer when the peripheral read executes, causing a read of the still-unclocked peripheral. The STM32 HAL inserts a read-back of the enable register after writing it (__IO uint32_t tmpreg = RCC->AHB1ENR; (void)tmpreg;) to force the write buffer to drain and ensure the clock is active before proceeding.

Power Modes

QHow do low-power modes work on STM32? How do you wake from Stop mode?

STM32 MCUs provide a hierarchy of low-power modes with increasing power savings and increasing recovery complexity. Sleep mode stops only the CPU core — all peripherals, clocks, and SRAM contents remain active. Current drops from tens of mA to a few mA. Any interrupt wakes the CPU, and execution resumes immediately from where it stopped. Stop mode goes further: it disables the HSI, HSE, and PLL (the entire high-speed clock tree is shut down), but preserves SRAM and register contents. Current drops to the microamp range (typically 2-30 uA depending on the STM32 family and voltage regulator mode). All peripheral clocks stop, so peripherals freeze in their current state. Standby mode is the deepest: it disables the voltage regulator entirely, losing all SRAM and register contents (only backup domain registers and RTC survive). Current drops to sub-microamp levels (0.3-2 uA). Wake-up from Standby is effectively a system reset.

Waking from Stop mode requires a wake-up source that can operate without the main clock tree: EXTI interrupts (external pin edges routed through the EXTI controller, which runs on a separate asynchronous detection circuit), the RTC alarm or wakeup timer (clocked by the 32.768 kHz LSE or LSI, which remain active in Stop mode), or on some families, LPUART (Low-Power UART, clocked by LSE). The wake-up event triggers the CPU to restart, but critically, the clock configuration reverts to HSI at 16 MHz — the HSE and PLL are off and must be reconfigured. If your application immediately accesses peripherals whose baud rates or timing were calculated for 168 MHz, they will be wrong by a factor of 10x until you reconfigure the PLL.

The wake-up sequence is therefore: (1) wake event triggers exit from Stop; (2) CPU resumes executing on HSI at 16 MHz; (3) firmware re-enables HSE and waits for HSERDY; (4) reconfigures and enables the PLL, waits for PLLRDY; (5) switches SYSCLK back to PLL. Only then should the application resume normal peripheral operations. A well-designed low-power application encapsulates the clock restoration in a SystemClock_Config() function called immediately in the wake-up path. The total wake-up latency is dominated by the HSE startup time (2-10 ms) and PLL lock time (sub-millisecond), making Stop mode unsuitable for applications that need sub-millisecond wake-to-active response times — use Sleep mode instead for those cases.

QWhat is clock gating and why does it matter?

Clock gating is the practice of disabling the clock signal to unused peripheral blocks so their internal flip-flops stop toggling and consume zero dynamic power. On STM32, every peripheral has an individual clock enable bit in the RCC (Reset and Clock Control) registers — for example, RCC->APB1ENR |= RCC_APB1ENR_USART2EN enables USART2's clock, and clearing that bit disables it. After reset, most peripheral clocks are disabled by default. Dynamic power in CMOS circuits is proportional to P = C * V^2 * f — by stopping the clock (f = 0), the dynamic power consumption of that peripheral block drops to essentially zero (only static leakage remains).

Clock gating is the simplest and most effective power optimization available to the embedded engineer, yet it is frequently overlooked. A typical STM32F4 application might enable clocks for GPIOA-E, USART1, SPI1, I2C1, TIM2, and DMA1 — but leave clocks enabled for DMA2, TIM3-TIM14, USART2-6, ADC1-3, and other unused peripherals that the HAL or BSP initialization code enabled "just in case." Each unused peripheral with an active clock wastes hundreds of microamps to several milliamps, depending on the peripheral complexity and the clock frequency. In a battery-powered design where the power budget is 50 mA active and 10 uA sleep, leaving unnecessary peripheral clocks running can waste 10-20% of the active power budget for zero benefit.

Best practices: (1) only enable peripheral clocks immediately before configuring the peripheral, and disable them when the peripheral is no longer needed (e.g., disable USART clock after a transmission is complete if communication is bursty); (2) audit your RCC enable register values — print them during development to identify unnecessarily enabled clocks; (3) use the MCU's power consumption calculator (STM32CubeMX has one) to estimate per-peripheral current draw; (4) during low-power mode entry, disable all peripheral clocks except those needed for wake-up detection. Clock gating has zero impact on functionality (a disabled peripheral simply freezes in its current state and its registers retain their values on most STM32 families) and zero code complexity — it is essentially free power savings.

QHow does dynamic frequency scaling work on an MCU, and what are the pitfalls?

Dynamic frequency scaling (DFS) means changing the CPU clock frequency at runtime based on the current workload — running at full speed during computationally intensive phases (signal processing, protocol handling) and dropping to a lower frequency during idle or light-load phases to save power. Since dynamic power scales linearly with frequency (P = C * V^2 * f), halving the clock frequency roughly halves the dynamic power consumption. On STM32, DFS is implemented by changing the PLL multiplier/divider values or the AHB/APB prescalers while the system is running.

The implementation sequence is critical and mirrors the initial clock configuration in reverse. To reduce frequency: first change the PLL configuration or prescalers to the lower frequency, wait for the PLL to re-lock (if PLL settings changed), then reduce Flash wait states to match the new frequency (fewer wait states actually improves performance at lower speeds). To increase frequency: first increase Flash wait states to accommodate the higher speed, then reconfigure the PLL/prescalers, and wait for PLL lock. The ordering of Flash wait states relative to frequency change is essential — too few wait states at a higher frequency causes instruction fetch corruption and hard faults, while too many wait states at a lower frequency is safe but wastes performance.

The major pitfall is that every peripheral whose timing depends on the clock frequency must be reconfigured after a frequency change. UART baud rate is derived from the APB clock — if you halve PCLK1 without updating the USART BRR register, the baud rate halves and communication fails. SPI clock dividers, timer prescaler/ARR values, ADC sampling time, I2C timing registers, and SysTick reload values all depend on their bus clock frequency. Missing even one reconfiguration produces a subtle bug: the UART might work at the slightly-wrong baud rate with occasional framing errors, or a timer interrupt fires at half the expected rate. A robust DFS implementation maintains a table of peripheral reconfiguration callbacks that are invoked automatically whenever the clock changes. Alternatively, some STM32H7 designs avoid full PLL reconfiguration by keeping the PLL fixed and only changing the AHB prescaler, which is simpler but limits the frequency range to powers-of-two divisions.