Driver Design: HAL vs Bare Metal — Interview Questions & Answers

HAL vs Bare Metal

QWhen would you use HAL vs bare-metal register access? Give a real example.

The decision is rarely all-or-nothing — most production firmware is a hybrid. HAL (Hardware Abstraction Layer) accelerates prototyping and improves portability within a vendor family. If you are evaluating three different STM32 boards for a new product, HAL lets you swap targets by changing a build configuration rather than rewriting every peripheral driver. It is also the right choice for peripherals that are complex to configure but not performance-critical — USB device stacks, Ethernet MAC setup, or RTC calendar initialization, where getting all 15 registers right on the first try matters more than saving a few microseconds.

Bare-metal register access is essential for timing-critical hot paths and size-constrained targets. Consider a motor control loop running at 20 kHz on a Cortex-M4: the ADC conversion-complete ISR must read the result, compute a PID update, and write a new PWM duty cycle — all within 50 microseconds. Calling HAL_ADC_GetValue() and HAL_TIM_PWM_ConfigChannel() adds function call overhead, parameter validation, and lock acquisition that you cannot afford in this ISR. Writing directly to ADC1->DR and TIM1->CCR1 eliminates that overhead entirely. Similarly, on an STM32F0 with 16 KB flash, the HAL UART driver alone can consume 2-4 KB — a significant fraction of the total.

The practical pattern in production: use HAL for initial peripheral setup (clock configuration, GPIO alternate function mapping, DMA channel configuration) and bare-metal register access inside ISRs and real-time loops. This gives you the convenience of HAL where it matters and the performance of direct access where it counts.

QHow do you access a hardware register in C? What is the role of volatile?

Hardware peripherals on ARM Cortex-M are memory-mapped — each peripheral register occupies a fixed address in the processor's memory space. To access a register in C, you cast its address to a pointer and dereference it. The standard pattern uses a struct that mirrors the register layout, cast from the peripheral's base address:

// Typical CMSIS-style register access
#define GPIOA_BASE  0x40020000UL
#define GPIOA       ((GPIO_TypeDef *)GPIOA_BASE)

GPIOA->ODR |= (1 << 5);  // Set pin 5 high

The volatile qualifier is mandatory for all hardware register pointers. Without it, the compiler is free to optimize away accesses that appear redundant from a pure C perspective. Three specific optimizations break hardware interaction: (1) read elimination — if you read a status register twice in a loop, the compiler may cache the first read and never re-read, so you never see the flag change; (2) write elimination — writing the same value twice to a control register (e.g., to generate a pulse) may be collapsed into one write; (3) reordering — the compiler may move register accesses relative to other operations if it determines they are independent, but hardware often requires specific ordering (e.g., write control register before data register).

volatile tells the compiler: every read must actually go to memory, every write must actually be emitted, and the order in the source code must be preserved. Note that volatile does not prevent the CPU's hardware reordering — on Cortex-M7, you may still need memory barriers (__DSB(), __DMB()) for that. But for most Cortex-M0/M3/M4 code, the processor's strongly-ordered peripheral memory region combined with volatile is sufficient to guarantee correct hardware register access.

QWhat are the downsides of using STM32 HAL in production?

The most impactful downside is code size. The HAL is designed for generality — every function handles every possible configuration and checks for error conditions that may not apply to your use case. A simple HAL_UART_Transmit() call includes parameter validation, lock acquisition, timeout management, and status flag checking. On a flash-constrained part like the STM32F030 (16 KB flash), the HAL UART driver plus its dependencies can consume 3-4 KB. Multiply that across five peripherals and half your flash is gone before you write any application logic.

Hidden timeouts are a more subtle problem. Many HAL functions use polling loops with timeout parameters — HAL_SPI_Transmit(&hspi1, data, len, HAL_MAX_DELAY) will spin indefinitely if the SPI peripheral is misconfigured. Even with a finite timeout, a 100 ms HAL timeout inside an ISR or a tight control loop conflicts with watchdog timers and real-time deadlines. The HAL's timeout mechanism is based on HAL_GetTick(), which depends on the SysTick interrupt — if you call a HAL function from a higher-priority ISR, HAL_GetTick() never increments and the timeout never expires, creating a deadlock.

Abstraction hiding timing bugs is the third concern. The HAL abstracts away the exact sequence and timing of register writes, making it difficult to reason about what happens at the hardware level. For example, HAL_ADC_Start_DMA() configures the DMA, enables the ADC, and starts conversion — but the exact ordering and any inserted delays are buried in the implementation. If a cache coherency issue or bus contention bug appears, debugging requires stepping through HAL source code that you did not write and may not fully understand. In a bare-metal implementation, every register write is visible in your code and the timing is explicit.

Driver Patterns

QDesign a simple UART driver API — what functions would you expose?

A production UART driver API should provide five core functions that separate configuration from data flow and support both blocking and non-blocking operation:

typedef struct {
    uint32_t baudrate;
    uint8_t  word_length;   // 8 or 9
    uint8_t  parity;        // NONE, EVEN, ODD
    uint8_t  stop_bits;     // 1 or 2
} uart_config_t;

int  uart_init(uint8_t port, const uart_config_t *cfg);
int  uart_send(uint8_t port, const uint8_t *data, uint16_t len);
void uart_register_rx_callback(uint8_t port,
         void (*cb)(const uint8_t *data, uint16_t len));
int  uart_get_error(uint8_t port);
void uart_deinit(uint8_t port);

uart_init() configures the peripheral, sets up DMA channels and interrupts, and allocates internal ring buffers. It returns an error code if the port is invalid or already initialized. uart_send() is non-blocking — it copies data into a transmit ring buffer and starts the DMA or interrupt-driven transfer. It returns immediately, allowing the caller to continue processing. If the TX buffer is full, it returns an error rather than blocking. uart_register_rx_callback() lets the application provide a function pointer that is called from the RX half-transfer or idle-line ISR with a pointer to the received data and its length.

uart_get_error() returns a bitmask of error flags (framing error, overrun, parity error, buffer overflow) accumulated since the last call, and clears them. This is better than returning errors from individual send/receive calls because errors often occur asynchronously. uart_deinit() disables the peripheral, releases DMA channels, and frees resources — essential for low-power modes where you need to shut down unused peripherals. The key design principle is that the API header contains no hardware-specific types — no UART_HandleTypeDef, no register addresses, no STM32-specific enums. This makes the API portable: swapping MCU families requires only a new implementation file, not changes to every file that calls the driver.

QBlocking vs non-blocking driver design — what are the tradeoffs?

A blocking driver holds the CPU in a polling loop until the operation completes. For UART transmission, this means spinning inside uart_send() until the last byte has shifted out of the transmit data register. The advantage is simplicity — the function returns only after the data is fully sent, so the caller knows the operation is done and can immediately reuse the buffer. No callbacks, no state machines, no concurrency concerns.

The downsides are severe in production systems. First, the CPU is 100% idle during the transfer — a 100-byte UART transmission at 9600 baud takes roughly 100 ms, during which a 72 MHz Cortex-M4 wastes approximately 7.2 million clock cycles doing nothing. Second, blocking inside an ISR is catastrophic — it prevents lower-priority interrupts from being serviced and can trigger a watchdog reset if the watchdog timeout is shorter than the transfer time. Third, blocking breaks real-time guarantees — if a motor control loop must execute every 50 microseconds but a blocking SPI transfer takes 200 microseconds, you miss three control deadlines.

Non-blocking drivers use interrupts or DMA to perform the transfer in the background. uart_send() copies data into a ring buffer, starts the DMA, and returns immediately. The completion is signaled via a callback, a flag, or an RTOS semaphore. This is the production-grade approach because it keeps the CPU available for computation and maintains deterministic timing for real-time tasks. The cost is complexity: you need buffer management, ISR-safe data structures, and careful handling of concurrent access between the ISR and the main application. You also need to track whether a previous transfer is still in progress before starting a new one. Despite this complexity, non-blocking design is the standard for any system that does more than blink an LED.

QHow would you test a hardware driver without the actual hardware?

The most effective technique is dependency injection through function pointers. Instead of having the driver directly access hardware registers, define a "hardware interface" struct with function pointers for each low-level operation:

typedef struct {
    void (*write_reg)(uint32_t addr, uint32_t val);
    uint32_t (*read_reg)(uint32_t addr);
    void (*delay_us)(uint32_t us);
} hw_interface_t;

In production, these point to real register access functions. In unit tests, they point to mock functions that record what was written and return pre-programmed values. This lets you verify that the driver writes the correct values to the correct registers in the correct order, without touching real hardware. You can also inject error conditions — make read_reg() return a status register value indicating a timeout or NACK — to test error handling paths that are difficult to trigger on real hardware.

For integration-level testing, emulators like QEMU (for Cortex-M) or Renode provide a simulated MCU with modeled peripherals. You flash your actual firmware binary into the emulator and it executes on a virtual CPU with virtual UART, SPI, GPIO, and DMA controllers. Renode is particularly powerful because it models peripheral behavior accurately enough to test DMA transfers, interrupt timing, and multi-peripheral interactions. You can script test scenarios — "inject 50 bytes on the virtual UART RX at 115200 baud and verify the firmware echoes them back" — and run these tests in CI/CD pipelines on every commit.

A practical middle ground is hardware-in-the-loop (HIL) testing, where the MCU runs on a physical development board connected to a test host via a debug probe. The host scripts test sequences: send data over UART, toggle GPIO lines, capture logic analyzer traces, and verify the firmware's responses. This catches issues that pure software mocking misses — like DMA configuration errors, clock tree misconfiguration, or interrupt priority bugs — while still being automated and repeatable.

Portability & Abstraction

QWhat is CMSIS and how does it differ from vendor HALs?

CMSIS (Cortex Microcontroller Software Interface Standard) is an ARM-defined standard that provides two things: (1) a set of core access functions for Cortex-M processor features — NVIC configuration (NVIC_EnableIRQ(), NVIC_SetPriority()), SysTick setup, MPU/FPU control, and intrinsics like __DSB(), __WFI(), __REV(); and (2) device header files that define register struct typedefs and base addresses for a specific MCU (e.g., stm32f407xx.h). These headers are generated from SVD (System View Description) files that silicon vendors provide, ensuring every register, every bit field, and every peripheral instance is accurately mapped.

The critical distinction: CMSIS adds zero runtime overhead. It is purely a compile-time abstraction — #define macros, typedef structs, and static inline functions. When you write GPIOA->BSRR = (1 << 5), the compiler generates a single STR instruction to the absolute address. There is no function call, no parameter validation, no lock. CMSIS does not include any peripheral driver logic — it gives you named access to registers but does not tell you what sequence to write to them.

Vendor HALs (STM32 HAL, NXP MCUXpresso SDK, TI DriverLib) build on top of CMSIS. They add peripheral driver functions — HAL_UART_Transmit(), HAL_SPI_Init() — that encapsulate the register write sequences, handle error checking, manage state, and provide a higher-level API. This adds convenience but also runtime overhead: function calls, parameter validation, timeout loops, and internal state tracking. The result is larger code size and less predictable timing. In practice, almost every embedded project uses CMSIS (you cannot avoid it — it defines how you access the core and peripherals), but whether you also use the vendor HAL is a project-level decision based on code size, performance, and portability requirements.

QHow do you make a driver portable across MCU families?

The standard pattern is a platform-independent API header paired with per-platform implementation files, selected at compile time by the build system. The API header declares the driver interface using only standard C types — no vendor-specific types, no register addresses, no #include of any hardware header:

// gpio_driver.h — platform-independent
int  gpio_init(uint8_t port, uint8_t pin, gpio_mode_t mode);
void gpio_write(uint8_t port, uint8_t pin, uint8_t value);
int  gpio_read(uint8_t port, uint8_t pin);

Each supported platform has its own implementation file — gpio_stm32f4.c, gpio_nrf52.c, gpio_esp32.c — that implements these functions using the platform's registers or vendor SDK. The build system (CMake, Makefile) selects the correct file based on a target variable:

makefile

ifeq ($(PLATFORM),stm32f4)
    SRCS += gpio_stm32f4.c
else ifeq ($(PLATFORM),nrf52)
    SRCS += gpio_nrf52.c
endif

The tricky part is defining the API at the right level of abstraction. Too low (exposing register-level details) and every platform implementation becomes trivial but the API is not truly portable. Too high (abstracting away all hardware differences) and you lose the ability to use platform-specific features or optimize for specific hardware. The sweet spot is an API that maps to common peripheral capabilities — init, read, write, configure interrupt — with a platform-specific "extras" extension point for features that only some hardware supports (e.g., STM32's GPIO BSRR atomic set/reset, or nRF52's GPIO SENSE for wake-from-sleep).

A common mistake is trying to abstract away differences that are fundamentally incompatible — for example, I2C on STM32 uses a peripheral with built-in state machine and interrupts, while bit-banged I2C on a tiny 8-bit MCU is purely software-driven. The driver API can be the same, but the implementations are so different that sharing code between them provides no real benefit. Portability is a spectrum: aim for API compatibility (same function signatures) rather than code reuse across wildly different platforms.