Quick Cap
Every embedded firmware must talk to hardware — but how you talk to it has major implications for portability, performance, code size, and debuggability. Bare-metal drivers manipulate registers directly for maximum control and minimum overhead. HAL (Hardware Abstraction Layer) drivers provide a high-level API that hides register details, making code portable across MCU families but adding abstraction overhead. Most real-world projects use a hybrid: HAL for standard peripherals (GPIO, UART, I2C) and bare-metal for time-critical or performance-sensitive code paths.
Key Facts:
- Bare-metal: Direct register access via volatile pointers. Maximum performance, zero overhead, but tied to one MCU family.
- HAL: Vendor-provided API (STM32 HAL, ESP-IDF, NRF SDK). Portable within a vendor family, slower, larger code size, can hide bugs.
- CMSIS: ARM's thin standardized layer — defines core access (NVIC, SysTick) and device register structs, but no peripheral drivers.
- Register access: Always through
volatilepointers. Read-modify-write is NOT atomic and must be protected in concurrent contexts. - Abstraction penalty: HAL function calls add 5-20 cycles overhead per call plus code size. For a GPIO toggle, HAL may be 10x slower than direct register write.
- The right answer is always "it depends": prototyping favors HAL, timing-critical production code favors bare-metal, multi-platform products favor custom thin abstractions.
Deep Dive
At a Glance
| Approach | Code Size | Performance | Portability | Debug Visibility | Development Speed |
|---|---|---|---|---|---|
| Bare-metal register | Minimal | Maximum | None (MCU-specific) | Full — you see every bit | Slow (must read datasheets) |
| CMSIS device headers | Minimal | Maximum | Within ARM Cortex-M | Full | Medium (struct access) |
| Vendor HAL (STM32 HAL) | Large (+50-200 KB) | Good (5-20% overhead) | Within vendor family | Limited (abstracted) | Fast (init structs) |
| Framework (Zephyr, ESP-IDF) | Large | Good | Cross-vendor | Framework-dependent | Fast |
| Custom thin HAL | Small | Near-maximum | Your platforms | Full | Medium |
The Abstraction Spectrum
Less abstraction More abstractionMore control More portable◄──────────────────────────────────────────────────────────►Register CMSIS Vendor HAL Zephyr/ Arduinobit-bang device (STM32 HAL, ESP-IDF HALheaders NRF SDK)GPIOA-> GPIOA-> HAL_GPIO_ gpio_pin_ digitalWrite(BSRR = BSRR = WritePin( set_dt( LED_PIN,(1<<5); GPIO_BSRR GPIOA, &led_gpio, HIGH);_BS_5; GPIO_PIN_5, 1);GPIO_PIN_SET);
All five lines do the same thing: set GPIO PA5 high. But they differ enormously in overhead, portability, and what the developer needs to know about the hardware.
Bare-Metal Register Access
At the lowest level, peripherals are memory-mapped registers. You access them through volatile pointers:
/* Direct register access — toggles PA5 in ~2 cycles */#define GPIOA_BSRR (*(volatile uint32_t *)0x40020018)GPIOA_BSRR = (1 << 5); /* Set PA5 high (atomic, single write) */GPIOA_BSRR = (1 << 21); /* Set PA5 low (reset bit = pin+16) */
Why volatile? Without volatile, the compiler may:
- Cache the register value in a CPU register and never re-read it (missing hardware changes)
- Optimize away "redundant" writes (but hardware needs every write)
- Reorder reads/writes (peripheral registers are order-sensitive)
Read-modify-write hazard:
/* This is NOT atomic — ISR could modify the register between read and write */GPIOA->ODR |= (1 << 5); /* Read ODR, OR with bit 5, write back */
If an interrupt fires between the read and write and modifies another bit in ODR, that change is lost. Use the BSRR register (which is a write-only atomic set/reset register) or wrap in a critical section.
CMSIS: The Middle Ground
CMSIS (Cortex Microcontroller Software Interface Standard) is ARM's standardized layer. It provides:
| CMSIS Component | What It Does | Example |
|---|---|---|
| CMSIS-Core | Core register access, NVIC, SysTick functions | NVIC_SetPriority(), __disable_irq() |
| Device headers | Peripheral register structs for a specific MCU | GPIOA->BSRR, USART1->DR |
| CMSIS-DSP | Optimized DSP functions (FFT, FIR, matrix) | arm_fir_f32() |
| CMSIS-RTOS | Standardized RTOS API | osThreadNew(), osMutexAcquire() |
CMSIS device headers give you named struct access to registers instead of raw addresses, which is much more readable and less error-prone than raw volatile pointers, while adding zero runtime overhead:
/* CMSIS device header style — zero overhead, readable */RCC->AHB1ENR |= RCC_AHB1ENR_GPIOAEN; /* Enable GPIOA clock */GPIOA->MODER |= GPIO_MODER_MODE5_0; /* PA5 as output */GPIOA->BSRR = GPIO_BSRR_BS5; /* Set PA5 high */
Vendor HAL (STM32 HAL Example)
Vendor HALs provide a high-level API with initialization structs:
/* STM32 HAL style — portable within STM32 family */GPIO_InitTypeDef gpio = {.Pin = GPIO_PIN_5,.Mode = GPIO_MODE_OUTPUT_PP,.Pull = GPIO_NOPULL,.Speed = GPIO_SPEED_FREQ_HIGH,};HAL_GPIO_Init(GPIOA, &gpio);HAL_GPIO_WritePin(GPIOA, GPIO_PIN_5, GPIO_PIN_SET);
Advantages:
- Initialization covers edge cases you might miss (pin configuration, clock enable, alternate function mapping)
- Code works across STM32F1, F4, H7 with minimal changes
- CubeMX generates boilerplate automatically
Disadvantages:
HAL_GPIO_WritePin()is ~20 cycles vs ~2 for direct BSRR write (10x overhead)- HAL interrupt callbacks add layers that hide timing behavior
- Hard to debug — when something fails, you must trace through multiple HAL source files
- Code size: full STM32 HAL adds 50-200 KB of Flash depending on peripherals used
- Default timeout values may not suit your application (e.g., I2C default timeout too long for your watchdog)
Designing a Custom Thin HAL
For multi-platform products, a custom thin abstraction is often the best tradeoff:
/* Platform-independent driver API (header) */typedef enum { GPIO_LOW, GPIO_HIGH } gpio_state_t;void gpio_init(uint8_t pin, gpio_mode_t mode);void gpio_write(uint8_t pin, gpio_state_t state);gpio_state_t gpio_read(uint8_t pin);
Each platform provides its own implementation file (gpio_stm32.c, gpio_nrf52.c) that maps to the hardware:
/* STM32 implementation */void gpio_write(uint8_t pin, gpio_state_t state) {if (state == GPIO_HIGH)port_map[pin]->BSRR = pin_map[pin];elseport_map[pin]->BSRR = pin_map[pin] << 16;}
This gives you near-bare-metal performance with compile-time platform selection.
Driver Design Patterns
| Pattern | How It Works | When to Use |
|---|---|---|
| Polling | ISR-free; main loop checks status registers | Simple, low-throughput, or debug-only |
| Interrupt-driven | ISR captures data, signals main loop | Most peripheral drivers (UART, SPI, timer) |
| DMA-driven | Hardware moves data without CPU involvement | High-throughput (ADC streaming, audio, bulk SPI) |
| Callback | Driver calls user-provided function on events | Vendor HALs (STM32 HAL_UART_RxCpltCallback) |
| Handle-based | Driver instance stored in a struct, passed to all calls | Multiple instances of same peripheral (UART1, UART2) |
Interview tip: When asked to "design a UART driver API", show an interrupt-driven design with init/deinit, send (non-blocking with buffer), receive (callback or ring buffer), and error handling. Avoid polling for production designs.
Popular Frameworks Compared
| Framework | Vendor | Approach | Portability | Code Size | Learning Curve |
|---|---|---|---|---|---|
| STM32 HAL | ST | Handle-based, callback | STM32 family only | Large | Medium |
| ESP-IDF | Espressif | Component-based, FreeRTOS | ESP32 family only | Large | Medium |
| Zephyr | Linux Foundation | Device tree, driver model | 400+ boards, cross-vendor | Large | Steep |
| CMSIS | ARM | Register structs, core API | Any Cortex-M | Minimal | Low |
| Arduino | Community | Simplified API | Many boards (with limitations) | Small-Medium | Very low |
Vendor HALs often use blocking waits with generous timeouts. A HAL I2C read with a 1-second default timeout works fine in testing but may cause a watchdog reset in production if the bus hangs. Always check and configure timeout values explicitly.
Debugging Story: HAL I2C Timeout Causing Watchdog Reset
A medical device used STM32 HAL for I2C communication with a pressure sensor. In the lab, it worked flawlessly for months. In the field, units would randomly reset every few days.
Investigation revealed that the STM32 HAL HAL_I2C_Mem_Read() has a default timeout of 25 ms. Normally the sensor responds in 1 ms. But occasionally, electrical noise caused the I2C bus to hang (SDA stuck low). The HAL function would block for 25 ms waiting for the transaction to complete. During this time, the main loop was not feeding the watchdog (16 ms timeout), triggering a reset.
The fix had two parts: (1) reduce the HAL timeout to 5 ms and add explicit bus recovery (SCL toggling) in the error path, and (2) for the next product revision, replace the HAL I2C driver with a bare-metal interrupt-driven implementation that never blocks — it sets up the transaction and returns immediately, with a callback on completion or error.
The lesson: HAL abstractions can hide critical timing behavior. For safety-critical or timing-sensitive applications, understand what the HAL does internally — especially how long it can block.
What Interviewers Want to Hear
- You can articulate the tradeoffs between HAL and bare-metal (not just "HAL is easier")
- You understand register access fundamentals: volatile, MMIO, read-modify-write hazards
- You know what CMSIS provides (and does not provide)
- You can design a clean driver API with init/deinit, blocking vs non-blocking, error handling
- You have an opinion on when to use which approach — and can justify it with concrete examples
- You understand that the right choice depends on project constraints (timeline, portability, performance)
Interview Focus
Classic Interview Questions
Q1: "When would you use HAL vs bare-metal? Give a real example."
Model Answer Starter: "I use HAL for rapid prototyping and when portability within a vendor family matters — for example, an STM32 product line where the same firmware targets F4 and H7 variants. I switch to bare-metal for time-critical paths: motor control PWM updates that must happen in under 1 us, or high-frequency interrupt handlers where HAL's function call overhead is unacceptable. In practice, most production projects are hybrid: HAL for initialization and standard I/O, bare-metal for the hot path."
Q2: "How do you access a hardware register in C? What's the role of volatile?"
Model Answer Starter: "Hardware registers are memory-mapped — they appear at fixed addresses in the MCU's memory map. You access them through volatile pointers: *(volatile uint32_t *)0x40020018 = value. Volatile tells the compiler three things: do not cache the value in a CPU register, do not optimize away reads or writes, and do not reorder accesses. Without volatile, the compiler may eliminate what it sees as 'dead' writes to a register or read a stale cached value instead of the current hardware state."
Q3: "Design a simple UART driver API — what functions would you expose?"
Model Answer Starter: "I would expose: uart_init(config) to set baud rate, parity, stop bits, and enable clocks; uart_send(data, len) as a non-blocking function that copies data to a transmit ring buffer and starts the TX interrupt; uart_receive(callback) to register a receive callback that fires when bytes arrive; uart_deinit() to disable the peripheral and release resources; and uart_get_error() to check for framing, overrun, or parity errors. The ISR handles the actual byte-level transfer, keeping it short. For DMA-capable UARTs, I would add a uart_send_dma() variant."
Q4: "What is CMSIS and how does it differ from vendor HALs?"
Model Answer Starter: "CMSIS is ARM's standardized interface for Cortex-M cores. It provides three things: core access functions (NVIC_SetPriority, __disable_irq, SysTick_Config), device-specific register struct definitions (GPIOA->BSRR, USART1->DR), and optional libraries (DSP, RTOS API). It does NOT provide peripheral drivers — no UART_Init() or SPI_Transmit(). Vendor HALs like STM32 HAL are built on top of CMSIS and add those high-level peripheral APIs. CMSIS has zero runtime overhead (it is just header files), while HALs add significant code."
Q5: "How do you make a driver portable across MCU families?"
Model Answer Starter: "I define a platform-independent API in a header file with abstract types and function signatures. Each target platform provides its own implementation file. The build system selects the right file at compile time. For example: gpio.h defines gpio_write(pin, state), gpio_stm32.c implements it with BSRR writes, gpio_nrf52.c implements it with NRF GPIO registers. I avoid #ifdef forests by keeping platform selection at the file level, not inside functions. This gives near-bare-metal performance with clean portability."
Trap Alerts
- Don't say: "Always use HAL" or "Always use bare-metal" — the right answer is always context-dependent
- Don't forget: That read-modify-write on shared registers is not atomic and needs protection in concurrent code
- Don't ignore: HAL timeout values — default timeouts can conflict with watchdog timers or real-time deadlines
Follow-up Questions
- "What are the downsides of using STM32 HAL in production?"
- "How would you test a hardware driver without the actual hardware?"
- "What is the difference between blocking and non-blocking driver design?"
- "How do you handle error recovery in a bare-metal driver?"
- "What role does DMA play in driver design?"
Ready to test yourself? Head over to the Driver Design Interview Questions page for a full set of Q&A with collapsible answers — perfect for self-study and mock interview practice.
Practice
❓ What does volatile prevent the compiler from doing with a hardware register access?
❓ A HAL_GPIO_WritePin() call takes ~20 cycles while a direct BSRR register write takes ~2 cycles. When does this difference matter?
❓ What does CMSIS provide that vendor HALs do NOT?
❓ Why is 'GPIOA->ODR |= (1 << 5)' potentially dangerous in a system with interrupts?
Real-World Tie-In
Multi-Platform Sensor Product — A company ships the same sensor product on three MCU platforms (STM32L4, nRF52840, ESP32-S3). A custom thin HAL with platform-specific driver files gives consistent behavior across all three. The SPI driver API is identical; only the underlying register access differs. New platform ports take 2-3 days instead of weeks because the application code does not change.
High-Performance Motor Controller — A BLDC motor controller uses STM32 HAL for CAN bus, UART logging, and configuration. But the field-oriented control (FOC) loop runs in a timer ISR at 20 kHz and uses direct register writes to update PWM duty cycles — the 10x overhead of HAL_TIM_PWM_ConfigChannel() would push ISR execution time past the 50 us deadline. The hybrid approach keeps development fast for non-critical paths while meeting hard real-time requirements.