Boot Process & U-Boot
QWalk through the embedded Linux boot sequence from power-on to application.
The embedded Linux boot sequence has four main stages, each handing off to the next. First, the ROM bootloader (burned into the SoC at the factory) executes from internal ROM on power-on. It initializes the most basic hardware — typically a small amount of internal SRAM — and loads the next-stage bootloader from a configured boot medium (eMMC, SD card, SPI NOR flash, or UART/USB in recovery mode). The ROM bootloader is fixed by the silicon vendor and cannot be modified.
Second, U-Boot (or another second-stage bootloader) takes over. It initializes external DRAM, sets up clocks and PLLs, configures the boot medium controller, and provides a command-line environment for development. U-Boot loads the kernel image, the device tree blob (DTB), and optionally an initramfs into DRAM, then transfers control to the kernel's entry point.
Third, the Linux kernel decompresses itself (if compressed), initializes the memory management unit (MMU), sets up virtual memory, brings up the scheduler, and probes hardware drivers guided by the device tree. The kernel mounts the root filesystem — either from initramfs or directly from a block device — and launches the init process (PID 1).
Finally, the init system (systemd, BusyBox init, or a custom init script) starts userspace services in the configured order. On an embedded product, this typically ends with launching the main application daemon. Understanding this chain matters because boot-time bugs can hide at any stage, and optimizing boot time requires knowing which stage dominates.
QWhat is U-Boot and what does it do?
U-Boot (Das U-Boot) is the most widely used open-source bootloader for embedded Linux systems. Its primary job is to initialize hardware that the kernel cannot initialize itself — most critically external DRAM — and then load and launch the Linux kernel. Without a bootloader, the kernel has no memory to run in and no way to find itself on the boot medium.
U-Boot provides several essential functions: it configures clocks, PLLs, and pin muxing; initializes the DRAM controller with the correct timing parameters for the specific memory chips on the board; sets up boot-medium drivers (eMMC, NAND, SPI NOR, Ethernet for TFTP boot); loads the kernel image, device tree blob, and optional initramfs into specific DRAM addresses; and passes the device tree pointer and kernel command-line arguments to the kernel before jumping to its entry point.
During development, U-Boot's interactive console is invaluable — you can read and write memory, test peripherals, load images over TFTP or USB, and modify boot parameters without reflashing. In production, U-Boot typically runs silently with a zero-second boot delay. U-Boot also supports secure boot (verified boot) by checking cryptographic signatures on the kernel and DTB before loading them, which is critical for products that must resist firmware tampering.
QWhat is initramfs and when do you need it?
initramfs (initial RAM filesystem) is a small temporary root filesystem that the kernel loads into RAM before mounting the real root filesystem. It is a gzip-compressed cpio archive that the bootloader places in memory alongside the kernel. The kernel unpacks it, mounts it as the initial root, and runs its /init script.
You need initramfs when the real root filesystem cannot be mounted directly by the kernel at boot time. Common scenarios include: the root filesystem is on an encrypted partition and the kernel must run userspace tools to unlock it; the root is on a network filesystem (NFS) and networking must be configured first; the root is on a complex storage setup (LVM, software RAID) requiring userspace utilities; or the root filesystem type requires a kernel module that is not built-in.
In many embedded systems, initramfs is unnecessary because the root filesystem is on a simple block device (eMMC partition, SD card) with a built-in filesystem driver. Skipping initramfs saves boot time — typically 100-300 ms — and reduces image size. However, some embedded products use initramfs as the only root filesystem (never switching to a block device), keeping the entire system in RAM for reliability. This approach avoids flash wear and filesystem corruption from unexpected power loss, at the cost of consuming DRAM for the entire filesystem.
QHow do you optimize embedded Linux boot time?
Boot-time optimization starts with measurement — you cannot optimize what you have not profiled. Use grabserial to timestamp the serial console output, bootchart or systemd-analyze for userspace, and kernel printk.time=1 to add timestamps to kernel messages. Identify which stage dominates: bootloader, kernel, or userspace.
Bootloader optimizations: reduce or eliminate the U-Boot autoboot delay (bootdelay=0), skip unnecessary hardware initialization (disable USB, network if not needed for boot), use Falcon mode to bypass U-Boot entirely and have the SPL load the kernel directly, and precompute DRAM timing instead of running calibration at every boot.
Kernel optimizations: build a minimal kernel with only the drivers you need (remove all unused subsystems), use kernel XIP (execute in place) from NOR flash to eliminate decompression time, defer non-critical driver probing with deferred_probe, compile critical drivers as built-in rather than modules to avoid module loading overhead, and use a compressed kernel format like LZ4 which decompresses faster than gzip at the cost of slightly larger images.
Userspace optimizations: replace systemd with a simpler init (BusyBox init or a custom init script), start only essential services, parallelize independent service startup, use readahead to preload files from storage, and move the application launch as early as possible — ideally as the init process itself. With aggressive optimization, sub-one-second boot from power-on to application is achievable on modern SoCs.
Device Tree & Drivers
QWhat is device tree and why does Linux use it?
A device tree is a data structure (a tree of nodes and properties) that describes the hardware on a board — which peripherals exist, what addresses and interrupts they use, how they are connected, and what compatible drivers should handle them. The device tree source (.dts) is compiled into a device tree blob (.dtb) that the bootloader passes to the kernel at boot.
Linux uses device tree because embedded hardware is not self-describing. On a PC, buses like PCI and USB support enumeration — the OS can query each device for its identity. But on an embedded SoC, most peripherals sit on non-enumerable buses (the platform bus, memory-mapped I/O). Without device tree, the kernel would need hard-coded C structures describing every board variant — the old ARM "board file" approach, which became unmaintainable as hundreds of SoCs and boards proliferated.
Device tree separates hardware description from driver code. The same UART driver works across dozens of SoCs because each board's .dts specifies the UART's register address, interrupt number, and clock. This also means a single kernel binary can boot on multiple boards by simply swapping the DTB. Device tree overlays extend this further, allowing runtime hardware description changes — for example, describing an add-on board (like a Raspberry Pi HAT) without recompiling the base DTB.
QHow does a compatible string bind a device to a kernel driver?
The compatible string is the primary mechanism that connects a device tree node to its kernel driver. Each device tree node has a compatible property — a list of one or more strings, ordered from most specific to most generic. For example: compatible = "ti,am335x-uart", "ti,omap3-uart", "ns16550a";. Each kernel driver has a match table (of_match_table) listing the compatible strings it supports.
When the kernel walks the device tree at boot, it creates a platform_device for each node and tries to find a matching driver. The matching process checks the device's compatible strings against every registered driver's match table. The first compatible string that matches wins. This is why the list is ordered from specific to generic — the most specific driver gets priority, but if no specific driver exists, a generic one (like ns16550a for basic 16550 UARTs) can take over.
A common interview debugging scenario: your driver's probe() function is never called. The most frequent cause is a mismatch between the compatible string in the device tree and the string in the driver's of_match_table. Even a single character difference (a typo, a missing vendor prefix) will prevent binding. Always verify with cat /sys/bus/platform/devices/*/of_node/compatible and compare against the driver source. Another cause is forgetting to set the device tree node's status property to "okay" — nodes with status = "disabled" are skipped.
QWhat is the difference between a platform driver and an I2C/SPI driver?
The difference lies in which bus subsystem manages the driver. A platform driver handles devices that are directly memory-mapped to the SoC's address space — they sit on an internal bus with no standard discovery protocol. UARTs, timers, GPIO controllers, and DMA engines are typical platform devices. The kernel creates platform_device objects from device tree nodes and matches them to platform_driver structures. The driver accesses hardware through ioremap() to map physical register addresses into kernel virtual memory.
An I2C or SPI driver handles devices attached to an external I2C or SPI bus — sensors, EEPROMs, display controllers, ADCs. These drivers register with the I2C or SPI subsystem (i2c_driver, spi_driver) and receive a handle to the bus controller. They access hardware through bus-specific transfer functions (i2c_smbus_read_byte(), spi_transfer()) rather than direct memory mapping. The bus controller itself is typically a platform driver.
The key architectural insight is that this is a layered model: the SoC's I2C controller is a platform device with memory-mapped registers, managed by a platform driver. That driver registers itself as an I2C adapter. External chips on the I2C bus are then managed by I2C client drivers that communicate through the adapter. This layered approach means you can swap the SoC (and its I2C controller driver) without changing the external device drivers, and vice versa.
QYour driver's probe() is never called — how do you debug it?
This is one of the most common embedded Linux driver development issues, and the debugging approach is systematic. Start by verifying that the device tree node exists and is enabled: check ls /proc/device-tree/ (or /sys/firmware/devicetree/base/) to confirm your node appears, and verify that its status property is "okay", not "disabled". If the node is missing, the DTB was not updated — rebuild and reflash it.
Next, check the compatible string match. Print the device tree node's compatible string with cat /sys/firmware/devicetree/base/your-node/compatible and compare it character-by-character against your driver's of_match_table. Common mistakes include typos, missing vendor prefixes (e.g., "uart" instead of "vendor,uart"), and forgetting to terminate the match table with an empty entry {}. Also verify that the driver's of_match_table is actually referenced in the platform_driver structure — a common copy-paste bug is defining the table but not assigning it to .driver.of_match_table.
Then verify the driver is loaded: check lsmod if it is a module, or confirm it is built-in (=y in .config, not =m). If it is a module, is it present in the rootfs and loaded? Check dmesg for any errors during module loading. Finally, check for deferred probing: if your driver depends on a resource (clock, regulator, GPIO) provided by another driver that has not probed yet, the kernel returns -EPROBE_DEFER and retries later. Check cat /sys/kernel/debug/devices_deferred to see if your device is stuck in deferral — this usually means a dependency is misconfigured in the device tree.
Userspace & Build Systems
QHow does an embedded Linux rootfs differ from a desktop Linux installation?
An embedded Linux root filesystem is radically stripped down compared to a desktop distribution. A desktop Ubuntu installation might occupy 5-10 GB with thousands of packages; an embedded rootfs can be as small as 2-8 MB. The goal is to include only what the product needs — the init system, core utilities, shared libraries, and the application — nothing more. Every unnecessary binary wastes flash space, increases the attack surface, and slows boot time.
The core utilities (ls, cp, mount, ifconfig) are typically provided by BusyBox, a single statically-linked binary that implements hundreds of standard Unix commands in roughly 1 MB. The C library is often musl or uClibc-ng instead of glibc, saving several megabytes of flash and RAM. Many embedded rootfs images have no package manager, no man pages, no compiler toolchain, and no graphical desktop — they are headless systems controlled via serial console, SSH, or a custom protocol.
The filesystem layout still follows the Filesystem Hierarchy Standard (FHS) with /bin, /etc, /lib, /dev, /proc, /sys, /tmp, but many directories are nearly empty. /dev is populated dynamically by mdev (BusyBox) or udev. /proc and /sys are virtual filesystems mounted at boot. Configuration files in /etc are minimal. The entire rootfs may be mounted read-only to prevent corruption from unexpected power loss, with a small writable partition (or tmpfs) for runtime data like logs and configuration changes.
QCompare Yocto and Buildroot — when would you choose each?
Buildroot is a simple, Makefile-based build system that generates a complete embedded Linux image (toolchain, kernel, bootloader, and rootfs) from source. It prioritizes simplicity and fast builds. Configuration is a single .config file managed through make menuconfig. A typical Buildroot image builds in 15-30 minutes on a modern machine. Buildroot produces a monolithic image — if you change one package, you rebuild the entire rootfs.
Yocto (via the OpenEmbedded build framework and BitBake) is a far more powerful and complex system. It uses a layered architecture of recipes (.bb files), classes, and configuration files. Yocto builds individual packages (.ipk, .deb, or .rpm), supports incremental builds (changing one package rebuilds only that package and its dependents), generates an SDK for application developers, and supports multiple image variants from the same configuration. A first Yocto build can take 1-4 hours and requires 50+ GB of disk space.
Choose Buildroot when: your team is small, the product is simple (single application, few packages), you want fast iteration, and you do not need a package manager on the target. Buildroot has a gentle learning curve and is excellent for products like IP cameras, IoT sensors, or industrial controllers. Choose Yocto when: you need long-term maintenance (Yocto's layered structure makes it easier to track upstream updates), your product requires multiple image variants, you have a large team where application developers need an SDK, you need a package manager for field updates, or your SoC vendor provides a Yocto BSP layer (most do — NXP, TI, STMicro, Qualcomm). Most commercial embedded Linux products use Yocto because its upfront complexity pays off over a multi-year product lifecycle.
QWhich filesystem would you use on raw NAND flash and why not ext4?
For raw NAND flash, you should use UBIFS (on top of the UBI layer) or JFFS2. UBIFS is the modern choice — it provides wear leveling, transparent compression, write-back caching, and scales well to large flash sizes (multiple gigabytes). UBI (Unsorted Block Images) sits between the MTD layer and UBIFS, handling wear leveling, bad block management, and logical-to-physical block mapping across the entire flash chip.
ext4 cannot be used on raw NAND flash because ext4 was designed for block devices — storage where every block is identical, rewritable, and reliable. Raw NAND flash has fundamentally different characteristics: blocks must be erased before rewriting (and erasure is a slow, whole-block operation), each block has a limited erase cycle count (1K-100K depending on the technology), and blocks can go bad during the lifetime of the device. ext4 has no mechanism for wear leveling (it would repeatedly rewrite the same blocks — the superblock, journal, and metadata — wearing them out rapidly) and no bad block management.
The exception is eMMC and SD cards, which are NAND flash with a built-in Flash Translation Layer (FTL) that handles wear leveling and bad block management internally, presenting a standard block device interface. On eMMC, ext4 is perfectly appropriate — and commonly used. The key distinction in an interview is recognizing that "NAND flash" can mean raw NAND (requiring UBI/UBIFS or JFFS2) or managed NAND with an FTL (eMMC/SD, where ext4 works fine). JFFS2 is simpler than UBIFS but has long mount times on large partitions because it scans the entire flash at mount, making it suitable only for small partitions (configuration storage, for example).
QAn application works on your dev host but crashes on the ARM target — how do you debug?
Start by getting a useful crash report. If the application segfaults, enable core dumps on the target (ulimit -c unlimited, ensure /proc/sys/kernel/core_pattern points to a writable location). Copy the core dump and the unstripped binary (with debug symbols) to your host and analyze with arm-linux-gnueabihf-gdb binary core — the backtrace will show exactly where it crashed. If core dumps are not practical, run the application under gdbserver on the target and connect remotely from your host GDB.
The most common causes of host-works-target-crashes fall into a few categories. Endianness or alignment: if you cast pointers (e.g., uint32_t *p = (uint32_t *)byte_buffer), x86 tolerates unaligned access but ARM may not — Cortex-A will handle it with a performance penalty, but certain modes and older cores will fault. Toolchain mismatch: the target uses a different C library (musl vs glibc) or an older version with different behavior for edge cases. Memory constraints: the target has far less RAM — mmap or large allocations succeed on the host but fail on the target. Floating point: the binary was compiled for hard-float but the target uses soft-float, or vice versa.
Use sanitizers during cross-compilation: -fsanitize=address catches out-of-bounds accesses and use-after-free, -fsanitize=undefined catches undefined behavior like signed integer overflow and unaligned access. You can also use strace on the target to trace system calls and see where the application fails — this is often the fastest way to identify missing files, permission errors, or failed system calls. For memory issues specifically, valgrind works on ARM Linux targets (though it is slow) and catches leaks, invalid reads/writes, and uninitialized memory usage.
IPC
QCompare shared memory and Unix domain sockets for IPC.
Shared memory (shmget/shmat or POSIX shm_open/mmap) provides the highest-throughput IPC mechanism because data is never copied — both processes map the same physical memory pages into their virtual address spaces. A producer writes data directly into the shared region, and the consumer reads it with zero copies. This makes shared memory ideal for large data transfers: video frames, audio buffers, sensor data streams, or any scenario where copying megabytes per second through a pipe or socket would waste CPU cycles and memory bandwidth.
Unix domain sockets (AF_UNIX) provide a byte-stream or datagram interface similar to network sockets but confined to the local machine. Data is copied from the sender's buffer into the kernel and then into the receiver's buffer — two copies per message. This is slower than shared memory for bulk data but far simpler to program correctly. Sockets provide built-in synchronization (blocking read/write), flow control (the kernel buffers data and applies backpressure when the receiver is slow), and clean error handling (the sender gets EPIPE or ECONNRESET if the peer crashes).
The fundamental tradeoff is performance vs. safety and simplicity. Shared memory requires you to implement your own synchronization (semaphores, mutexes, or lock-free data structures) and your own signaling mechanism (often a Unix domain socket or eventfd just to notify the consumer that data is available). If you get the synchronization wrong, you have race conditions that are extremely difficult to debug. Unix domain sockets handle all of this for you. In embedded Linux systems, use shared memory only when profiling proves that the copy overhead of sockets is a bottleneck — for most IPC (commands, status updates, configuration), Unix domain sockets are simpler and sufficient.
QWhen would you use D-Bus vs a raw Unix socket?
D-Bus is a message bus system — a daemon (dbus-daemon) that routes messages between processes using a publish-subscribe and method-call model. Processes connect to the bus, register named services (e.g., org.freedesktop.NetworkManager), and expose objects with interfaces. Other processes can call methods, read properties, and subscribe to signals without knowing the service's PID or socket path. D-Bus provides introspection (you can query what interfaces a service exposes), access control policies, and automatic service activation (launching a service on demand when someone calls it).
Raw Unix domain sockets give you a direct point-to-point channel with no intermediary. You define your own wire protocol (framing, serialization, message types), manage connections yourself, and handle all routing logic. This is lower overhead — no bus daemon, no XML introspection, no policy engine — and gives you full control over the data format and transport behavior.
Use D-Bus when you have multiple processes that need to discover and communicate with each other dynamically, especially if you are integrating with existing Linux system services (systemd, NetworkManager, BlueZ) that already expose D-Bus APIs. D-Bus is the standard IPC for Linux system services. Use raw sockets when you have only two or three processes with well-defined communication patterns, when you need minimal latency and overhead (D-Bus adds serialization and routing overhead), or when your embedded system is too resource-constrained for the D-Bus daemon (it consumes a few megabytes of RAM). Many embedded Linux products use raw Unix sockets for their own application IPC while still using D-Bus to interact with system services like BlueZ for Bluetooth.
QHow do you synchronize access to shared memory between processes?
Shared memory provides no built-in synchronization — two processes can read and write the same memory simultaneously, causing data races. You must add an explicit synchronization mechanism. The three main approaches are POSIX named semaphores, pthread mutexes in shared memory, and lock-free data structures.
POSIX named semaphores (sem_open, sem_wait, sem_post) are the simplest approach. They exist in the filesystem namespace (/dev/shm/sem.name), so both processes can open the same semaphore by name. Use a semaphore as a mutex (initialized to 1) to protect critical sections. The downside is that if a process crashes while holding the semaphore, it stays locked — there is no automatic cleanup. You need a watchdog or timeout mechanism (sem_timedwait) to recover.
Pthread mutexes with the PTHREAD_PROCESS_SHARED attribute can be placed directly in the shared memory region. This is more efficient than named semaphores (no filesystem lookup) and supports features like priority inheritance (PTHREAD_PRIO_INHERIT) to prevent priority inversion. The mutex must be initialized with pthread_mutexattr_setpshared(&attr, PTHREAD_PROCESS_SHARED) and pthread_mutexattr_setrobust(&attr, PTHREAD_MUTEX_ROBUST) — the robust attribute is critical because it allows a surviving process to detect and recover when the mutex holder crashes (the next pthread_mutex_lock returns EOWNERDEAD, and the process calls pthread_mutex_consistent to reclaim it).
For the highest performance, lock-free ring buffers eliminate synchronization overhead entirely. A single-producer, single-consumer ring buffer needs only atomic load/store on the read and write indices — no mutexes, no system calls, no possibility of deadlock. This is the preferred pattern for high-throughput data streaming (audio, video, sensor data) between two processes. Multi-producer or multi-consumer scenarios require compare-and-swap (CAS) operations and are significantly more complex to implement correctly.