Boot, Startup & Linker Scripts — Interview Questions & Answers

Boot Sequence

QDescribe what happens from power-on until main() is called on a Cortex-M microcontroller.

When a Cortex-M processor comes out of reset, the hardware performs exactly two memory reads before executing any instruction. First, it reads the 32-bit value at address 0x00000000 (the first entry of the vector table) and loads it into the Main Stack Pointer (MSP) — this is the initial stack pointer value. Second, it reads the 32-bit value at address 0x00000004 (the second entry) and loads it into the Program Counter (PC) — this is the address of the Reset_Handler. The CPU then begins executing instructions at the Reset_Handler address. No bootloader, no BIOS, no firmware is involved in these two reads — they are hardwired into the Cortex-M core logic itself.

The Reset_Handler is the first code that runs, and it is typically written in assembly (found in the startup file, e.g., startup_stm32f4xx.s). Its job is to prepare the C runtime environment. It copies the .data section from its load address in Flash (where the initialized global variable values are stored by the linker) to its runtime address in RAM, using symbols like _sidata, _sdata, and _edata provided by the linker script. It then zeroes the .bss section in RAM (from _sbss to _ebss), since uninitialized global variables must start as zero per the C standard. Without these two steps, global variables would contain garbage values and any C code relying on them would produce undefined behavior.

After .data and .bss are initialized, the startup code calls SystemInit() (which configures clocks, Flash wait states, and optionally the FPU), then branches to main(). Some toolchains also call C library initialization (__libc_init_array) for static constructors before main(). The entire sequence — hardware vector fetch, Reset_Handler, memory initialization, SystemInit, main — is deterministic and happens in microseconds. Understanding this chain is essential for debugging hard faults that occur before main(), which are almost always caused by incorrect linker scripts, corrupt vector tables, or misconfigured clock settings in SystemInit().

QWhat is the vector table and where does it live?

The vector table is an array of 32-bit function pointers stored at the base of Flash memory (starting at address 0x00000000 on most Cortex-M devices, though this can be aliased or remapped). Each entry is the address of a handler function for a specific exception or interrupt. The first entry is special — it is not a function pointer but the initial value for the Main Stack Pointer (MSP). The second entry is the address of the Reset_Handler. Entries 3 through 15 are the ARM-defined system exceptions: NMI, HardFault, MemManage, BusFault, UsageFault, SVCall, PendSV, and SysTick. From entry 16 onward, the table contains vendor-specific peripheral interrupt handlers (WWDG, EXTI, USART, TIM, DMA, etc.), and the total number depends on the specific MCU — an STM32F407 has 82 IRQ entries, while a simpler Cortex-M0 may have only 32.

The vector table is defined in the startup assembly file and placed in a dedicated linker section (typically .isr_vector) that the linker script maps to the very beginning of Flash. Each unused handler is typically aliased to a Default_Handler — an infinite loop — so that an unexpected interrupt traps visibly rather than running wild through memory. The KEEP() directive in the linker script prevents the linker's garbage collector from discarding the vector table even though nothing in the code explicitly references it (the hardware references it, not the software).

A critical detail for bootloader designs: the vector table location can be changed at runtime using the VTOR register (Vector Table Offset Register). When a bootloader jumps to an application, it must set VTOR to point to the application's vector table in a different Flash region. Forgetting this step means the application's interrupts will still dispatch to the bootloader's handlers, causing crashes or silent misbehavior. On Cortex-M0 (which lacks VTOR), the vector table can be copied to RAM at the base address, and the memory map remapped — a more complex and error-prone process.

QWhat does SystemInit() typically do, and why does it run before main()?

SystemInit() is a C function (defined by CMSIS convention in system_stm32f4xx.c or the equivalent for your MCU family) that configures the core chip-level settings needed before any application code can run reliably. Its primary responsibilities are: (1) configuring the clock tree — enabling the HSE oscillator, configuring the PLL multipliers and dividers to achieve the target system clock frequency, switching SYSCLK to the PLL output, and setting AHB/APB prescalers; (2) setting Flash wait states to match the new clock speed — if the CPU runs at 168 MHz but Flash is still configured for 16 MHz (the reset default), instruction fetches will return corrupt data and the processor will hard fault; (3) optionally enabling the FPU (on Cortex-M4F/M7 with hardware floating point) by setting the CP10/CP11 bits in the CPACR register — without this, any floating-point instruction triggers a UsageFault.

SystemInit() runs after the .data copy and .bss zeroing in the startup code but before main(). This ordering matters because SystemInit() is a C function that may use global variables (which must be initialized first), and because main() typically expects the clock to be running at full speed before configuring peripherals — UART baud rate calculations, timer prescaler values, and SPI clock dividers all depend on knowing the actual system clock frequency. If SystemInit() ran after main() started configuring peripherals, all timing-sensitive configurations would be wrong.

A common debugging scenario: the MCU hard faults immediately after reset before reaching main(). The most likely cause is a SystemInit() that configures an HSE crystal oscillator that is not populated on the board, causing the PLL to never lock and the clock switch to fail. Another common issue is configuring the wrong Flash wait states for the target frequency — zero wait states at 168 MHz will corrupt every instruction fetch. Always verify that SystemInit() matches your actual hardware configuration, not just the evaluation board defaults.

Memory Sections

QWhy does the .data section need to be copied from Flash to RAM at startup? Why doesn't .bss need copying?

The .data section contains initialized global and static variables — variables that have explicit initial values assigned in C code (e.g., int counter = 42; or static float gain = 1.5f;). These initial values must persist across power cycles, so they are stored in Flash by the linker at a load address (LMA). However, the C standard requires these variables to be writable at runtime (the program can modify counter after startup), and Flash memory cannot be written by normal store instructions — it requires a special erase-then-program sequence. Therefore, the startup code must copy the initial values from their Flash location to their runtime address (VMA) in RAM, where the CPU can read and write them normally. The linker script defines both addresses: the LMA (where the data lives in Flash) and the VMA (where the data lives during execution), and exports symbols (_sidata, _sdata, _edata) so the startup code knows the source, destination, and size of the copy.

The .bss section contains uninitialized global and static variables (e.g., int buffer[256]; or static uint32_t tick_count;). The C standard mandates that these start at zero. Since every byte is guaranteed to be zero, there is no unique initialization data to store in Flash — storing thousands of zeros in Flash would waste space for no benefit. Instead, the startup code simply writes zeros to the entire .bss region in RAM using a memset-style loop, guided by the _sbss and _ebss symbols from the linker script. This is both faster (a simple zero-fill loop) and smaller (no Flash storage needed) than a copy. On a large MCU with 128 KB of RAM where .bss might be 64 KB, the Flash savings are substantial.

A subtle but important detail: the order of initialization matters. The .data copy must complete before .bss zeroing if the two regions are adjacent in RAM, because the copy length calculation uses the linker symbols which assume a clean memory layout. In practice, most startup files handle .data first, then .bss. If either step is skipped or the linker symbols are wrong, global variables will contain garbage — a bug that manifests as "works in debug but not release" or "works after reset but not after power cycle," since the debugger may initialize RAM during download.

QWalk me through a linker script — what are the key sections and directives?

A linker script (.ld file) tells the linker how to map compiled object file sections into physical memory regions. It has two main blocks: MEMORY and SECTIONS. The MEMORY block defines the available memory regions with their start addresses, sizes, and attributes. For a typical STM32F407, this looks like FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 1024K and RAM (rwx) : ORIGIN = 0x20000000, LENGTH = 128K. The rx and rwx flags specify read/execute and read/write/execute permissions respectively. Additional regions might include CCM RAM, external SDRAM, or backup SRAM.

The SECTIONS block maps each input section to a memory region and defines the layout order. The .isr_vector section comes first (placed at the start of FLASH with KEEP() to prevent garbage collection). Next, .text holds all code and read-only data (const variables, string literals). Then .rodata for additional read-only data. The .data section is more complex because it has two addresses: a VMA (Virtual Memory Address) in RAM where the variables live at runtime, specified with > RAM, and an LMA (Load Memory Address) in Flash where the initial values are stored, specified with AT> FLASH. The linker exports symbols at the boundaries: _sidata (Flash source), _sdata (RAM start), _edata (RAM end). The .bss section is placed in RAM with _sbss and _ebss symbols. Finally, ._user_heap_stack typically reserves space for the heap (growing up) and stack (growing down) with PROVIDE symbols for _end (heap start) and _estack (initial stack pointer, usually the top of RAM).

Key directives to know: KEEP() prevents sections from being discarded by --gc-sections (used for the vector table and interrupt handlers that are referenced by hardware, not code). ALIGN() ensures sections start on word boundaries (critical for DMA buffers and memory-mapped peripherals). PROVIDE() creates symbols only if they are not already defined elsewhere, allowing user code to override default values. . (the location counter) tracks the current address during layout. Understanding VMA vs LMA is the most critical concept — it is what enables initialized data to be stored in Flash but used from RAM, which is the foundation of the .data copy mechanism.

QHow do you place a function or variable at a fixed memory address?

Placing code or data at a specific address requires coordination between the C source file and the linker script. In C, you use __attribute__((section(".my_section"))) to assign a function or variable to a named section. For example: const uint32_t firmware_version __attribute__((section(".fw_version"))) = 0x01020003; places the version number in a section called .fw_version. For a function: void bootloader_entry(void) __attribute__((section(".bootloader")));. The attribute alone does not determine the address — it only ensures the symbol ends up in the named section rather than the default .text or .data.

In the linker script, you create a corresponding section entry and pin it to a fixed address. For example:

text

.fw_version 0x0800FF00 :
{
    KEEP(*(.fw_version))
} > FLASH

This places the .fw_version section at exactly 0x0800FF00 in Flash. The KEEP() directive is essential — without it, the linker's --gc-sections optimization will discard the section if no code references the variable (it may be read by an external tool, a bootloader, or via JTAG). The fixed address must not overlap with any other section; if it falls in the middle of .text, the linker will either error out or silently corrupt the layout depending on the toolchain.

Common use cases include: firmware version or CRC fields at known offsets (so a bootloader can read them without parsing the firmware), calibration data stored in a dedicated Flash page (so it survives firmware updates), peripheral register overlays (mapping a struct to the peripheral's base address), and shared memory regions between a bootloader and application. A dangerous pitfall: using __attribute__((at(address))) (an ARM Compiler 5 extension) — this is not portable to GCC. The section attribute plus linker script approach works across all major embedded toolchains (GCC, Clang, IAR, ARM Compiler 6).

QWhat is VTOR and when would you relocate the vector table?

The Vector Table Offset Register (VTOR) is a system control register (at address 0xE000ED08 on Cortex-M3/M4/M7) that tells the processor where to find the vector table in memory. After reset, VTOR defaults to 0x00000000, meaning the processor fetches exception and interrupt handler addresses from the base of the memory map (typically the start of Flash). By writing a new address to VTOR, you redirect all interrupt dispatching to a vector table at a different location — in a different region of Flash, or even in RAM.

The most common reason to relocate the vector table is in a bootloader architecture. A bootloader occupies the first portion of Flash (e.g., 0x08000000 to 0x08003FFF) with its own vector table at the base. The application firmware starts at a higher address (e.g., 0x08004000) with its own separate vector table. When the bootloader finishes and jumps to the application, it must set VTOR to 0x08004000 so that all subsequent interrupts dispatch to the application's handlers instead of the bootloader's. Without this step, a UART interrupt in the application would vector to the bootloader's UART handler (or worse, a Default_Handler infinite loop), causing a crash or hang. The jump sequence is: set VTOR, load the application's initial MSP from its vector table entry 0, then branch to the Reset_Handler address from entry 1.

A second use case is copying the vector table to RAM for runtime ISR modification. On Cortex-M, the vector table entries in Flash are read-only (without a Flash erase/program cycle). If the application needs to dynamically install interrupt handlers at runtime — for example, a plugin system or a DMA channel that changes its completion callback — the vector table can be copied to a RAM array, VTOR pointed to the RAM copy, and then entries modified freely. This is also useful for performance: RAM access is faster than Flash on some MCUs (especially at high clock speeds without Flash accelerator), so servicing interrupts from a RAM-based vector table can reduce interrupt latency by a few cycles. The RAM array must be aligned to a power-of-two boundary that is at least the size of the table (e.g., 256-byte or 512-byte alignment), otherwise VTOR ignores the low bits and the table is misaligned.

Stack & Heap

QHow would you detect a stack overflow on a bare-metal system?

Stack overflow detection on bare-metal is challenging because most Cortex-M processors (M0, M3, M4) have no hardware stack limit checking — the stack pointer simply decrements past the allocated region and silently corrupts whatever memory lies below it (often the heap, .bss, or .data sections). The corruption may not cause an immediate fault, instead manifesting as intermittent data corruption, seemingly random variable changes, or crashes that occur long after the actual overflow. This makes stack overflows among the hardest embedded bugs to diagnose.

Stack painting (also called stack watermarking) is the simplest and most widely used technique. At startup, fill the entire stack region with a known sentinel pattern — typically 0xDEADBEEF or 0xCCCCCCCC. Periodically (or at a debug breakpoint), scan upward from the bottom of the stack region to find the first non-sentinel value — this is the high-water mark, showing the maximum stack depth ever reached. The difference between the high-water mark and the stack base is the remaining margin. If the margin is zero or negative, the stack overflowed. This technique has zero runtime overhead during normal operation (the check is only performed when you inspect it) but is retrospective — it tells you the stack overflowed after the fact, not at the instant it happens.

For real-time detection, use the MPU (Memory Protection Unit) on Cortex-M3 and above. Configure an MPU region at the bottom of the stack as a guard region — a small area (32 bytes to 256 bytes) marked as no-access. When the stack grows into this guard region, the MPU immediately triggers a MemManage fault, catching the overflow at the exact instruction that caused it. This gives you a precise stack trace and program counter for debugging. On Cortex-M0 (which lacks an MPU), you can set a hardware data watchpoint via the debugger on the stack sentinel address — this halts the CPU when the stack pointer reaches the danger zone, but only works during debug sessions. FreeRTOS uses stack painting for its uxTaskGetStackHighWaterMark() API and optional MPU-based detection when configCHECK_FOR_STACK_OVERFLOW is enabled.

QHow do you determine the appropriate stack and heap sizes for your application?

Determining stack and heap sizes is one of the most underappreciated aspects of embedded development, and getting it wrong leads to either wasted RAM (oversized allocations) or silent memory corruption (undersized allocations). The stack size must accommodate the deepest possible call chain, including all local variables, saved registers, and interrupt context. The heap size must accommodate all dynamic allocations that can be live simultaneously. Both require a combination of static analysis and runtime measurement.

For stack sizing, start with static analysis using GCC's -fstack-usage flag. This generates .su files that report the stack consumption of each function. Tools like puncover or custom scripts can parse these files and the call graph to compute the worst-case stack depth — the sum of stack usage along the deepest call path. However, static analysis has blind spots: function pointers, recursive calls, and interrupt preemption are difficult to analyze automatically. For interrupt stacking, each nested interrupt adds a fixed context frame (8 words on Cortex-M3/M4, more with FPU state), and you must account for the worst-case nesting depth. Follow static analysis with runtime measurement: paint the stack with a sentinel pattern, run the system through all stress test scenarios (maximum interrupt load, deepest function paths, worst-case input data), then check the high-water mark. Add a safety margin of 20-30% above the measured high-water mark to cover paths not exercised during testing.

For heap sizing, track peak allocation using a custom malloc wrapper that records the current and maximum heap usage. Better yet, avoid the heap entirely in safety-critical bare-metal systems — use static allocation and fixed-size pools instead. If dynamic allocation is unavoidable (e.g., for a TCP/IP stack like lwIP), the pool sizes and maximum connections define the upper bound. In the linker script, the stack and heap sizes are typically defined as symbols (_Min_Heap_Size and _Min_Stack_Size) and the linker verifies that RAM is large enough to contain .data + .bss + heap + stack. If the total exceeds available RAM, the linker produces a "region overflow" error — but this only checks the minimums, not the runtime maximums.

QWhat happens if the stack and heap collide, and how can you prevent it?

On most bare-metal Cortex-M systems, the stack grows downward from the top of RAM and the heap grows upward from the end of .bss. If neither has a hard boundary enforced in hardware, they grow toward each other, and when they meet, they silently overwrite each other's data. The stack corrupts heap metadata (free-list pointers, block headers), causing malloc to return invalid pointers or crash on the next allocation. The heap corrupts stack frames (return addresses, saved registers, local variables), causing functions to return to random addresses or operate on garbage data. There is no automatic fault or exception — on Cortex-M0 and M3 without MPU configuration, the corruption is completely silent until it causes a secondary failure, often far removed in time and code location from the actual collision.

The debugging difficulty is extreme because the symptoms appear random: a function returns to a wrong address (hard fault with a nonsensical PC value), a local variable changes value between two sequential reads, or malloc returns a pointer inside the stack region. These symptoms may only appear under specific load conditions (maximum interrupt nesting + maximum heap allocation simultaneously) and may vanish when debugging because the debugger's halt-and-inspect cycle changes the timing.

Prevention strategies, in order of effectiveness: (1) MPU guard regions — on Cortex-M3/M4/M7, configure MPU regions as no-access barriers between the stack bottom and heap top. A collision immediately triggers a MemManage fault with a precise address, making the bug trivially diagnosable. (2) Static allocation — eliminate the heap entirely by using fixed-size arrays and memory pools. If heap size is zero, collision is impossible and the only risk is stack overflow (which is easier to detect alone). (3) Linker script enforcement — define explicit _Min_Heap_Size and _Min_Stack_Size in the linker script so the linker fails the build if the sum of all sections exceeds RAM. This catches the problem at build time, though it relies on the engineer choosing correct size values. (4) Runtime monitoring — periodically check the stack pointer against a minimum threshold and the heap high-water mark against a maximum threshold, asserting or logging if either crosses the safety boundary.