Pipeline & Compilation
QWalk me through what happens when you press "Build" on a typical embedded C project.
Four programs run in sequence per source file. First the preprocessor expands #include directives and #define macros, producing a single .i text file with everything substituted. Then the compiler translates that to target assembly — this is where optimization flags like -Os do their work, producing a .s file. The assembler turns assembly into a relocatable object file (.o) with sections, a symbol table, and relocation entries — but no final addresses yet.
After all source files have been compiled, the linker does three things: resolves symbols across all .o files and libraries (every "undefined" symbol must match a "defined" symbol somewhere), applies relocations with the chosen final addresses, and places sections per the linker script. The output is the ELF binary.
For embedded targets, objcopy then converts ELF to a flat binary or Intel HEX file for flashing. The whole chain — preprocess, compile, assemble, link — is what gcc invokes when you give it .c files and an -o output.
The interview-relevant lesson: compilation is per-translation-unit (one .c at a time, in isolation), but linking is whole-program. Most "weird build errors" are at the linker step because that's where cross-file consistency is checked.
QWhat's the difference between .data and .bss?
Both hold global variables, but they serve different purposes and have different storage costs.
.data holds globals with non-zero initial values: int x = 42;. Those values must exist somewhere persistent (Flash) so they survive power-off, but the variables themselves must be writable (RAM). So .data has both a Load Memory Address in Flash (where the initial values are stored in the binary) and a Virtual Memory Address in RAM (where the variables live at runtime). The startup code copies the initial values from Flash to RAM as part of the boot chain. Cost: Flash for the initial values + RAM for the variables.
.bss holds uninitialized (or zero-initialized) globals: int y; or int z = 0;. The C standard guarantees these start as zero. Since zero is a single value, you don't need to store anything in Flash — the startup code just memsets the .bss region to zero in RAM. Cost: RAM only; zero Flash.
The practical implication for embedded: a 100 KB uninitialized buffer (uint8_t buf[102400];) costs zero Flash because it lives in .bss. The same buffer with = {0} may go in .data and cost 100 KB Flash for all-zero initial values. Always declare large buffers without initializers and let .bss zero-fill.
A related distinction is .rodata — const-qualified globals like string literals and lookup tables — which lives in Flash and is read directly by the CPU at runtime. Cost: Flash only, no RAM. Always const your large lookup tables; without it, they double-cost (Flash + RAM).
QHow do you place an array or function at an exact address (e.g., a DMA buffer or a firmware version block)?
Two complementary parts: a C attribute and a linker-script placement.
In C, tag the symbol with __attribute__((section(".my_section"))) to put it in a custom section, and optionally __attribute__((aligned(N))) for alignment:
__attribute__((section(".dma_buffer"), aligned(32)))uint8_t dma_buf[1024];
In the linker script, add a SECTIONS block for that custom section, choosing where it goes:
.dma_buffer : {. = ALIGN(32);KEEP(*(.dma_buffer))} >RAM
Or for a fixed address:
.fw_metadata 0x0801FF00 : {KEEP(*(.fw_metadata))} >FLASH
Two things to remember: use KEEP() so --gc-sections doesn't drop the section if nothing in C references it (a DMA peripheral reads the buffer by physical address, not by symbol name); for cache-coherent systems (Cortex-M7) also configure the MPU to mark DMA regions as non-cacheable.
The fixed-address pattern is canonical for firmware version blobs (so a bootloader at a known offset can read them), DMA descriptor rings, dual-bank firmware metadata, and any "the hardware expects this at exactly this address" requirement.
QWhat flags optimize for size, and what are the trade-offs?
The primary size-optimization flag is -Os, which is roughly -O2 minus optimizations that grow code (loop unrolling, function inlining beyond a small threshold). For most embedded firmware, -Os is the default. There's also -Oz in newer GCCs for "even more aggressive size optimization at the expense of speed."
Two extremely effective additions:
-ffunction-sections -fdata-sections + -Wl,--gc-sections: this trio puts each function and each global in its own section, then lets the linker garbage-collect any section not reachable from the entry point. Saves 10-30% Flash on typical firmware with zero source changes. The trade-off is slightly larger object files and a marginally slower link.
--specs=nano.specs (newlib-nano): switches to a stripped-down libc with smaller printf, simpler malloc, etc. Typically saves 15-30 KB on hello-world-class firmware. Trade-off: float printf is off by default (re-enable with -u _printf_float if needed).
-flto (Link-Time Optimization): defers final code generation until link, when the compiler can see all translation units together. Enables cross-TU inlining and dead-code elimination. Often saves 5-15% Flash AND improves runtime performance. Trade-offs: longer link times, occasionally incompatible with hand-tuned inline assembly, and harder-to-debug stack traces because functions get inlined across files.
The general principle: optimize for size first (-Os), then add section-level GC, then evaluate LTO. Don't reach for -O3 on Flash-constrained targets — it grows code aggressively for marginal speed gains.
Sections & Memory Layout
Q.text vs .rodata — what goes in each, and where do they live at runtime?
Both live in Flash on a typical embedded target.
.text holds executable code — the machine instructions for every function in your program. The CPU's program counter (PC) points into .text to fetch and execute instructions. On Cortex-M, code in .text is read directly from Flash (with appropriate wait states for the CPU clock speed). It's marked read-execute (rx) in the MEMORY block.
.rodata holds read-only data — const-qualified globals, string literals (every "hello" your program contains), lookup tables marked const, and switch-statement jump tables. The CPU reads .rodata from Flash via the regular bus, not as instructions. It's effectively part of the loadable Flash image but has no executable permission.
Both share Flash because:
- They're both immutable
- They're both accessible from the CPU bus (with the right MPU configuration)
- Putting them in RAM would waste it for no benefit
Some toolchains and linker scripts even merge .rodata into .text for simplicity (since they have the same lifecycle and permissions). The distinction matters when you want different MPU permissions (e.g., execute on .text, read-only on .rodata) or when stripping debug builds.
The practical implication: declaring const uint8_t lookup[1024] = {...}; puts it in .rodata (1024 B Flash, zero RAM). Without const, the same array goes in .data (1024 B Flash for init values + 1024 B RAM for runtime copy). Always const lookup tables.
QHow do you read a linker map file to find what's bloating your binary?
The linker map (generated via -Wl,-Map=firmware.map) is verbose but structured. Three sections matter:
-
Memory configuration — the top of the file shows your
MEMORYregions and how much of each is used. A quickUsed: 458752 / 524288 (87%)for FLASH tells you how close to the limit you are. -
Section/symbol map — the body of the file lists every output section, every input section it contains, and every linker-defined symbol with its address and size. This is where bloat investigation happens. Search for the section name (typically
.textor.rodata) and look for the largest entries:
.text 0x08000000 0x6e80.text.main 0x080012a0 0x180 build/main.o.text.printf 0x08001420 0x4a00 /opt/.../newlib/libc.a(printf.o) ← here it is
The rightmost column tells you which .o file or library member brought the symbol in. So printf.o(libc.a) means newlib's printf — likely pulled in by a printf("%f", ...) somewhere that's dragging in float-formatting code.
- Cross-references — at the end, the linker lists which
.ofiles reference which symbols. Useful for "where is this symbol used?" investigations.
The faster alternative for most bloat hunts is arm-none-eabi-nm -S --size-sort -r firmware.elf | head -20. The nm output shows the largest symbols sorted descending, with their location codes (T=text, D=data, etc.). 80% of bloat investigations are solved in 30 seconds with this command. The map file is the deeper-dive tool when you need to know which library or .o introduced a symbol.
For tracking size over time, modern teams put arm-none-eabi-size firmware.elf into CI and alert on regression — typically failing the build if text grows more than 1% or bss grows more than 5%.
QStatic library (.a) vs relocatable object (.o) vs shared library (.so) — when does each apply in embedded?
Three different things at the binary level:
.o — Relocatable object file. The output of compiling and assembling one source file. Contains compiled code, data, a symbol table, and relocation entries. Not yet linked, no final addresses. The linker takes one or more .o files (plus libraries) and produces an executable. Per-source compilation produces .o; almost every build goes through .o files as the intermediate step.
.a — Static library (archive). A collection of .o files bundled together with an index. When you pass -lfoo, the linker searches for libfoo.a and pulls in only the .o members that satisfy unresolved symbols — selectively, not the whole archive. Used everywhere in embedded: libc.a, libm.a, libgcc.a, vendor SDK static libs. The result is statically linked into the final ELF; no runtime dependency.
.so (or .dll) — Shared library. Dynamically linked at runtime. The executable contains references to symbols, and the OS loader resolves them against the shared library when the program starts. Almost never used in bare-metal embedded because there's no OS loader. Common on embedded Linux for the same reasons as desktop Linux: less duplication, easier patches, smaller per-application footprint. But on a Cortex-M MCU running bare metal or RTOS, you statically link everything into one ELF.
So for an MCU embedded engineer: .o is the per-file unit, .a is the way you bundle libraries (vendor SDKs, libc, third-party drivers), .so is something you'd see only if you drift into embedded Linux. Linker errors like "cannot find -lc" mean a .a (static library) is missing from the link command.
QWhat does Reset_Handler do before main() is called?
Reset_Handler is the first code that runs after the hardware loads SP and PC from the vector table. Its job is to prepare the C runtime so that main() can use globals safely. Three core steps, in order:
-
Copy
.datafrom Flash (LMA) to RAM (VMA). Initialized globals likeint x = 42;have their initial values stored in Flash (as part of the binary image), but the variables themselves live in RAM. The linker provides three symbols —_sidata(Flash source),_sdata(RAM destination start),_edata(RAM destination end) — and Reset_Handler runs a tight*dst++ = *src++loop to bridge them. -
Zero
.bss. Uninitialized globals live in.bss; the C standard says they start as zero. Reset_Handler memsets the region from_sbssto_ebssto zero. -
Call
SystemInit()(typically from CMSIS) to bring up the clock tree (switch from HSI to PLL, raise Flash wait states, configure prescalers) and any other early hardware init.
Then it calls main(). If main returns, Reset_Handler typically loops in while(1) {} — main returning is unusual on bare metal and usually indicates a bug.
A subtle but important point: before step 1 completes, no C global is valid. Anything called before that — including SystemInit on some platforms — must work purely with register-level operations and not rely on globals. This is why startup sequences are sometimes written in assembly: to control exactly what code runs before the C runtime is set up.
Some toolchains also call __libc_init_array() between SystemInit and main to run C++ static constructors. Bare-metal C without C++ skips this.
The hardware side of boot (reset sources, vector table, clock-tree bring-up) lives separately on the Boot, vector table, and clock bring-up page; this page covered the C-runtime side.
Linker Errors & Diagnostics
QHow do you debug a "multiple definition" or "undefined reference" linker error?
Both are about cross-file consistency, but they have different fixes.
undefined reference to 'foo' — the linker couldn't find a definition for foo in any of the .o files or libraries you linked. Common causes:
- Missing library. For
printfand friends, make sure libc is being linked — check for--specs=nano.specsor explicit-lcin the link command. Vendor SDKs often need explicit-lvendor_lib. - Source file not in the build. The file containing
foo's definition isn't being compiled, so no.oexists with the symbol. Check the build system's source list. - Wrong symbol name (C/C++ boundary). C++ does name mangling. If a C++ caller references a C function declared without
extern "C", the C++ side calls_Z3foov(mangled) but the C definition providesfoo(unmangled). Wrap the declaration inextern "C" { ... }. - Typo or wrong prototype. Check spelling and signature.
Diagnostic: arm-none-eabi-nm on each candidate .o shows whether foo is defined (T) or only referenced (U).
multiple definition of 'foo' — the same symbol is defined in two .o files. Common causes:
- Non-static function or global in a header. Every
.cthat includes the header gets its own copy. Fix: mark itstatic inline(per-TU copy is allowed) or move the body to a single.cfile with only a declaration in the header. - A
.cfile is being compiled and linked twice. Check the source list for duplicates. - Two libraries provide the same symbol. Less common; usually the fix is to remove one or use
--allow-multiple-definition(not recommended — usually masks a real problem).
Diagnostic: arm-none-eabi-nm on each .o to find where the symbol is defined.
The general workflow: run the build, copy the failing symbol name, run nm on every .o in the link to see who defines it (or doesn't). Both errors are 1-minute fixes once you know which .o is at fault.
QWhat is Link-Time Optimization (LTO) and what are its embedded trade-offs?
LTO (-flto flag) defers final code generation until the link step. Normally, the compiler processes one translation unit at a time and emits final machine code per .o. With LTO, each .o contains an intermediate representation (essentially compiler IR) instead, and the linker invokes the compiler again at link time to do final code generation across the whole program.
The wins:
- Cross-TU inlining: a small accessor function defined in
foo.ccan be inlined into a caller inbar.c— impossible without LTO because the compiler couldn't see across files - Whole-program dead-code elimination: functions that nothing actually calls (transitively from the entry point) get dropped, even if they're not
static - Better register allocation across function boundaries: the compiler can see how a callee uses its arguments and optimize the caller accordingly
Typical wins for embedded: 5-15% Flash reduction, 2-5% performance improvement on hot paths.
The trade-offs:
- Longer link times: linking is now also compiling. Can be 2-5x slower for full builds. Incremental builds may not benefit.
- Harder debugging: heavy inlining means stack traces have fewer named frames.
addr2linemay report a function that doesn't exist in the source (because it was inlined). - Compatibility issues: hand-written inline assembly may not optimize correctly across LTO boundaries; some linker scripts that depend on specific section layout may need adjustment.
- All
.ofiles in the link must be LTO-enabled for full benefit — vendor prebuilt libraries that are not LTO-built won't participate.
Recommendation for embedded: try LTO on a release build, measure, and decide. If the size win is meaningful and the debug experience is acceptable, ship with LTO enabled. Don't enable in debug builds (slower link, harder to step through code).
A related flag is -fwhole-program for single-.o programs — same idea but doesn't require LTO infrastructure.
Toolchains & Build Systems
QWhat's the difference between Make and CMake, and when do you use each?
Make is a build executor. It reads a Makefile that lists rules of the form target: dependencies; recipe, and runs the recipe whenever a dependency is newer than the target. Make is one layer — you write Make rules and Make compiles your code. Strengths: transparent (you see exactly what runs), no abstraction layers, ubiquitous. Weaknesses: header dependency tracking is manual (need -MMD -MP flags); cross-platform support is poor; doesn't generate IDE projects.
CMake is a meta-build system. It reads CMakeLists.txt (a portable, higher-level project description) and generates native build files: Makefiles, Ninja files, Visual Studio solutions, Xcode projects. You then run the generated build. CMake is two layers — you write CMakeLists.txt, CMake generates build files, the generated build executes them. Strengths: cross-platform, generates IDE projects, supports cross-compilation cleanly via toolchain files, target-based modern style scales well. Weaknesses: more abstraction (errors can be confusing); steeper learning curve; the generated files are not human-friendly.
When to use each:
- Pick Make for small projects (< 20 source files), single-platform, when you want maximum transparency, or when the existing build is a working Makefile and migration isn't justified
- Pick CMake for larger projects, multi-platform, when you need IDE integration (CLion, VS Code, Visual Studio), or when greenfielding a project with multiple developers
- Use IDE-generated builds (STM32CubeIDE, MCUXpresso) when working in a vendor-supported flow where the IDE is non-negotiable
For embedded specifically, CMake + Ninja is the modern default. The CMake toolchain file handles cross-compilation cleanly: a small arm-none-eabi.cmake script sets CMAKE_SYSTEM_NAME Generic, the cross-compiler paths, and a few build options. Pass -DCMAKE_TOOLCHAIN_FILE=arm-none-eabi.cmake to switch from native to cross builds without touching CMakeLists.txt.
QWhy don't embedded projects use glibc?
glibc (GNU C library) is the standard libc on desktop Linux. It's huge — a hello-world program statically linked against glibc is 100+ KB minimum. It assumes a full POSIX environment: filesystems, processes, dynamic linking, threads, locales, internationalization. None of that exists on bare-metal MCUs.
Embedded targets use one of three smaller alternatives:
- newlib — ~25-40 KB hello-world. The default in
arm-none-eabi-gcc. Reasonably full-featured. - newlib-nano — ~5-10 KB hello-world. A stripped-down newlib variant: smaller printf (no float by default), simpler malloc, less locale support. Most embedded projects use this. Switched in via
--specs=nano.specs. - picolibc — ~3-8 KB hello-world. A modern fork; even smaller, more modular, doesn't ship default syscall stubs (you provide them).
For embedded Linux systems (e.g., Yocto-built distros on a Cortex-A), glibc is fine because the OS provides the runtime environment. There you might also see musl as an alternative — smaller than glibc, MIT-licensed, more standards-compliant.
The interview-relevant lesson: don't confuse bare-metal embedded with embedded Linux. The libc choice is one of the clearest indicators of which environment you're in. If someone says "I just call printf and floats work" on a 256 KB Cortex-M, they probably haven't measured Flash usage and are unaware they've doubled their binary size by pulling in newlib's IEEE 754 conversion routines.
