What is Link-Time Optimization (LTO) and what are its embedded trade-offs?

Question

Accepted Answer

LTO (-flto flag) defers final code generation until the link step. Normally, the compiler processes one translation unit at a time and emits final machine code per .o. With LTO, each .o contains an intermediate representation (essentially compiler IR) instead, and the linker invokes the compiler again at link time to do final code generation across the whole program. The wins: - Cross-TU inlining: a small accessor function defined in foo.c can be inlined into a caller in bar.c — impossible without LTO because the compiler couldn't see across files - Whole-program dead-code elimination: functions that nothing actually calls (transitively from the entry point) get dropped, even if they're not static - Better register allocation across function boundaries: the compiler can see how a callee uses its arguments and optimize the caller accordingly Typical wins for embedded: 5-15% Flash reduction, 2-5% performance improvement on hot paths. The trade-offs: - Longer link times: linking is now also compiling. Can be 2-5x slower for full builds. Incremental builds may not benefit. - Harder debugging: heavy inlining means stack traces have fewer named frames. addr2line may report a function that doesn't exist in the source (because it was inlined). - Compatibility issues: hand-written inline assembly may not optimize correctly across LTO boundaries; some linker scripts that depend on specific section layout may need adjustment. - All .o files in the link must be LTO-enabled for full benefit — vendor prebuilt libraries that are not LTO-built won't participate. Recommendation for embedded: try LTO on a release build, measure, and decide. If the size win is meaningful and the debug experience is acceptable, ship with LTO enabled. Don't enable in debug builds (slower link, harder to step through code). A related flag is -fwhole-program for single-.o programs — same idea but doesn't require LTO infrastructure.