Do you know any assembler code? For which processor? What assembler code is your favorite or hate? Have you ever written an assembler from scratch?
(Technical core — what an assembler is and how one is built.)
Assembly is the human-readable, processor-specific representation of machine instructions; common targets in embedded work include ARM (Cortex-M Thumb/Thumb-2, A-profile AArch64), x86/x86-64, AVR, PIC, MIPS, and RISC-V. Each has its own register set, addressing modes, and instruction encoding, so assembly is inherently non-portable.
What an assembler does: it translates mnemonics (e.g., MOV, ADD, LDR, BL) and their operands into the corresponding binary machine-code encodings for the target ISA. Key elements it handles:
- Mnemonics → opcodes/encodings, including operand and addressing-mode encoding.
- Labels and symbols, resolved to addresses; the assembler maintains a symbol table.
- Directives (pseudo-ops) that aren't instructions but control assembly — e.g.,
.text/.datasection selection,.org,.equ/.setconstants,.byte/.worddata definitions, alignment, and macros. - Relocation: emitting an object file with a symbol table and relocation entries for addresses that aren't known until link time, so the linker can patch them.
How you'd write one from scratch (conceptually):
- Lexer/tokenizer to split each source line into labels, mnemonics, operands, and directives.
- Two-pass design — the classic approach. Pass 1 scans the source, assigns addresses, and builds the symbol table (resolving label definitions and tracking the location counter) so that forward references are known. Pass 2 generates the actual machine-code bytes, now that all symbols have addresses, and emits relocation entries for anything the linker must fix up.
- Encoder that maps each (mnemonic, operands) pair to the precise bit fields of the target ISA's instruction format.
- Output in an object format (e.g., ELF) with sections, symbol table, and relocations, ready for the linker.
The two-pass structure exists precisely to handle forward references (a branch to a label defined later in the file) cleanly; single-pass assemblers can work but must backpatch unresolved references at the end.
