Search topics...

Do you know any assembler code? For which processor? What assembler code is your favorite or hate? Have you ever written an assembler from scratch?

0 upvotes
Practice with AISoon

(Technical core — what an assembler is and how one is built.)

Assembly is the human-readable, processor-specific representation of machine instructions; common targets in embedded work include ARM (Cortex-M Thumb/Thumb-2, A-profile AArch64), x86/x86-64, AVR, PIC, MIPS, and RISC-V. Each has its own register set, addressing modes, and instruction encoding, so assembly is inherently non-portable.

What an assembler does: it translates mnemonics (e.g., MOV, ADD, LDR, BL) and their operands into the corresponding binary machine-code encodings for the target ISA. Key elements it handles:

  • Mnemonics → opcodes/encodings, including operand and addressing-mode encoding.
  • Labels and symbols, resolved to addresses; the assembler maintains a symbol table.
  • Directives (pseudo-ops) that aren't instructions but control assembly — e.g., .text/.data section selection, .org, .equ/.set constants, .byte/.word data definitions, alignment, and macros.
  • Relocation: emitting an object file with a symbol table and relocation entries for addresses that aren't known until link time, so the linker can patch them.

How you'd write one from scratch (conceptually):

  1. Lexer/tokenizer to split each source line into labels, mnemonics, operands, and directives.
  2. Two-pass design — the classic approach. Pass 1 scans the source, assigns addresses, and builds the symbol table (resolving label definitions and tracking the location counter) so that forward references are known. Pass 2 generates the actual machine-code bytes, now that all symbols have addresses, and emits relocation entries for anything the linker must fix up.
  3. Encoder that maps each (mnemonic, operands) pair to the precise bit fields of the target ISA's instruction format.
  4. Output in an object format (e.g., ELF) with sections, symbol table, and relocations, ready for the linker.

The two-pass structure exists precisely to handle forward references (a branch to a label defined later in the file) cleanly; single-pass assemblers can work but must backpatch unresolved references at the end.