Describe how to multiply two 256-bit numbers using any 32-bit processor without FPU or special instructions. Two or more methods?

Question

Accepted Answer

Represent each 256-bit operand as an array of eight 32-bit "limbs" (base B = 2^32): A = Σ a[i]·B^i, B = Σ b[j]·B^j, for i, j ∈ [0,7]. The full product is up to 512 bits = sixteen 32-bit limbs. The core trick on a 32-bit CPU is that the product of two 32-bit values fits in 64 bits, so we accumulate partial products in a 64-bit accumulator (uint64_t) and propagate carries explicitly.

Method 1 — Schoolbook (long multiplication), O(n²) limb multiplies. For each pair (i, j), compute a[i] * b[j] (a 64-bit result), add it at limb position i + j, and ripple the carry upward. Using a 64-bit running accumulator per output column keeps carry handling clean:

c

#include <stdint.h>

#define N 8   /* 8 * 32 = 256-bit operands */

/* r[] has 2*N = 16 limbs (512-bit result). */
void mul256(uint32_t r[2 * N], const uint32_t a[N], const uint32_t b[N]) {
    for (int k = 0; k < 2 * N; k++) {
        r[k] = 0u;
    }
    for (int i = 0; i < N; i++) {
        uint64_t carry = 0u;
        for (int j = 0; j < N; j++) {
            /* product + existing column + incoming carry, all fit in 64 bits... */
            uint64_t cur = (uint64_t)a[i] * (uint64_t)b[j]
                         + (uint64_t)r[i + j]
                         + carry;
            r[i + j] = (uint32_t)cur;        /* low 32 bits stay */
            carry    = cur >> 32;            /* high 32 bits carry on */
        }
        r[i + N] += (uint32_t)carry;         /* deposit final carry */
    }
}

Carry safety: the maximum column value is (2^32 − 1)² + (2^32 − 1) + (2^32 − 1) = 2^64 − 1, which exactly fits in uint64_t with no overflow — that's why the 64-bit accumulator is sufficient. For 256-bit × 256-bit this is 64 limb-multiplies.

If only 32-bit arithmetic is available (no uint64_t), split each 32-bit limb into two 16-bit halves so each partial product is ≤ (2^16−1)² and fits in 32 bits, then combine — the same algorithm, finer granularity.

Method 2 — Karatsuba (divide and conquer), O(n^1.585). Split each 256-bit number into high/low 128-bit halves: A = A_hi·2^128 + A_lo, B = B_hi·2^128 + B_lo. Naively that's four 128-bit multiplies; Karatsuba reduces it to three:

text

z0 = A_lo * B_lo
z2 = A_hi * B_hi
z1 = (A_lo + A_hi) * (B_lo + B_hi) - z0 - z2     /* = A_lo*B_hi + A_hi*B_lo */

product = z2 * 2^256 + z1 * 2^128 + z0

Each 128-bit multiply can itself be done by schoolbook (or recursively split into 64-bit halves). Trading one multiply for several add/subtracts wins as operand size grows. Carry handling is the subtle part: A_lo + A_hi can overflow into a 129th bit, and the subtraction of z0 and z2 can borrow, so the implementation must use big-integer add/sub with carry/borrow propagation across limbs and allocate enough limbs for the intermediate sums.

Other approaches worth mentioning:

Comba / column (product-scanning) multiplication: like schoolbook but iterate over each output column k and sum all a[i]*b[j] with i+j == k, keeping a wider (e.g. 96-bit, modeled as two uint64_t) accumulator. It minimizes memory traffic (each result limb is written once) and is the common choice in crypto libraries for fixed sizes like 256-bit.
Toom-Cook (generalization of Karatsuba) and FFT-based multiplication — only beneficial for much larger numbers than 256 bits; for 256-bit, schoolbook/Comba (often with one Karatsuba split) is typically fastest in practice.

For 256-bit specifically, plain schoolbook/Comba with a 64-bit accumulator is usually the pragmatic choice; Karatsuba starts to pay off at larger widths or when limb-multiplies are very expensive relative to adds.