Implement a Count Leading Zero (CLZ) bit algorithm, but don't use the assembler instruction. What optimizations to make it faster? What are some uses of CLZ?

Question

Accepted Answer

CLZ counts the number of consecutive zero bits starting from the most significant bit (MSB) of a word until the first set bit. For a 32-bit input, CLZ(0x80000000) = 0, CLZ(1) = 31, and CLZ(0x00FF0000) = 8.

Edge case — input 0: There is no set bit, so the result is undefined for most hardware CLZ instructions (e.g. ARM CLZ actually returns 32, but x86 BSR is undefined for 0). A portable software implementation must explicitly decide what to return for 0; below it returns 32.

Binary-search implementation (the standard portable, branch-light approach). Instead of shifting one bit at a time (32 iterations), we narrow the position of the highest set bit in log2(32) = 5 steps. Each step asks "is the highest set bit in the top half of the remaining window?":

c

#include <stdint.h>

/* Returns 32 for n == 0; otherwise the count of leading zero bits. */
int clz32(uint32_t n) {
    if (n == 0u) {
        return 32;
    }
    int count = 0;
    if ((n & 0xFFFF0000u) == 0u) { count += 16; n <<= 16; }
    if ((n & 0xFF000000u) == 0u) { count += 8;  n <<= 8;  }
    if ((n & 0xF0000000u) == 0u) { count += 4;  n <<= 4;  }
    if ((n & 0xC0000000u) == 0u) { count += 2;  n <<= 2;  }
    if ((n & 0x80000000u) == 0u) { count += 1;             }
    return count;
}

This is O(log N) — five comparisons — versus the naive O(N) bit-by-bit loop.

De Bruijn sequence implementation (table-based, no branches in the hot path). First "smear" all bits below the highest set bit down to fill a contiguous run of 1s (producing 2^(k+1) - 1), then use a perfect-hash multiply by a de Bruijn constant to index a 32-entry table giving log2. CLZ is then 31 - log2:

c

#include <stdint.h>

int clz32_debruijn(uint32_t n) {
    if (n == 0u) {
        return 32;
    }
    /* Smear: set every bit below the MSB. */
    n |= n >> 1;
    n |= n >> 2;
    n |= n >> 4;
    n |= n >> 8;
    n |= n >> 16;
    /* n is now 2^(k+1) - 1; the de Bruijn constant/table map this smeared
       value directly to log2 -- do NOT isolate the top bit first. */

    static const uint8_t debruijn_log2[32] = {
        0,  9,  1, 10, 13, 21,  2, 29,
        11, 14, 16, 18, 22, 25,  3, 30,
        8, 12, 20, 28, 15, 17, 24,  7,
        19, 27, 23,  6, 26,  5,  4, 31
    };
    int log2 = debruijn_log2[(uint32_t)(n * 0x07C4ACDDu) >> 27];
    return 31 - log2;
}

Optimizations:

Prefer the binary-search or de Bruijn form over the bit-by-bit loop: O(log N) / O(1) instead of O(N).
The de Bruijn table approach is fully branch-free after the zero check, which helps on pipelines without good branch prediction (typical of small MCUs).
If the compiler/target supports it, use __builtin_clz (GCC/Clang) which maps to a single hardware instruction; just remember __builtin_clz(0) is undefined.
Specialize the width — don't run a 32-bit algorithm on data you know is 8- or 16-bit.

Uses of CLZ:

Floating-point normalization: finding how far to left-shift a mantissa so the leading 1 sits in the implicit/MSB position (essential in soft-float and fixed→float conversion).
Fast integer log2 / MSB position: log2(n) = 31 - clz(n), used for bucket sizing, allocator size classes, and tree depth.
Round up to next power of two: 1u << (32 - clz(n - 1)).
Priority encoders / interrupt arbitration: find the highest-priority pending bit in a bitmask in O(1).
Fixed-point math: computing scale/shift amounts for renormalization and reciprocal/division algorithms (Newton-Raphson seeding).
Compression and bit-stream parsing: counting run lengths and Golomb/Elias code lengths.