Search topics...

Implement a Count Leading Zero (CLZ) bit algorithm, but don't use the assembler instruction. What optimizations to make it faster? What are some uses of CLZ?

0 upvotes
Practice with AISoon

CLZ counts the number of consecutive zero bits starting from the most significant bit (MSB) of a word until the first set bit. For a 32-bit input, CLZ(0x80000000) = 0, CLZ(1) = 31, and CLZ(0x00FF0000) = 8.

Edge case — input 0: There is no set bit, so the result is undefined for most hardware CLZ instructions (e.g. ARM CLZ actually returns 32, but x86 BSR is undefined for 0). A portable software implementation must explicitly decide what to return for 0; below it returns 32.

Binary-search implementation (the standard portable, branch-light approach). Instead of shifting one bit at a time (32 iterations), we narrow the position of the highest set bit in log2(32) = 5 steps. Each step asks "is the highest set bit in the top half of the remaining window?":

c
#include <stdint.h>
/* Returns 32 for n == 0; otherwise the count of leading zero bits. */
int clz32(uint32_t n) {
if (n == 0u) {
return 32;
}
int count = 0;
if ((n & 0xFFFF0000u) == 0u) { count += 16; n <<= 16; }
if ((n & 0xFF000000u) == 0u) { count += 8; n <<= 8; }
if ((n & 0xF0000000u) == 0u) { count += 4; n <<= 4; }
if ((n & 0xC0000000u) == 0u) { count += 2; n <<= 2; }
if ((n & 0x80000000u) == 0u) { count += 1; }
return count;
}

This is O(log N) — five comparisons — versus the naive O(N) bit-by-bit loop.

De Bruijn sequence implementation (table-based, no branches in the hot path). First "smear" all bits below the highest set bit down to fill a contiguous run of 1s (producing 2^(k+1) - 1), then use a perfect-hash multiply by a de Bruijn constant to index a 32-entry table giving log2. CLZ is then 31 - log2:

c
#include <stdint.h>
int clz32_debruijn(uint32_t n) {
if (n == 0u) {
return 32;
}
/* Smear: set every bit below the MSB. */
n |= n >> 1;
n |= n >> 2;
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
/* n is now 2^(k+1) - 1; the de Bruijn constant/table map this smeared
value directly to log2 -- do NOT isolate the top bit first. */
static const uint8_t debruijn_log2[32] = {
0, 9, 1, 10, 13, 21, 2, 29,
11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7,
19, 27, 23, 6, 26, 5, 4, 31
};
int log2 = debruijn_log2[(uint32_t)(n * 0x07C4ACDDu) >> 27];
return 31 - log2;
}

Optimizations:

  • Prefer the binary-search or de Bruijn form over the bit-by-bit loop: O(log N) / O(1) instead of O(N).
  • The de Bruijn table approach is fully branch-free after the zero check, which helps on pipelines without good branch prediction (typical of small MCUs).
  • If the compiler/target supports it, use __builtin_clz (GCC/Clang) which maps to a single hardware instruction; just remember __builtin_clz(0) is undefined.
  • Specialize the width — don't run a 32-bit algorithm on data you know is 8- or 16-bit.

Uses of CLZ:

  • Floating-point normalization: finding how far to left-shift a mantissa so the leading 1 sits in the implicit/MSB position (essential in soft-float and fixed→float conversion).
  • Fast integer log2 / MSB position: log2(n) = 31 - clz(n), used for bucket sizing, allocator size classes, and tree depth.
  • Round up to next power of two: 1u << (32 - clz(n - 1)).
  • Priority encoders / interrupt arbitration: find the highest-priority pending bit in a bitmask in O(1).
  • Fixed-point math: computing scale/shift amounts for renormalization and reciprocal/division algorithms (Newton-Raphson seeding).
  • Compression and bit-stream parsing: counting run lengths and Golomb/Elias code lengths.