Explain processor pipelines, and the pro/cons of shorter or longer pipelines.

Question

Accepted Answer

What a pipeline is A CPU pipeline breaks instruction execution into stages so multiple instructions are processed concurrently, like an assembly line. A classic RISC 5-stage pipeline: IF (instruction fetch) → ID (decode/register read) → EX (execute/ALU) → MEM (memory access) → WB (write-back). With one instruction in each stage, the CPU can (ideally) retire one instruction per clock even though each instruction takes several cycles end-to-end. Pipelining increases instruction throughput and lets the chip run at a higher clock frequency, because each stage does less work per cycle (shorter critical path). Hazards limit the ideal one-per-cycle rate: - Structural hazards: two stages need the same resource. - Data hazards: an instruction needs a result not yet written back (mitigated by forwarding/bypassing and, failing that, stalls/bubbles). - Control hazards: branches change the PC, so instructions fetched after a branch may be wrong; mitigated by branch prediction, branch delay slots (some ISAs), and speculative execution. A mispredict forces a pipeline flush. Shorter pipelines — pros/cons Pros: - Lower branch mispredict / flush penalty: fewer stages to flush, so wrong-path recovery costs fewer cycles → more deterministic, better for hard real-time. - Lower latency per instruction, simpler hazard logic, smaller area/power, easier to verify. - More predictable timing — valued in microcontrollers (e.g., Cortex-M0+ has a 2-stage pipeline; M3/M4 a 3-stage). Cons: - Lower maximum clock frequency (each stage does more work → longer critical path), so potentially lower peak throughput. Longer (deeper) pipelines — pros/cons Pros: - Higher achievable clock frequency (less logic per stage), enabling higher peak throughput on well-predicted, streaming code. - Allows aggressive superscalar/out-of-order designs. Cons: - Higher mispredict penalty: a flush throws away many in-flight instructions (a deep pipeline like the old Pentium 4 "NetBurst," ~20–31 stages, paid a heavy branch-mispredict cost) → requires very good branch prediction. - More complex forwarding, hazard, and speculation logic; more power; harder to make timing-deterministic (bad for hard real-time). - Diminishing returns: beyond a point, added pipelining yields little net performance because stall/flush overhead grows. Bottom line: deeper pipelines trade single-instruction latency and predictability for higher clock speed and throughput; shorter pipelines trade peak speed for low latency, low power, and determinism — which is why MCUs favor short pipelines and high-performance application/desktop CPUs use deeper ones (balanced with strong branch prediction).