Why does pipeline flush matter for interrupt latency?
When an interrupt arrives, the CPU must abandon the instructions currently in the pipeline and begin executing the ISR. All partially-completed instructions in the pipeline stages — fetched but not decoded, decoded but not executed — are discarded (flushed). The CPU then stacks the context (registers R0-R3, R12, LR, PC, xPSR on Cortex-M), fetches the ISR address from the vector table, and begins filling the pipeline with ISR instructions. The total time from interrupt assertion to the first ISR instruction executing is the interrupt latency.
A deeper pipeline means more stages to flush and refill. The Cortex-M0 has a 3-stage pipeline (fetch, decode, execute), so a flush discards at most 2 instructions and the pipeline refills in 3 cycles. The Cortex-M7 has a 6-stage pipeline (plus branch prediction and dual-issue), so a flush is more expensive — though the M7 compensates with faster clock speeds and more sophisticated prefetch logic. The net result is that Cortex-M0/M3 have a deterministic 12-cycle interrupt latency, while Cortex-M7 has a 12-cycle minimum but with more variability depending on the instruction being interrupted.
Multi-cycle instructions amplify the problem. If the CPU is executing a 64-bit multiply (UMULL), a load-multiple (LDM with 8 registers), or a divide (UDIV, which takes 2-12 cycles on M3/M4), the interrupt cannot be serviced until that instruction completes — the pipeline cannot be flushed mid-instruction. This adds unpredictable delay on top of the base latency. Cortex-M3 and later implement interruptible load/store-multiple instructions (LDM/STM can be abandoned and restarted), which reduces worst-case latency from these long instructions. This is why interrupt latency specifications always list both minimum and maximum values, and why real-time systems must account for the worst case.
Source: CPU Fundamentals Q&A
