Fundamentals
QMealy vs Moore — which is more common in firmware?
In a Moore machine, the output depends only on the current state. In a Mealy machine, the output depends on both the state and the most recent event (so the same state can produce different outputs based on which event arrived).
In practice, embedded firmware FSMs are mostly Moore-style, often with Mealy-style accents.
The Moore aspect is dominant because most embedded behavior is mode-driven: in CONNECTED, the radio stays on; in IDLE, it stays off. The state implies long-running behavior — peripheral configuration, LED color, sensor sampling rate — and that's set by entry actions when you transition into the state. As long as you remain in the state, the behavior is constant.
The Mealy aspect comes in for one-shot actions tied to specific transitions. A UART parser in state WAITING_FOR_LENGTH_BYTE does different things on BYTE_RX(0x00) (zero-length frame: emit empty message and return to idle) vs BYTE_RX(other) (start collecting payload). The action depends on which byte arrived, not just on the state. That's a Mealy transition.
So in real code you'll see: state-level entry/exit actions for Moore-style invariants, plus per-transition actions for Mealy-style event-specific work. The classification is mostly an analysis tool — you don't typically declare "this is a Moore FSM" — but knowing the distinction helps you think about where actions belong.
The interview-relevant test is: when asked, give a concrete example. "In a charger controller, CHARGING_FAST is Moore — the PMIC is configured for high current as long as we're in this state. But the transition from CHARGING_FAST to CHARGING_TRICKLE on THRESHOLD_REACHED(voltage) is Mealy — the action depends on the voltage value carried in the event."
QHow do you test a state machine effectively?
The single best property of FSMs for testing is determinism: given the same starting state and event sequence, the FSM produces the same final state and same actions. No timing dependencies, no peripheral mocking, no flakiness.
The pattern is straightforward:
- Reset the FSM to a known starting state (via
fsm_initor a test-onlyfsm_force_statehelper) - Inject a sequence of events by calling the dispatch function directly:
fsm_handle_event(E_BUTTON) - Assert the final state and that the right actions fired (ideally via mocks that record calls)
For coverage, aim for transition coverage: every (state, event) pair in your transition table should be exercised by tests at least once. With a 2D transition table, this is a generated loop:
for (int s = 0; s < NUM_STATES; s++) {for (int e = 0; e < NUM_EVENTS; e++) {fsm_init();fsm_force_state(s);State expected = fsm_table[s][e].next;fsm_handle_event(e);assert(fsm_get_state() == expected);}}
For switch-case or function-pointer FSMs, the same idea applies — iterate over enum values.
Beyond transition coverage, two more properties to check:
- Invariants: for safety-critical FSMs, assert "whenever in state X, condition Y holds." Run the property check after every event injection.
- Sequence behavior: longer event sequences exercise paths through the FSM. A handful of "happy path" and "error recovery" sequences as separate test cases.
The reason this works is the separation of event source from FSM logic. If your dispatch is a single function fsm_handle_event(Event e) with no peripheral side effects in the FSM itself (actions go through mockable interfaces), tests are pure functions of event sequences. The whole test suite runs in microseconds.
A common testing failure mode is hidden state: if the FSM reads a global variable or a peripheral register that the tests can't easily set up, you've broken the deterministic-input property. The fix is to pass that data as an event parameter or refactor to make it injectable.
QWhen would you choose table-driven over switch-case?
Table-driven becomes the better choice when an FSM grows past about 5 states or has significant transition uniformity. Three concrete reasons:
1. The transition table is the spec. A 2D matrix with rows = states, columns = events, cells = (next state, action) is reviewable as a literal table. You can verify completeness by inspection ("every cell has something defined"). Adding a new event adds a column without touching existing cells. Adding a new state adds a row. The structure stays linear; the dispatch logic is unchanged.
2. Testing is straightforward. As discussed in the testing question, transition coverage is a generated loop iterating rows × columns. Switch-case FSMs require manual enumeration of each (state, event) combo.
3. Code generation is natural. If you ever want to generate the FSM from a spec file (YAML, JSON, Yakindu, Stateflow), the natural output is a transition table. Switch-case generation is awkward.
Switch-case is better when:
- The FSM is small (≤ 5 states), so the whole thing fits in one screen
- Performance matters at the dispatch level — switch-case typically compiles to a tight jump table with no indirect call overhead
- You want each case's logic visible in line with control flow rather than encoded as data
The breakdown threshold is roughly when adding a new event becomes painful: if you're touching 8 different case blocks to add one event, switch-case is past its sweet spot. Refactor to table-driven or function-pointer.
A common middle ground in practice: switch-case for the outer state dispatch, but with shared helper functions for common actions. This avoids code duplication without going fully to a table.
QWhat is the state-explosion problem and how do HSMs solve it?
State explosion is what happens when you try to model a system with multiple independent dimensions of state as a flat FSM. The state count is the product of the values across dimensions — a small number of dimensions explodes into an unmanageable cross-product.
Example: a phone with three dimensions:
- Power: ON / OFF
- Connection: CONNECTED / DISCONNECTED / CONNECTING
- Audio: MUTED / UNMUTED
A flat FSM needs 2 × 3 × 2 = 12 states. Add a fourth dimension and you might be at 24. Worse, every event has to be handled in every state — POWER_BUTTON needs 12 transitions, all doing roughly the same thing. Forgetting to handle POWER_BUTTON in one state silently breaks the device for that state combination.
A hierarchical state machine (HSM) solves this two ways:
Nested states with behavioral inheritance: a parent state contains child states, and a child state inherits all of its parent's transitions unless it explicitly overrides them. So POWER_BUTTON → POWERED_OFF is defined once at the parent POWERED_ON level and applies automatically to every child state nested inside. Adding new states later inherits the transition without changes. Like method inheritance in OOP.
Orthogonal regions: a composite state can contain multiple independent concurrent sub-FSMs. The phone's connection FSM and audio FSM both live inside POWERED_ON as orthogonal regions, each with its own current state, processing events independently. The total state count becomes additive across dimensions (3 connection + 2 audio = 5) instead of multiplicative (3 × 2 = 6) — and the savings grow rapidly with more dimensions.
Combined: a flat FSM that would need 24 states becomes an HSM with maybe 8 states organized into a clear hierarchy. The state diagram becomes readable. Adding a new dimension is bounded — define a new orthogonal region with its own states. Behavior that should be uniform across many sub-states is uniform by construction.
The cost of HSMs is runtime overhead (event propagation traverses the hierarchy) and conceptual complexity (must reason about hierarchical entry/exit cascades). They're justified when a flat FSM gives you 15+ states with significant duplication, or when independent dimensions of state are obvious. For a 4-state debouncer, an HSM would be overkill — a flat FSM is simpler.
The Miro Samek QP framework is the canonical embedded HSM implementation, widely used in safety-critical (automotive ISO 26262, medical IEC 62304) systems.
Implementation & Design
QHow do you handle events that arrive during a state transition?
This is the run-to-completion problem. The danger: an action inside a transition synchronously dispatches another event, which mutates state mid-transition. When the outer transition's state assignment runs, the inner update is lost — or worse, both updates conflict in undefined order.
static void on_button_in_idle(void) {led_on();fsm_handle_event(E_TIMER_EXPIRY); // re-entry — bug!state = S_ACTIVE;}
The standard discipline is run-to-completion: an event handler must finish entirely (actions executed, state updated) before the next event is dispatched. Two common ways to enforce it:
1. Event queue / message queue. The FSM task drains a queue, processing one event at a time:
while (1) {xQueueReceive(event_q, &e, portMAX_DELAY);fsm_handle_event(e);/* loop back, get next event */}
If an action needs to fire a follow-up event, it enqueues it (xQueueSend) rather than dispatching synchronously. The current event runs to completion; the follow-up gets processed cleanly on the next iteration. This is the canonical pattern in event-driven FSMs and is the basis for the active-object pattern.
2. Re-entry detection (less common). A boolean flag indicates whether dispatch is in progress; nested calls assert or buffer the event. Brittle and harder to reason about than queues; use only when you can't have a queue (extremely tight bare-metal).
In real-world embedded code, the queue-based pattern is overwhelmingly preferred. Most RTOS-based projects naturally arrive at it: each FSM is its own task with its own event queue (active object), and cross-FSM communication is queue posting. The pattern eliminates a whole class of subtle race bugs.
The interview-relevant lesson: name run-to-completion explicitly, identify synchronous re-entry as the failure mode, and propose the queue-based fix. Bonus points for mentioning that ISRs should also enqueue (using *FromISR API variants) rather than calling the FSM directly.
QWalk me through implementing a UART byte-stream parser as an FSM.
A binary protocol parser is the textbook FSM use case. Suppose the protocol has the frame structure:
[SOF][LEN][PAYLOAD bytes...][CRC16]
Where SOF is a fixed start-of-frame byte (e.g., 0x7E), LEN is the payload length (1 byte, max 255), payload is LEN bytes, and CRC16 is two trailing bytes.
The FSM has four states:
| State | Meaning | Expects |
|---|---|---|
WAIT_SOF | Idle / between frames | SOF byte (0x7E) |
READ_LEN | Just got SOF | One length byte |
READ_PAYLOAD | Reading payload bytes | LEN bytes |
READ_CRC | Payload complete | Two CRC bytes |
The single event is BYTE_RX(value). The transition logic:
typedef enum { WAIT_SOF, READ_LEN, READ_PAYLOAD, READ_CRC } State;static State state = WAIT_SOF;static uint8_t payload[256];static uint8_t payload_len;static uint8_t bytes_read;static uint16_t crc_received;void on_byte_rx(uint8_t b) {switch (state) {case WAIT_SOF:if (b == 0x7E) state = READ_LEN;break;case READ_LEN:payload_len = b;bytes_read = 0;state = (b == 0) ? READ_CRC : READ_PAYLOAD;break;case READ_PAYLOAD:payload[bytes_read++] = b;if (bytes_read == payload_len) state = READ_CRC;break;case READ_CRC:crc_received = (crc_received << 8) | b;if (++bytes_read == payload_len + 2) {if (crc_received == compute_crc16(payload, payload_len)) {deliver_message(payload, payload_len);} else {/* CRC mismatch — drop frame, log error */}state = WAIT_SOF;}break;}}
Things to discuss in an interview:
- Run-to-completion: each byte is a discrete event, processed to completion before the next byte arrives
- Buffer ownership: the
payload[]buffer is owned by the FSM; deliver_message must copy it if it needs to retain the data, since the next frame will overwrite - Error recovery: on CRC mismatch, return to
WAIT_SOF— the next 0x7E byte (whether it's a real SOF or a payload byte coincidentally equal to 0x7E in a corrupted frame) restarts parsing - Length validation: a malicious or corrupted LEN byte could be 255; the buffer must accommodate or you must clamp/reject early
- Testing: drive the FSM with synthetic byte sequences (good frames, frames with bit flips, frames with wrong length, garbage between frames) and assert correct delivery / drop behavior
For a real protocol you'd often refactor to a table-driven FSM as it grows (escape characters, multi-byte length fields, variable header formats). The structure stays the same; only the dispatch encoding changes.
QWhat are entry/exit actions and why do they matter for safety-critical code?
Entry and exit actions are side effects attached to a state, not to a specific transition. The entry action runs every time the system enters the state, regardless of which transition led there. The exit action runs every time the system leaves the state, regardless of which transition leads out.
For safety, this matters because entry/exit actions encode invariants — properties that must hold structurally, not by convention.
Example: a motor controller with a MOTOR_RUNNING state. Safety requires that whenever the motor is running, the brake is released, and whenever it's not running, the brake is engaged. Two ways to encode this:
Wrong: put release_brake() on every transition that enters MOTOR_RUNNING (from IDLE, from RECOVERY, from MAINTENANCE...) and engage_brake() on every transition that leaves it. Six months later, someone adds a EMERGENCY → MOTOR_RUNNING transition and forgets the release. Silent safety violation.
Right: put release_brake() in the entry action of MOTOR_RUNNING and engage_brake() in the exit action. Now no matter how the state is entered or left — by user command, by sensor trigger, by recovery from fault, by future code added years later — the brake is released on entry and engaged on exit. The invariant is structural; the framework guarantees it.
For safety certification (ISO 26262, IEC 62304, DO-178C), this kind of structural guarantee is much easier to argue. You can point to the state diagram and say "the brake is engaged whenever we're not in MOTOR_RUNNING because of the exit action — and we never bypass the entry/exit framework." A reviewer can verify this by checking that all transitions go through the proper dispatch path, not by reading every if.
The same argument applies to:
- Watchdog management:
WORKINGstate's entry starts the watchdog; exit stops it. Any state where the watchdog isn't being kicked is an exit-from-WORKING. - Resource locks:
EXCLUSIVE_ACCESSstate's entry acquires the lock; exit releases it. No code path can leave with the lock held. - LED indicators: the visual state matches the FSM state by construction, not by convention.
- Logging: entry to
FAULTtriggers diagnostic capture; exit triggers diagnostic flush.
The interview answer is: entry/exit actions encode invariants structurally; per-transition actions encode opt-in side effects. For anything safety-related, prefer the structural guarantee — the bug surface is much smaller.
QHow would you debug a state machine that's stuck in the wrong state?
The challenge: the system "feels stuck," but where? A state machine has many possible failure modes. A systematic approach:
1. Confirm what state the FSM is actually in. This sounds trivial but is often where investigations stall. Add logging at every state transition (or use a debugger to inspect the state variable). The actual state may not match what the developer assumed.
2. Verify events are being delivered. The next hypothesis: the FSM is in the right state, but the event you expect isn't arriving. Log every event posted to the queue (or every call to the dispatch function). If E_LINK_LOST is never logged when you yank the cable, the event source is broken — investigate the ISR or the source layer.
3. Verify transitions are firing as expected. Log not just the state, but the (state, event) pair on every dispatch. If the event arrives in the right state but no transition fires, the transition table or switch-case is missing the case.
4. Check guards. A guard returning false silently blocks a transition. Log guard evaluations: "transition (S_RUNNING, E_PAUSE) considered, guard returned false." If guards are failing, audit their logic — guards that depend on global state are often the culprit.
5. Check for re-entry. If an action dispatches another event synchronously, the inner call mutates state mid-transition and the outer's state assignment overwrites. Look for action functions that call back into fsm_handle_event. Convert to enqueueing.
6. Check event ordering. With a queue, events are delivered in order. If an unexpected event arrives first (e.g., E_TIMER_EXPIRY before E_BUTTON_RELEASE), the FSM may transition along an unexpected path. Inspect the queue history.
7. Check entry/exit invariants. If the state is right but a side effect is wrong (LED off in CONNECTED state), the entry action may have failed silently — perhaps a peripheral wasn't yet initialized. Re-order init or add an entry-action error check.
A useful tool is a state-history ring buffer that records every (state_before, event, state_after) tuple in a circular buffer in RAM. When the bug fires, dump the buffer and read the last 100 transitions in chronological order — the path to the wrong state is right there. This costs a few hundred bytes of RAM and pays for itself the first time you debug a sporadic FSM bug.
For event-driven FSMs in active-object systems, the same principle extends: each active object's queue and dispatch can be traced. Tools like SEGGER SystemView visualize cross-task event flow.
