Search topics...
State MachinesImplementation & Designintermediate

How would you debug a state machine that's stuck in the wrong state?

0 upvotes
Practice with AISoon
Study the fundamentals first — State Machines topic page

The challenge: the system "feels stuck," but where? A state machine has many possible failure modes. A systematic approach:

1. Confirm what state the FSM is actually in. This sounds trivial but is often where investigations stall. Add logging at every state transition (or use a debugger to inspect the state variable). The actual state may not match what the developer assumed.

2. Verify events are being delivered. The next hypothesis: the FSM is in the right state, but the event you expect isn't arriving. Log every event posted to the queue (or every call to the dispatch function). If E_LINK_LOST is never logged when you yank the cable, the event source is broken — investigate the ISR or the source layer.

3. Verify transitions are firing as expected. Log not just the state, but the (state, event) pair on every dispatch. If the event arrives in the right state but no transition fires, the transition table or switch-case is missing the case.

4. Check guards. A guard returning false silently blocks a transition. Log guard evaluations: "transition (S_RUNNING, E_PAUSE) considered, guard returned false." If guards are failing, audit their logic — guards that depend on global state are often the culprit.

5. Check for re-entry. If an action dispatches another event synchronously, the inner call mutates state mid-transition and the outer's state assignment overwrites. Look for action functions that call back into fsm_handle_event. Convert to enqueueing.

6. Check event ordering. With a queue, events are delivered in order. If an unexpected event arrives first (e.g., E_TIMER_EXPIRY before E_BUTTON_RELEASE), the FSM may transition along an unexpected path. Inspect the queue history.

7. Check entry/exit invariants. If the state is right but a side effect is wrong (LED off in CONNECTED state), the entry action may have failed silently — perhaps a peripheral wasn't yet initialized. Re-order init or add an entry-action error check.

A useful tool is a state-history ring buffer that records every (state_before, event, state_after) tuple in a circular buffer in RAM. When the bug fires, dump the buffer and read the last 100 transitions in chronological order — the path to the wrong state is right there. This costs a few hundred bytes of RAM and pays for itself the first time you debug a sporadic FSM bug.

For event-driven FSMs in active-object systems, the same principle extends: each active object's queue and dispatch can be traced. Tools like SEGGER SystemView visualize cross-task event flow.

Source: State Machines Q&A