Design a Multi-MCU System

The Prompt

"Design a distributed sensor system for a manufacturing line. A main controller MCU coordinates 3 sensor MCUs, each monitoring different parameters (vibration, temperature, pressure). The main MCU aggregates data, runs analysis, and reports to a supervisory system via Ethernet."

Requirements Clarification

Category	Requirement	Detail
Functional	Sensor types	Vibration (accelerometer), temperature (thermocouple), pressure (piezo)
	Aggregation	Main MCU collects all sensor data each cycle
	Reporting	Main MCU sends aggregated data to supervisory system via Ethernet
	Update	Firmware updates pushed from main MCU to sensor MCUs
Non-functional	Synchronization	All sensor MCUs sample within 100 us of each other
	Fault tolerance	System continues operating if one sensor MCU fails
	Replaceability	Sensor modules field-replaceable without powering down main MCU
	Latency	Under 10 ms from sample acquisition to supervisory report

Architecture Overview

text

┌───────────────────┐
│ Sensor MCU 1      │
│ (Vibration)       │──────┐
│ Cortex-M0+        │      │
└───────────────────┘      │
                           │  CAN Bus
┌───────────────────┐      │  (1 Mbit/s)
│ Sensor MCU 2      │      │
│ (Temperature)     │──────┤
│ Cortex-M0+        │      │
└───────────────────┘      │
                           │
┌───────────────────┐      │
│ Sensor MCU 3      │      │
│ (Pressure)        │──────┘
│ Cortex-M0+        │      │
└───────────────────┘      │
                           │
                    ┌──────┴──────────┐
                    │    Main MCU     │
                    │   (Cortex-M7)   │
                    │                 │
                    │  Aggregation    │
                    │  Analysis       │
                    │  Ethernet       │
                    └────────┬────────┘
                             │
                      ┌──────┴──────┐
                      │ Supervisory │
                      │   System    │
                      └─────────────┘

Component Deep Dive

Inter-MCU Communication Protocol Selection

Protocol	Speed	Topology	Addressing	Cable Length	Fault Tolerance	Complexity
SPI	Up to 50 MHz	Master-slave	CS lines (1 per slave)	Under 30 cm	Poor (shared bus)	Low
I2C	Up to 3.4 MHz	Multi-master possible	7/10-bit address	Under 1 m	Moderate	Low
CAN	Up to 1 Mbit/s	Multi-master	11/29-bit message ID	Up to 40 m at 1 Mbit/s	Excellent (differential, error frames)	Medium
UART	Up to 3 Mbit/s	Point-to-point	None (1:1 only)	Under 15 m (RS-485)	Poor	Low

Decision: CAN bus at 1 Mbit/s.

Rationale:

Multi-drop: All 3 sensor MCUs plus the main MCU share one bus. Adding a 4th sensor MCU means just connecting it — no extra CS lines or UART ports.
Fault tolerant: Differential signaling rejects noise. Built-in error detection (CRC, bit stuffing, ACK). A faulty node auto-isolates via bus-off state.
Message-based with priority: CAN arbitration gives lower message IDs higher priority. Emergency messages always win.
Distance: Manufacturing lines can span meters. CAN handles 40 m at 1 Mbit/s easily.

Message Protocol Design

Message Types

Message Type	Direction	Frequency	CAN ID Range	Payload
Emergency alert	Sensor to Main	Async	0x010 - 0x01F	Error code (2 B) + value (4 B)
Command	Main to Sensor	On-demand	0x100 - 0x1FF	Command ID (1 B) + params (7 B)
Sensor data	Sensor to Main	Every 1 ms	0x200 - 0x2FF	Sensor readings (6-8 B)
Heartbeat	Sensor to Main	Every 100 ms	0x700 - 0x7FF	Status byte (1 B) + uptime (4 B)

Lower CAN ID = higher priority. Emergency messages (0x01x) always preempt sensor data (0x2xx) and heartbeats (0x7xx).

CAN Frame Layout (Standard 11-bit ID)

text

┌─────────┬────────┬──────────────────────────────┐
│ CAN ID  │  DLC   │         Payload (0-8 B)      │
│ (11 bit)│ (4 bit)│                               │
└─────────┴────────┴──────────────────────────────┘

Sensor data example (Vibration, MCU 1):
  ID:  0x201
  DLC: 6
  Payload: [Ax_H][Ax_L][Ay_H][Ay_L][Az_H][Az_L]
           (3 axes x 16-bit signed, big-endian)

ID Assignment Per Sensor MCU

Sensor MCU	Data ID	Heartbeat ID	Emergency ID
1 (Vibration)	0x201	0x701	0x011
2 (Temperature)	0x202	0x702	0x012
3 (Pressure)	0x203	0x703	0x013

Adding a 4th sensor MCU: assign IDs 0x204, 0x704, 0x014. No bus reconfiguration needed.

Synchronization

All sensor MCUs must sample within 100 us of each other. Two approaches:

Option A: Hardware Sync Pulse (chosen)

text

Main MCU GPIO ──────────┬──────────┬──────────┐
                        │          │          │
              Sensor 1 IRQ  Sensor 2 IRQ  Sensor 3 IRQ
              (sample)      (sample)      (sample)

Timing:
  Main MCU asserts sync pin ──▶ propagation delay (~10 ns)
  ──▶ IRQ latency on Cortex-M0+ (~12 cycles = 0.25 us at 48 MHz)
  ──▶ ADC start (~1 us)
  Total jitter: under 5 us (well within 100 us budget)

The main MCU toggles a dedicated GPIO line at the start of each measurement cycle (1 kHz). Each sensor MCU has this line connected to an external interrupt pin. On the rising edge, the sensor MCU immediately triggers its ADC.

Worst-case synchronization error: interrupt latency variation across MCUs is under 5 us, far within the 100 us requirement.

Option B: CAN Sync Message

Send a broadcast CAN message as the sync trigger. Adds 50-130 us of CAN frame transmission time plus receive ISR latency. Marginal for the 100 us requirement, so hardware sync is preferred.

Fault Isolation and Recovery

Heartbeat Monitoring

Each sensor MCU sends a heartbeat CAN message every 100 ms containing its status byte and uptime counter.

text

Main MCU heartbeat monitor (per sensor MCU):

  ┌─────────────┐     heartbeat       ┌──────────┐
  │   HEALTHY   │◀────received────────│ Waiting  │
  │             │────────────────────▶│ (100 ms  │
  │  Reset miss │    timeout          │  timer)  │
  │  counter    │                     └──────────┘
  └──────┬──────┘
         │ 3 consecutive misses (300 ms)
         ▼
  ┌─────────────┐
  │   FAILED    │
  │             │──▶ Log fault
  │             │──▶ Alert supervisory system
  │             │──▶ Enter degraded mode
  └──────┬──────┘
         │ heartbeat resumes
         ▼
  ┌─────────────┐
  │ RECOVERING  │──▶ Re-sync, validate data
  │             │──▶ Return to HEALTHY after 10 good heartbeats
  └─────────────┘

3 consecutive misses = 300 ms. This avoids false positives from a single delayed message while detecting real failures within half a second.

Degraded Mode

When a sensor MCU is declared failed:

Main MCU continues collecting data from the remaining 2 sensors.
The supervisory report includes a flag indicating which sensor is offline.
Analysis algorithms adapt (e.g., skip vibration correlation if vibration MCU is down).
No cascade: the CAN bus remains operational for healthy nodes.

Per-MCU Watchdog

Each sensor MCU runs an independent watchdog timer (IWDG, 500 ms timeout). If firmware hangs, the watchdog resets the MCU. The bootloader re-initializes, and the MCU rejoins the CAN bus automatically.

Hot-Swap Support

CAN bus supports hot-swap natively:

New sensor MCU powers up, initializes its CAN peripheral, and starts sending heartbeats.
Main MCU detects the new heartbeat ID and begins accepting data from it.
No bus interruption for the other nodes.

Firmware Update Strategy

Updating sensor MCUs in the field without taking the entire system offline.

Update Sequence

text

1. Main MCU receives new firmware image via Ethernet
2. Main MCU selects ONE sensor MCU for update
3. Main MCU sends UPDATE_START command (CAN ID 0x100 + node)
4. Sensor MCU enters bootloader mode, stops sending sensor data
5. Main MCU streams firmware in 8-byte CAN frames
   (64 KB image / 8 bytes per frame = 8,192 frames)
   At 1 Mbit/s, ~65 ms per frame with overhead = ~8.5 seconds total
6. Sensor MCU verifies CRC-32 of entire image
7. If CRC passes: write to flash, reboot into new firmware
8. If CRC fails: remain in bootloader, request retransmission
9. Main MCU confirms sensor MCU is back online (heartbeat resumes)
10. Repeat for next sensor MCU

Key rule: update one MCU at a time. The other two continue operating normally. If the update fails and the sensor MCU is bricked, it stays in the bootloader and can be retried.

Rollback

Each sensor MCU has a dual-bank flash layout:

text

┌────────────────┐
│  Bootloader    │  (16 KB, never overwritten)
├────────────────┤
│  Bank A (app)  │  (64 KB, active firmware)
├────────────────┤
│  Bank B (app)  │  (64 KB, previous firmware)
└────────────────┘

If the new firmware crashes within 10 seconds of boot (watchdog reset detected by bootloader), the bootloader swaps back to Bank B automatically.

Data Flow

Per-Cycle Timing Budget (1 kHz cycle = 1 ms)

text

Time (us)   Event
─────────   ──────────────────────────────
0           Main MCU asserts sync pulse
5           Sensor MCUs trigger ADC (within 5 us)
50          ADC conversions complete (~45 us)
100         Sensor MCUs transmit CAN frames
            Frame 1: ~130 us (vibration)
            Frame 2: ~130 us (temperature)
            Frame 3: ~130 us (pressure)
490         All 3 frames received by main MCU
500-800     Main MCU runs analysis algorithm
900         Main MCU sends Ethernet report (UDP)
1000        Next cycle begins
─────────   ──────────────────────────────
Total:      under 1 ms end-to-end

The 10 ms latency requirement is easily met — the actual pipeline completes in under 1 ms.

Buffering on Main MCU

The main MCU maintains a per-sensor ring buffer of 100 samples (100 ms of data). This handles:

CAN bus jitter (a frame arriving a few hundred microseconds late)
Brief retransmission delays from CAN error recovery
Analysis algorithms that need a sliding window of recent data

text

Per-sensor ring buffer (100 entries):
┌────┬────┬────┬────┬────┬─ ─ ─┬────┐
│ S0 │ S1 │ S2 │... │S98 │     │S99 │
└────┴────┴────┴────┴────┴─ ─ ─┴────┘
  ▲                                ▲
  │                                │
 tail                            head
(oldest)                       (newest)

Key Design Decisions

Decision	Options Considered	Choice	Rationale
Inter-MCU bus	SPI, I2C, CAN, UART	CAN (1 Mbit/s)	Multi-drop, fault-tolerant, prioritized, long cable support
Sync method	CAN message, GPIO pulse, free-running	GPIO sync pulse	Under 5 us jitter vs. 100 us requirement
Failure detection	Heartbeat, polling, watchdog-only	Heartbeat + watchdog	Heartbeat for remote detection, watchdog for self-recovery
Failure threshold	1 miss, 3 misses, 5 misses	3 consecutive misses (300 ms)	Balances false-positive avoidance vs. detection speed
FW update transport	CAN, UART, SWD	CAN (bootloader)	Reuses existing bus, no extra wiring
FW rollback	Single bank, dual bank	Dual bank with auto-rollback	Prevents bricking on bad update
Sensor MCU class	Cortex-M0+, M4, M7	Cortex-M0+	Sufficient for sensor acquisition, low cost (~$1), low power

What Interviewers Evaluate

Protocol selection with justification — not just "use CAN" but explaining why CAN beats SPI, I2C, and UART for this specific scenario.
Fault tolerance thinking — what happens when sensor MCU 2 dies? Does the whole system halt, or does it degrade gracefully? How do you detect it? How do you recover?
Synchronization with concrete timing — showing the math: 5 us jitter is well within 100 us, CAN message sync would be marginal at 130 us frame time.
Scalability — can you add a 4th sensor MCU? With CAN, yes: assign new message IDs, connect to the bus, no hardware changes to existing nodes.
Update safety — updating one MCU at a time, CRC verification, dual-bank rollback. These details separate senior candidates from junior ones.