System Design

Design a Multi-MCU System

System design walkthrough: main controller with 3 peripheral MCUs for distributed sensor acquisition, inter-MCU communication, synchronization, and fault tolerance.

The Prompt

"Design a distributed sensor system for a manufacturing line. A main controller MCU coordinates 3 sensor MCUs, each monitoring different parameters (vibration, temperature, pressure). The main MCU aggregates data, runs analysis, and reports to a supervisory system via Ethernet."


Requirements Clarification

CategoryRequirementDetail
FunctionalSensor typesVibration (accelerometer), temperature (thermocouple), pressure (piezo)
AggregationMain MCU collects all sensor data each cycle
ReportingMain MCU sends aggregated data to supervisory system via Ethernet
UpdateFirmware updates pushed from main MCU to sensor MCUs
Non-functionalSynchronizationAll sensor MCUs sample within 100 us of each other
Fault toleranceSystem continues operating if one sensor MCU fails
ReplaceabilitySensor modules field-replaceable without powering down main MCU
LatencyUnder 10 ms from sample acquisition to supervisory report

Architecture Overview

text
┌───────────────────┐
│ Sensor MCU 1 │
│ (Vibration) │──────┐
│ Cortex-M0+ │ │
└───────────────────┘ │
│ CAN Bus
┌───────────────────┐ │ (1 Mbit/s)
│ Sensor MCU 2 │ │
│ (Temperature) │──────┤
│ Cortex-M0+ │ │
└───────────────────┘ │
┌───────────────────┐ │
│ Sensor MCU 3 │ │
│ (Pressure) │──────┘
│ Cortex-M0+ │ │
└───────────────────┘ │
┌──────┴──────────┐
│ Main MCU │
│ (Cortex-M7) │
│ │
│ Aggregation │
│ Analysis │
│ Ethernet │
└────────┬────────┘
┌──────┴──────┐
│ Supervisory │
│ System │
└─────────────┘

Component Deep Dive

Inter-MCU Communication Protocol Selection

ProtocolSpeedTopologyAddressingCable LengthFault ToleranceComplexity
SPIUp to 50 MHzMaster-slaveCS lines (1 per slave)Under 30 cmPoor (shared bus)Low
I2CUp to 3.4 MHzMulti-master possible7/10-bit addressUnder 1 mModerateLow
CANUp to 1 Mbit/sMulti-master11/29-bit message IDUp to 40 m at 1 Mbit/sExcellent (differential, error frames)Medium
UARTUp to 3 Mbit/sPoint-to-pointNone (1:1 only)Under 15 m (RS-485)PoorLow

Decision: CAN bus at 1 Mbit/s.

Rationale:

  • Multi-drop: All 3 sensor MCUs plus the main MCU share one bus. Adding a 4th sensor MCU means just connecting it — no extra CS lines or UART ports.
  • Fault tolerant: Differential signaling rejects noise. Built-in error detection (CRC, bit stuffing, ACK). A faulty node auto-isolates via bus-off state.
  • Message-based with priority: CAN arbitration gives lower message IDs higher priority. Emergency messages always win.
  • Distance: Manufacturing lines can span meters. CAN handles 40 m at 1 Mbit/s easily.

Message Protocol Design

Message Types

Message TypeDirectionFrequencyCAN ID RangePayload
Emergency alertSensor to MainAsync0x010 - 0x01FError code (2 B) + value (4 B)
CommandMain to SensorOn-demand0x100 - 0x1FFCommand ID (1 B) + params (7 B)
Sensor dataSensor to MainEvery 1 ms0x200 - 0x2FFSensor readings (6-8 B)
HeartbeatSensor to MainEvery 100 ms0x700 - 0x7FFStatus byte (1 B) + uptime (4 B)

Lower CAN ID = higher priority. Emergency messages (0x01x) always preempt sensor data (0x2xx) and heartbeats (0x7xx).

CAN Frame Layout (Standard 11-bit ID)

text
┌─────────┬────────┬──────────────────────────────┐
│ CAN ID │ DLC │ Payload (0-8 B) │
│ (11 bit)│ (4 bit)│ │
└─────────┴────────┴──────────────────────────────┘
Sensor data example (Vibration, MCU 1):
ID: 0x201
DLC: 6
Payload: [Ax_H][Ax_L][Ay_H][Ay_L][Az_H][Az_L]
(3 axes x 16-bit signed, big-endian)

ID Assignment Per Sensor MCU

Sensor MCUData IDHeartbeat IDEmergency ID
1 (Vibration)0x2010x7010x011
2 (Temperature)0x2020x7020x012
3 (Pressure)0x2030x7030x013

Adding a 4th sensor MCU: assign IDs 0x204, 0x704, 0x014. No bus reconfiguration needed.


Synchronization

All sensor MCUs must sample within 100 us of each other. Two approaches:

Option A: Hardware Sync Pulse (chosen)

text
Main MCU GPIO ──────────┬──────────┬──────────┐
│ │ │
Sensor 1 IRQ Sensor 2 IRQ Sensor 3 IRQ
(sample) (sample) (sample)
Timing:
Main MCU asserts sync pin ──▶ propagation delay (~10 ns)
──▶ IRQ latency on Cortex-M0+ (~12 cycles = 0.25 us at 48 MHz)
──▶ ADC start (~1 us)
Total jitter: under 5 us (well within 100 us budget)

The main MCU toggles a dedicated GPIO line at the start of each measurement cycle (1 kHz). Each sensor MCU has this line connected to an external interrupt pin. On the rising edge, the sensor MCU immediately triggers its ADC.

Worst-case synchronization error: interrupt latency variation across MCUs is under 5 us, far within the 100 us requirement.

Option B: CAN Sync Message

Send a broadcast CAN message as the sync trigger. Adds 50-130 us of CAN frame transmission time plus receive ISR latency. Marginal for the 100 us requirement, so hardware sync is preferred.


Fault Isolation and Recovery

Heartbeat Monitoring

Each sensor MCU sends a heartbeat CAN message every 100 ms containing its status byte and uptime counter.

text
Main MCU heartbeat monitor (per sensor MCU):
┌─────────────┐ heartbeat ┌──────────┐
│ HEALTHY │◀────received────────│ Waiting │
│ │────────────────────▶│ (100 ms │
│ Reset miss │ timeout │ timer) │
│ counter │ └──────────┘
└──────┬──────┘
│ 3 consecutive misses (300 ms)
┌─────────────┐
│ FAILED │
│ │──▶ Log fault
│ │──▶ Alert supervisory system
│ │──▶ Enter degraded mode
└──────┬──────┘
│ heartbeat resumes
┌─────────────┐
│ RECOVERING │──▶ Re-sync, validate data
│ │──▶ Return to HEALTHY after 10 good heartbeats
└─────────────┘

3 consecutive misses = 300 ms. This avoids false positives from a single delayed message while detecting real failures within half a second.

Degraded Mode

When a sensor MCU is declared failed:

  • Main MCU continues collecting data from the remaining 2 sensors.
  • The supervisory report includes a flag indicating which sensor is offline.
  • Analysis algorithms adapt (e.g., skip vibration correlation if vibration MCU is down).
  • No cascade: the CAN bus remains operational for healthy nodes.

Per-MCU Watchdog

Each sensor MCU runs an independent watchdog timer (IWDG, 500 ms timeout). If firmware hangs, the watchdog resets the MCU. The bootloader re-initializes, and the MCU rejoins the CAN bus automatically.

Hot-Swap Support

CAN bus supports hot-swap natively:

  • New sensor MCU powers up, initializes its CAN peripheral, and starts sending heartbeats.
  • Main MCU detects the new heartbeat ID and begins accepting data from it.
  • No bus interruption for the other nodes.

Firmware Update Strategy

Updating sensor MCUs in the field without taking the entire system offline.

Update Sequence

text
1. Main MCU receives new firmware image via Ethernet
2. Main MCU selects ONE sensor MCU for update
3. Main MCU sends UPDATE_START command (CAN ID 0x100 + node)
4. Sensor MCU enters bootloader mode, stops sending sensor data
5. Main MCU streams firmware in 8-byte CAN frames
(64 KB image / 8 bytes per frame = 8,192 frames)
At 1 Mbit/s, ~65 ms per frame with overhead = ~8.5 seconds total
6. Sensor MCU verifies CRC-32 of entire image
7. If CRC passes: write to flash, reboot into new firmware
8. If CRC fails: remain in bootloader, request retransmission
9. Main MCU confirms sensor MCU is back online (heartbeat resumes)
10. Repeat for next sensor MCU

Key rule: update one MCU at a time. The other two continue operating normally. If the update fails and the sensor MCU is bricked, it stays in the bootloader and can be retried.

Rollback

Each sensor MCU has a dual-bank flash layout:

text
┌────────────────┐
│ Bootloader │ (16 KB, never overwritten)
├────────────────┤
│ Bank A (app) │ (64 KB, active firmware)
├────────────────┤
│ Bank B (app) │ (64 KB, previous firmware)
└────────────────┘

If the new firmware crashes within 10 seconds of boot (watchdog reset detected by bootloader), the bootloader swaps back to Bank B automatically.


Data Flow

Per-Cycle Timing Budget (1 kHz cycle = 1 ms)

text
Time (us) Event
───────── ──────────────────────────────
0 Main MCU asserts sync pulse
5 Sensor MCUs trigger ADC (within 5 us)
50 ADC conversions complete (~45 us)
100 Sensor MCUs transmit CAN frames
Frame 1: ~130 us (vibration)
Frame 2: ~130 us (temperature)
Frame 3: ~130 us (pressure)
490 All 3 frames received by main MCU
500-800 Main MCU runs analysis algorithm
900 Main MCU sends Ethernet report (UDP)
1000 Next cycle begins
───────── ──────────────────────────────
Total: under 1 ms end-to-end

The 10 ms latency requirement is easily met — the actual pipeline completes in under 1 ms.

Buffering on Main MCU

The main MCU maintains a per-sensor ring buffer of 100 samples (100 ms of data). This handles:

  • CAN bus jitter (a frame arriving a few hundred microseconds late)
  • Brief retransmission delays from CAN error recovery
  • Analysis algorithms that need a sliding window of recent data
text
Per-sensor ring buffer (100 entries):
┌────┬────┬────┬────┬────┬─ ─ ─┬────┐
│ S0 │ S1 │ S2 │... │S98 │ │S99 │
└────┴────┴────┴────┴────┴─ ─ ─┴────┘
▲ ▲
│ │
tail head
(oldest) (newest)

Key Design Decisions

DecisionOptions ConsideredChoiceRationale
Inter-MCU busSPI, I2C, CAN, UARTCAN (1 Mbit/s)Multi-drop, fault-tolerant, prioritized, long cable support
Sync methodCAN message, GPIO pulse, free-runningGPIO sync pulseUnder 5 us jitter vs. 100 us requirement
Failure detectionHeartbeat, polling, watchdog-onlyHeartbeat + watchdogHeartbeat for remote detection, watchdog for self-recovery
Failure threshold1 miss, 3 misses, 5 misses3 consecutive misses (300 ms)Balances false-positive avoidance vs. detection speed
FW update transportCAN, UART, SWDCAN (bootloader)Reuses existing bus, no extra wiring
FW rollbackSingle bank, dual bankDual bank with auto-rollbackPrevents bricking on bad update
Sensor MCU classCortex-M0+, M4, M7Cortex-M0+Sufficient for sensor acquisition, low cost (~$1), low power

What Interviewers Evaluate

  • Protocol selection with justification — not just "use CAN" but explaining why CAN beats SPI, I2C, and UART for this specific scenario.
  • Fault tolerance thinking — what happens when sensor MCU 2 dies? Does the whole system halt, or does it degrade gracefully? How do you detect it? How do you recover?
  • Synchronization with concrete timing — showing the math: 5 us jitter is well within 100 us, CAN message sync would be marginal at 130 us frame time.
  • Scalability — can you add a 4th sensor MCU? With CAN, yes: assign new message IDs, connect to the bus, no hardware changes to existing nodes.
  • Update safety — updating one MCU at a time, CRC verification, dual-bank rollback. These details separate senior candidates from junior ones.