The Prompt
"Design a distributed sensor system for a manufacturing line. A main controller MCU coordinates 3 sensor MCUs, each monitoring different parameters (vibration, temperature, pressure). The main MCU aggregates data, runs analysis, and reports to a supervisory system via Ethernet."
Requirements Clarification
| Category | Requirement | Detail |
|---|---|---|
| Functional | Sensor types | Vibration (accelerometer), temperature (thermocouple), pressure (piezo) |
| Aggregation | Main MCU collects all sensor data each cycle | |
| Reporting | Main MCU sends aggregated data to supervisory system via Ethernet | |
| Update | Firmware updates pushed from main MCU to sensor MCUs | |
| Non-functional | Synchronization | All sensor MCUs sample within 100 us of each other |
| Fault tolerance | System continues operating if one sensor MCU fails | |
| Replaceability | Sensor modules field-replaceable without powering down main MCU | |
| Latency | Under 10 ms from sample acquisition to supervisory report |
Architecture Overview
┌───────────────────┐│ Sensor MCU 1 ││ (Vibration) │──────┐│ Cortex-M0+ │ │└───────────────────┘ ││ CAN Bus┌───────────────────┐ │ (1 Mbit/s)│ Sensor MCU 2 │ ││ (Temperature) │──────┤│ Cortex-M0+ │ │└───────────────────┘ ││┌───────────────────┐ ││ Sensor MCU 3 │ ││ (Pressure) │──────┘│ Cortex-M0+ │ │└───────────────────┘ ││┌──────┴──────────┐│ Main MCU ││ (Cortex-M7) ││ ││ Aggregation ││ Analysis ││ Ethernet │└────────┬────────┘│┌──────┴──────┐│ Supervisory ││ System │└─────────────┘
Component Deep Dive
Inter-MCU Communication Protocol Selection
| Protocol | Speed | Topology | Addressing | Cable Length | Fault Tolerance | Complexity |
|---|---|---|---|---|---|---|
| SPI | Up to 50 MHz | Master-slave | CS lines (1 per slave) | Under 30 cm | Poor (shared bus) | Low |
| I2C | Up to 3.4 MHz | Multi-master possible | 7/10-bit address | Under 1 m | Moderate | Low |
| CAN | Up to 1 Mbit/s | Multi-master | 11/29-bit message ID | Up to 40 m at 1 Mbit/s | Excellent (differential, error frames) | Medium |
| UART | Up to 3 Mbit/s | Point-to-point | None (1:1 only) | Under 15 m (RS-485) | Poor | Low |
Decision: CAN bus at 1 Mbit/s.
Rationale:
- Multi-drop: All 3 sensor MCUs plus the main MCU share one bus. Adding a 4th sensor MCU means just connecting it — no extra CS lines or UART ports.
- Fault tolerant: Differential signaling rejects noise. Built-in error detection (CRC, bit stuffing, ACK). A faulty node auto-isolates via bus-off state.
- Message-based with priority: CAN arbitration gives lower message IDs higher priority. Emergency messages always win.
- Distance: Manufacturing lines can span meters. CAN handles 40 m at 1 Mbit/s easily.
Message Protocol Design
Message Types
| Message Type | Direction | Frequency | CAN ID Range | Payload |
|---|---|---|---|---|
| Emergency alert | Sensor to Main | Async | 0x010 - 0x01F | Error code (2 B) + value (4 B) |
| Command | Main to Sensor | On-demand | 0x100 - 0x1FF | Command ID (1 B) + params (7 B) |
| Sensor data | Sensor to Main | Every 1 ms | 0x200 - 0x2FF | Sensor readings (6-8 B) |
| Heartbeat | Sensor to Main | Every 100 ms | 0x700 - 0x7FF | Status byte (1 B) + uptime (4 B) |
Lower CAN ID = higher priority. Emergency messages (0x01x) always preempt sensor data (0x2xx) and heartbeats (0x7xx).
CAN Frame Layout (Standard 11-bit ID)
┌─────────┬────────┬──────────────────────────────┐│ CAN ID │ DLC │ Payload (0-8 B) ││ (11 bit)│ (4 bit)│ │└─────────┴────────┴──────────────────────────────┘Sensor data example (Vibration, MCU 1):ID: 0x201DLC: 6Payload: [Ax_H][Ax_L][Ay_H][Ay_L][Az_H][Az_L](3 axes x 16-bit signed, big-endian)
ID Assignment Per Sensor MCU
| Sensor MCU | Data ID | Heartbeat ID | Emergency ID |
|---|---|---|---|
| 1 (Vibration) | 0x201 | 0x701 | 0x011 |
| 2 (Temperature) | 0x202 | 0x702 | 0x012 |
| 3 (Pressure) | 0x203 | 0x703 | 0x013 |
Adding a 4th sensor MCU: assign IDs 0x204, 0x704, 0x014. No bus reconfiguration needed.
Synchronization
All sensor MCUs must sample within 100 us of each other. Two approaches:
Option A: Hardware Sync Pulse (chosen)
Main MCU GPIO ──────────┬──────────┬──────────┐│ │ │Sensor 1 IRQ Sensor 2 IRQ Sensor 3 IRQ(sample) (sample) (sample)Timing:Main MCU asserts sync pin ──▶ propagation delay (~10 ns)──▶ IRQ latency on Cortex-M0+ (~12 cycles = 0.25 us at 48 MHz)──▶ ADC start (~1 us)Total jitter: under 5 us (well within 100 us budget)
The main MCU toggles a dedicated GPIO line at the start of each measurement cycle (1 kHz). Each sensor MCU has this line connected to an external interrupt pin. On the rising edge, the sensor MCU immediately triggers its ADC.
Worst-case synchronization error: interrupt latency variation across MCUs is under 5 us, far within the 100 us requirement.
Option B: CAN Sync Message
Send a broadcast CAN message as the sync trigger. Adds 50-130 us of CAN frame transmission time plus receive ISR latency. Marginal for the 100 us requirement, so hardware sync is preferred.
Fault Isolation and Recovery
Heartbeat Monitoring
Each sensor MCU sends a heartbeat CAN message every 100 ms containing its status byte and uptime counter.
Main MCU heartbeat monitor (per sensor MCU):┌─────────────┐ heartbeat ┌──────────┐│ HEALTHY │◀────received────────│ Waiting ││ │────────────────────▶│ (100 ms ││ Reset miss │ timeout │ timer) ││ counter │ └──────────┘└──────┬──────┘│ 3 consecutive misses (300 ms)▼┌─────────────┐│ FAILED ││ │──▶ Log fault│ │──▶ Alert supervisory system│ │──▶ Enter degraded mode└──────┬──────┘│ heartbeat resumes▼┌─────────────┐│ RECOVERING │──▶ Re-sync, validate data│ │──▶ Return to HEALTHY after 10 good heartbeats└─────────────┘
3 consecutive misses = 300 ms. This avoids false positives from a single delayed message while detecting real failures within half a second.
Degraded Mode
When a sensor MCU is declared failed:
- Main MCU continues collecting data from the remaining 2 sensors.
- The supervisory report includes a flag indicating which sensor is offline.
- Analysis algorithms adapt (e.g., skip vibration correlation if vibration MCU is down).
- No cascade: the CAN bus remains operational for healthy nodes.
Per-MCU Watchdog
Each sensor MCU runs an independent watchdog timer (IWDG, 500 ms timeout). If firmware hangs, the watchdog resets the MCU. The bootloader re-initializes, and the MCU rejoins the CAN bus automatically.
Hot-Swap Support
CAN bus supports hot-swap natively:
- New sensor MCU powers up, initializes its CAN peripheral, and starts sending heartbeats.
- Main MCU detects the new heartbeat ID and begins accepting data from it.
- No bus interruption for the other nodes.
Firmware Update Strategy
Updating sensor MCUs in the field without taking the entire system offline.
Update Sequence
1. Main MCU receives new firmware image via Ethernet2. Main MCU selects ONE sensor MCU for update3. Main MCU sends UPDATE_START command (CAN ID 0x100 + node)4. Sensor MCU enters bootloader mode, stops sending sensor data5. Main MCU streams firmware in 8-byte CAN frames(64 KB image / 8 bytes per frame = 8,192 frames)At 1 Mbit/s, ~65 ms per frame with overhead = ~8.5 seconds total6. Sensor MCU verifies CRC-32 of entire image7. If CRC passes: write to flash, reboot into new firmware8. If CRC fails: remain in bootloader, request retransmission9. Main MCU confirms sensor MCU is back online (heartbeat resumes)10. Repeat for next sensor MCU
Key rule: update one MCU at a time. The other two continue operating normally. If the update fails and the sensor MCU is bricked, it stays in the bootloader and can be retried.
Rollback
Each sensor MCU has a dual-bank flash layout:
┌────────────────┐│ Bootloader │ (16 KB, never overwritten)├────────────────┤│ Bank A (app) │ (64 KB, active firmware)├────────────────┤│ Bank B (app) │ (64 KB, previous firmware)└────────────────┘
If the new firmware crashes within 10 seconds of boot (watchdog reset detected by bootloader), the bootloader swaps back to Bank B automatically.
Data Flow
Per-Cycle Timing Budget (1 kHz cycle = 1 ms)
Time (us) Event───────── ──────────────────────────────0 Main MCU asserts sync pulse5 Sensor MCUs trigger ADC (within 5 us)50 ADC conversions complete (~45 us)100 Sensor MCUs transmit CAN framesFrame 1: ~130 us (vibration)Frame 2: ~130 us (temperature)Frame 3: ~130 us (pressure)490 All 3 frames received by main MCU500-800 Main MCU runs analysis algorithm900 Main MCU sends Ethernet report (UDP)1000 Next cycle begins───────── ──────────────────────────────Total: under 1 ms end-to-end
The 10 ms latency requirement is easily met — the actual pipeline completes in under 1 ms.
Buffering on Main MCU
The main MCU maintains a per-sensor ring buffer of 100 samples (100 ms of data). This handles:
- CAN bus jitter (a frame arriving a few hundred microseconds late)
- Brief retransmission delays from CAN error recovery
- Analysis algorithms that need a sliding window of recent data
Per-sensor ring buffer (100 entries):┌────┬────┬────┬────┬────┬─ ─ ─┬────┐│ S0 │ S1 │ S2 │... │S98 │ │S99 │└────┴────┴────┴────┴────┴─ ─ ─┴────┘▲ ▲│ │tail head(oldest) (newest)
Key Design Decisions
| Decision | Options Considered | Choice | Rationale |
|---|---|---|---|
| Inter-MCU bus | SPI, I2C, CAN, UART | CAN (1 Mbit/s) | Multi-drop, fault-tolerant, prioritized, long cable support |
| Sync method | CAN message, GPIO pulse, free-running | GPIO sync pulse | Under 5 us jitter vs. 100 us requirement |
| Failure detection | Heartbeat, polling, watchdog-only | Heartbeat + watchdog | Heartbeat for remote detection, watchdog for self-recovery |
| Failure threshold | 1 miss, 3 misses, 5 misses | 3 consecutive misses (300 ms) | Balances false-positive avoidance vs. detection speed |
| FW update transport | CAN, UART, SWD | CAN (bootloader) | Reuses existing bus, no extra wiring |
| FW rollback | Single bank, dual bank | Dual bank with auto-rollback | Prevents bricking on bad update |
| Sensor MCU class | Cortex-M0+, M4, M7 | Cortex-M0+ | Sufficient for sensor acquisition, low cost (~$1), low power |
What Interviewers Evaluate
- Protocol selection with justification — not just "use CAN" but explaining why CAN beats SPI, I2C, and UART for this specific scenario.
- Fault tolerance thinking — what happens when sensor MCU 2 dies? Does the whole system halt, or does it degrade gracefully? How do you detect it? How do you recover?
- Synchronization with concrete timing — showing the math: 5 us jitter is well within 100 us, CAN message sync would be marginal at 130 us frame time.
- Scalability — can you add a 4th sensor MCU? With CAN, yes: assign new message IDs, connect to the bus, no hardware changes to existing nodes.
- Update safety — updating one MCU at a time, CRC verification, dual-bank rollback. These details separate senior candidates from junior ones.