Safety, Security & Reliability
intermediate
Weight: 3/10

OTA updates and rollback

Design reliable over-the-air firmware updates: A/B partitioning, delta updates, rollback strategies, update security, and power-fail safety.

ota
firmware-update
rollback
a-b-partitioning
delta-update
secure-update

Quick Cap

OTA (Over-The-Air) updates let you push firmware fixes and features to deployed devices without physical access -- essential for any connected product. The core engineering challenge is ensuring the device never gets bricked: if the download is interrupted, the image is corrupt, or the new firmware crashes, the device must recover to a known-good state automatically.

Interviewers test whether you can design a complete update pipeline -- from download through verification to atomic commit -- and whether you understand the trade-offs between A/B partitioning, single-bank recovery, and streaming approaches.

Key Facts:

  • A/B partitioning: Two firmware slots; write to inactive, swap on verified success, rollback by re-swapping
  • Delta updates: Transmit only changed bytes (e.g., bsdiff), reducing transfer size by 60-90%
  • Update security: Signed images (ECDSA/RSA), anti-rollback monotonic counters, secure boot chain
  • Power-fail safety: Atomic commit flag in flash -- device boots old image until new image is confirmed
  • State machine: idle to download to verify to apply to reboot to confirm (or rollback)
  • Health check: Post-boot self-test must pass before committing new image; failure triggers automatic rollback

Deep Dive

Update Architecture Overview

px-2 py-1 rounded text-sm font-mono border
OTA Update Flow
===============
[Server]
|
| (1) Download image
v
[Device: Idle] ---> [Download] ---> [Verify Signature]
|
fail | pass
v v
[Abort] [Write to Inactive Slot]
|
v
[Set Pending Flag]
|
v
[Reboot]
|
v
[Boot from New Slot]
|
pass | fail (health check)
v v
[Confirm] [Rollback to Old Slot]
|
v
[Update Complete]

A/B Partitioning (Dual-Bank)

A/B partitioning dedicates two equal-sized flash regions to firmware images. The bootloader maintains metadata indicating which slot is active. During an update, the new image is written to the inactive slot while the active slot continues running -- zero downtime during the write phase.

After writing and verifying the new image, the bootloader metadata is atomically updated to mark the new slot as "pending test." On reboot, the bootloader loads from the new slot. If the application confirms health within a timeout window, the slot is marked "confirmed." If the health check fails or the device crashes before confirmation, the bootloader reverts to the old slot on the next reset.

Flash layout (typical MCU with 512 KB flash):

px-2 py-1 rounded text-sm font-mono border
Flash Memory Map
================
0x0800_0000 +------------------+
| Bootloader | 16 KB
0x0800_4000 +------------------+
| Slot A (active) | 240 KB
0x0803_C000 +------------------+
| Slot B (inactive)| 240 KB
0x0807_4000 +------------------+
| Update Metadata | 4 KB (boot_slot, pending, confirmed, version, CRC)
0x0807_5000 +------------------+
| Config / NVS | 12 KB
0x0807_8000 +------------------+

Update Strategies Compared

StrategyFlash OverheadTransfer SizeRollbackPower-Fail SafeComplexity
A/B (dual-bank)2x firmware sizeFull imageInstant (swap back)Yes -- old slot intactLow
Single-bank + recovery1x + small recovery partitionFull imageSlow (re-flash from recovery)Partial -- recovery must surviveMedium
Streaming (in-place)0 extraFull imageNone -- destructiveNo -- brick risk on power lossLow flash, high risk
Delta (with A/B)2x firmware sizePatch only (10-40%)InstantYesHigh (patch engine needed)

Delta / Differential Updates

Delta updates transmit only the binary difference between old and new firmware, dramatically reducing transfer size. Tools like bsdiff or detools generate compact patch files. The device applies the patch against its current firmware to reconstruct the new image.

Trade-offs: delta updates require the device to know its exact current version (patches are version-specific), need enough RAM or scratch flash to hold the patch and reconstruct the image, and add complexity to the update pipeline. They shine for bandwidth-constrained links (cellular IoT, LoRa) where transferring a full 256 KB image is expensive or slow.

Update Security

Every OTA image must be verified before execution:

  1. Signature verification: The image is signed with the vendor's private key (ECDSA P-256 or RSA-2048). The device verifies using the corresponding public key baked into the bootloader or OTP fuses.
  2. Anti-rollback counter: A monotonic counter stored in OTP fuses or protected flash. Each firmware version increments the counter. The bootloader refuses to boot any image with a counter value below the stored minimum, preventing attackers from downgrading to a known-vulnerable version.
  3. Secure boot chain: The ROM bootloader verifies the first-stage bootloader, which verifies the firmware image. If any link in the chain fails, boot halts.

Power-Fail Safety

The critical invariant: the device must always be able to boot a working image, even if power is lost at any point during the update. This is achieved by:

  • Writing the new image to the inactive slot -- the active slot remains untouched until verification succeeds
  • Atomic commit: The "boot from new slot" flag is a single word write to flash. Either the write completes (boot new) or it does not (boot old). There is no in-between state.
  • CRC/hash verification at boot: The bootloader checks the image integrity before jumping. A partially written image fails the check, triggering fallback.
c
/* Simplified update metadata and confirmation logic */
typedef struct {
uint32_t active_slot; /* 0 = Slot A, 1 = Slot B */
uint32_t pending_slot; /* Slot awaiting confirmation */
uint32_t confirmed; /* 1 = confirmed, 0 = pending */
uint32_t version;
uint32_t crc32;
} update_meta_t;
void app_confirm_update(update_meta_t *meta) {
/*
* Called after post-boot health check passes.
* Marks the new slot as confirmed so the bootloader
* will not roll back on the next reset.
*/
meta->confirmed = 1;
flash_write_meta(meta);
}

Debugging Story

A consumer IoT company shipped 50,000 smart locks with a single-bank OTA scheme and no rollback mechanism. A firmware update introduced a Wi-Fi driver regression that caused a crash loop during boot. Because the update wrote directly over the only firmware slot, the devices could not recover -- they were bricked. The company had to physically recall and re-flash every unit at a cost exceeding $2 million. The fix for future products: A/B partitioning with mandatory health-check confirmation before committing the new slot. The lesson: never deploy an update architecture that lacks an automatic rollback path.

What interviewers want to hear: that you design for the worst case (power loss mid-write, corrupt download, crashing new firmware), use A/B partitioning as the default architecture, verify image signatures before writing to flash, and implement a health-check confirmation window so rollback is automatic. Bonus points for discussing delta updates, anti-rollback counters, and bandwidth trade-offs.

Interview Focus

Classic Interview Questions

Q1: "How does A/B partitioning work and why is it preferred for OTA updates?"

Model Answer Starter: "A/B partitioning maintains two firmware slots in flash. The device runs from the active slot while writing the new image to the inactive slot. After writing, the bootloader metadata is updated to mark the new slot as 'pending test,' and the device reboots into it. If the application passes its health check and calls a confirmation function, the new slot becomes active permanently. If the health check fails or the device crashes, the bootloader reverts to the old slot on the next reset. This guarantees the device always has a known-good image to fall back to, even if the new firmware is fatally flawed."

Q2: "How do you make OTA updates power-fail safe?"

Model Answer Starter: "The key principle is that the old firmware must remain bootable until the new firmware is fully written and verified. I achieve this by writing the new image to a separate flash region (inactive slot), verifying its CRC and signature in place, and then atomically updating a single metadata flag to switch boot targets. If power is lost during the image write, the active slot is untouched. If power is lost after the metadata write but before health confirmation, the bootloader will attempt the new slot once and roll back if it fails. The metadata write itself is designed as a single flash word write to be as close to atomic as possible."

Q3: "What is a delta update and when would you use one?"

Model Answer Starter: "A delta update transmits only the binary difference between the old and new firmware images, typically generated by tools like bsdiff. On the device, a patch engine applies the delta to the current image to reconstruct the new image in the inactive slot. I use delta updates when bandwidth is expensive or limited -- for example, cellular IoT devices where each kilobyte of transfer has a cost, or LoRa devices with very low data rates. The trade-off is complexity: you need version-specific patches, enough RAM or scratch flash for reconstruction, and a fallback to full-image update if the delta fails."

Q4: "How do you prevent firmware downgrade attacks?"

Model Answer Starter: "I use a monotonic anti-rollback counter stored in OTP fuses or a protected flash region. Each firmware release increments this counter, and the bootloader refuses to boot any image whose counter value is lower than the stored minimum. This prevents attackers from flashing an older firmware version with known vulnerabilities. The counter is one-way -- once incremented, it cannot be decreased. Combined with signature verification, this ensures only authentic, current-or-newer firmware can execute."

Q5: "Walk me through the complete OTA update state machine."

Model Answer Starter: "The state machine has six states: idle, download, verify, apply, reboot, and confirm. In idle, the device polls or receives a push notification for available updates. In download, it fetches the image in chunks, storing to a staging area or directly to the inactive slot, with resume-on-disconnect support. In verify, it checks the CRC, validates the cryptographic signature, and compares the version against the anti-rollback counter. In apply, it writes the image to the inactive slot if not already there and updates bootloader metadata to 'pending.' Reboot transitions boot to the new slot. In confirm, the application runs its health-check suite and either confirms the update or the watchdog triggers rollback."

Trap Alerts

  • Don't say: "Just overwrite the current firmware in place" -- this is a bricking risk with no rollback
  • Don't forget: Signature verification must happen before writing to flash, not after reboot
  • Don't ignore: Anti-rollback counters -- without them, signed images of old vulnerable versions can be replayed

Follow-up Questions

  • "How would you handle OTA updates for a device with only 128 KB of flash?"
  • "What happens if the health check itself has a bug and never confirms?"
  • "How do you coordinate OTA updates across a fleet of 100,000 devices?"

Practice

What is the primary advantage of A/B partitioning over single-bank updates?

When is the new firmware 'confirmed' in a typical A/B update scheme?

What does an anti-rollback counter prevent?

Why are delta updates useful for cellular IoT devices?

Real-World Tie-In

Smart Home Hub (A/B + Delta) -- Managed OTA for 200,000 deployed smart home hubs over cellular backhaul. Implemented A/B partitioning with bsdiff delta updates that reduced average transfer size from 1.2 MB to 180 KB, saving over $40,000/year in cellular data costs. Health-check confirmation required successful MQTT broker reconnection and sensor polling before committing the new slot. Rollback rate across 12 firmware releases was 0.3%, all caught automatically by the health-check timeout.

Automotive Telematics ECU (ISO 26262 + OTA) -- Designed the OTA subsystem for an ASIL B telematics controller. Updates were dual-signed (ECDSA by engineering, RSA by the OEM), with anti-rollback counters in OTP fuses. The update state machine was itself a safety element -- a windowed watchdog monitored the health-check confirmation phase, and three consecutive failed boot attempts triggered a permanent rollback with a diagnostic trouble code (DTC) logged for dealer diagnosis. The system survived 15,000+ field update cycles with zero bricks.