How do you make OTA updates power-fail safe?
Power-fail safety during OTA updates means that the device must recover to a working state regardless of when power is lost — during download, during flash erase/write, during metadata update, or during reboot. This is not a theoretical concern: battery-powered IoT devices lose power unpredictably, and even line-powered devices experience brownouts. A bricked device in the field that requires physical intervention to recover is an unacceptable failure mode for any commercial product.
The A/B (dual-slot) scheme provides inherent power-fail safety during the download and flash write phases because the active firmware is never modified. If power is lost while writing the new image to the inactive slot, the partially written slot is simply invalid — the device reboots into the existing active firmware and re-requests the update. The image integrity check (CRC or signature verification) that runs before marking the new slot as bootable ensures that a partially written image is never activated. The only critical window is the metadata update — the moment when the bootloader's boot selection flag changes from "boot Slot A" to "boot Slot B." If power is lost during this write, the metadata could be corrupted.
Robust implementations handle the metadata update using one of several techniques: (1) Atomic flag with redundancy — store the boot selection in two independent flash locations; the bootloader reads both and uses a voting scheme to determine the valid selection. (2) Transaction log — write a "pending" flag first, then update the boot target, then clear the pending flag. On boot, if the pending flag is set, the bootloader knows the update was interrupted and falls back to the known-good slot. (3) Hardware-supported atomic writes — some MCUs guarantee that a single word write to flash is atomic (either completes fully or does not happen); the boot selection flag is a single word. STM32's option bytes and ESP32's otadata partition use variations of these patterns. The broader principle is that every step in the update process must be idempotent — repeating it after a power cycle produces the same correct result. Download can be resumed using range requests (HTTP) or block-level tracking. Flash writes can be repeated because erased flash (all 0xFF) is harmless to re-erase, and writing the same data twice is safe on most flash technologies. The entire pipeline must be designed so that power loss at any point leads to exactly one outcome: the device boots working firmware.
Source: Safety & Security Q&A
