Safety & Security — Interview Questions & Answers

Standards

QWhat is ISO 26262 and what are ASIL levels?

ISO 26262 is the international standard for functional safety of electrical and electronic systems in road vehicles. It is an adaptation of the broader IEC 61508 standard, tailored specifically for the automotive domain. ISO 26262 defines a safety lifecycle that spans the entire product development process — from concept and system-level design through hardware and software development, production, operation, and decommissioning. The standard requires a systematic identification of hazards through Hazard Analysis and Risk Assessment (HARA), decomposition of safety requirements, verification and validation at every stage, and documented evidence that the system achieves an acceptable level of residual risk.

ASIL (Automotive Safety Integrity Level) is the risk classification scheme defined by ISO 26262, with levels from QM (Quality Management — no special safety requirements) through ASIL A (lowest safety criticality) to ASIL D (highest, most stringent). The ASIL level is determined by three factors evaluated during HARA: Severity (how bad is the harm if the hazard occurs — S0 to S3), Exposure (how likely is the operational situation where the hazard can occur — E0 to E4), and Controllability (how likely is the driver to avoid the harm — C0 to C3). For example, an electric power steering failure at highway speed would be rated S3 (life-threatening), E4 (highly probable scenario), C3 (difficult to control), resulting in ASIL D. A heated seat malfunction might be S1, E4, C3, resulting in ASIL A or B.

Each ASIL level imposes progressively stricter requirements on development processes, design techniques, and verification methods. ASIL D requires the most rigorous techniques: formal verification, comprehensive code coverage (MC/DC — Modified Condition/Decision Coverage), redundant hardware architectures, extensive safety analysis (FMEA, FTA, FMEDA), and independent assessment by a third party. ASIL A might only require basic testing and code review. A crucial concept is ASIL decomposition: a safety requirement at ASIL D can be decomposed into two independent requirements at ASIL B(D) implemented on separate hardware, reducing the development cost while maintaining the overall safety integrity. Interviewers often ask about ASIL to test whether you understand not just what the levels mean, but how they drive architectural decisions — redundancy, independence, and the cost tradeoff between a single ASIL D component versus two ASIL B(D) components.

QCompare IEC 61508, ISO 26262, and IEC 62304 — which applies when?

These three standards all address functional safety but target different industries and have distinct scopes. IEC 61508 is the "parent" standard — a generic framework for functional safety of electrical, electronic, and programmable electronic (E/E/PE) safety-related systems. It applies across all industries (industrial machinery, process control, railways, nuclear) and defines Safety Integrity Levels from SIL 1 (lowest) to SIL 4 (highest). IEC 61508 covers the entire safety lifecycle and provides general requirements for hardware reliability (random failure rates measured in PFH — Probability of Dangerous Failure per Hour) and systematic software development processes.

ISO 26262 is the automotive-specific adaptation of IEC 61508. It replaces the SIL classification with ASIL (A through D), adds automotive-specific concepts like HARA tailored to vehicle hazard scenarios, and provides more detailed guidance on automotive hardware metrics (SPFM — Single Point Fault Metric, LFM — Latent Fault Metric) and software development practices suited to automotive ECU development. ISO 26262 does not cover commercial vehicles over 3.5 tons (those use ISO 19638) or motorcycles (ISO 25119 for agricultural machinery). The key insight is that ISO 26262 is not simply a relabeling of IEC 61508 — it adds automotive domain knowledge, different hardware metric calculations, and a production-focused process model.

IEC 62304 covers medical device software lifecycle processes and is harmonized with IEC 61508 but tailored for the regulatory requirements of the medical industry. It defines three software safety classes: Class A (no injury possible), Class B (non-serious injury possible), and Class C (serious injury or death possible). IEC 62304 focuses heavily on software development processes — requirements management, architecture documentation, unit testing, integration testing, and risk management (linked to ISO 14971 for medical device risk management). In an interview, the answer to "which applies when" is: IEC 61508 for general industrial safety systems, ISO 26262 for passenger vehicles, and IEC 62304 for medical device software. Many automotive companies also reference ASPICE (Automotive SPICE) for process maturity alongside ISO 26262 for safety, and medical device companies pair IEC 62304 with FDA 21 CFR Part 820 for quality system requirements.

QWhat is MISRA C and why does it matter for embedded development?

MISRA C is a set of coding guidelines for the C language developed by the Motor Industry Software Reliability Association (originally for automotive, now widely adopted across safety-critical industries). The current version, MISRA C:2012 (with amendments through 2023), defines approximately 175 rules categorized as Mandatory (must be followed, no deviation permitted), Required (must be followed unless a formal deviation is documented and approved), and Advisory (recommended best practice). The rules target the dangerous and undefined behaviors in the C language that cause subtle, hard-to-detect bugs — exactly the kind of bugs that crash safety-critical systems.

MISRA C matters for embedded development because the C language, while powerful and hardware-close, is riddled with undefined behavior, implementation-defined behavior, and easily misused constructs. Examples of MISRA rules: Rule 10.3 prohibits implicit narrowing conversions (assigning a uint32_t to a uint16_t without an explicit cast — silently truncating data), Rule 17.2 prohibits recursive functions (because stack depth becomes unpredictable, critical for systems with limited RAM), Rule 11.3 restricts casting between pointer types (a common source of alignment faults on ARM), and Rule 13.5 prohibits side effects in the right-hand operand of logical operators (because short-circuit evaluation makes execution unpredictable). Each rule exists because the prohibited pattern has caused real failures in production systems.

In practice, MISRA compliance is enforced through static analysis tools like Polyspace, PC-lint, LDRA, Parasoft, or Cppcheck configured with MISRA rule sets. These tools flag violations during compilation, and the development team must either fix the violation or document a formal deviation with justification. Full MISRA compliance on a large codebase is expensive — legacy code almost always has hundreds of violations, and third-party libraries (including standard C library implementations) often violate MISRA rules. The pragmatic approach is to achieve MISRA compliance for new code, document deviations for legacy and library code, and use static analysis in CI/CD to prevent regression. For an interview, emphasize that MISRA C is not about coding style — it is about eliminating categories of bugs that are catastrophic in safety-critical systems (undefined behavior, data loss through implicit conversion, stack overflow from recursion, aliasing violations).

Secure Boot & Crypto

QExplain the secure boot chain from ROM to application

A secure boot chain establishes a root of trust anchored in immutable hardware and extends that trust through each stage of the boot process using cryptographic signature verification. The chain begins in ROM (Read-Only Memory) — code burned into the silicon during manufacturing that cannot be modified. This Boot ROM contains the first-stage bootloader and, critically, the public key hash (or the public key itself) used to verify the next boot stage. Because the ROM is immutable, an attacker cannot replace the root of trust without physically replacing the chip. The Boot ROM loads the second-stage bootloader (often called SPL, U-Boot, or a manufacturer-specific bootloader) from flash, computes a cryptographic hash (SHA-256) of the binary, verifies the hash against a digital signature using the root public key (RSA-2048/4096 or ECDSA-P256), and only transfers execution if verification passes.

The second-stage bootloader repeats this process for the next stage — typically the application firmware or an operating system kernel. It loads the image from flash, verifies its signature against a key that was itself authenticated in the previous step (this key may be embedded in the signed bootloader image or stored in a protected key store), and jumps to the application only on success. This creates a chain of trust: ROM verifies bootloader, bootloader verifies firmware, and optionally firmware verifies application-level components. If any stage fails verification, the boot process halts or enters a recovery mode. On more complex systems (embedded Linux), the chain might be: ROM, SPL, U-Boot, kernel, and root filesystem, with each stage verifying the next.

The security of the entire chain depends on several implementation details that interviewers probe: (1) Key storage — the root public key hash must be stored in OTP (One-Time Programmable) fuses, not in writable flash, otherwise an attacker can replace the key and sign their own malicious firmware. (2) Rollback protection — without an anti-rollback counter (also in OTP fuses), an attacker can flash an older, signed firmware version that has known vulnerabilities. (3) JTAG/debug port lockdown — if the debug port is open, an attacker can bypass secure boot entirely by loading code directly into RAM. (4) Fault injection resistance — glitching the power supply or clock during signature verification can cause the check to pass incorrectly; hardened implementations verify the signature twice or use constant-time comparison. Platforms like ARM TrustZone, NXP HAB (High Assurance Boot), STM32 SBSFU, and TI TIFS implement these concepts with varying levels of sophistication.

QHow do you manage cryptographic keys in an embedded device?

Key management is arguably the hardest problem in embedded security — the cryptographic algorithms themselves are well-understood, but protecting the keys on a device with limited resources, physically accessible hardware, and a multi-year field lifetime requires careful architectural decisions. The fundamental principle is that private keys and symmetric keys must never exist in readable flash memory. If a key is stored as a constant in firmware, anyone with a JTAG debugger, a flash reader, or even a firmware binary dump can extract it. Once one device is compromised, every device using the same key is compromised — a catastrophic failure mode for IoT fleets.

The preferred approach uses hardware security: a Secure Element (SE, such as ATECC608A/B, OPTIGA Trust M, or SE050) or a Trusted Platform Module (TPM) stores keys in tamper-resistant silicon where they can be used for cryptographic operations but never exported. The key is generated inside the SE and never leaves it — signing and decryption happen within the secure element, and only the result is returned to the MCU. For devices without a dedicated SE, many modern MCUs provide a hardware key store or OTP fuses for storing a limited number of keys (STM32's OTP area, NXP's PUF-based key storage, ESP32's eFuse key blocks). ARM TrustZone provides a software isolation mechanism where keys are stored in the Secure World and crypto operations are performed via secure function calls from the Non-Secure World.

A complete key management strategy for an embedded product addresses the full lifecycle: Provisioning (how keys are injected during manufacturing — ideally in a secure facility using a Hardware Security Module, or generated on-device and the public key exported for registration), Rotation (how keys are updated if compromised — challenging for embedded devices; a key hierarchy with a long-lived root key and short-lived session keys limits the blast radius), Revocation (how the backend stops trusting a compromised device — certificate revocation lists or short-lived certificates), and Attestation (how the device proves its identity to the cloud — using device-unique certificates signed by a manufacturer CA). Common interview mistakes include: storing symmetric keys in source code, using the same key for all devices in a product line, hardcoding API tokens, and failing to protect the debug interface that could be used to dump key material.

QSymmetric vs asymmetric cryptography — when to use each in embedded?

Symmetric cryptography (AES-128/256, ChaCha20) uses the same key for encryption and decryption. It is fast — hardware AES accelerators on modern MCUs (STM32, nRF52, ESP32) can encrypt at hundreds of Mbps with negligible CPU load and minimal power consumption. AES-128-GCM provides both confidentiality and authentication (AEAD — Authenticated Encryption with Associated Data) in a single operation. The limitation is key distribution: both parties must possess the same secret key, which means you need a secure channel to share it initially. In embedded systems, symmetric crypto is used for bulk data encryption (sensor data, firmware images), session communication after a key has been established, and secure storage of data at rest on flash.

Asymmetric cryptography (RSA-2048/4096, ECDSA-P256, Ed25519, X25519) uses a key pair — a public key that can be freely distributed and a private key that must be protected. It solves the key distribution problem: you can verify a signature or encrypt data using only the public key, without ever sharing the private key. However, asymmetric operations are computationally expensive — an ECDSA-P256 signature verification on a Cortex-M4 without hardware acceleration takes 50-100 ms and consumes significant energy. RSA-2048 is even slower. In embedded systems, asymmetric crypto is used for secure boot (verifying firmware signatures), TLS/DTLS handshakes (establishing a session key), device authentication (proving identity to a server using a device certificate), and firmware update signature verification.

The practical pattern in embedded systems is to combine both: use asymmetric crypto for authentication and key exchange (infrequent, latency-tolerant operations), and symmetric crypto for bulk data protection (frequent, performance-sensitive operations). A TLS handshake exemplifies this: the client and server use ECDHE (Elliptic Curve Diffie-Hellman Ephemeral) to negotiate a shared secret, derive symmetric session keys from it, and then use AES-GCM for all subsequent communication. Secure boot uses asymmetric crypto (ECDSA signature verification) once at boot time, while encrypted flash storage uses AES continuously. The interview-ready insight is understanding the tradeoff: asymmetric solves trust establishment but is too slow for data throughput; symmetric is fast but requires a pre-shared key. The combination gives you both trust and performance. Always prefer algorithms with hardware accelerator support on your target MCU to minimize the power and latency impact.

OTA Updates

QHow do you implement A/B firmware updates with rollback?

The A/B update scheme (also called ping-pong or dual-bank) maintains two complete firmware slots in flash memory — Slot A and Slot B. At any given time, one slot contains the active (running) firmware and the other contains either the previous version or a newly downloaded update. When an OTA update is available, the new firmware image is downloaded and written to the inactive slot while the current firmware continues running uninterrupted. Once the download is complete and the image is verified (CRC or cryptographic signature), the bootloader metadata is updated to mark the inactive slot as the "pending" boot target, and the device reboots.

The bootloader is the critical component that enables rollback. On reboot, the bootloader checks the metadata (stored in a dedicated flash sector or OTP area) and boots from the newly written slot. But it does not immediately mark the new firmware as "confirmed." Instead, the new firmware must explicitly call a confirmation function (e.g., boot_set_confirmed() in MCUboot) after it has successfully initialized, passed self-tests, and established connectivity. If the new firmware crashes before confirming — due to a bug, incompatible configuration, or corrupted image — the watchdog timer fires, the device resets, and the bootloader sees that the pending slot was never confirmed. It then rolls back to the previous slot, which is known-good. This mechanism ensures that a bad update can never permanently brick a device, even if the failure occurs during the first few seconds of boot.

The tradeoff of A/B updates is flash usage: you need at least 2x the firmware size in flash (plus bootloader and metadata sectors). For a 256 KB firmware image on a 1 MB flash, this is manageable. For a 2 MB firmware on a 4 MB flash, it consumes most of the available space. Alternatives include differential updates (delta patches that modify the active slot in-place, with a recovery partition for rollback) and single-slot with external flash (download to external SPI flash, verify, then copy to internal flash). MCUboot (used with Zephyr), ESP-IDF's OTA library, and STM32's SBSFU all implement A/B update schemes with minor variations. Key implementation details interviewers probe: how do you handle power loss during the flash write (answer: the inactive slot is being written, so the active slot remains intact), how do you prevent downgrade attacks (answer: include a version counter checked by the bootloader), and how do you verify integrity before rebooting (answer: compute SHA-256 over the entire image and verify the signature before marking it as pending).

QHow do you make OTA updates power-fail safe?

Power-fail safety during OTA updates means that the device must recover to a working state regardless of when power is lost — during download, during flash erase/write, during metadata update, or during reboot. This is not a theoretical concern: battery-powered IoT devices lose power unpredictably, and even line-powered devices experience brownouts. A bricked device in the field that requires physical intervention to recover is an unacceptable failure mode for any commercial product.

The A/B (dual-slot) scheme provides inherent power-fail safety during the download and flash write phases because the active firmware is never modified. If power is lost while writing the new image to the inactive slot, the partially written slot is simply invalid — the device reboots into the existing active firmware and re-requests the update. The image integrity check (CRC or signature verification) that runs before marking the new slot as bootable ensures that a partially written image is never activated. The only critical window is the metadata update — the moment when the bootloader's boot selection flag changes from "boot Slot A" to "boot Slot B." If power is lost during this write, the metadata could be corrupted.

Robust implementations handle the metadata update using one of several techniques: (1) Atomic flag with redundancy — store the boot selection in two independent flash locations; the bootloader reads both and uses a voting scheme to determine the valid selection. (2) Transaction log — write a "pending" flag first, then update the boot target, then clear the pending flag. On boot, if the pending flag is set, the bootloader knows the update was interrupted and falls back to the known-good slot. (3) Hardware-supported atomic writes — some MCUs guarantee that a single word write to flash is atomic (either completes fully or does not happen); the boot selection flag is a single word. STM32's option bytes and ESP32's otadata partition use variations of these patterns. The broader principle is that every step in the update process must be idempotent — repeating it after a power cycle produces the same correct result. Download can be resumed using range requests (HTTP) or block-level tracking. Flash writes can be repeated because erased flash (all 0xFF) is harmless to re-erase, and writing the same data twice is safe on most flash technologies. The entire pipeline must be designed so that power loss at any point leads to exactly one outcome: the device boots working firmware.

QWhat is anti-rollback protection and why is it important?

Anti-rollback protection prevents an attacker (or a faulty update mechanism) from installing an older, previously valid firmware version that contains known security vulnerabilities. Without anti-rollback, an attacker who gains physical or remote access to the update mechanism can downgrade the firmware to version 1.0 — which was legitimately signed by the manufacturer — and exploit a vulnerability that was patched in version 2.0. The old firmware passes signature verification because it was signed with the correct key, so secure boot alone does not prevent this attack. Anti-rollback adds a version monotonicity check: the bootloader refuses to boot any firmware with a version number lower than the minimum acceptable version.

The implementation uses a monotonic counter stored in tamper-resistant, non-volatile storage — typically OTP (One-Time Programmable) fuses or a protected flash region. Each firmware image includes a version number in its signed header. When the bootloader verifies a new firmware image, it compares the image's version against the counter stored in OTP. If the image version is lower than the counter, the bootloader rejects it regardless of the signature. When a firmware update is successfully confirmed (post-self-test), the bootloader increments the OTP counter to match the new version, permanently preventing rollback to any earlier version. OTP fuses are ideal because they can only be written once per bit — setting a fuse is irreversible, providing hardware-guaranteed monotonicity.

The tradeoff is that OTP fuses are a limited resource — a typical MCU might have 32-128 fuse bits available for version counting, supporting that many version increments over the product lifetime. Strategies to conserve fuses include: using epoch-based versioning where the fuse counter represents a "security epoch" that only increments when a critical security fix is deployed (not for every feature release), or using thermometer coding where each fuse bit represents one increment (32 fuses = 32 increments). Some platforms (e.g., TrustZone-M with secure storage, or a secure element) provide a larger monotonic counter in protected flash instead of fuses. An important nuance: anti-rollback must be applied carefully during development and manufacturing — if you increment the counter prematurely during factory testing, you might lock out legitimate firmware versions. Production provisioning workflows must account for this by setting the counter only at the final firmware programming stage.

Safety Mechanisms

QWhat safety mechanisms would you implement for a safety-critical embedded system?

Safety mechanisms in embedded systems are designed to detect, contain, and respond to hardware faults and software errors before they can cause harm. The selection of mechanisms depends on the required safety integrity level (ASIL or SIL), but a comprehensive safety-critical system typically implements multiple layers of defense. Hardware-level mechanisms include: ECC (Error Correcting Code) on RAM and flash — single-bit errors are corrected transparently, multi-bit errors are detected and flagged; lockstep CPU cores — two identical cores execute the same code in lockstep, and a comparator checks that their outputs match on every cycle (any divergence indicates a fault and triggers a safe state); voltage and clock monitors — dedicated supervisory circuits detect brownouts, overvoltage, and clock frequency deviations that could cause erratic behavior; and hardware watchdog timers — independent oscillator, cannot be disabled by software, resets the system if not serviced within the timeout window.

Software-level mechanisms include: stack overflow detection using MPU (Memory Protection Unit) guard regions or canary values — the MPU is configured to make the memory region immediately below the stack non-accessible, so a stack overflow triggers a MemManage fault instead of silently corrupting adjacent data; control flow monitoring — inserting sequence counters at critical decision points to verify that the code executed the expected path (not just that it reached the right destination); redundant computation — performing safety-critical calculations twice (ideally with a diverse algorithm or on different data representations) and comparing results; plausibility checks on sensor data — range validation (is the temperature reading physically possible?), rate-of-change limits (did the reading jump impossibly fast?), and cross-sensor comparison (do the redundant sensors agree within tolerance?); and periodic RAM and ROM tests using CRC or March test algorithms to detect memory corruption.

System-level mechanisms include: safe state definition — every safety analysis must identify what "safe" means when a fault is detected (for a motor controller, safe might be cutting power; for a ventilator, safe might be maintaining basic airflow); fault reaction time — the maximum time from fault detection to reaching the safe state, which must be shorter than the time for the fault to cause harm; diagnostic coverage — the percentage of dangerous faults detected by the safety mechanisms, a key metric in ISO 26262 hardware metric calculations (SPFM and LFM require 90-99% diagnostic coverage depending on ASIL level); and independent monitoring — a separate, simpler watchdog processor that monitors the main processor's outputs and can independently activate the safe state if the main processor fails. The interview-ready insight is that no single mechanism is sufficient — safety-critical systems require defense in depth, where each mechanism covers failure modes that others might miss.

QExplain fail-safe vs fail-operational with examples

Fail-safe means that when a fault is detected, the system transitions to a safe state that eliminates the hazard, even if it means the system stops providing its intended function. The key assumption is that a safe state exists and is acceptable. Examples: an industrial robot detects an encoder fault and stops all motion — a stationary robot cannot injure workers. An electric heater detects a temperature sensor failure and shuts off the heating element — a cold room is uncomfortable but not dangerous. An automotive ABS system detects an internal fault and disables ABS, reverting to conventional braking — the driver loses anti-lock capability but retains basic braking. In all these cases, the loss of function is acceptable because the alternative (continued operation with a potential fault) could cause harm.

Fail-operational means that when a fault is detected, the system continues to provide its intended function (or a degraded but still functional version of it), because the loss of function itself is dangerous. Fail-operational requires redundancy — duplicate or diverse hardware and software that can take over when the primary system fails. Examples: a fly-by-wire aircraft flight control system cannot simply "stop" when a computer fails — it must continue providing flight control. Typically this is achieved with triple-redundant computers using majority voting. A self-driving vehicle (SAE Level 4+) must remain controllable after a fault because there may be no human driver to take over — it uses redundant steering actuators, redundant power supplies, and redundant compute platforms to continue operating in a degraded mode long enough to execute a minimum risk maneuver (pulling over safely). A medical ventilator cannot simply stop — it uses redundant pressure sensors, backup motor controllers, and a separate safety-critical monitoring processor to ensure airflow continues.

The tradeoff between fail-safe and fail-operational is cost and complexity versus availability. Fail-safe systems are dramatically simpler and cheaper — detecting the fault and shutting down is straightforward. Fail-operational systems require redundant hardware (2x or 3x cost), diverse software (to avoid common-mode failures), fault isolation mechanisms (ensuring one failing component does not corrupt the redundant one), and switchover logic that itself must be verified for safety. ISO 26262 captures this distinction in the ASIL decomposition concept: a fail-operational architecture decomposes a high-ASIL requirement across redundant, independent channels. The interview question tests whether you understand that "safe" does not always mean "off" — in systems where loss of function is itself hazardous, the safety strategy must maintain operation, which fundamentally changes the architecture, cost, and development effort.

QHow do you detect and handle memory corruption at runtime?

Memory corruption in embedded systems — caused by hardware faults (single-event upsets from radiation, aging SRAM cells, marginal voltage levels), software bugs (buffer overflows, wild pointers, stack overflow, DMA misconfiguration), or environmental stress (EMI, temperature extremes) — is insidious because it can silently alter data or code without triggering an immediate crash. Detection requires proactive runtime checks at multiple levels.

RAM integrity testing uses algorithms like the March test (March C-, March B) that write and read test patterns to every cell in a systematic order designed to detect stuck-at faults, coupling faults, and address decoder faults. Running a full March test on the entire RAM takes time, so safety-critical systems typically test a rotating subset of RAM in each main loop iteration (a "walking" test), completing full coverage over many cycles. The key challenge is testing RAM that is in use — you must save the contents, run the test, and restore the contents, all within a critical section. For flash/ROM integrity, a CRC-32 or SHA-256 is computed over the entire code region and compared against a stored reference value. This check runs periodically (every second or every main loop iteration for high-integrity systems) and detects bit-flips in the stored program. Most MCUs with ECC on flash detect and correct single-bit errors in hardware, but the software CRC catches multi-bit errors and corruption from failed flash writes.

Stack overflow detection is critical because stack corruption produces some of the hardest-to-diagnose failures — variables on the stack are silently overwritten, causing incorrect behavior that may not manifest until much later. The MPU (Memory Protection Unit) is the most reliable mechanism: configure a guard region (typically 32-256 bytes) immediately below the stack as "no access." Any stack growth into this region triggers an immediate MemManage fault, catching the overflow at the exact instruction that caused it. Without an MPU, the stack canary pattern places a known value (like 0xDEADBEEF) at the bottom of the stack and periodically checks it — if the value has changed, the stack has overflowed. This is less reliable because the corruption might skip over the canary, but it catches most cases. For global data integrity, redundant storage with voting stores critical variables in three copies and uses majority voting on read — a single corrupted copy is outvoted. Combined with ECC RAM, MPU stack guards, periodic ROM CRC checks, and plausibility validation on all sensor inputs, these mechanisms provide defense-in-depth against memory corruption in safety-critical systems.