Networking Protocols — Interview Questions & Answers

TCP/IP & Sockets

QExplain the TCP/IP layer model and how it applies to embedded systems.

The TCP/IP model has four layers, each with a distinct responsibility. The Link layer handles physical transmission over a specific medium — Ethernet, Wi-Fi, or cellular. The Internet layer (IP) provides addressing and routing so packets can traverse multiple networks. The Transport layer (TCP or UDP) provides end-to-end communication between applications — TCP adds reliability with retransmissions and flow control, while UDP provides raw datagrams. The Application layer implements the protocol your application actually speaks — HTTP, MQTT, CoAP, DNS, or a custom binary protocol.

In embedded systems, you interact with this model differently depending on your platform. On embedded Linux, the kernel implements the full TCP/IP stack, and your application uses the POSIX sockets API — the same API as a desktop application. You rarely think about the lower layers. On a bare-metal or RTOS system, you typically use a lightweight IP stack like lwIP or Zephyr's native stack. These stacks implement the same layers but with much smaller memory footprints (lwIP can run with as little as 40 KB of RAM) by using memory pools instead of malloc, supporting fewer simultaneous connections, and offering a "raw" API that avoids data copies.

The key embedded-specific considerations at each layer are: the link layer may be constrained (a cellular modem with high latency, a low-power radio with tiny MTU), the IP layer must handle intermittent connectivity gracefully, the transport layer choice (TCP vs UDP) directly impacts RAM usage (each TCP connection requires send/receive buffers), and the application protocol must be bandwidth-efficient (binary formats like CBOR or Protobuf instead of JSON).

QTCP vs UDP — when to use each in embedded systems?

TCP provides reliable, ordered, byte-stream delivery. The kernel (or your embedded IP stack) handles retransmissions, duplicate detection, flow control, and congestion control. This reliability comes at a cost: each TCP connection maintains state (sequence numbers, timers, send/receive buffers consuming 2-16 KB of RAM each), and the three-way handshake adds latency to connection setup. TCP is the right choice when you need guaranteed delivery — firmware updates, configuration changes, log uploads, or any command-and-control channel where a lost message would cause incorrect behavior.

UDP provides unreliable, unordered datagrams — fire and forget. There is no connection state, no handshake, and minimal overhead (8-byte header vs TCP's 20-byte minimum). UDP is the right choice for time-sensitive data where freshness matters more than completeness — periodic sensor readings (if you miss one, the next reading arrives in a second), audio/video streaming (retransmitting a stale frame is worse than skipping it), and local service discovery (mDNS, SSDP). If you need reliability on top of UDP, you implement it yourself — protocols like DTLS, CoAP, and QUIC do exactly this.

In embedded systems, the tradeoff often comes down to RAM. A device with 64 KB of RAM supporting 50 simultaneous TCP connections would need 50 sets of send/receive buffers — potentially exceeding available memory. UDP connections are stateless and cost almost nothing. Many IoT protocols (CoAP, LwM2M) use UDP for this reason. However, MQTT — the most popular IoT protocol — runs over TCP because its QoS mechanisms are simpler when the transport already guarantees delivery. The choice depends on your specific constraints: available RAM, network reliability, latency requirements, and whether the application protocol already handles reliability.

QWhat is the sockets API and how does it map to embedded networking?

The sockets API (Berkeley sockets, POSIX sockets) is the standard programming interface for network communication. A socket is a file-descriptor-like handle representing one endpoint of a communication channel. The basic flow for a TCP server is: socket() to create the endpoint, bind() to assign a local address and port, listen() to mark it as a server socket, accept() to wait for incoming connections (returning a new socket for each client), and send()/recv() to exchange data. A TCP client calls socket(), then connect() to reach the server, then send()/recv(). For UDP, there is no connection — you use sendto() and recvfrom() with explicit addresses.

On embedded Linux, you use the exact same POSIX sockets API as on a desktop — #include <sys/socket.h> and link against libc. The kernel handles the TCP/IP stack. On bare-metal/RTOS with lwIP, you have two options: the "netconn" or "socket" API that closely mirrors POSIX sockets (with blocking calls in a threaded environment), or the "raw" API that uses callbacks (your function is called when data arrives, with no threads required). The raw API is more efficient but harder to program.

The critical embedded consideration is non-blocking I/O and multiplexing. A blocking recv() ties up a thread waiting for data. On a desktop system with plenty of RAM, you might spawn a thread per connection. On an embedded system with 10 KB stacks and limited threads, you use non-blocking sockets with select(), poll(), or (on Linux) epoll() to handle multiple connections in a single thread. Frameworks like libevent or libev abstract this for embedded Linux. On RTOS systems, lwIP's raw API inherently works this way — callbacks are invoked from the network thread without blocking.

QHow do you handle socket errors and implement reconnection logic?

Socket operations can fail at any time — the network is unreliable, the remote server can crash, and wireless links can drop. Robust embedded code must handle every error return and implement automatic reconnection. The key errors to handle: connect() can return ECONNREFUSED (server not running), ETIMEDOUT (unreachable), or ENETUNREACH (no route). send() can return EPIPE or ECONNRESET (peer closed the connection). recv() returning 0 means the peer performed an orderly shutdown. recv() returning -1 with EAGAIN or EWOULDBLOCK on a non-blocking socket means no data is available yet — this is not an error.

A robust reconnection pattern uses exponential backoff with jitter. When a connection fails, wait 1 second, then 2 seconds, then 4, 8, 16, up to a maximum (e.g., 60 seconds). Add random jitter (plus or minus 20%) to prevent thousands of IoT devices from reconnecting to the server at the exact same instant after a cloud outage (the "thundering herd" problem). Reset the backoff timer to the minimum after a successful connection that has been stable for a configurable duration.

For embedded systems, also consider: TCP keepalive to detect dead connections (the default 2-hour timeout is far too long — configure TCP_KEEPIDLE, TCP_KEEPINTVL, and TCP_KEEPCNT for faster detection, e.g., probe after 30 seconds of idle, every 10 seconds, 3 probes). Application-level heartbeat if the protocol supports it (MQTT PINGREQ, WebSocket ping/pong) — this catches cases where the TCP connection is technically open but data is not flowing. Persistent state across reconnections: buffer unsent messages in flash or a circular RAM buffer so they can be transmitted after reconnecting. MQTT's clean session flag and message persistence mechanisms are designed exactly for this scenario.

MQTT & Application Protocols

QWhat is MQTT and why is it popular for IoT?

MQTT (Message Queuing Telemetry Transport) is a lightweight publish-subscribe messaging protocol designed for constrained devices and unreliable networks. A device (client) connects to a central broker (server), subscribes to topics it is interested in, and publishes messages to topics. The broker handles all routing — publishers and subscribers are fully decoupled and do not need to know about each other. A minimal MQTT CONNECT packet is only 12-14 bytes, and a minimal PUBLISH with a short topic and small payload can be under 20 bytes.

MQTT is popular for IoT for several practical reasons. Low overhead: the protocol header is as small as 2 bytes, compared to HTTP headers that typically consume 200-800 bytes per request — this matters on metered cellular connections where you pay per byte. Push model: the broker pushes messages to subscribed devices instantly, whereas HTTP requires the device to poll the server repeatedly (wasting bandwidth and battery). Built-in reliability: three QoS levels let you trade off between delivery guarantee and overhead. Last Will and Testament (LWT): the broker can automatically publish a message when a device disconnects unexpectedly, enabling presence detection without polling. Retained messages: the broker stores the last message on each topic and delivers it immediately to new subscribers, so a device that just connected can get the current state without requesting it.

The ecosystem is another factor: MQTT brokers are well-supported (Mosquitto, EMQX, HiveMQ, AWS IoT Core, Azure IoT Hub), client libraries exist for every language and platform (including lightweight C libraries like Paho Embedded C and Eclipse Mosquitto for bare-metal and RTOS), and the protocol is an OASIS and ISO standard (ISO/IEC 20922). For most IoT use cases — telemetry upload, remote command delivery, device status monitoring — MQTT is the default choice.

QExplain MQTT QoS levels 0, 1, and 2 with tradeoffs.

MQTT defines three Quality of Service levels that control message delivery guarantees between a client and the broker.

QoS 0 ("at most once") — fire and forget. The sender publishes the message and does not wait for acknowledgment. The message may be lost if the network drops the packet. This is the fastest and lowest-overhead option: one packet, no state, no retransmission. Use QoS 0 for high-frequency sensor telemetry where losing an occasional reading is acceptable (e.g., temperature every 5 seconds — a missing sample is irrelevant because the next one arrives shortly).

QoS 1 ("at least once") — the sender stores the message and retransmits until the broker acknowledges receipt with a PUBACK. This guarantees delivery but may deliver the message more than once if the PUBACK is lost (the sender retransmits, the broker processes the duplicate). This is the most commonly used level — it provides reliable delivery with a simple two-packet handshake (PUBLISH + PUBACK). Your application must be idempotent (handle duplicate messages gracefully) or deduplicate using message IDs. Use QoS 1 for alerts, commands, and state changes.

QoS 2 ("exactly once") — a four-packet handshake (PUBLISH, PUBREC, PUBREL, PUBCOMP) guarantees the message is delivered exactly once with no duplicates. This is the most reliable but also the most expensive: it requires four packets, the broker must maintain state for each in-flight message, and both sides must store messages until the handshake completes. Use QoS 2 only when duplicates would cause real problems — financial transactions, billing events, or firmware update commands. In practice, most IoT systems use QoS 0 or QoS 1. Designing your application to tolerate duplicates (idempotency) is usually simpler and more efficient than using QoS 2.

QCompare MQTT vs HTTP/REST vs CoAP for embedded IoT.

MQTT is a publish-subscribe protocol over TCP. Its strengths are bidirectional push (the server can send data to the device without polling), low per-message overhead (2-byte minimum header), built-in reliability (QoS levels), and excellent support for one-to-many distribution (one publish reaches all subscribers). Its weaknesses: it requires a persistent TCP connection (which consumes RAM and keeps the radio active on battery devices), needs a broker infrastructure, and is not request-response by nature — implementing RPC patterns (send a command, get a result) requires pairing topics and correlation IDs.

HTTP/REST is the most ubiquitous protocol on the internet. Every developer knows it, every cloud platform supports it, and it works through proxies and firewalls without special configuration. For embedded systems, the downsides are significant: HTTP headers add 200-800 bytes of overhead per request, the request-response model requires the device to poll for commands (wasting bandwidth and battery), TLS handshake is expensive (HTTP without TLS is insecure for IoT), and each request is stateless — there is no persistent session. HTTP is appropriate when the device communicates infrequently (once per hour or less), only needs to push data to the cloud (not receive commands in real-time), or must integrate with existing REST APIs.

CoAP (Constrained Application Protocol) is specifically designed for constrained devices. It runs over UDP (not TCP), uses a compact binary header (4 bytes), supports request-response like HTTP but with much lower overhead, and includes built-in discovery and observe (subscribe to resource changes). CoAP shines on severely constrained devices (8-bit MCUs, 6LoWPAN networks, NB-IoT with tiny MTU) where even a TCP connection is too expensive. DTLS provides security without TCP overhead. The downside is a smaller ecosystem compared to MQTT and HTTP — fewer cloud platforms natively support CoAP, and proxy translation (CoAP-to-HTTP) adds complexity. For most IoT products with Wi-Fi or cellular connectivity and reasonable RAM (64 KB or more), MQTT is the pragmatic default. CoAP is the better choice for truly constrained devices on constrained networks.

QHow do you handle MQTT broker disconnection and message persistence?

Broker disconnections are inevitable in embedded IoT — networks fail, brokers restart, and wireless links drop. A robust MQTT client must handle disconnections transparently. The MQTT protocol provides two key mechanisms: clean session and persistent session.

With cleanSession = true (MQTT 3.1.1) or sessionExpiryInterval = 0 (MQTT 5.0), the broker discards all session state when the client disconnects — subscriptions are lost, and any QoS 1/2 messages published while the client was offline are dropped. The client must resubscribe after reconnecting. This is simpler but means you lose messages during downtime. With cleanSession = false, the broker retains the session — subscriptions persist, and QoS 1/2 messages published while the client was offline are queued and delivered when it reconnects. This is essential for command delivery to intermittently connected devices.

On the client side, you need local message persistence. When the device publishes a QoS 1/2 message but the connection is down, the message must be stored locally — in a circular buffer in RAM, or in flash/EEPROM for persistence across reboots. The client library retransmits these messages after reconnecting. Paho Embedded C supports pluggable persistence backends for this purpose. Size the buffer appropriately: on a cellular IoT device that transmits sensor data every 30 seconds and may be offline for up to 1 hour, you need buffer space for 120 messages.

Also implement Last Will and Testament (LWT): register a will message (e.g., {"status": "offline"} on topic devices/{id}/status) during CONNECT. If the client disconnects ungracefully (crash, network failure), the broker automatically publishes the LWT, notifying other subscribers. Pair this with a retained message published on connect ({"status": "online"}) so the device's online/offline status is always current. This pattern provides reliable presence detection without polling.

TLS & Security

QHow does TLS work on a resource-constrained embedded device?

TLS (Transport Layer Security) encrypts and authenticates network connections. On a resource-constrained device, the same TLS handshake and record protocol run as on a desktop — the protocol itself does not change — but the implementation and configuration choices differ significantly to accommodate limited CPU, RAM, and flash.

Embedded devices use lightweight TLS libraries instead of OpenSSL (which requires 1-2 MB of flash and is complex to configure). The primary options are mbedTLS (ARM, widely used, ~100 KB flash), wolfSSL (~100 KB flash, FIPS 140-2 certified variant), and BearSSL (extremely small, ~25 KB flash, designed for minimal RAM). These libraries support the same cipher suites and protocol versions but are optimized for static memory allocation, small code size, and hardware crypto acceleration.

The TLS handshake is the most expensive operation — it involves asymmetric cryptography (RSA or ECDHE key exchange) that can take 1-5 seconds on a low-end Cortex-M4 without hardware acceleration. To mitigate this: use ECDHE with ECDSA certificates instead of RSA (256-bit ECC is equivalent to 3072-bit RSA but much faster on ARM), enable TLS session resumption (the client caches session parameters and resumes without a full handshake on reconnection — reducing the handshake to one round-trip with symmetric crypto only), and use hardware crypto accelerators if available (many SoCs include AES, SHA-256, and sometimes ECC engines). After the handshake, the record protocol uses symmetric encryption (AES-128-GCM is the standard choice) which is fast even on small MCUs, especially with hardware AES.

QCertificate-based vs PSK authentication — tradeoffs for embedded?

Certificate-based authentication uses X.509 certificates and public-key cryptography. The server presents its certificate, the client verifies it against a trusted CA certificate, and optionally the client presents its own certificate (mutual TLS / mTLS). This is the standard approach for internet-facing devices and cloud services. Tradeoffs: parsing and verifying certificates requires code space (~20-40 KB for X.509 parsing) and RAM (~5-10 KB for certificate chains). The asymmetric crypto is CPU-intensive. You need to manage certificates — provision a unique client certificate per device during manufacturing, bundle the server's CA certificate in firmware, and handle certificate rotation and expiration.

Pre-Shared Key (PSK) authentication uses a symmetric key shared between client and server during provisioning. The TLS handshake is dramatically simpler: no certificates to parse, no asymmetric crypto, no CA chain validation. A PSK handshake on a Cortex-M0 completes in milliseconds rather than seconds, uses 1-2 KB of RAM instead of 10+ KB, and requires minimal code. PSK is defined in RFC 4279 and supported by mbedTLS, wolfSSL, and BearSSL.

The tradeoffs: PSK has significant security and operational limitations. Each device must have a unique PSK provisioned during manufacturing — if a single key is shared across all devices, compromising one device compromises the entire fleet. PSK does not support server identity verification the way certificates do (you trust whoever knows the PSK). PSK does not provide forward secrecy unless combined with DHE (DHE-PSK cipher suites). And PSK key rotation requires touching every device. For small fleets with controlled provisioning (factory equipment, lab instruments), PSK is practical and efficient. For large-scale IoT deployments (thousands of consumer devices connecting to a cloud service), certificate-based mTLS with ECDSA is the industry standard — the upfront complexity pays off in scalable identity management, integration with cloud IoT platforms (AWS IoT, Azure IoT Hub), and proper forward secrecy.

QWhat is the RAM cost of TLS and how do you minimize it?

TLS RAM consumption comes from three main sources. The handshake requires buffers for incoming and outgoing handshake messages — the largest being the server's certificate chain, which can be 2-4 KB per certificate (a chain of 2-3 certificates means 4-12 KB just for reassembly). The record layer needs send and receive buffers — by default 16 KB each (the maximum TLS record size), totaling 32 KB. The session state includes the symmetric keys, sequence numbers, and cipher context — typically 1-2 KB. Total baseline: 10-50 KB per TLS connection, depending on configuration.

To minimize RAM, apply these techniques. Reduce the maximum fragment length using the TLS Maximum Fragment Length (MFL) extension (RFC 6066): negotiate 1 KB or 2 KB fragments instead of 16 KB, reducing the record buffers from 32 KB to 2-4 KB. Both client and server must support this extension. Use ECDSA certificates instead of RSA — ECDSA certificates are much smaller (a 256-bit ECDSA certificate is ~500 bytes vs ~1.5 KB for a 2048-bit RSA certificate), reducing the certificate reassembly buffer. Enable session resumption to avoid the full handshake (and its peak RAM usage) on reconnections. Limit cipher suites to a single suite (e.g., TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256) to avoid code and RAM for unused algorithms.

With aggressive optimization in mbedTLS, you can achieve a per-connection RAM footprint of 10-15 KB — achievable on Cortex-M4 devices with 64-128 KB of RAM. BearSSL is even more aggressive, achieving under 25 KB total (including code) with careful configuration. The key insight is that TLS RAM cost is not fixed — it is highly configurable, and embedded engineers must tune it rather than accepting the defaults that were designed for desktop machines.

QHow do you handle certificate expiration for devices deployed in the field?

Certificate expiration is one of the most challenging operational problems for IoT devices. Unlike servers that are actively managed, embedded devices may be deployed in remote locations for 10-15 years with limited connectivity. If a device's client certificate or its stored CA certificate expires, TLS connections fail and the device goes offline — potentially with no way to recover remotely.

Prevention starts at provisioning. Issue device certificates with long validity periods (10-20 years) matching the expected product lifetime. Use intermediate CA certificates (not root CAs) for device issuance so you can rotate the intermediate without touching the root. Store the root CA certificate on the device with a very long validity (25+ years). Some deployments use certificates with no expiration at all, relying on revocation lists instead — but this requires the device to check CRL or OCSP, which adds complexity.

Certificate rotation (renewing certificates before they expire) requires a secure update mechanism. The standard approach is EST (Enrollment over Secure Transport, RFC 7030) or a proprietary cloud-based protocol: the device uses its current (still-valid) certificate to authenticate to a renewal service, generates a new key pair, sends a Certificate Signing Request (CSR), receives a new certificate, and stores it in flash. This must happen well before expiration — build in alerts when a certificate is within 10-20% of its lifetime. For the CA certificate, include it in firmware updates (OTA updates carry the new CA certificate signed by the old one).

Recovery from expiration is the hardest case. If the device certificate has already expired and the server rejects the connection, you need a fallback authentication mechanism — PSK-based TLS using a factory-provisioned key, or a special bootstrap endpoint that accepts expired certificates for the sole purpose of certificate renewal. Without such a fallback, the only recovery is physical access to the device for re-provisioning — which may be impractical for devices deployed on cell towers, in pipelines, or inside sealed enclosures.