MQTT

Quick Cap

MQTT (Message Queuing Telemetry Transport) is a lightweight publish/subscribe messaging protocol designed for constrained devices and unreliable networks. It runs over TCP, uses a central broker to decouple publishers from subscribers, and provides three quality-of-service levels for delivery guarantees. MQTT is the de facto standard for IoT telemetry -- understanding its architecture, QoS trade-offs, and session management is one of the most common IoT interview topics.

Key Facts:

Publish/subscribe: Clients never communicate directly -- all messages flow through a broker that routes by topic
Topics are hierarchical: sensors/building1/floor2/temperature -- separated by /, with + (single-level) and # (multi-level) wildcards for subscriptions
Three QoS levels: 0 (at most once), 1 (at least once), 2 (exactly once) -- each adds protocol overhead and RAM cost
Retained messages: The broker stores the last message on a topic so new subscribers immediately receive current state
Last Will and Testament (LWT): A message the broker publishes on behalf of a client that disconnects ungracefully
Minimal overhead: Fixed header is only 2 bytes; a complete PUBLISH packet can be under 20 bytes total

Deep Dive

At a Glance

Concept	Detail
Transport	TCP (port 1883), TLS over TCP (port 8883)
Architecture	Broker-centric publish/subscribe; clients are publishers, subscribers, or both
Topic format	UTF-8 hierarchical string separated by `/`; max 65,535 bytes
QoS levels	0 (fire-and-forget), 1 (ACK-based), 2 (four-step handshake)
Session state	Clean session flag controls whether broker persists subscriptions and queued messages
Keep-alive	PINGREQ/PINGRESP heartbeat; broker disconnects client after 1.5x keep-alive with no activity
Packet size	Fixed header 2 bytes + variable header + payload; remaining length uses 1-4 byte encoding (max 256 MB)
Versions	3.1.1 (most deployed), 5.0 (adds reason codes, shared subscriptions, topic aliases)

The Publish/Subscribe Model

MQTT decouples message producers from consumers through a broker. Publishers send messages to a topic (a UTF-8 string). Subscribers register interest in one or more topics. The broker matches published messages to subscriptions and forwards accordingly. Neither side knows the other exists.

This decoupling is powerful in IoT because it solves three problems: (1) space decoupling -- publisher and subscriber do not need to know each other's IP address; (2) time decoupling -- they do not need to be online simultaneously (with persistent sessions); (3) synchronization decoupling -- publishing and receiving are asynchronous, so neither side blocks waiting for the other.

DiagramPub/Sub via Broker

Sensor A  ──PUBLISH──▶  ┌──────────┐  ──▶  Dashboard App
(temp data)              │  BROKER   │  ──▶  Logging Service
Sensor B  ──PUBLISH──▶  │ (routes   │  ──▶  Alert Engine
(humidity)               │  by topic)│
                         └──────────┘
                              ▲
                  Subscribers register
                  topic filters once

Publishers send to topics; the broker fans out to all matching subscribers.

Why a broker, not direct connections? In a system with 1,000 sensors and 10 consumers, direct connections would require 10,000 TCP sockets. With MQTT, each sensor maintains one connection to the broker, and the broker fans out to consumers. The broker is the single point of coordination, which simplifies device firmware but makes broker availability critical.

Topics: Hierarchy and Wildcards

Topics are hierarchical strings separated by /. A well-designed topic hierarchy acts like a filesystem for your data:

text

factory/line1/press3/temperature
  factory/line1/press3/pressure
  factory/line1/press3/status
  factory/line2/conveyor1/speed

Wildcard subscriptions let subscribers match multiple topics:

Wildcard	Meaning	Example	Matches
`+`	Single level -- matches exactly one level	`factory/+/+/temperature`	`factory/line1/press3/temperature`, `factory/line2/oven1/temperature`
`#`	Multi-level -- matches zero or more levels; must be last	`factory/line1/#`	`factory/line1/press3/temperature`, `factory/line1/press3/status`, `factory/line1/anything/else`

Design rules for topics:

Use meaningful hierarchy: {domain}/{location}/{device}/{measurement}
Never start with / (creates an empty leading level)
Never use # in production subscriber code without rate limiting -- it matches every topic on the broker and can flood a constrained device
Topics starting with $ are reserved for broker system topics (e.g., $SYS/broker/clients/connected)

⚠️Common Trap: Wildcard on Constrained Devices

Subscribing to # on an MCU-based client is a frequent source of crashes. The broker will forward every message published by every client on the entire broker. On a device with 32 KB RAM, a burst of messages can overflow the receive buffer, exhaust heap, and trigger a hard fault. Always subscribe to the narrowest topic filter possible.

QoS Levels: The Core Trade-Off

QoS governs how hard the protocol works to ensure delivery. Higher QoS means more protocol messages, more RAM for in-flight tracking, and higher latency.

QoS	Name	Protocol Flow	Guarantee	When to Use
0	At most once	PUBLISH only (fire-and-forget)	Message may be lost	Periodic sensor telemetry where the next reading replaces the last
1	At least once	PUBLISH, PUBACK (two-step)	Message delivered at least once; duplicates possible	Alerts, commands -- where missing a message is worse than receiving it twice
2	Exactly once	PUBLISH, PUBREC, PUBREL, PUBCOMP (four-step)	No loss, no duplicates	Financial transactions, firmware update triggers, billing events

Embedded trade-offs by QoS:

Resource	QoS 0	QoS 1	QoS 2
RAM	None beyond TX buffer	Must store message until PUBACK (per message ID)	Must store message + track state across 4 steps
Bandwidth	1 packet	2 packets	4 packets
Latency	Lowest	Moderate	Highest
Battery	Best	Moderate	Worst (radio on for 4 exchanges)

Important nuance: QoS is agreed per hop -- publisher-to-broker and broker-to-subscriber negotiate independently. If a sensor publishes at QoS 1 but a subscriber subscribes at QoS 0, the broker downgrades delivery to the subscriber. The broker-to-subscriber QoS is the minimum of the published QoS and the subscribed QoS.

💡Interview Tip: QoS 1 Duplicates

Interviewers often ask "how do you handle QoS 1 duplicates?" The answer is idempotent message handling. Design your subscriber so processing the same message twice produces the same result -- for example, storing the latest sensor value (idempotent) rather than incrementing a counter (not idempotent). If you must avoid duplicates but cannot use QoS 2, use application-level deduplication with a message ID or timestamp.

Retained Messages

When a client publishes with the retain flag set, the broker stores that message. Any future client that subscribes to the topic immediately receives the retained message -- it does not have to wait for the next publish. Only one retained message is stored per topic (the latest one). Publishing an empty payload with the retain flag clears the retained message.

Why this matters for IoT: A device that reports its status (online/offline) once every 10 minutes can retain the message. A monitoring dashboard that connects 5 minutes later immediately knows the device's current status without waiting for the next report. Without retained messages, new subscribers have no context until the next publish event.

Last Will and Testament (LWT)

A client specifies a will message (topic, payload, QoS, retain flag) during the CONNECT handshake. If the client disconnects ungracefully -- network failure, crash, keep-alive timeout -- the broker publishes the will message on the client's behalf. If the client disconnects cleanly with a DISCONNECT packet, the will is discarded.

Common LWT pattern: A sensor publishes a retained message "online" to devices/sensor42/status on connect, and sets its LWT to publish "offline" (retained) to the same topic. Monitoring systems subscribe to devices/+/status and always know which devices are alive. This is the standard presence detection pattern in IoT.

Persistent Sessions (Clean Session Flag)

The clean session flag in the CONNECT packet controls session persistence:

Clean Session	Behavior	Use Case
true (1)	Broker discards any previous session state; starts fresh	Stateless devices, testing, devices with ample connectivity
false (0)	Broker preserves subscriptions and queues QoS 1/2 messages while client is offline	Battery-powered devices that sleep and reconnect periodically

With clean session = false, when the device reconnects (using the same client ID), the broker replays any QoS 1/2 messages that arrived while it was offline. This is critical for devices that sleep for minutes or hours between transmissions -- they will not miss important commands.

⚠️Persistent Session Memory Trap

If a device with clean session = false goes offline for days while messages accumulate, the broker queues them all. When the device reconnects, the broker floods it with potentially thousands of messages. On a constrained device, this can exhaust RAM and crash the client. Configure the broker's max_queued_messages and message_expiry_interval (MQTT 5.0) to bound queue growth.

MQTT vs HTTP for IoT

This comparison comes up in nearly every IoT interview.

Criteria	MQTT	HTTP
Architecture	Publish/subscribe via broker	Request/response (client-server)
Transport	TCP (persistent connection)	TCP (typically new connection per request)
Direction	Bidirectional (server can push to client)	Client-initiated only (server cannot push without polling or WebSocket)
Header overhead	2 bytes fixed header	Hundreds of bytes (method, URL, headers, cookies)
State	Stateful (persistent connection, session)	Stateless (each request is independent)
Data format	Binary payload (any format)	Typically JSON/XML with content-type headers
Best for	Telemetry, events, commands, presence	Configuration APIs, OTA metadata, dashboards
Power	Low (one connection, small packets)	High (connection setup per request, large headers)

The embedded rule of thumb: Use MQTT for continuous data flow (sensor readings, device commands, status updates) where low overhead and server push matter. Use HTTP/REST for occasional, on-demand operations (device provisioning, OTA update URLs, configuration retrieval) where ubiquitous tooling and caching are more important than efficiency.

MQTT over TLS

MQTT itself has no encryption. For security, MQTT runs over TLS (port 8883). On constrained devices, TLS adds significant overhead:

RAM: 20-40 KB for TLS session state and certificate buffers (mbedTLS, wolfSSL)
Flash: 50-100 KB for the TLS library
CPU: TLS handshake takes hundreds of milliseconds on a Cortex-M4 at 80 MHz
Mutual TLS (mTLS): Each device has its own X.509 certificate -- the broker authenticates the device, and the device authenticates the broker. This is the standard for AWS IoT Core and Azure IoT Hub.

For devices too constrained for TLS, alternatives include pre-shared keys (PSK), which avoid certificate overhead, or encrypting payloads at the application layer while running MQTT over plaintext TCP on a physically secured network.

MQTT 5.0: Key New Features

MQTT 5.0 (ratified 2019) adds features that solve real operational pain points from 3.1.1:

Feature	What It Does	Why It Matters
Reason codes	Every ACK includes a reason code (success, quota exceeded, topic invalid, etc.)	Debugging -- 3.1.1 silently dropped invalid operations
Shared subscriptions	Multiple subscribers share a topic; broker distributes messages round-robin	Load balancing across worker instances
Topic aliases	Map a long topic string to a short integer after first use	Reduces per-message overhead for repeated topics
Message expiry	Publisher sets TTL on a message; broker discards it after expiry	Prevents stale messages from flooding devices that reconnect
User properties	Key-value metadata attached to any packet	Application-level routing, tracing, correlation IDs
Flow control	Receive maximum limits in-flight messages per direction	Prevents broker from overwhelming constrained clients
Session expiry interval	Explicit time after which the broker discards a persistent session	Replaces 3.1.1's "session lives forever" behavior

💡Interview Signal: MQTT 5.0 Awareness

Mentioning MQTT 5.0 features -- especially shared subscriptions and message expiry -- signals that you have worked with MQTT at scale, not just followed a tutorial. Most interview candidates only know 3.1.1.

Broker Selection for Embedded Projects

Broker	Type	Best For	Notes
Mosquitto	Open-source, self-hosted	Development, small deployments	Lightweight C implementation; MQTT 5.0 support; easy to run on a Raspberry Pi
AWS IoT Core	Managed cloud	Production AWS-based IoT	Mutual TLS required; integrates with Lambda, DynamoDB, S3; per-message pricing
HiveMQ	Commercial / cloud	Enterprise, high-throughput	MQTT 5.0 full support; clustering for millions of connections; has a free community edition
EMQX	Open-source / commercial	High-scale self-hosted	Erlang-based; handles millions of concurrent connections; good Kubernetes support
Azure IoT Hub	Managed cloud	Production Azure-based IoT	MQTT 3.1.1 subset (no wildcards, no retained messages); device twin abstraction

Embedded MQTT Clients

On constrained devices, the MQTT client library must be small, non-blocking, and integrate with the RTOS or bare-metal scheduler.

Library	Target	RAM	Notes
lwMQTT	Cortex-M / bare-metal	2-4 KB	Minimal, no dynamic allocation, Arduino-compatible
Paho Embedded C	Cortex-M / RTOS	5-10 KB	Eclipse foundation; supports QoS 0/1/2; FreeRTOS and lwIP integration
coreMQTT (AWS)	Cortex-M / FreeRTOS	3-6 KB	Part of FreeRTOS libraries; designed for AWS IoT Core; no dynamic allocation
Paho C	Embedded Linux	50+ KB	Full-featured; async API; good for Yocto/Buildroot Linux devices

Memory footprint considerations: On an MCU with 64 KB RAM, a typical MQTT stack breakdown is: lwIP (15-20 KB) + TLS (20-30 KB) + MQTT client (3-5 KB) + application buffers (5-10 KB) = 43-65 KB. This leaves almost no room for application logic. Strategies to reduce footprint: disable QoS 2 if not needed, limit topic string lengths, reduce the receive buffer to match your maximum expected message size, and use QoS 0 for high-frequency telemetry.

Debugging Story: The Phantom Duplicate Messages

A fleet of 500 agricultural soil-moisture sensors reported data via MQTT QoS 1 to an AWS IoT Core broker. The cloud backend began logging duplicate readings -- the same sensor value appeared twice with identical timestamps. The data pipeline used a simple counter to track total readings, so duplicates inflated the count and corrupted daily averages.

Initial suspicion was network instability causing retransmissions, but packet captures showed clean TCP connections with no retransmits. The root cause was the keep-alive interval. The sensors used a 60-second keep-alive, but they entered deep sleep for 90 seconds between readings. The broker declared the client dead after 90 seconds (1.5x the 60-second keep-alive), published the LWT, and cleaned up the session. When the sensor woke and reconnected, it still had the last QoS 1 PUBLISH in its outgoing queue (the PUBACK was lost when the broker killed the session). The sensor retransmitted the message, and the broker treated it as a new message because the old session was gone.

The fix was twofold: (1) increase the keep-alive to 180 seconds so the broker tolerates the 90-second sleep cycle (1.5x 180 = 270 seconds, well above 90); (2) make the backend idempotent by deduplicating on (sensor_id, timestamp) pairs before inserting into the database.

Lesson: MQTT keep-alive must account for the device's maximum sleep duration. If the device sleeps longer than 1.5x the keep-alive, the broker will kill the connection, and reconnection with QoS 1 can produce duplicates. Always design backend consumers to be idempotent -- QoS 1 means "at least once," and "at least" includes two or more.

What interviewers want to hear: You understand the publish/subscribe model and why a broker-based architecture scales better than point-to-point for IoT. You can explain QoS 0/1/2 with specific embedded trade-offs (RAM, battery, bandwidth) rather than just reciting definitions. You know about retained messages, LWT, and persistent sessions -- and their pitfalls on constrained devices. You can compare MQTT to HTTP with concrete criteria (header size, connection model, push vs poll). You have opinions on broker selection and client library choice for resource-constrained targets, and you understand that MQTT 5.0 exists and solves real problems.

Interview Focus

Classic MQTT Interview Questions

Q1: "What are the three MQTT QoS levels, and when would you use each in an embedded system?"

Model Answer Starter: "QoS 0 is fire-and-forget -- the publisher sends the message and moves on. I use it for high-frequency sensor telemetry where the next reading replaces the last, because it has zero protocol overhead beyond the PUBLISH packet and uses the least battery. QoS 1 adds a PUBACK acknowledgment -- the publisher retransmits until it gets an ACK, guaranteeing at least one delivery but allowing duplicates. I use it for alerts and commands where missing a message is worse than processing it twice, and I make my subscribers idempotent. QoS 2 uses a four-step handshake (PUBLISH, PUBREC, PUBREL, PUBCOMP) to guarantee exactly-once delivery. I reserve it for critical operations like firmware update triggers or billing events, because the four-packet exchange doubles bandwidth and keeps the radio on longer, which is costly on battery-powered devices."

Q2: "How does MQTT's Last Will and Testament work, and why is it useful?"

Model Answer Starter: "When a client connects, it can register a will message -- a topic, payload, QoS, and retain flag -- with the broker. If the client disconnects ungracefully (network failure, crash, keep-alive timeout), the broker publishes the will message on the client's behalf. If the client sends a clean DISCONNECT, the will is discarded. The classic use case is presence detection: a sensor publishes a retained 'online' message when it connects and sets its LWT to publish a retained 'offline' message on the same status topic. Any monitoring dashboard subscribed to the status topic always knows which devices are alive."

Q3: "Compare MQTT and HTTP for IoT. When would you choose one over the other?"

Model Answer Starter: "MQTT uses a persistent TCP connection with a 2-byte fixed header and supports server-initiated push via the pub/sub model. HTTP uses a new TCP connection per request (unless keep-alive is used), has hundreds of bytes of headers, and is strictly client-initiated -- the server cannot push data without polling or WebSocket. I choose MQTT for continuous telemetry, real-time commands, and status updates because the persistent connection and small headers save battery on constrained devices. I choose HTTP for on-demand operations like device provisioning, OTA update metadata retrieval, and configuration APIs where existing web infrastructure, caching, and tooling are more valuable than protocol efficiency."

Q4: "What is the clean session flag, and what happens if a device with persistent sessions reconnects after a long sleep?"

Model Answer Starter: "The clean session flag in the CONNECT packet controls whether the broker preserves session state. With clean session false, the broker stores the client's subscriptions and queues any QoS 1/2 messages that arrive while the client is offline. When the device reconnects with the same client ID, the broker replays all queued messages. This is useful for battery-powered devices that sleep between transmissions. The danger is queue buildup -- if the device is offline for hours and thousands of messages queue up, reconnection can flood the device and crash it. I mitigate this by configuring the broker's max queued messages and using MQTT 5.0's message expiry interval."

Q5: "What are the key differences between MQTT 3.1.1 and MQTT 5.0?"

Model Answer Starter: "MQTT 5.0 adds several features that solve real operational problems. Reason codes on every ACK make debugging much easier -- in 3.1.1, if a subscribe fails, the client gets no error. Shared subscriptions let multiple worker instances load-balance a topic, which is essential for scalable backends. Topic aliases reduce per-message overhead by mapping long topic strings to short integers. Message expiry prevents stale messages from flooding devices that reconnect after a long sleep. Session expiry interval lets the broker discard abandoned sessions instead of keeping them forever. Flow control with receive-maximum prevents the broker from overwhelming constrained clients with too many in-flight messages."

Trap Alerts

Don't say: "QoS 2 is always better because it's more reliable." It is 4x the packets and dramatically increases latency and power consumption. Most IoT telemetry uses QoS 0 or 1.
Don't forget: MQTT runs over TCP, not UDP. This means TCP's overhead (handshake, per-connection buffers, retransmissions) applies in addition to MQTT's own QoS mechanisms. On constrained devices, the TCP stack often uses more RAM than the MQTT client.
Don't ignore: The keep-alive and clean session interaction. If a device sleeps longer than 1.5x the keep-alive, the broker kills the connection. If the session is not persistent, subscriptions are lost and must be re-established on reconnect.

Follow-up Questions

"How would you design an MQTT topic hierarchy for a smart factory with 10,000 sensors?"
"What happens if two MQTT clients connect with the same client ID?"
"How does MQTT handle large payloads that exceed the receive buffer on a constrained device?"
"What is the difference between MQTT over WebSocket and standard MQTT, and when would you use each?"

Practice

❓ What does MQTT QoS 1 guarantee?

❓ What is the purpose of a retained message in MQTT?

❓ Which MQTT wildcard matches exactly one topic level?

❓ What happens when an MQTT client disconnects ungracefully (e.g., network failure)?

❓ On a Cortex-M device with 64 KB RAM, what is the approximate total RAM needed for an MQTT client with TLS?

Real-World Tie-In

Smart Agriculture Fleet -- A network of 2,000 soil-moisture sensors across a farm used MQTT QoS 0 for 15-minute telemetry readings (losing one reading is acceptable) and QoS 1 for daily battery-level reports (must not miss a low-battery alert). The sensors slept between readings with persistent sessions (clean session false) and a keep-alive of 1800 seconds. The backend subscribed to farm/+/+/moisture with a shared subscription (MQTT 5.0) across three processing instances for load balancing.

Industrial Predictive Maintenance -- A motor vibration monitoring system used MQTT with retained messages to publish each motor's health status (healthy, degraded, critical) to plant/line3/motor/{id}/health. When a maintenance dashboard connected -- even hours after the last status update -- it immediately received the current health of every motor via retained messages, without waiting for the next analysis cycle. LWT was set to publish unknown so that a sensor failure was visible within the keep-alive timeout.

Connected Vehicle Telemetry -- An automotive telematics unit published GPS position, speed, and diagnostic codes over MQTT QoS 1 to a cloud broker. The unit operated on cellular with frequent connectivity drops. MQTT's persistent session ensured that commands sent from the fleet management server (lock doors, start diagnostics) were queued at the broker and delivered when the vehicle reconnected, even after being in a tunnel for 30 minutes. HTTP polling would have required the vehicle to check for commands repeatedly, wasting cellular data and battery.