How do you handle socket errors and implement reconnection logic?

Question

Accepted Answer

Socket operations can fail at any time — the network is unreliable, the remote server can crash, and wireless links can drop. Robust embedded code must handle every error return and implement automatic reconnection. The key errors to handle: connect() can return ECONNREFUSED (server not running), ETIMEDOUT (unreachable), or ENETUNREACH (no route). send() can return EPIPE or ECONNRESET (peer closed the connection). recv() returning 0 means the peer performed an orderly shutdown. recv() returning -1 with EAGAIN or EWOULDBLOCK on a non-blocking socket means no data is available yet — this is not an error.

A robust reconnection pattern uses exponential backoff with jitter. When a connection fails, wait 1 second, then 2 seconds, then 4, 8, 16, up to a maximum (e.g., 60 seconds). Add random jitter (plus or minus 20%) to prevent thousands of IoT devices from reconnecting to the server at the exact same instant after a cloud outage (the "thundering herd" problem). Reset the backoff timer to the minimum after a successful connection that has been stable for a configurable duration.

For embedded systems, also consider: TCP keepalive to detect dead connections (the default 2-hour timeout is far too long — configure TCP_KEEPIDLE, TCP_KEEPINTVL, and TCP_KEEPCNT for faster detection, e.g., probe after 30 seconds of idle, every 10 seconds, 3 probes). Application-level heartbeat if the protocol supports it (MQTT PINGREQ, WebSocket ping/pong) — this catches cases where the TCP connection is technically open but data is not flowing. Persistent state across reconnections: buffer unsent messages in flash or a circular RAM buffer so they can be transmitted after reconnecting. MQTT's clean session flag and message persistence mechanisms are designed exactly for this scenario.