Memory Alignment & Endianness

Quick Cap

Memory alignment dictates where the CPU can efficiently read and write multi-byte values, and endianness dictates the order those bytes are stored. Getting either wrong causes silent data corruption, performance penalties, or hard faults -- exactly the kind of low-level bug that embedded interviews are designed to surface.

Interviewers test whether you can look at a struct and predict its padded size, reorder its fields to save RAM, and write serialization code that works regardless of which end of a multi-byte value comes first in memory.

Key Facts:

Natural alignment: A N-byte type must sit at an address divisible by N. uint32_t needs a 4-byte-aligned address; uint16_t needs 2-byte.
Unaligned access on Cortex-M: Cortex-M3/M4/M7 allow unaligned access to normal memory but take a multi-cycle penalty. Cortex-M0/M0+ generate a HardFault.
Struct padding: The compiler inserts padding bytes between fields (and at the end) to satisfy alignment. A poorly ordered struct can waste 50%+ of its size.
Reorder rule of thumb: Sort fields from largest to smallest alignment requirement to minimize padding.
Little-endian: LSB at lowest address (ARM default, x86, RISC-V). Big-endian: MSB at lowest address (network byte order, PowerPC, some ARM configs).
htonl/ntohl: The portable way to convert between host byte order and network (big-endian) byte order. Always use them in protocol code.

Deep Dive

At a Glance

Concept	Detail
Natural alignment	N-byte type at address divisible by N
Struct padding	Compiler-inserted bytes to maintain field alignment
Struct tail padding	Extra bytes at the end so arrays of structs stay aligned
`__attribute__((packed))`	Removes all padding -- use only for wire formats, never for performance-critical structs
Little-endian	LSB stored first (ARM Cortex-M default, x86)
Big-endian	MSB stored first (network byte order, PowerPC)
`htonl` / `ntohl`	Host-to-network / network-to-host for 32-bit values
`htons` / `ntohs`	Same, for 16-bit values

Why CPUs Prefer Aligned Access

A 32-bit processor reads memory in 4-byte-wide chunks on its data bus. When a uint32_t sits at an aligned address (divisible by 4), the CPU fetches it in a single bus transaction. When it sits at an unaligned address, the hardware must perform two bus reads and stitch the bytes together -- doubling the access time at best, triggering a HardFault at worst.

Behavior	Architectures
Unaligned access generates a fault	Cortex-M0, Cortex-M0+, some RISC-V cores, older ARM7TDMI
Unaligned access works but with performance penalty	Cortex-M3, Cortex-M4, Cortex-M7, x86
Configurable via control register	Some Cortex-M3/M4 (SCB->CCR `UNALIGN_TRP` bit)

On Cortex-M3/M4, you can enable the unaligned-access trap by setting UNALIGN_TRP in the Configuration and Control Register. This is useful during development to catch accidental unaligned accesses before they become performance problems in production.

Struct Padding in Practice

The compiler follows two rules: (1) each field must be naturally aligned, and (2) the struct's total size must be a multiple of its largest field's alignment so that arrays of structs work correctly.

/* Poorly ordered -- 12 bytes with 3 bytes of padding */
typedef struct {
    uint8_t  flags;     /* offset 0, size 1 */
                        /* 3 bytes padding to align next field */
    uint32_t timestamp; /* offset 4, size 4 */
    uint8_t  sensor_id; /* offset 8, size 1 */
                        /* 3 bytes tail padding (struct must be 4-byte multiple) */
} bad_order_t;          /* sizeof = 12 */

/* Reordered -- 8 bytes, zero wasted */
typedef struct {
    uint32_t timestamp; /* offset 0, size 4 */
    uint8_t  flags;     /* offset 4, size 1 */
    uint8_t  sensor_id; /* offset 5, size 1 */
                        /* 2 bytes tail padding */
} good_order_t;         /* sizeof = 8 */

The rule is simple: sort fields from largest alignment to smallest. This pushes all the small fields together at the end, where they can share padding slots.

⚠️Common Trap: __attribute__((packed))

Packed structs eliminate padding, but every access to a multi-byte field in a packed struct may be unaligned. On Cortex-M0 this causes a HardFault. On Cortex-M3/M4 the compiler emits byte-by-byte loads, which are slow. Use packed structs only for matching exact wire or register layouts, and copy fields to a properly aligned struct for processing.

Endianness: Byte Order in Multi-Byte Values

Endianness defines which byte of a multi-byte value is stored at the lowest memory address. For the 32-bit value 0x12345678:

text

Address:        0x00   0x01   0x02   0x03
Little-endian:  0x78   0x56   0x34   0x12   (LSB first)
Big-endian:     0x12   0x34   0x56   0x78   (MSB first)

Property	Little-Endian	Big-Endian
Byte at lowest address	Least significant	Most significant
Common architectures	ARM (default), x86, RISC-V	PowerPC, Motorola 68k, network protocols
Casting advantage	`(uint8_t)&val` gives LSB directly	Human-readable in memory dumps
Network byte order?	No	Yes

ARM Cortex-M cores default to little-endian but the architecture is bi-endian -- some configurations (and some Cortex-A/Cortex-R cores) can run in big-endian mode.

Detecting Endianness at Runtime

/* Union method -- clean and portable */
static inline bool is_little_endian(void) {
    union { uint32_t word; uint8_t byte; } probe = { .word = 1 };
    return probe.byte == 1;  /* LSB == 1 means little-endian */
}

In practice, endianness is almost always known at compile time for a given target. Runtime detection is mainly useful in portable library code or unit tests that run on the host.

Network Byte Order and htonl/ntohl

Network protocols (TCP/IP, CAN, Modbus TCP) define a canonical byte order so that different architectures can exchange data. That order is big-endian, called network byte order. The POSIX functions handle conversion:

#include <arpa/inet.h>  /* or hand-roll on bare metal */

uint32_t host_val = 0x12345678;
uint32_t net_val  = htonl(host_val);  /* to big-endian */
uint32_t back     = ntohl(net_val);   /* back to host */

On a big-endian host, htonl is a no-op. On a little-endian host, it swaps bytes. The key point is that your code doesn't need to know which case applies -- htonl/ntohl abstract that away.

On bare-metal targets without <arpa/inet.h>, implement byte swaps manually:

static inline uint16_t swap16(uint16_t v) {
    return (v << 8) | (v >> 8);
}

static inline uint32_t swap32(uint32_t v) {
    return ((v >> 24) & 0xFF)
         | ((v >>  8) & 0xFF00)
         | ((v <<  8) & 0xFF0000)
         | ((v << 24) & 0xFF000000);
}

GCC and Clang recognize these patterns and emit a single REV instruction on ARM, so there is no need to use inline assembly.

Portable Serialization and Deserialization

When sending a struct over a wire (UART, SPI, Ethernet, CAN), never memcpy the raw struct. Different compilers, architectures, and padding rules will produce different binary layouts. Instead, serialize field-by-field into a byte buffer in a defined byte order:

/* Serialize a sensor reading into a network-order byte buffer */
void serialize(const sensor_reading_t *r, uint8_t *buf) {
    uint32_t ts_n  = htonl(r->timestamp);
    uint16_t val_n = htons(r->value);
    memcpy(&buf[0], &ts_n,  4);  /* bytes 0-3: timestamp */
    memcpy(&buf[4], &val_n, 2);  /* bytes 4-5: value     */
    buf[6] = r->sensor_id;       /* byte  6:  id (1 byte, no swap) */
}

/* Deserialize from network-order byte buffer */
void deserialize(const uint8_t *buf, sensor_reading_t *r) {
    uint32_t ts_n;
    uint16_t val_n;
    memcpy(&ts_n,  &buf[0], 4);
    memcpy(&val_n, &buf[4], 2);
    r->timestamp = ntohl(ts_n);
    r->value     = ntohs(val_n);
    r->sensor_id = buf[6];
}

Using memcpy for multi-byte loads avoids unaligned-access traps even if buf is not 4-byte aligned. The compiler optimizes these memcpy calls into efficient load/store instructions when it can prove alignment.

💡Interview Tip: Never memcpy a Whole Struct Over a Wire

If an interviewer asks how you'd send data between two different MCUs, immediately say "field-by-field serialization with explicit byte order." Saying "just memcpy the struct" is a red flag -- it breaks on different compilers, different padding, and different endianness.

Debugging Story: The Every-Other-Sensor Bug

A team was integrating a Cortex-M4 gateway with a legacy big-endian PowerPC sensor node over a custom UART protocol. Odd-numbered sensor IDs reported reasonable values, but even-numbered sensor IDs showed wildly wrong readings -- temperature values like 14,000 degrees.

The root cause: the protocol struct was defined identically on both sides, and the developer used memcpy to transfer raw struct bytes. The uint8_t sensor_id field happened to be followed by 3 bytes of padding on the Cortex-M4 side (little-endian, different struct layout) but only 1 byte of padding on the PowerPC side. The mismatched padding shifted the uint32_t sensor_value field by 2 bytes, and the endianness difference flipped the remaining bytes. For odd sensor IDs, a coincidental alignment masked the bug; for even IDs, it produced garbage.

The fix: replace the raw struct transfer with field-by-field serialization using htonl/htons, and define a wire format that is independent of either side's struct layout.

Lesson: Never assume two compilers produce the same struct layout, even for identical source code. Always define an explicit wire format and serialize field-by-field.

What interviewers want to hear: You understand that alignment is a hardware constraint, not a compiler preference -- misaligned access causes either faults or performance hits depending on the core. You can look at a struct and mentally calculate its padded size, then reorder fields to eliminate waste. You know that endianness determines byte order and that network byte order is big-endian. You always serialize field-by-field with htonl/ntohl instead of sending raw structs. You know that __attribute__((packed)) trades correctness and performance for exact layout control, and you use it sparingly.

Interview Focus

Classic Memory Alignment & Endianness Interview Questions

Q1: "What is memory alignment and what happens if you violate it?"

Model Answer Starter: "Natural alignment means storing an N-byte type at an address divisible by N -- a uint32_t at a multiple of 4, a uint16_t at a multiple of 2. If you violate this, the result depends on the CPU. On Cortex-M0 and M0+, an unaligned 32-bit access triggers a HardFault. On Cortex-M3/M4/M7, the hardware handles it transparently but takes extra bus cycles. On x86 it works with a small penalty. The compiler normally ensures alignment through padding, but you can break it with packed structs or by casting arbitrary byte pointers to wider types."

Q2: "Given a struct with fields uint8_t, uint32_t, uint8_t -- what is sizeof and how would you reduce it?"

Model Answer Starter: "The compiler pads it to 12 bytes: 1 byte for the first uint8_t, 3 bytes padding to align the uint32_t at offset 4, 4 bytes for the uint32_t, 1 byte for the second uint8_t, then 3 bytes tail padding so the struct size is a multiple of 4. To reduce it, reorder the fields: put the uint32_t first, then both uint8_t fields. That gives 4 + 1 + 1 + 2 bytes tail padding = 8 bytes total. The general rule is to sort fields from largest alignment requirement to smallest."

Q3: "Explain big-endian vs little-endian. How do you handle byte order in protocol code?"

Model Answer Starter: "Endianness is the byte order of multi-byte values. Little-endian stores the least significant byte first -- ARM and x86 are little-endian by default. Big-endian stores the most significant byte first -- network protocols use big-endian as the standard, called network byte order. In protocol code I use htonl/htons before sending and ntohl/ntohs after receiving. On bare metal without a POSIX library, I implement byte-swap functions using shifts and masks. I never send raw structs because padding differences compound the endianness problem."

Q4: "How would you detect endianness at runtime?"

Model Answer Starter: "The simplest method is to store a known multi-byte value and read its first byte. For example, store uint32_t val = 1 and check if *(uint8_t*)&val is 1 -- if so, the system is little-endian because the least significant byte is at the lowest address. A cleaner approach uses a union to avoid the pointer cast. In practice, endianness is usually known at compile time for a given target, so runtime detection is mainly needed in portable library code."

Q5: "When should you use attribute((packed)) and what are the risks?"

Model Answer Starter: "I use packed structs only when I need the struct layout to match an exact binary format -- a protocol header, a hardware register block, or a file format. The risk is that packed structs may contain unaligned fields. On Cortex-M0, accessing a misaligned uint32_t in a packed struct causes a HardFault. Even on Cortex-M3/M4, the compiler generates byte-by-byte loads for packed fields, which is significantly slower. My approach is to define the packed struct for the wire format, then immediately copy its fields into a properly aligned struct for processing."

Trap Alerts

Don't say: "Endianness doesn't matter on ARM because it's always little-endian" -- ARM is bi-endian, and you constantly deal with big-endian network protocols and peripherals.
Don't forget: Struct tail padding exists to ensure arrays of structs stay aligned. sizeof(struct) is not just the sum of field sizes plus inter-field padding.
Don't ignore: The performance cost of packed structs. Saying "just pack everything" shows you don't understand the alignment trade-offs.

Follow-up Questions

"How does #pragma pack differ from __attribute__((packed))?"
"What is the alignof operator and when would you use _Alignas?"
"If you have a uint8_t* buffer from a DMA transfer and need to read a uint32_t at offset 3, how do you do it safely?"
"How does the linker ensure that global variables meet their alignment requirements?"

Practice

❓ What is the sizeof a struct with fields: uint8_t, uint32_t, uint16_t (in that order, no packing)?

❓ On a little-endian system, the 32-bit value 0xDEADBEEF is stored in memory as:

❓ What happens when a Cortex-M0 executes an unaligned 32-bit load?

❓ What does htonl() do on a big-endian host?

❓ Which struct field order minimizes padding for fields uint64_t, uint8_t, uint32_t, uint8_t?

Real-World Tie-In

Automotive CAN Gateway -- A CAN-to-Ethernet gateway bridging a Cortex-M4 and a PowerPC ECU was corrupting 32-bit diagnostic counters. The team had serialized CAN frames by casting struct pointers to byte arrays. Switching to field-by-field serialization with htonl fixed the corruption and reduced the gateway's error rate from 0.3% to zero.

IoT Sensor Node Memory Savings -- A battery-powered LoRa sensor node had 48 statically allocated event structs consuming 2,304 bytes of its 8 KB RAM. Each struct had fields ordered by logical grouping (ID, flags, timestamp, payload). Reordering fields by alignment reduced the struct from 48 bytes to 32 bytes, saving 768 bytes (33%) -- enough to add a second event buffer for redundancy.

Medical Device Certification Audit -- During IEC 62304 review, an auditor flagged all uses of __attribute__((packed)) in a patient monitor's firmware. The team had to demonstrate that every packed struct was accessed only through memcpy-based deserialization functions, never by direct field access, to prove freedom from unaligned-access faults on the Cortex-M0+ target.