Skip to content(if available)orjump to list(if available)

Writing memory efficient C structs

Writing memory efficient C structs

102 comments

·July 30, 2025

_rend

For completeness, this description of alignment is misleading:

> Well, dear reader, this padding is added because the CPU needs memory to be aligned in sets of 4 bytes because it’s optimized in that fashion.

> ...

> Remember: since structs are aligned to 4 bytes, any padding is therefore unnecessary if the size of the struct is a multiple of 4 without the padding.

Individual data types have their own alignment (e.g., `bool`/`char` may be 1, `short` may be 2, `int` may be 4, `long` may be 8, etc.), and the alignment of a compound type (like a struct) defaults to the maximum alignment of its constituent types.

In this article, `struct Monster` has an alignment of 4 because `int` and `float` have an alignment of 4 for the author's configuration. Expanding one of the `int`s to a `long` could increase the alignment to 8 on some CPUs, and removing the `int` and `float` fields would decrease the alignment to 1 for most CPUs.

sumtechguy

Also keep in mind that is also all very CPU and compiler specific. Had one compiler where it packed everything at 4/8, usually 8. Not the 1/2/4/8 you would expect. That was because the CPU would just seg fault if you didnt play nice with the data access. The compiler hid a lot of it if you set the packing with offsets and mem moves and shifting. It was clever but slow. So they by default picked a wide enough packing that removed the extra instructions at the cost of using more memory. x86 was by far the most forgiving while at the time I was doing it. ARM was the least forgiving (at least on the platform I was using). With MIPS being OK in some cases but not others.

jandrewrogers

Some of the Cray hardware was basically pure 64-bit. The systems largely didn’t recognize smaller granularity. I learned a lot of lessons about writing portable C by writing code for Cray systems.

gavinsyancey

Um ... isn't alignment generally dictated by the platform ABI so that programs compiled by different compilers can be linked together?

plorkyeran

The widely used platforms with multiple compilers generally have one or more written down ABIs that the compilers all follow, but more niche platforms frequently have exactly one compiler (often a very out of date fork of gcc) that just does whatever they felt like implementing and may not even support linking together things built by different versions of that one compiler.

bobmcnamara

Ideally yes, but practically there are at least a dozen just for x86.(there's like 3 big ones).

null

[deleted]

flohofwoe

AFAIK alignment doesn't even matter anymore (for CPU data at least) since the 'word size' of a modern CPU is the size of a cache line (32 or 64 bytes?), e.g. unaligned accesses within a 32 or 64 byte block are not different than aligned accesses.

(technically there is still an advantage of items aligned to their size in that such an item can never straddle adjacent cache lines though)

And there's also still tons of different alignment requirements when working with GPU data - and interestingly those alignment requirements may differ from C's alignment rules, so you may need to explicitly use packed structs (which are still not a standard C feature!) with manual padding.

birn559

My understanding is that C++ compilers still add padding by default for performance reasons. CPU will have to spend a few cycles to reorganize data that is not aligned in chunks of 4 bytes.

xxs

> CPU will have to spend a few cycles to reorganize data that is not aligned in chunks of 4 bytes.

That's not true for quite a lot of CPUs. Pretty much all x64 and stuff don't care

flohofwoe

Daniel Lemire did some measuring ~~recently~~ (oops, in 2012):

https://lemire.me/blog/2012/05/31/data-alignment-for-speed-m...

TL;DR: 10% difference on what in 2012 was a low-end CPU, no difference on "new in 2012" CPUs. So my guess is that by now it really doesn't matter anymore :)

boomlinde

> Individual data types have their own alignment (e.g., `bool`/`char` may be 1, `short` may be 2, `int` may be 4, `long` may be 8, etc.), and the alignment of a compound type (like a struct) defaults to the maximum alignment of its constituent types.

I will add that this is implementation defined. IIRC the only restriction the standard imposes on the alignment of a struct is that a pointer to it is also a pointer to its first member when converted, meaning its alignment must practically be a multiple of that of its first field.

NooneAtAll3

implementation-defined means your specialized platform can be supported without needing to conform - it does not mean that common knowledge is false for common users

boomlinde

"Implementation-defined" means that there is nothing to conform to as far as the standard is concerned. I have not claimed that "common knowledge is false for common users" or anything to that effect. My comment is additive, which should have been clear to anyone reading the first three words of it.

kazinator

I.e. the author is wrong; struct s { char x; }; is not required to be four-byte-aligned. It can have a 1 byte size, and thus alignment.

bobmcnamara

And bool may be 4, and char may be 2(m68k)!

flohofwoe

Not unless you're stuck in the 90s ;) These sizes have all been standardized to 1 byte since C99.

bobmcnamara

Not the sizes, the alignments.

david2ndaccount

The author relies on the compiler fitting the bitfield in unused padding bytes after “speed”, but that is implementation-defined behavior (almost everything about bitfields is implementation defined). MSVC will align the unsigned bitfield to alignof(unsigned), whereas GCC and clang will use the padding bytes. To portably pack the struct, use unsigned integers and use flags (monster.flags & CAN_FLY instead of monster.can_fly).

See https://c.godbolt.org/z/K7oY77KGj

cryptonector

Bitfields have other problems. Say you have two bitfields each of `uint8_t` type and totaling 16 bits: well, you might think that's _two_ fields, but the compiler is allowed to treat them as _one_ whenever it wants to, and _that_ can be a problem if you're accessing one with atomics and the other without because the other can yield unsynchronized accesses to the one that needs synchronization.

Bitfields in C leave way too much behavior to the compiler or undefined. It's really intolerable.

atiedebee

Even worse: bit fields can only be applied to int, signed int and unsigned int (maybe char as well but i dont remember)

Even crazier is the fact that an int bitfield's signedness is implementation defined

arjvik

Why can't there be a standard defined for bitfields in future C releases? This is a long-discussed drawback of a feature that I really really want to be able to use in production code.

petermclean

I've been working on an open source library that, while it doesn't solve bitfields, can provide convenient ergonomic access to bit-granular types:

https://github.com/PeterCDMcLean/BitLib

You can inherit from bit_array and create a pseudo bitfield:

  class mybits_t : bit::bit_array<17> {
    public:
      // slice a 1 bit sized field
      // returns a proxy reference to the single bit
      auto field1() {
        return (*this)(0, 1);
      }

      // slice a 3 bit sized field 
      // returns a proxy reference to the 3 bits
      auto field2() {
        return (*this)(1, 4);
      } 
  };
  mybits_t mybits(7);
  assert(mybits.field1() == 1);
  mybits.field1() = 0;
  assert(static_cast<int>(mybits) == 6); //no implicit cast to integers, but explicit supported

There is minimal overhead depending on how aggressively the compiler inlines or optimizes*

PaulDavisThe1st

different/more than std::bitset<SIZE> ?

gizmo686

There is. I'm not sure when it was added, but looking at https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf, page 106, note 13

> An implementation may allocate any addressable storage unit large enough to hold a bit-field. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit. If insufficient space remains, whether a bit-field that does not fit is put into the next unit or overlaps adjacent units is implementation-defined. The order of allocation of bit-fields within a unit (high-order to low-order or low-order to high-order) is implementation-defined. The alignment of the addressable storage unit is unspecified.

In practice, I've been using this feature for ages (along with __attribute__((packed))). There comes a point where you can start depending on compiler specific features instead of relying solely on the standard.

wahern

The portion you quoted derives word-for-word from the original ANSI standard (C89/C90): https://port70.net/~nsz/c/c89/c89-draft.html#3.5.2.1

nomel

Can this be standardized? It seems it would have accommodate the most limited architecture, in terms of memory access, probably making it practically useless. Seems best left to the compiler.

ajross

There is, but it's part of the platform ABI and not the C language standard. The latter specifies syntax and behavior, the former is what's concerned with interoperability details like memory layout.

I happen to have a copy of AAPCS64 in front of me, and you can find the specification of bit packing in section 10.1.8. The sysv ABI for x86/x86_64 has its own wording (but I'm pretty sure is compatible). MSVC does something else I believe, etc...

munificent

This is a good article, but it bugs me that it sort of only tells half the story.

Why are you making your structs smaller. Probably for "performance". If your goal is literally just reducing memory usage, then this is all fine. Maybe you're running on a small microcontroller and you just straight up don't have much RAM.

But these days, in most areas, memory is cheap and plentiful. What's expensive is CPU cache. If you're optimizing memory use for that, it's really about runtime speed and less about total memory usage. You're trying to make things small so you have fewer cache misses and the CPU doesn't get stuck waiting as much.

In that case, the advice here is only part of the story. Using bitfields and smaller integers saves memory. But in order to access those fields and do anything with them, they need to get loaded into registers. I'm not an expert here, but my understanding is that loading a bitfield or small int into a word-sized register can sometimes have some bit of overhead, so you have to pay attention to the trade-off.

If I was optimizing for overall runtime speed and considering packing structs to do that, I'd really want to have good profiling and benchmarks in place first. Otherwise it's easy to think you're making progress when you're actually going backwards.

ack_complete

> If you're optimizing memory use for that, it's really about runtime speed and less about total memory usage. You're trying to make things small so you have fewer cache misses and the CPU doesn't get stuck waiting as much.

The complication is that cache is a global resource. Code that has larger data structures, even if it runs faster, can contribute to a larger working set and a slightly higher cache miss rate in other code. This can lead to a death-by-thousand cuts scenario and it's hard to get solid data on this when profiling.

You're right, though, that there are a number of ways that smaller structs can be slower, either directly by execution time or indirectly by larger code causing more misses in the instruction cache. Some other factors include whether the compiler can coalesce accesses to multiple fields grouped together, or whether the struct hits a magic power of two size for faster array indexing or vectorization. Which means you can end up speeding up code occasionally by making the struct bigger.

ljchen

I most agree, only except that perf is determined by a myriad of facors. Even if this piece of data fits into a cache line, it may have no or negative effect on the following code. For complex softwares, even if it's faster after the change, it's hard to tell what's the contributing factor really is. I once read a system paper (forgot the name) from a recent system research conference on how changing the code layout, e.g., moving a block of code, randomly may have unexpected impacts on perf.

Personally, I only pay attention to the cache line and layout when the structure is a concurrent data structure and needs to sync across multiple cores.

bluGill

I pack structs because the packed layout matches some hardware or network structre and thus I get direct access . This isn't a common case but it exists.

yellowapple

That case is much more common in embedded/bare-metal programming. A decent chunk of the code I've written so far for getting Zig running on a Nintendo 64 is a bunch of packed structs representing various registers (and a bunch of methods on those structs representing common uses of those registers).

It's also a reasonably common case for parsing binary data formats (e.g. a file's header has fields at exact byte offsets, and it's much faster and easier to just grab all the bytes of that header and interpret those bytes as a packed struct).

voidnap

> memory is cheap and plentiful. What's expensive is CPU cache.

CPU cache is memory. Also memory isn't cheap, it is relatively expensive to access. CPU cache is way cheaper. You have it backwards.

mort96

The obvious and charitable read of their post is that they use "memory" to mean main memory/RAM, "CPU cache" to mean CPU cache, and "memory is cheap and plentiful" to mean "memory has a low dollar cost and most systems have a whole lot of it".

Your read -- that they use "memory" to mean "anything which stores bytes, including caches", and that they use "memory is cheap" to mean "accessing RAM has a low latency" -- is uncharitable, contrived and even literally self-contradictory ("CPU cache is memory" + "memory is expensive to access while CPU cache is cheap to access" is a contradiction which can only be resolved through using different definitions of "memory" in those two sentences).

writebetterc

They're talking about dollars

ryao

If you are using an intrusive data structure, keeping the key on the same cache line as the pointers can make traversals faster. Reducing padding can make this easier to do.

In general, keeping commonly accessed together things on the same cache line is a good idea and the work needed to remove excessive padding overlaps with the work needed to do this.

zeusk

The real issue with packed structs and bitfields happens when concurrency gets involved. Majority of modern CPU caches are private and only allow one core to hold the cache line - so it creates more false dependencies when cores are trying to alter information that was compacted into a cache line on another core.

charleslmunger

Avoiding false sharing is a separate problem best solved by explicitly aligning the struct or relevant members to std::hardware_destructive_interference_size.

jcelerier

> But these days, in most areas, memory is cheap and plentiful

and that's why I am contemplating freaking 128G of ram for my next machine, I have 64 right now and OOM regularly, 32 is unuseable

zahlman

What software are you running? I'm humming along without a problem on 8.

jcelerier

just an instance of GCC on a larger file with some complex C++ will easily eat multiple gigabytes of RAM

wbl

Everything is microarch dependant but typically the memory subsystem can handle subword loads and writes. What it usually can't do is bits.

o11c

Note that although alignment and size are platform-specific, in practice padding can be minimized simply by using a fixed order:

  (various vector types)
  int128_t
  long double
  long long, int64_t, int_fast64_t, double
  off_t
  time_t (can be 64 bits on 32-bit platforms only if off_t also is)
  register_t, int_fast16_t, int_fast32_t (64-bit on e.g. x32)
  void *, void(*)(), intptr_t
  ptrdiff_t (theoretically wrong if fully aligned and bigger than pointer)
  size_t, ssize_t
  long (32-bit on win64; generally has 16-bit alignment if bigger than pointer)
  int32_t, float
  int (can be 16-bit)
  short, int16_t
  char, bool, int8_t, int_fast8_t
  (a trailing array member must come last; note when calculating combined size you should use offsetof instead of sizeof to not waste padding)
Besides this, `int_leastN_t` goes with `intN_t`, and all unsigned types go with their signed types. Note that types like `id_t` are hard to place however.

Although there do exist many platforms where this is not in order of size, the bigger types generally have smaller alignment on such platforms. Despite this I have split out rows occasionally despite not knowing of any such platform.

kazinator

Generally speaking, sort the members in descending order by alignment, if you want the tightest packing without resorting to nonportable compiler extensions. This descending order eliminates padding between members producing the smallest offset for the last byte of the last member, and that then minimizes the padding required at the end of the struct.

In C, alignment for basic types is linked to size. A type may have an alignment requirement that is as much as its size: e.g. an 8 byte data type might be aligned as strictly as addresses divisible by 8.

The alignment of an array aggregate is that of its element type.

The alignment of a struct is that of its most strictly aligned member.

In portable coding, we don't exactly know, but we can make the worst-case assumption that every basic type has the strictest possible alignment---i.e. its size. We also don't know the sizes of all the types, but we know that integers of higher ranks are at least as large as lower ranks; i.e. sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long), etc.

Things are tricky when you have a mixture of floating-point, pointers and integers. E.g. pointer could be the same size as a float, or bigger.

Here is a trick you use: combine together clumps of like types, and try to make combinations that are powers of two. For instance, if your structure has two or four pointers to things, try to make them adjacent. E.g.:

  struct s {
    void *p1, *p2, *p3, p4;
    short s1, s2, s3, s4;
    float f1, f2, f3, f4;
  };

Even though float is almost certainly larger than short and may have stricter alignment, its moot because I have four of everything. Even if this is on a tiny system where pointers are 16 bits wide, four of them will make 8 bytes. So when the f1 member is being placed into the struct, the offset is at least 8 byte aligned: we don't expect padding.

If you aggregate enough members of the same type in a power-of-two sized clump, you can easily create alignment which meets or exceeds the strictest alignment in the system.

So those are the main principles: group together members of the same type. Then keeping those groups together, sort by descending alignment strictness.

rwmj

No mention of pahole / dwarves? https://lwn.net/Articles/335942/ It's the standard tool used by the Linux kernel developers.

petergeoghegan

Recent versions of clangd show pahole-like struct size/alignment padding overhead annotations. I find this built-in support a lot more convenient than using pahole.

uticus

for somebody starting out, where do you go for a list of tools like this?

ryao

pahole is an awesome tool.

variadix

Some other things you can do to improve packing are using integer indices/offsets/relative pointers instead of absolute pointers (especially on 64-bit machines), overlapping fields for unions (as an example you may have a tag that can share some bits with another field in each member of the union), using a structs position in memory to encode information (for example, if you have a binary tree of array allocated nodes, you can specify that the right child always follows the parent in memory, then you only need to store the index of the left child), and of course structure packing pragmas/attributes.

foota

Some related thoughts (more for C++ than C).

You can eliminate the cost of a wrapping class with no members by taking advantage of the empty base class optimization (or rather: by avoiding members in a base class you will be taking advantage of EBCO).

You can use an untagged union in C++ (taking care to avoid undefined behavior) to implement small object optimization. In it's general form, the idea is you can store some smaller struct in a union (as a char[], since C++ doesn't have native unions) with a larger structure, and use a sentinel in the larger structure to determine which it is. E.g., if large struct has some field that must not be zero, then make it zero in the smaller struct and you can distinguish between the two without a tag (like would be required with std::variant).

I'm also a huge fan in general of "make invalid states unrepresentable" and following this can often reduce redundancy and eliminate extra fields.

bhawks

Optimizing struct layouts is cool. but if your motivation is to pack more monsters in your game definitely consider transposing your row oriented monster struct into a column oriented entity using Entity Component System.

Example framework:

https://github.com/skypjack/entt?tab=readme-ov-file

It has benefits besides memory/cache optimizations. It organizes logic in a much more composable way that is friendly for reuse.

cogman10

Nice little article.

The next thing to really explore is how the struct is being used. Smaller is always nice, but sometimes you need it to be fast. To do that, you'll want to group together fields that are commonly read. For example, x, y, and speed are likely always read together. So, it'd make sense to group those fields together to optimize for cache reads.

The next thing to question is if your Monster is being interacted with as an individual thing or as a group of things. If it's as a group of things then it might make sense to create a "Struct of arrays" of monsters. Imagine, for example, you have a system which removes all the dead monsters. It'd be faster to iterate through an `isAlive` array rather than an array of `Monster` structs.

And now you are on your way to reinventing an ECS :)

miguel_martin

>“… Note that I'm not an expert, in anything, so take everything I write about with a grain of salt I guess :).”

>”I’m by far an expert and I could be wrong in some specifications”

You’re missing a “not” on that 2nd sentence ;)

In all seriousness, you don’t need this disclaimer, you’re selling yourself short.

There’s nothing wrong with your post to make the disclaimer.

And if there is something wrong with the post, let the comments flood in and tell you that rather than yourself.

Retr0id

I don't think I've ever seen struct bitfields used in real code, it's almost always manual bit-masks against a numeric type, often via macros. (I assume due to portability)

ziml77

The Windows API uses bitfields like that in some places. I know because I was working with the display configuration API recently and found the bitfields very annoying to use through generated FFI bindings (especially because they were used in anonymous structs inside anonymous unions).

gizmo686

I see it decently often in embedded code for memory mapped IO registers. Other than that, I don't think I've ever seen it either.

I'm not sure portability is the reason though. I've seen it in places that have no problem spamming #define __GNU_SOURCE

mort96

I would be scared of using it for IO... With a struct bit field, I don't know what bit offset a field ends up with, I don't know how it gets padded, and I don't know how large the generated load and store becomes. I'd rather do the bit operations and store myself.

nomel

This is the only place I've seen it used, and the only placed I've used it.

It's absolutely wonderful for IO.

dirt_like

I have seen bit fields used in real code. Long ago. The reason I know it so well is that bit field updates are NOT atomic, and this caused a rare crashing bug under heavy load. Unless you read the header the code looks atomic.

S->buffer_free = 1;

This is a very serious limitation if the structure/class is being updated in a multi-threaded environment.

tom_

In terms of the underlying bit assingments, bitfields have been pretty safe for a very long time now, as you can assume little-endian. Just a question of bearing in mind the packing rules, not too difficult given stdint.h. I went full bitfield shortly after the time I went full size_t, about 15 years ago now.

Big-endian systems sort-of still exist (I do indeed have an Amiga, and an Atari STe, and an Xbox 360, and they all work, so they must count) - but they're only slightly less weird from a practical C programming perspective than stuff like the DSPs with CHAR_BIT==24, and Harvard architecture microcontrollers with 2-byte ints. And so you can ignore them. People that use still actually these systems will be used to it.

ack_complete

There are some other annoyances, like not being able to inline initialize a bitfield prior to C++20, and sometimes having to use unnatural typing to get optimal packing depending on the ABI. But I've seen them used, compilers have gotten pretty good at optimizing them and can coalesce writes or tests against adjacent bitfields.

abnercoimbre

Yep. If I need a boolean for a struct I know I’ll want more than one. So I’ll stick an integer type that lets me OR multiple enum values (and so on).

anonnon

I have, e.g., in Julia's UTF-8 library (utf8proc).