Skip to content(if available)orjump to list(if available)

The power of interning: making a time series database smaller

trevor-e

Great write-up!

Apologies if this side-tracks the conversation, but why do we as an industry complicate the naming of such techniques? I don't remember what "string interning" is off the top of my head, but a "string lookup table" is immediately obvious. Maybe I'm wrong though and this is a commonly understood name. I guess interning captures both creating the lookup table + using the index in the related data structures. We used this technique at KAYAK to help efficiently construct the list of possible flight combinations for N x M depart/return legs.

c-linkage

Interning strings saves a ton of space. I wish more programmers would use it.

Back in the 32-bit days I was working on a large (multi GB) distributed Oracle database system but I couldn't use transaction log shipping (don't ask). Database A was the "main" system and database B was the "replica". To keep them in sync I needed a program that would compare the two databases and then generate an update script that would make B look like A.

Complicating things was that, for some reason, floating point data in binary format would never match exactly between the two systems, so all floats had to be exported as text.

The first attempt by a junior dev was implemented in C#. Not only was it terribly slow, but it also ran out of memory.

I wrote a new version in C that interned all strings using a hash table and a custom bump-allocator. I also exported every field in the database as a string, so I didn't have to deal with native types. Using this technique meant that a database record could be represented as a plain array of pointers to the interned strings.

Since each string was only recorded once, and every field was a pointer to a string, should two database records have the same values then they must by definition point to the same string. Comparing database rows was as easy as doing a memcmp() on the two pointer arrays, one being a record from database A and the other being a the record from database B.

Not only was the system incredibly fast, but it never took more than 150MB of memory to run.

gopalv

This is mostly the real reason why interning gets used, to avoid long string comparisons over saving memory as such.

Interned strings tend to not have a good cleanup mechanism, in a system where a lot of them are churned through. So often they tend to actually use more memory as data patterns evolve in a system.

I use the same trick when parsing json, where a large set of rows tend to have the keys repeated & the conversion to columnar is easier if the keys are interned.

kazinator

[delayed]

kccqzy

Here's another perspective: I was once asked to improve a custom data caching system that took too much memory even though it had string interning. (At that time, eviction had to be used on factors other than memory used.) String interning certainly helped with memory use but it wasn't enough. Eventually my solution was to compress each record of the dataset in memory. I found that this saved more memory than interning individual strings within each record. At that time I picked Google's snappy compression algorithm since it was already a dependency of the project, but these days I might have picked zstd with a negative compression level or lz4.

This just goes to show that if your primary goal is to save memory, you should consider storing it compressed and then decompress on the fly when used. Modern compression algorithms are good and fast on moderately sized textual data. You might be surprised to learn how fast decompression is. There are of course other benefits to interning like using pointer equality as string equality, but these aren't a factor in my project.

kazinator

[delayed]

Someone

A problem with explicit interning is that libraries, when creating objects, cannot make an informed decision whether to offer some runtime performance in exchange for the expectation of a decrease in memory usage.

And it is even less than an expectation. Many interning libraries never free objects that they created, so they can keep lots of objects that never are needed again around, and thus increase memory usage.

I think the ideal API would somehow a) be simple and b) have the program communicate with the system about what they desire from the system. As a first step, the system has to know how much memory the program is willing to use, how much longer it expects to run, and for how long the program plans to retain data it is creating now.

Simple examples:

- if the program is shutting down, and there’s a megabyte of memory available, it’s likely detrimental to intern any new data.

- in a “json grep” tool, interning json keys likely isn’t worth it, as most data will be discarded soon.

That’s probably at least a dozen of ph.d’s of research, and likely not attainable, though.

Diggsey

I have a Rust crate (ijson) which implements string interning and a generally much more efficient in-memory JSON representation. It probably doesn't compete with the author's specialized implementation, but if you just want a plug and play solution it's pretty good.

saghm

I did something similar to reduce the memory overhead of our Rust backend at a recent job; I noticed that when dealing with certain requests, we were using quite a lot of memory on duplicates of the same strings when processing things, so I made something I called an "alias cache" type that could be used to obtain an alias of an existing string if needed, and then introduced those to several places in the codebase where it was most helpful. This strategy also allowed some flexibility on how long strings were "cached" for, which let me tweak which layers of processing should share a cache versus creating their own (and letting the old one drop). Our use case definitely didn't have room for interning to use 2000x less memory, but I think in some of the pathological cases it reduced the max memory usage to around 50% of what it was before without significantly affecting runtime, which I was pretty pleased with!

jessekv

If you are curious about how Python's string interning works:

https://github.com/python/cpython/blob/main/InternalDocs/str...

wdb

As its already in Rust. Next step, is leveraging Apache DataFusion?

3abiton

Very fun read guillaume! I never find the courage to work on my "this seems to be interesting weekend idea" due to my fear of time commitment, but I often find these HN write up somehow fulfilling.

hu3

From the looks of it it seems that interning here is creating references to strings that repeat. Strings are large so if you can store a reference to it, you save space.

null

[deleted]

valand

Fun fact, V8 interns strings! But it doesn't seem to intern big ones.

Horffupolde

What about loading it into ClickHouse?