Sick: Indexed deduplicated binary storage for JSON-like data structures

28 comments

·October 28, 2025

qixxiq

    Current implementation has the following limitations:
      Maximum object size: 65534 keys
      The order of object keys is not preserved
      ...

    These limitations may be lifted by using more bytes to store offset pointers and counts on binary level. Though it's hard to imagine a real application which would need that.

I've worked on _many_ applications which have needed those features. Object keys is a per implementation detail, but failing at 65k keys seems like a problem people would likely hit if this were to be used at larger scales.

halayli

I don't know what kind of data you are dealing with but its illogical and against all best practices to have this many keys in a single object. it's equivalent to saying having tables with 65k columns is very common.

on the other hand most database decisions are about finding the sweet spot compromise tailored toward the common use case they are aiming for, but your comment sound like you are expecting a magic trick.

jerf

Every pathological case you can imagine is something someone somewhere has done.

Sticking data into the keys is definitely a thing I've seen.

One I've done personally is dump large portions of a Redis DB into a JSON object. I could guarantee for my use case it would fit into the relevant memory and resource constraints but I would also have been able to guarantee it would exceed 64K keys by over an order of magnitude. "Best practices" didn't matter to me because this wasn't an API call result or something.

There are other things like this you'll find in the wild. Certainly some sort of "keyed by user" dump value is not unheard of and you can easily have more than 64K users, and there's nothing a priori wrong with that. It may be a bad solution for some specific reason, and I think it often is, but it is not automatically a priori wrong. I've written streaming support for both directions, so while JSON may not be optimal it is not necessarily a guarantee of badness. Plus with the computers we have nowadays sometimes "just deserialize the 1GB of JSON into RAM" is a perfectly valid solution for some case. You don't want to do that a thousand times per second, but not every problem is a "thousand times per second" problem.

Groxx

redis is a good point, I've made MANY >64k key maps there in the past, some up to half a million (and likely more if we didn't rearchitect before we got bigger).

kevincox

You seem to be assuming that a JSON object is a "struct" with a fixed set of application-defined keys. Very often it can also be used as a "map". So the number of keys is essentially unbounded and just depends on the size of the data.

zarzavat

Let's say you have a localization map: the keys are the localization key and the values are the localized string. 65k is a lot but it's not out of the question.

You could store this as two columnar arrays but that is annoying and hardly anyone does that.

duped

A pattern I've seen is to take something like `{ "users": [{ "id": string, ... }]}` and flatten it into `{ "user_id": { ... } }` so you can deserialize directly into a hashmap. In that case I can see 65k+ keys easily, although for an individual query you would usually limit it.

paulddraper

That's like saying it's illogical to have 65k elements in an array.

What is the difference?

pshirshov

If the lnmitation affects your usecase, you can chunk your structures.

The limitation comes with benefits.

xienze

> I don't know what kind of data you are dealing with but its illogical and against all best practices to have this many keys in a single object.

The whole point of this project is to handle efficiently parsing "huge" JSON documents. If 65K keys is considered outrageously large, surely you can make do with a regular JSON parser.

pshirshov

> If 65K keys is considered outrageously large

You can split it yourself. If you can't, replace Shorts with Ints in the implementation and it would just work, but I would be very happy to know your usecase.

Just bumping the pointer size to cover relatively rare usecases is wasteful. It can be partially mitigated with more tags and tricks, but it still would be wasteful. A tiny chunking layer is easy to implement and I don't see any downsides in that.

nine_k

I'd say that it's generally unwise to use fixed-width integers in a data structure where this width can vary widely, and has no logical upper limit. Arbitrary-size integers are well known, used in practice, and not hard to implement.

pshirshov

Even if it's "generally unwise" it was a well-thought decision in this particular case. See my other comments. An array of elements with constant size is indexed for free. An array of elements of varying size needs a separate index. SICK's binary representation (EBA) was created with several particular usecases in mind. I needed most compact representation and fastest access (for very underpowered devices), large objects were not a big concern-they can be chunked externally.

pshirshov

In our usecase, for which we created the library, we made this tradeoff to save several bytes per pointer and keep binary form more compact. The application splits large objects into smaller chunks. 99% of the structures there are relatively small but there are tons of them. Most likely you can do the same - just split large structures into smaller ones.

If you need support for larger structures, you may create your own implementation or extend ours (and I would really like to hear about your usecase).

SICK as a concept is simple. SICK as the library was created to cover some particular usecases and may be not suitable for everyone. We would welcome any contributions.

duped

If you use varints for pointers you can have the best of both worlds, and achieve even better compaction for the smallest objects.

pshirshov

Yep, can be done but they aren't free because of variable size. With constant pointers I can access const-sized elements directly, with varints I would have to do some rituals.

I have another binary encoding for different purposes (https://github.com/7mind/baboon) which relies on varints, in case of SICK I decided to go with pointers of constant size to save some pennies on access efficiency.

gethly

It is a bit confusing that JSON is being mention so much when in reality this has nothing to do with it - except to showcase that JSON is not suitable for streaming whereas this format is.

Secondly, I fail to see advantages here as the claim is that it allows streaming for partial processing compared to JSON that has to be fully loaded in order to be parseable. Mainly, because the values must be streamed first, before their location/pointers in order for the structure to make sense and be usable for processing, but that also means we need all the parent pointes as well in order to know where to place the children in the root. So all in all, I just do not see why this is advantageous format above JSON(as that is its main complaint here), since you can stream JSON just as easily because you can detect { and } and { and ] and " and , delimiters and know when your token is complete to then process it, without having to wait for the whole structure to finish being streamed or wait for the SICK pointers to arrive in full so you can build the structure.

Or, I am just not getting it at all...

pshirshov

> when in reality this has nothing to do with it

It's a specific representation of JSON-like data structures, with an indexed deduplicated binary format and JSON encoders and decoders. Why "nothing"? It's all about it.

Mostly it's not about streaming. More efficient streaming is a byproduct of the representation.

> because you can detect { and } and { and ] and " and ,

You need a pushdown automaton for that. In case of SICK you don't need potentially unbounded accumulation for many (not all) usecases.

> the values must be streamed first, before their location/pointers

Only to avoid accumulation. If you are fine with (some) accumulation, you can reorder. Also think about the updates.

But again, streaming is a byproduct. This tool is an indexed binary deduplicating storage which does not require parsing and provides amortized O(1) access time.

aaronblohowiak

I think this is a generational thing. To a bunch of people now practicing, JSON isn’t just an encoding but has come to mean “nested basic data types”.

8organicbits

I think this quote explains the efficient streaming:

> There is an interesting observation: when a stream does not contain removal entries it can be safely reordered.

So if I'm understanding, the example in the readme could be sent in reverse, allowing the client to immediately use root:0 and then string:2 while the rest streams in.

I was looking for something like this, but my use case exceeds the 65k key limit for objects.

pshirshov

> 65k key limit for objects

The limit comes from 2-byte element pointer size. That can be adjusted. We don't have an implementation with larger pointers but it can be done easily.

> while the rest streams in

Yes, there are many usecases where you can use some chunks of the data/rebuild some parts of the structures immediately without any accumulation. The problem is that we don't have a nice streaming abstraction which would suit anyone for any usecase.

SICK as a library is an efficient indexed binary storage for JSON with listed limitations.

SICK as a concept is much more but you might need your own implementation tailored to your usecase.

gethly

If the stream contains removal entries, then that is not a data stream but command stream and, again, has nothing to do with JSON itself. We can extrapolate this to some kind of json-rpc-ish stream if we want to keep this aligned with JSON.

Overall, I think that mentioning JSON here at all is simply a mistake. It would be better to just introduce this as streaming protocol/framework for data structures. But then we can do the same thing with literally any format and syntax.

pshirshov

> has nothing to do with JSON itself.

It's literally a deduplicated indexed binary storage for JSON (plus an approach to JSON representation more suitable for streaming than serialized JSON).

> we can do the same thing

I would highly encourage doing same things! For some reason people love to fully parse a 50MiB long JSON document when they need just 0.1% of the data in it.

noobcoder

it could be really useful for cases where youre repeatedly processing similar JSON structure like in case of analytical events but any plans for language bindings beyond the current implementation?

pshirshov

> any plans for language bindings beyond the current implementation?

If we need more bindings for the projects we work on - we will implement and opensource them. E.g. recently we added rudimentary JS support (no cursors, just encoder/decoder).

For many reasons, we avoid working on something we don't use ourselves and we are not paid for. But your contributions are very welcome. Also we would be happy to have you as a paying client.

null

[deleted]

null

[deleted]

keyliejener

Spill! Yg lain udah pada liat. Lu kapan? jo777.help

HN

Sick: Indexed deduplicated binary storage for JSON-like data structures

Sick: Indexed deduplicated binary storage for JSON-like data structures