Begrudgingly Choosing CBOR over MessagePack

53 comments

·March 2, 2025

magicalhippo

The article links to this[1] older HN comment which argues against CBOR. I must admit this passage made me laugh out loud:

    A decoder that comes across a simple value (Section 2.3) that it does not
    recognize, such as a value that was added to the IANA registry after the
    decoder was deployed or a value that the decoder chose not to implement,
    might issue a warning, might stop processing altogether, might handle the
    error by making the unknown value available to the application as such (as
    is expected of generic decoders), or take some other type of action.

I choose not to implement `int`. I decide instead to fill up your home folder. I'm a compliant CBOR implementation.

Seems this part of the specification has been rewritten[2], so now a generic decoder is to pass on the tag and value to the user or return an error.

[1]: https://news.ycombinator.com/item?id=14072598

[2]: https://www.rfc-editor.org/rfc/rfc8949.html#name-generic-enc...

camgunz

Disclaimer: I wrote and maintain a MessagePack implementation.

Hey that's me!

Yeah they fixed that, but there's other parts of the spec that are basically unworkable like indefinite length values, "canonicalization", and tags, making it essentially MP (MP does have extension types, I should say, the virtue of tossing out CBOR's tags is you then don't have to implement things like datetimes/timezones, bignums, etc), and indeed at least FIDO tosses this all out: https://fidoalliance.org/specs/fido-v2.0-ps-20190130/fido-cl...

Beyond that, CBOR is MessagePack. The story is that Carsten Bormann wanted to create an IETF standardized MP version, the creators asked him not to (after he acted in pretty bad faith), he forked off a version, added the aforementioned ill-advised tweaks, named it after himself, and submitted it anyway. All this design by committee stuff is mostly wrong--though IETF has modified it by committee since.

There's no reason an MP implementation has to be slower than a CBOR implementation. If a given library wanted to be very fast it could be. If anything, the fact that CBOR more or less requires you to allocate should put a ceiling on how fast it can really be. Or, put another way, benchmarks of dynamic language implementations of a serialization format aren't a high signal indication of its speed ceiling. If you use a dynamic language and speed is a concern to this degree, you'd write an adapter yourself, probably building on one of the low level implementations.

That said, people are usually disappointed by MP's speed over JSON. A lot of engineering hours have gone into making JSON fast, to the point where I don't think it ever made sense to choose MP over it for speed reasons (there are other good reasons). Other posters here have pointed out that your metrics are usually dominated by something else.

But finally, CBOR is fine! The implementations are good and it's widely used. Users of CBOR and MP alike will probably have very similar experiences unless you have a niche use case (on an embedded device that can't allocate, you really need bignums, etc).

lifthrasiir

> there's other parts of the spec that are basically unworkable like indefinite length values, "canonicalization", and tags, making it essentially MP [...].

I'm not sure but your wording suggests that CBOR is equally unworkable as MP because they implement the same feature set...?

But anyway, those features are not always required but useful from time to time and any complete serialization format ought to include them in some way. Canonicalization for example is an absolute requirement for cryptographic applications; you know JWT got so cursed due to JSON's lack of this property. Tag facilities are well thought out in my opinion, while specific tags are less so but implementations can choose to ignore them anyway---thankfully after the aforementioned revisions.

> The story is that Carsten Bormann wanted to create an IETF standardized MP version, the creators asked him not to (after he acted in pretty bad faith), he forked off a version, added the aforementioned ill-advised tweaks, named it after himself, and submitted it anyway.

To me it looks more like Markdown vs. CommonMark disputes; John Gruber repeatedly refused to standardize Markdown or even any subset in spite of huge needs because he somehow believes that standardization ruins simplicity. I don't really agree---simple but correct standard is possible, albeit with efforts. So people did their own standardization including CommonMark, which subtly differ from each other, but any further efforts would be inadvently blocked by Gruber. MessagePack seems no different to me.

elcritch

> To me it looks more like Markdown vs. CommonMark disputes; John Gruber repeatedly refused to standardize Markdown or even any subset in spite of huge needs because he somehow believes that standardization ruins simplicity. I don't really agree---simple but correct standard is possible, albeit with efforts.

Right, that was my take after reading about it for a while. The way MessagePack and CBOR frame the problem is fairly different, with CBOR intentionally opting for an open tagging system.

Plus it feels a bit childish brining up the circumstances of the fork (correct or not) when they've clearly diverged bit in purpose and scope. The Markdown vs CommonMark is an apt comparison.

I've used both and both work very well. They're both stable, and be parsed into native objects at a speed nearing that of memory copy with the right implementations.

camgunz

> I'm not sure but your wording suggests that CBOR is equally unworkable as MP because they implement the same feature set...?

That's fair; I've been a little confusing when I say things like "CBOR is MessagePack". To be clear I mean that CBOR's format is fundamentally MessagePack's, and my issues are with the stuff added beyond that.

> But anyway, those features are not always required but useful from time to time and any complete serialization format ought to include them in some way.

Totally! I think MP's extension types (CBOR's "tags") are pretty perfect for this. I mean, bignums or datetimes or whatever are often useful, and supporting extension types/tags in an implementation is really straightforward. I just don't think stuff like this belongs in a data representation format. There's a reason JSON, gRPC, Cap'n Proto, Thrift, etc. don't even support datetimes.

> Canonicalization for example is an absolute requirement for cryptographic applications; you know JWT got so cursed due to JSON's lack of this property.

This is the example I always have in my head too. But canonicalization puts some significant requirements on a serializer. Like, when do you enable canonicalization? CBOR limits the feature set when canonicalizing, so you can do it up front and error if someone tries to add an indefinite-length type, or you can do it at the end and error then, and this by itself is a big question. You also have to recursively descend through any potentially nested type and canonicalize it. What about duplicate keys? CBOR's description on how to handle them is pretty hands off [0], and canonicalization is silent on it [1].

But alright, you can make a reasonable library even aside from all this stuff. But are you really just trusting that things are canonicalized receiver side? Definitely not, so you do a lot of validation on your own which pretty much obviates any advantage you might get. JWT is a great use-case of people assuming the JSON was well-formed: canonicalized or not, you gotta validate. You're a lot better off just defining the format for JWT and validating receiver side; canonicalization is basically just extra work.

> To me it looks more like Markdown vs. CommonMark disputes

There was some of this because of the bytes vs. strings debate. Basically people were like, "hey wait, when I deserialize in a dynamic language that assumes strings are UTF-8, I get raw byte strings? I don't like that", but on the other hand Treasure Data (MP creators) had lots of data already stored in the existing format so they needed (well, wanted anyway) a solution that was backwards compatible, plus you want to consider languages that don't really know about things like UTF-8 or use something else internally (C/C++, Java, Python for a while). That's where MPv5 came from, and the solution is really elegant. If CBOR was MPv4 + strings I'd 100% agree with you, but it's a kitchen sink of stuff that's pretty poorly thought out. You can see this in the diversity of support in CBOR implementations. I'm not an expert so LMK if you know differently, but for example the "best" Rust lib for this doesn't support canonicalization [2]. Go's is really comprehensive [3] but the lengths it has to go to (internal buffer pools, etc) are pretty bananas and beyond what you'd expect for a data representation format, plus it has knobs like disabling indefinite-length encodings, limiting the sizes of them, limiting the stack depth for nesting, and so on, again really easy to get into trouble here.

[0]: https://datatracker.ietf.org/doc/html/rfc8949#name-specifyin...

[1]: https://datatracker.ietf.org/doc/html/rfc8949#name-serializa...

[2]: https://github.com/enarx/ciborium/issues/144

[3]: https://github.com/fxamacker/cbor

magicalhippo

> Yeah they fixed that, but there's other parts of the spec that are basically unworkable

Yeah it just made me chuckle cause it was such an obvious oversight and a fun way of pointing it out. That said I totally get that writing specs are hard, so not dissing the authors as such.

> There's no reason an MP implementation has to be slower than a CBOR implementation.

Yeah that also struck me. Like ok that CBOR library might be faster than that MP library, but could be either is just missing some optimizations. And it didn't look like there were orders-of-magnitude differences in either case.

Anyway I've only looked at CBOR and MessagePack when I dabbled with some microcontroller projects. I found both to be too big, ie couldn't find a library suitably small, either compiled size or memory requirements or both. So I ended up with JSON for those due to that. Using a SAX-like parser I could avoid dynamic allocations entirely (or close enough).

camgunz

> That said I totally get that writing specs are hard, so not dissing the authors as such.

Oh definitely. Yeah maybe I come off as anti-spec or something, but in this case I just think MP was really well thought out, and then Bormann hung a bunch of stuff on it that really wasn't, and I'm salty haha.

> Anyway I've only looked at CBOR and MessagePack when I dabbled with some microcontroller projects. I found both to be too big, ie couldn't find a library suitably small, either compiled size or memory requirements or both. So I ended up with JSON for those due to that. Using a SAX-like parser I could avoid dynamic allocations entirely (or close enough).

Whaa? I wrote an MP implementation specifically for this use case: https://github.com/camgunz/cmp. JSON parsing terrifies me; there was some table of tons of JSON (de)serializers with all their weirdo bugs that I never would've thought of. There are probably pretty good test suites now though? I've never looked.

Joker_vD

> parts of the spec that are basically unworkable like indefinite length values,

Is this really a problem in practice? Say, an HTTP/1.1 message also may have the body of indefinite length, and it usually works just fine.

camgunz

No in practice people just ignore the spec, but that's not really what you're hoping for when writing one.

maciejw

CBOR started as a complimentary project to previous-decade IoT (Internet of Things) and WSN (Wireless Sensor Networks) initiaties. It was designed together with 6LoWPAN, CoAP, RPL and other standards. Main improvement over message pack was discriminating between byte strings and text strings - an important usecase for firmware updates etc. Reasoning is probably available in IETF mailing archive somewhere.

All these standards were designed as a research and seem rather slow to gain general popularity (6LoWPAN is used by Thread, but its uptake is also quite slow - e.g. Nanoleaf announced dropping support for it).

I would say if CBOR fits your purpose it's a good pick, and you shouldn't be worried by it being "not cool". Design by committee is how IETF works, and I wouldn't call it a weakness, although in DOGE times it might sound bloated and outdated.

lifthrasiir

To be fair, CBOR proper is amazingly well designed given its constraints and design-by-committee nature. It is not even hard to remember the whole specification in your head due to the regular design. Unfortunately though I can't say that for any other CBOR ecosystem; many related specs do show varying level of signs of bloat. I recently heavily criticized the packed CBON draft because I couldn't make any sense out of it [1], and Bormann seemed to have clearly missed most of my points.

[1] https://mailarchive.ietf.org/arch/msg/cbor/qdMZwu-CxHT5XP0nj...

camgunz

Disclaimer: I wrote and maintain a MessagePack implementation.

To be uncharitable, that's probably because CBOR's initial design was lifted from MP, and everything Bormann added to it was pretty bad. This snippet from your great post captures it pretty well I think:

---

CBOR records the number of nested items and thus has to maintain a stack to skip to a particular nested item.

Alternatively, we can define the "processability" to only include a particular set of operations. The statement 3c implies and 3d seems to confirm that it should include a space-constrained decoding, but even that is quite vague. For example,

- Can we assume we have enough memory to buffer the whole packed CBOR data item? If we can't, how many past bytes can we keep during the decoding process?

lifthrasiir

> To be uncharitable, that's probably because CBOR's initial design was lifted from MP, and everything Bormann added to it was pretty bad.

To be clear, I disagree and believe that Bormann did make a great addition by forking. I can explain this right away by how my point can be fixed entirely within CBOR itself.

CBOR tags are of course not required to be processed at all, but some common tags have useful functions that many implementations are expected to implement them. One example is the tag 24 "Encoded CBOR data item" (Section 3.4.5.1), which indicates that the following byte string is encoded as CBOR. Since this string has the size in bytes, every array or map can be embedded in such tags to ensure the easy skippability. [1] This can be made into a formal rule if the supposed processability is highly desirable. And given those tags are defined so early, my design sketch should have been already considered in advance, which is why I believe CBOR is indeed designed better.

[1] Alternatively RFC 8742 CBOR sequences (tag 63) can be used to emulated an array or map of an indeterminate size.

eqvinox

> Design by committee is how IETF works

If the IETF is design by committee, almost any collaboratively developed standard could be called designed by committee. And I'm rather confident in assuming you haven't seen ITU or IEEE in action, or you'd be singing angel's choir praises on the IETF process…

(The IETF really does not have a committee process by any reasonable definition.)

pwdisswordfishz

> Obviously MessagePack is what cool kids would use.

Why is that even a consideration?

> To measure complexity, you can often use documentation length as a proxy. MessagePack is just a markdown file. The CBOR spec has its own gravitational field.

That's a proxy for underspecification, not complexity.

relistan

>> Obviously MessagePack is what cool kids would use. >Why is that even a consideration?

What’s “cool” is not important but being “cool” _can_ mean there is a larger ecosystem around it. It can also be a proxy for how well your stuff will interact with other systems and how many people you can hire that will just know how it works without having to learn another new thing.

On the other hand, none of that applies to this person’s personal project. And in that context, I think your comment stands.

dathinab

because he seems to be working on "fun hobby projects" not work

bythreads

Just a question where you trying to optimize for speed or size?:

I never tried CBOR when looking for a sub 10ms solution for websocket comms, however my use case was not bound by datasize but entirely by speed (network not inet).

However it all came down to a suprising realisation: "compression on both ends is the primary performance culprit"

Optimizing the hell out of the protocol over websockets got me to a fairly ok response time, just using string json and 0 compression blew it out of the water.

So the result was that data load was faster and easier the debug going with strings of json vs any other optimization (messagesize where in the 10-50mb realm)

The amount of shotty ws sever implementations and gzip operations in the communications pipeline is mindblowing - would be interested in hearing how just pure json and zero compression/binary transforms performed :)

lifthrasiir

Virtually every use case of zlib (can be used to implement gzip) should be replaced with zlib-ng, in fact. The stock zlib is too slow for modern computers. If you have a right workload---no streaming, fit in memory etc.---, then libdeflate is even faster. The compression can't be a bottleneck when you've got a correct library.

nicoburns

zlib-rs (the rust port) is now faster in most cases (which exposes a zlib compatible API)

masklinn

> The compression can't be a bottleneck when you've got a correct library.

It absolutely can tho. You’re not going to do memory compression using zlib regardless of its flavour.

lifthrasiir

In this context, of course. It is not a general statement ;-)

null

[deleted]

dathinab

I think on important thing to realize is that using CBOR or MessagePack does not involve compression (except if you add it in the same way you do for JSON as another layer).

CBOR and MessagePack are more compact but they do not archive this by compression but instead by adding less noise in-between your data when placing your data on the wire.

E.g. Instead of (in JSON) outputting a " then going through every utf-8 code point and checking if they need escaping and escaping it and then placing another " They place some type tag +length hint and then just memcopy the utf-8 to the wire (assuming they can rely on the input being valid utf-8).

The only thing which goes a bit in the direction of compression is that you can encode a integer as tiny, short or long field. But then that is still way faster then converting it to it's decimal us-ascii representation...

Through that doesn't mean they are guaranteed to always be faster. There are some heavy absurdly optimized JSON libraries using all kind of trickery like SIMD and many "straight forward" implementations of CBOR and MessagePack.

Similar you data might already be in JSON in which case cross encoding it is likely to outweigh and gains.

mrkeen

Is your data already JSON at rest? Because encoding/decoding CBOR should easily beat encoding/decoding JSON.

aeontech

Just curious if you considered Cap'n Proto as another option, or if it wasn't in the running?

[1] https://capnproto.org/

lifthrasiir

It covers a different use case. JSON, MessagePack and CBOR fall into the "schameless" format which mandate a common but useful enough data model (CBOR is novel in that this data model can be somehow extensible too). Cap'n Proto and Protobuf fall into the "schametic" or "schemaful" format where you always need the correct schema to encode and decode the thing, but is possibly more efficient in encoded size and general performance.

aeontech

Thanks for clarifying! I thought that Cap'n Proto allows for evolving schemas, but I guess it's true that if each of your messages is completely different, it's not going to benefit you as much perhaps.

eidorb

Yibico's python-fido2 library (https://github.com/Yubico/python-fido2) contains (a minimal) CBOR implementation too: https://developers.yubico.com/python-fido2/API_Documentation...

I found it wouldn't encode `None`s, but didn't dig at all, just worked around it.

Star count would place it about midway in the list.

randomtoast

> Everything about CBOR is uncool. It was designed by a committee. It reeks of RFCs. Acronyms are lame. Saying "SEE-BORE" is like licking a nickel. One of the authors is "Carsten Bormann", which makes the name feel masturbatory.

Carsten Bormann was my professor for Rechnernetze (computer networks). He is also one of the authors of GNU Screen. I mentioned to him that I use tmux, and he asked what was wrong with screen :). His wife is also in the same IT department at the university, where she was the dean. She helped me sort out a problem regarding my course selections, a very kind person. I think he is a decent teacher and knowledgeable in his field, but if you look at his work over the past decades, it's evident that he has a tendency to author RFCs that are rarely used.

IshKebab

I think the list of stars is probably not a good representation of popularity. Serde and JSON For Modern C++ have vastly more stars than any of those libraries and they both support CBOR and MessagePack.

I think CBOR is pretty decent though it is fairly inexplicable that a format designed in 2013 uses big endian.

js8

I don't understand why the author doesn't prefer CBOR, isn't doing things according to an RFC standard better? MsgPack and CBOR are pretty much comparable, feature-wise.

Anyway, I work on IBM mainframes, and big endian is so much easier to read in hex. Not sure why anybody would want little endian, honestly.

IshKebab

> big endian is so much easier to read in hex. Not sure why anybody would want little endian, honestly.

Because you don't need to read these files in a hex editor and 99.999% of people aren't working on an IBM mainframe; they're working on a little endian machine.

otabdeveloper4

> MsgPack and CBOR are pretty much comparable, feature-wise.

They're pretty much exactly the same thing. IIRC the difference is that CBOR specifies how to handle custom types slightly more verbosely.

Philpax

serde doesn't support CBOR/MP, implementations of those support serde, and those implementations are listed in the table. You might have a point about JfMC++, though.

IshKebab

Good point and actually the MessagePack Serde library has way more stars than the CBOR one.

surfingdino

I wasn't aware of CBOR as I have yet to come across a project that needed an alternative to MessagePack or even MassagePack (only considered using it once in the past). However, based on my experience on various projects, if you have to get approvals and buy-ins from architects, legal, and security teams, using something that has an RFC helps you win those battles regardless of the technical merits of a RFC or non-RFC backed project/tool/protocol.

dathinab

TL;DR: CBOR is a bit more complex, but mainly due to additional features (tags, infinite/unknown length types) which if you need them will make using CBOR the simpler choice and which libraries can decide to support only in a "naive" but simple way (ignore most tags, directly collect unknown length types with some hard size limits).

---

On interesting aspect of CBOR vs. MessagePack is simplicity.

But it's not really that much of a win for MessagePack, especially if we ignore "tags".

Like sure message packs splitting their type/length markers at clean nibble boundaries and as such if you need to read a message pack file through a hex editor its easier.

but why would you do so without additional tooling????

Like, if we are realistic most developer won't even read (non trivial small) JSON without tooling which displays the data with nice formatting and syntax highlighting. Weather it's in your browser, editor or by piping it through jq on the command line tooling is often just a click way.

And if you anyway use tooling it doesn't matter weather type/size hints use 4/4-bit split are a 3/5-bit split. Implementation wise outside of maybe some crazy SIMD tricks or similar it also doesn't matter (too) much.

And the moment 4/4-bit and 3/5-bit are seen as in practice similar complex they are now really similar in complexity.

CBOR is a bit more consistent with it's encodings then MessagePack (e.g. MessagePack has "fix int" special cases which doesn't follow the 4/4-bit split). But it's also lacking boolean and null in it's core data model.

CBORs encoding of true,false and null is ... iffy (and 2 bytes). It also has both null and undefined for whatever cursed reasons.

MessagePack has a extension system which associate a custom type [0;127] with a byte array.

CBOR has a tagging system which associates a custom type [0;2*64-1] with any kind of CBOR type.

In this aspect MessagePack is simpler but also so restrictive that running into collisions in large interconnected enterprise use cases is not just viable but expected if extension types are widely used (but then for most apps 0..127 is good enough).

On the other hand if a CBOR library wants to support all kinds of complex custom tag based extension use cases it could add quite a bit to the libraries complexity. But then ignoring tags when parsing is valid. On benefit of the large tag space is that there are some pretty useful predefined tags, e.g. for a unix timestamp or that a byte slice is itself valid CBOR or big num encoding.

Lastly CBOR supports "infinite"/"unknown when encoding starts" length bytes string, utf-8 strings, lists and maps. That clearly adds complexity to any library fully implementing CBOR, but if you need it removes a lot of complexity from you as you now don't need to come up with some JsonL-ish format for MessagePack (which is trivial but outside of the spec). Most importantly you can stream insides of nested items, not viable with MessagePack).

eqvinox

[dead]

indulona

[dead]