Parsing Protobuf like never before

91 comments

·July 17, 2025

nemo1618

There are two ways to look at this.

First is that, if the parsing library for your codec includes a compiler, VM, and PGO, your codec must be extremely cursed and you should take a step back and think about your life.

Second is that, if the parsing library for your codec includes a compiler, VM, and PGO, your codec must be wildly popular and adds enormous value.

VikingCoder

If you want to do something several hundreds of billions of times per day, you probably want to do it very efficiently.

benlivengood

Or, you know, several hundreds of billions of times per second...

cryptonector

I'm not sure which part you're objecting to.

If it's compilation at run-time, then I agree: it should be done at build time. But in this case it's really not a big deal.

If you're objecting to needing a compiler, then... you're not even wrong.

pantalaimon

You can also prase Protobuf on the very low end, if you don't need super high throughput

https://jpa.kapsi.fi/nanopb/

skybrian

This is excellent: an in-depth description showing how the Go internals make writing fast interpreters difficult, by someone who is far more determined than I ever was to make it fast anyway.

I’ve assumed that writing fast interpreters wasn’t a use case the Go team cared much about, but if it makes protobuf parsing faster, maybe it will get some attention, and some of these low-level tricks will no longer be necessary?

RGBCube

> and some of these low-level tricks will no longer be necessary?

Don't count on it. This language is Golang.

anonymoushn

Hi to my favorite turkish teenager. You really don't get to generate fast codecs from normal-looking code in any language :(

actionfromafar

In as-far-as wuffs looks normal, wuffs?

AceJohnny2

Offtopic, but is anyone using CapnProto, the ProtoBuf former maintainer's (kentonv around here) subsequent project?

https://capnproto.org

If so, how does it compare in practice?

(what does Cloudflare Workers use?)

necubi

Cloudflare is almost entirely run on Cap'n Proto, including the entire workers platform

kentonv

The Workers platform uses Cap'n Proto extensively, as one might expect (with me being the author of Cap'n Proto and the lead dev on Workers). Some other parts of Cloudflare use it (including the logging pipeline, which used it before I even joined), but there are also many services using gRPC or just JSON. Each team makes their own decisions.

I have it on my TODO list to write a blog post about Workers' use of Cap'n Proto. By far the biggest wins for us are from the RPC system -- the serialization honestly doesn't matter so much.

That said, the ecosystem around Cap'n Proto is obviously lacking compared to protobuf. For the Cloudflare Workers Runtime team specifically, the fact that we own it and can make any changes we need to balances this out. But I'm hesitant to recommend it to people at other companies, unless you are eager to jump into the source code whenever there's a problem.

anonymoushn

A while ago I talked to some team that was planning a migration to GraphQL, long after this was generally thought to be a bad idea. The lead seemed really attached to the "composable RPCs" aspect of the thing, and at the time it seemed like nothing else offered this. It would be quite cool if capnproto became a more credible option for this sort of situation. At the time users could read about the rpc composition/promise passing/"negative latency" stuff, but it was not quite implemented.

stouset

This makes me really sad. Protobufs are not all that great, but they were there first and “good enough”.

It’s frustrating when we can’t have nice things because a mediocre Google product has sucked all the air out of the room. I’m not only talking about Protobufs here either.

motorest

Thank you for posting here. Always insightful, always a treat.

k_bx

To piggyback on this, is anyone using flatbuffers? They solve same problem as CapnProto, are a basis of arrow.

I've used it but too long ago and don't know their current state.

plasticeagle

We use flatbuffers extensively in our message passing framework, and they are extremely fast if you take care with your implementation. They have a few features that make them especially useful for us

1) The flatbuffer parser can be configured at runtime from a schema file, so our message passing runtime does not to need to know about any schemas at build time. It reads the schema files at startup, and is henceforth capable of translating messages to and from JSON when required. It's also possible to determine that two schemas will be compatible at runtime.

2) Messages can be re-used. For our high-rate messages, we build a message and then modify it to send again, rather than building it from scratch each time.

3) Zero decode overhead - there is often no need to deserialise messages - so we can avoid copying the data therein.

The flatbuffer compiler is also extremely fast, which is nice at build time.

elcritch

Used them before and they're ok. They were missing some important features like sum types. The code output was a pain, but targeted a few languages. My suspicion is that Captain Proto would be technically superior but less documented than flatbuffers.

However, my preference is to use something like MsgPack or CBOR with compile time reflection to do serde directly into types. You can design the types to require minimal allocations and to parse messages within a few dozen nanoseconds. That means doing things like using static char arrays for strings. It wastes a bit of space but it can be very fast. Also skipping out on spaced used by 64bit pointers can replace a lot of shorter text fields.

That said, I should wrap or port this UPB to Nim. It'd be a nice alternative if it's really as fast as claimed. Though how it handles allocating would be the clincher.

discreteevent

> They were missing some important features like sum types.

They support unions now. I haven't had any trouble representing anything including recursive structures.

> The code output was a pain, but targeted a few languages.

You do need a day or so to get used to the code. Its a pointer based system with a 'flat memory'. Order of operations matters. If you have a parent and a child you need to write the child first, obtain a pointer to it and only then create/write the parent containing the pointer to the child. Once you get used to this it goes quickly. The advantage is that you don't have to create an intermediate copy in memory when reading (like protobuf) and you can read any particular part by traversing pointers without having to load the rest of the data into memory.

pantalaimon

To piggyback on this, is anyone using CBOR? They solve same problem as Protobuf, but more like JSON where a schema is not required.

vamega

Amazon uses CBOR extensively. Most AWS services by now should support being called using CBOR. The protocol they're using is publicly documented at: https://smithy.io/2.0/additional-specs/protocols/smithy-rpc-...

The services serve both CBOR and other protocols simultaneously.

crabbone

Neither of them are well-designed or well thought-through. They address some cosmetic issues of Protobuf, but don't really deal with the major issues. So, you could say they are slightly better, but the original was bad enough to completely disqualify it either.

vvanders

Flatbuffers lets you directly mmap from disk, that trick alone makes it really good for use cases that can take advantage of it(fast access of read-only data). If you're clever enough to tune the ordering of fields you can give it good cache locality and really make it fly.

We used to store animation data in mmaped flatbuffers at a previous gig and it worked really well. Kernel would happily prefetch on access and page out under pressure, we could have 10s of MBs of animation data and only pay a couple hundred kb based on access patterns.

karel-3d

Yes, I am, it's fast and great, but the UX in go is a bit annoying since you need to constantly keep checking for errors on literally every set (to check for possible area errors). So your code has even more `if err != nil {return nil, err}` than usual go code.

UncleEntity

> In other words, a UPB parser is actually configuration for an interpreter VM, which executes Protobuf messages as its bytecode.

This is kind of confusing, the VM is runtime crafted to parse a single protobuf message type and only this message type? The Second Futamura Projection, I suppose...

Or the VM is designed specifically around generic protobuf messages and it can parse any random message but only if it's a protobuf message?

I've been working on the design of a similar system but for general binary parsing (think bison/yacc for binary data) and hadn't even considered doing data over specialized VM vs. bytecode+data over general VM. Honestly, since it's designed around 'maximum laziness' (it just parses/verifies and creates metadata over the input so you only pay for decoding bytes you actually use) and I/O overhead is way greater than the VM dispatching trying this out is probably one of those "premature optimization is the root of all evil" cases but intriguing none the less.

haberman

I think I can shed some light on this, as the creator and lead of upb.

Calling a Protobuf Parser an "interpreter VM" is a little bit of rhetorical flourish. It comes from the observation that there are some deep structural similarities between the two, which I first observed in an article a few years back: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...

> It may seem odd to compare interpreter loops to protobuf parsers, but the nature of the protobuf wire format makes them more similar than you might expect. The protobuf wire format is a series of tag/value pairs, where the tag contains a field number and wire type. This tag acts similarly to an interpreter opcode: it tells us what operation we need to perform to parse this field’s data. Like interpreter opcodes, protobuf field numbers can come in any order, so we have to be prepared to dispatch to any part of the code at any time.

This means that the overall structure of a protobuf parser is conceptually a while() loop surrounding a switch() statement, just like a VM interpreter.

The tricky part is that the set of "case" labels for a Protobuf parser is message-specific and defined by the fields in the schema. How do we accommodate that?

The traditional answer was to generate a function per message and use the schema's field numbers as the case labels. You can see an example of that here (in C++): https://github.com/protocolbuffers/protobuf/blob/f763a2a8608...

More recently, we've moved towards making Protobuf parsing more data-driven, where each field's schema is compiled into data that is passed as an argument to a generic Protobuf parser function. We call this "table-driven parsing", and from my read of the blog article, I believe this is what Miguel is doing with hyperpb.

The trick then becomes how to make this table-driven dispatch as fast as possible, to simulate what the switch() statement would have done. That question is what I cover at length in the article mentioned above.

anonymoushn

Really great. I wonder, for the "types encoded as code" approach, is there any benefit to fast paths for data with fields in ascending order? For some json parsers with types encoded as code I have observed some speedup from either hard-coding a known key order or assuming keys in some order and providing a fallback in case an unexpected key is encountered. For users who are stuck with protobuf forever because of various services using it and various data being encoded this way, the historical data could plausibly be canonicalized and written back in large chunks when it is accessed, so that one need not pay the entire cost of canonicalizing it all at once. But of course the icache concerns are still just as bad.

haberman

This kind of "expected next field" optimization has a long history in protobuf, but results are mixed.

The generated code in C++ used to check for the expected next field before falling back to the switch() (example here: https://github.com/protocolbuffers/protobuf/blob/460e7dd7c47...) but this was removed in 2016 when load tests found that it hurt more than it helped.

One tricky part of making this optimization work is making a good guess about what the next field should be. Miguel's article alludes to this:

> Each field specifies which fields to try next. This allows the compiler to perform field scheduling, by carefully deciding which order to try fields in based both on their declaration order and a rough estimation of their “hotness”, much like branch scheduling happens in a program compiler. This avoids almost all of the work of looking up the next field in the common case, because we have already pre-loaded the correct guess.

> I haven’t managed to nail down a good algorithm for this yet, but I am working on a system for implementing a type of “branch prediction” for PGO, that tries to provide better predictions for the next fields to try based on what has been seen before.

One delightful thing about the tail-call parser design is that the CPU's branch predictor effectively takes over the job of guessing. With a tail call parser, the dispatch sequence ends up looking like this:

    cmp    QWORD PTR [rdi+0x8],rsi               # Bounds check
    jbe    .fallback
    movzx  r10d,WORD PTR [rsi]                   # Load two bytes of tag
    mov    eax,r10d
    and    eax,0xf8
    mov    r9,QWORD PTR [rcx+rax*2]              # Load table data
    xor    r9,r10
    mov    rax,QWORD PTR [rcx+rax*2+0x8]         # Load field parser function
    jmp    rax                                   # Tail call to field parser

That "jmp rax" instruction is an indirect branch that can be predicted by the CPU. The CPU effectively guesses for us!

And unlike any kind of guess we might have performed ahead-of-time, the branch predictor will constantly adapt to whatever patterns it is seeing in the data. This is good, because statically guessing ahead-of-time is hard.

benreesman

Regular expression engines on this model are often called VMs, certainly thats the terminology and layout I used. A big switch/computed-goto loop thing with state codes? Yeah, thats a VM.

I think its a very illuminating way to describe it. Nicely done with the implementation as well.

pjc50

> This means that the overall structure of a protobuf parser is conceptually a while() loop surrounding a switch() statement, just like a VM interpreter.

This is a very insightful design. Encoding the parser in a table is an old technique, it's what YACC uses. There's a tradeoff between using the CPU's stack+registers versus having your own stack+state in this kind of work, and people have found situations where a small VM is faster because it gets to stay in the instruction cache and benefit from branch prediction, while the less-predictable data stays in the d-cache.

mananaysiempre

> More recently, we've moved towards making Protobuf parsing more data-driven, where each field's schema is compiled into data that is passed as an argument to a generic Protobuf parser function. We call this "table-driven parsing", and from my read of the blog article, I believe this is what Miguel is doing with hyperpb.

Everything old is new again, I guess—one of the more advertised changes in Microsoft COM as it was maturing (circa 1995) was that you could use data-driven marshalling with “NDR format strings” (bytecode, essentially[1,2]) instead of generating C code. Shortly after there was typelib marshalling (format-compatible but limited), and much later also WinRT’s metadata-driven marshalling (widely advertised but basically completely undocumented).

Fabrice Bellard’s nonfree ASN.1 compiler[3] is also notable for converting schemas into data rather than code, unlike most of its open-source alternatives.

I still can’t help wondering what it is, really, that makes the bytecode-VM approach advantageous. In 1995, the answer seems simple: the inevitable binary bloat was unacceptable for a system that needs to fit into single-digit megabytes of RAM; and of course bytecode as a way to reduce code footprint has plenty of prior art (microcomputer BASICs, SWEET16, Forth, P-code, even as far back as the AGC).

Nowadays, the answer doesn’t seem as straightforward. Sure, the footprint is enormous if you’re Google, but you’re not Google (you’re probably not even Cloudflare), and besides, I hope you’re not Google and can design stuff that’s actually properly adapted to static linking. Sure, the I$ pressure is significant (thinking back to Mike Pall’s explanation[4] why he didn’t find a baseline JIT to be useful), but the bytecode interpreter isn’t going to be a speed demon either.

I don’t get it. I can believe it’s true, but I don’t really feel that I get it.

[1] https://learn.microsoft.com/en-us/windows/win32/rpc/rpc-ndr-...

[2] https://gary-nebbett.blogspot.com/2020/04/rpc-ndr-engine-dce...

[3] https://bellard.org/ffasn1/

[4] https://news.ycombinator.com/item?id=14460027

anonymoushn

In some workloads, you can benefit from paying the latency cost of data dependencies instead of the latency cost of conditional control flow, but there's no general rule here. It's best to try out several options on the actual task and the actual production data distribution.

tonyarkles

Based on what I know about the structure of protobufs internally and without having looked deep into what UPB is doing... I'd guess it could probably be a stack machine that treats (byte)+ as opcodes. Most of the time I'd think of it as parser -> AST -> bytecode, but I think the "grammar" of protobufs would allow your parser to essentially emit terminals as they're parsed straight to the VM as instructions to execute.

UncleEntity

In the couple days since I posted my confusion (threads merged or something) I consulted the daffy robots and figured out how it all works. Also had them come up with a design document for "a specialized compiler and virtual machine architecture for parsing Protocol Buffer messages that achieves significant performance improvements through a novel compilation pipeline combining protobuf-specific AST optimization, continuation-passing style transformations, and tail call interpreter execution."

Interesting times we live in...

alexozer

So am I identifying the bottlenecks that motivate this design correctly?

1. Go FFI is slow

2. Per-proto generated code specialization is slow, because of icache pressure

I know there's more to the optimization story here, but I guess these are the primary motivations for the VM over just better code generation or implementing a parser in non-Go?

hinkley

I know that Java resisted improving their FFI for years because they preferred that the JIT get the extra resources. And that customers not bail out of Java every time they couldn’t figure out how to make it faster. There’s a case I recall from when HotSpot was still young, where the Java GUI team moved part of the graphics pipeline to the FFI in one release, hotspot got faster in the next, and then they rolled back the changes because it was now faster without the FFI.

But eventually your compiler is good enough that the FFI Is now your bottleneck, and you need to do something.

unbrice

3. The use case is dynamic schemas and access is through the reflection API. Thus PGO has to be done at runtime...

johnisgood

I keep hearing that Go's C FFI is slow, why is that? How much slower is it in comparison to other languages?

pornel

Go's goroutines aren't plain C threads (blocking syscalls are magically made async), and Go's stack isn't a normal C stack (it's tiny and grown dynamically).

A C function won't know how to behave in Go's runtime environment, so to call a C function Go needs make itself look more like a C program, call the C function, and then restore its magic state.

Other languages like C++, Rust, and Swift are similar enough to C that they can just call C functions directly. CPython is a C program, so it can too. Golang was brave enough to do fundamental things its own way, which isn't quite C-compatible.

9rx

> CPython is a C program

Go (gc) was also a C program originally. It still had the same overhead back then as it does now. The implementation language is immaterial. How things are implemented is what is significant. Go (tinygo), being a different implementation, can call C functions as fast as C can.

> ...so it can too.

In my experience, the C FFI overhead in CPython is significantly higher than Go (gc). How are you managing to avoid it?

hinkley

I wonder if they should be using something like libuv to handle this. Instead of flipping state back and forth, create a playground for the C code that looks more like what it expects.

johnisgood

What about languages like Java, or other popular languages with GC?

3836293648

Go's threading model involves a lot of tiny (but growable) stacks and calling C functions almost immediately stack overflows.

Calling C safely is then slow because you have to allocate a larger stack, copy data around and mess with the GC.

null

[deleted]

9rx

> How much slower is it in comparison to other languages?

It's about the same as most other languages that aren't specifically optimized for C calling. Considerably faster than Python.

Which is funny as everyone on HN loves to extol the virtues of Python being a "C DSL" and never think twice about its overhead, but as soon as the word Go is mentioned its like your computer is going to catch fire if you even try.

Emotion-driven development is a bizarre world.

malkia

I've asked ChatGPT to summarize (granted my prompt might not be ideal), but some points to note, here just first in details others in the link at the bottom:

     Calling C from Go (or vice versa) often requires switching from Go's lightweight goroutine model to a full OS thread model because:
       - Go's scheduler manages goroutines on M:N threads, but C doesn't cooperate with Go's scheduler.
       - If C code blocks (e.g., on I/O or mutex), Go must assume the worst and parks the thread, spawning another to keep Go alive.
     * Cost: This means entering/exiting cgo is significantly more expensive than a normal Go call. There’s a syscall-like overhead.

... This was only the first issue, but then it follows with "Go runtime can't see inside C to know is it allocating, blocking, spinning, etc.", then "Stack switching", "Thread Affinity and TLS", "Debug/Profiling support overhead", "Memory Ownership and GC barriers"

All here - https://chatgpt.com/share/688172c3-9fa4-800a-9b8f-e1252b57d0...

johnisgood

Just to roll with your way: https://chatgpt.com/share/688177c9-ebc0-8011-88cc-9514d8e167...

Please do not take the numbers below at face value. I still expect an actual reply to my initial comment.

Per-call overhead:

  C (baseline)    - ~30 ns
  Rust (unsafe)   - ~30 ns
  C# (P/Invoke)   - ~30-50 ns
  LuaJIT          - ~30-50 ns
  Go (cgo)        - ~40-60 ns
  Java (22, FFM)  - ~40-70 ns
  Java (JNI)      - ~300-1000 ns
  Perl (XS)       - ~500-1000 ns
  Python (ctypes) - ~10,000-30,000 ns
  Common Lisp (SBCL) - ~500-1500 ns

Seems like Go is still fast enough as opposed to other programming languages with GC, so I am not sure it is fair to Go.

adsharma

I see some discussion here about protobufs being widely used and preventing innovation because it's good enough and comes from Google.

I see a second effect that's probably more regressive: using protobufs (designed for RPC) as a database schema definition language or a disk data structure description language.

Much prefer something like typespec to describe types.And then derive the rpc and disk schema languages from that programatically.

My first attempt was to use the flatbuffer schema language as a starting point. But being attached to a serialization format can be a net negative. Did not find traction.

Motivation: can not stand field numbers in a database schema when other migration mechanisms exist.

ryukoposting

> Every type contributes to a cost on the instruction cache, meaning that if your program parses a lot of different types, it will essentially flush your instruction cache any time you enter a parser. Worse still, if a parse involves enough types, the parser itself will hit instruction decoding throughput issues.

Interesting. This makes me wonder how nanopb would benchmark against the parsers shown in the graphs. nanopb's whole schtick is that it's pure C99 and it doesn't generate separate parsing functions for each message. nanopb is what a lot of embedded code uses, due to the small footprint.

dang

Related ongoing thread:

Hyperpb: Faster dynamic Protobuf parsing - https://news.ycombinator.com/item?id=44661785

cryptonector

Does hyperb compile PB types into something like a byte-compiled or AST description of them that it then interprets at run-time?

EDIT: Yes https://github.com/bufbuild/hyperpb-go/blob/main/internal/td...

This is very cool. The ASN.1 compiler I [sort of] maintain has an option to generate C code or a "template", and the "template" is just an AST that gets interpreted at run-time. I did not come up with that idea, but the people who did did it because it's faster to do the AST interpretation thing than to generate lots of object code -- the reason is that the templates and the template interpreter are smaller than the object code for the alternative, and cache effects add up.

mdhb

I’d really love to see more work bringing the best parts of protobuf to a standardised serialization format like CBOR.

I’d make the same argument for gRPC-web to something like WHATWG streams and or WebTransport.

There is a lot of really cool and important learnings in both but it’s also so tied up in weird tooling and assumptions. Let’s rebase on IETF and W3C standards

youngtaff

Would be good to see support for encoding / decoding CBOR exposed as a broswer API - they currently use CBOR internally for WebAuthn so I’d hope it’s bnot too hard

cyberax

You can easily do this. Protobuf supports pluggable writers, and iterating over a schema is pretty easy. We do it for the JSONB.

I'm not sure the purpose, though. Protobuf is great for its inflexible schema, and CBOR is great for its flexible data representation.

A separate CBOR schema would be a better fit, there's CDDL but it has no traction.

spenczar5

JITting bytecode VMs really seem fantastic at this. A similar trick works in Avro in Python (https://journal.spencerwnelson.com/entries/avro.html).

I suppose the insight is that schema documents are programs, just in a weird language, and compiled programs are fast.

tschellenbach

last time i benchmarked msgpack and protobuf against each other it was near flat for my use case. JSON was 2-3x slower, but msgpack and protobuf were near equal. might be different now after this release, exciting :)

Analemma_

Keep in mind that some of the performance bottlenecks the author is talking about and optimizing for show up in large-scale uses and possibly not in benchmarks. In particular, the "parsing too many different types blows away your instruction cache" issue will only show up if you are actually parsing lots of types, otherwise UPB is not necessary.