I wasted weeks hand optimizing assembly because I benchmarked on random data

163 comments

·July 21, 2025

MathMonkeyMan

Chandler Carruth told a similar story in one of his talks.

He met Ken Thompson and saw beautiful C code for the first time because he had encountered a performance problem in a service. The service had to choose a policy to enforce (or something) based on the incoming request. It was taking too long to match the criteria of each potential policy against the request.

Ken wrote a finite automata based pattern matcher that would simultaneously advance through all of the policies' predicates. It was perfect, and it was much faster than the existing code.

Then somebody noticed that 99.9% of requests matched a particular policy, so they changed the existing code to just check that predicate first, and the code sped up a zillion times, much more than with Ken's solution.

tasn

This is such a great anecdote, thanks for sharing!

Somehow relatedly, I still remember the first time I heard about profile-guided optimization which is essentially the same but for all of your code at once (well, same idea, not sure it's aggressive enough to reach the same result as the anecdote you shared).

scottlamb

> Somehow relatedly, I still remember the first time I heard about profile-guided optimization which is essentially the same but for all of your code at once (well, same idea, not sure it's aggressive enough to reach the same result as the anecdote you shared).

Exactly: profile-guided optimization is pretty awesome if you have the right infrastructure. You can get maybe 15% speedup on your program without going deep into all the code. But it has limits. IIUC, it gathers information about which code is hot, including which branches are taken, and enables certain compiler optimizations based on that. [1] But it's not going to make huge algorithm changes, it's not going to change data structure definitions [2], and it's not going to prepare good microbenchmark input for you to compare your new algorithm against the current one.

Also, this article is about Java, and profile-guided optimization is something Java just has baked into its JIT already.

[1] I don't know exactly which, and they likely vary by compiler version. But I think a variety of stuff to better use cache: breaking code apart into likely-"hot" vs likely-"cold" pages, deciding which loops are worth aligning, and where inlining is worthwhile. Also where branches should be optimized for likely vs unlikely vs unpredictable (conditional instructions). When it's worth autovectorizing. But it's not as good at SIMD as good as a human who puts two weeks into it, and SIMD usage is constrained by the inability to change data structures.

[2] in theory, in Rust's default #[repr(Rust)] it could reorder a struct's members, but it's not going to say change from array-of-struct to struct-of-array format or hashmap to btree.

RedNifre

Okay, but how about checking the 99.9% policy first and if it doesn't match, run Ken's solution?

fifilura

You'd be better off with some stupid code that junior devs (or Grug brained senior devs) could easily understand for the 0.1% cases.

smcin

(In that specific case, with that specific very skewed data distribution, yes. But not necessarily in the general case. So "they" would be better off with that solution, but not a generalized "you").

username135

Then how do they learn

TrapLord_Rhodo

That's what he said they did. that's the joke.

>so they changed the existing code to just check that predicate first, and the code sped up a zillion times, much more than with Ken's solution.

but since you've now spent all this time developing some beautiful solution for .01% of the actual data flow. Not the best use of dev time, essentially.

lispitillo

If a case doesn't match, then speeding up the remaining 0.1% is not going to change much the overall speed. Hence, a non optimized algorithm maybe enough. But when speed and speed variance is critical, then optimization is a must.

hinkley

The devil is in the details. A pair of my coworkers spent almost two years working on our page generation to split it onto a static part and a dynamic part. Similar to Amazon’s landing pages. Static skeleton, but lots of lazy loaded blocks. When they finally finished it, the work was mediocre for two reasons. One, they ended up splitting one request into several (but usually two), and they showed average response times dropping 20% but but neglected to show the overall request rate also went up by more than 10%. They changed the denominator. Rookie mistake.

Meanwhile, one of my favorite coworkers smelled a caching opportunity in one of our libraries that we use absolutely everywhere, and I discovered a whole can of worms around it. The call was too expensive which was most of the reason a cache seemed desirable. My team had been practicing Libraries Over Frameworks, and we had been chipping away for years at a NIH framework a few of them and a lot of people who left had created in-house. At some point we lost sight of how we had sped up the “meat“ of workloads at the expense of higher setup time initializing these library functions over, and over, and over again. Example: the logging code depended on the feature toggle code, which depended on a smaller module that depended on a smaller one still. Most modules depended on the logging and the feature toggle modules. Those leaf node modules were being initialized more than fifty times per request.

Three months later I had legitimately cut page load time by 20% by running down bad code patterns. 1/3 of that gain was two dumb mistakes. In one we were reinitializing a third party library once per instantiation instead of once per process.

So on the day they shipped, they claimed 20% response reduction (which was less than the proposal expected to achieve), but a lot of their upside came from avoiding work I’d already made cheaper. They missed that the last of my changes went out in the same release, so their real contribution was 11%, but because of the extra requests it only resulted in a 3% reduction in cluster size, after three man years of work. Versus 20% and 7% respectively for my three months of work spread over five.

Always pay attention to the raw numbers and not just the percentages in benchmarks, kids.

snvzz

My take is that code is disposable, and often needs to be written to figure out what is really needed, then disposed off.

Learning this is worth writing the otherwise useless code.

BigJ1211

This is why I strongly adhere to WET¹, you discover so much about the domain the first time you write something. By the time you have it 'working' you realize you should've written vastly different code to solve or cut-off edge cases.

That's why these days I just start writing code, get it to 'it works!' level. To then throw it all away and start again. Takes a little more effort, but the payoff is far less problematic code and tech debt.

¹: Write Everything Twice

wizzwizz4

Don't do WET with your entire system, though, or you're likely to end up with the Second System Effect.

null

[deleted]

5pl1n73r

My reaction to the article title was recalling how old assembly teachers would drill into our heads to make sure we have proof of 200% gains before implementing an optimization. However the article shows how sometimes such planning doesn't work.

Cthulhu_

Or one should spend a little more time at first to fully understand the problem, in both the mentioned case and the OP, use realistic data.

TillE

Writing code is very very often a big part of understanding any new problem. It's just that you should be quite comfortable discarding that code. Refactor mercilessly and all that.

scott_w

Writing code that solves the problem is a huge way of proving you understand the problem.

yetihehe

Sometimes you can't have realistic data without writing some code first, that will allow you to gather that data.

ddalex

I came to the same conclusion months ago with the advent of coding LLMs.... I used to treat code as precious, carefully saving, cataloguing and referencing git commits and SQL scripts and command line histories, because they were painful to write, and thus they are valuable. I mean, they were, because when I had the same need again, I could find previous instances and reuse stuff.

Now ? I just dump the problem into an LLM and it spits up a script that solves it, and then I throw away the script because it horribly bugged for any other case then my particular problem, but hey, I just solved my problem

underdeserver

This is a good place to mention profile-guided optimization.

In PGO you profile your application running in the real world (via automatic instrumentation) and the compiler uses the output to make optimization decisions.

I'm not sure what exactly they use it for, but I imagine it could be used for loop unrolling, switch case ordering, branch prediction optimization, memory ordering and more.

menaerus

What if your "application" is a server-side code hosting multitude of different types of workloads, e.g. databases? I was never really convinced by the PGO benefits for such open-ended and complex workloads.

ltbarcly3

PGO operates at the cpu branch level. The main benefit, if i understand correctly, is to reduce branch mispredictions which lead to pipeline stalls.

Theres absolutely no reason to think that the specific workload of a system generally makes random/arbitrary/heuristic be the best way to pick a most likely branch for a given branching instruction.

For example, when scanning data you might check if the current value is equal to a test value, and branch on that to either code that handles the match, or else code that moves to the next entry to test that. This may be extremely likely to match (95%) as in a hash table, or very likely to not match (string search). A compiler can't tell which is true, and your workload literally has no impact on this unless it is pathological.

ralegh

This is fine assuming the popular request types don’t change, but arguably if both new versions of matching are sufficiently fast then I would prefer Ken’s long term as the other could become slow again if the distribution of request types changes.

sfilmeyer

As a counterpoint, what fraction of the future engineers who will touch the project are likely to be able to competently edit the finite automata based version without introducing bugs and what fraction will be able to competently edit the if statement that checks the particular policy?

mikepurvis

A further question mark is whether any of this has sufficient instrumentation to be able to notice and act on a change of and when it occurs.

scott_w

I think you missed the point. Ken's version wasn't removed, it was simply prepended with something like:

  if value in most_common_range:
    take_fast_path()
  else:
    use_kens_solution()

andrepd

Nonsense. The pre-check can literally be one line (if common_case {fast_path()} else {slow_path()}), and thus enabling or disabling it is dead simple and obvious if the problem changes in the future.

Lines of thinking like that are part of the reason most modern software is so sloooow :)

Rendello

This situation where two paths produce the same output but one is optimized is the easiest case in property-based testing, as the property is just:

  normal(x) == optimized(x)

vlovich123

Hyperoptimizing for the fast path today and ignoring that hardware and usage patterns change is the reason modern software is so slooow :)

A more robust strategy would be at least be to check if the rule was the same as the previous one (or a small hash table) so that the system is self-healing.

Ken’s solution is at least robust and by that property I would prefer it since it’s just as fast but doesn’t have any weird tail latencies where the requests out of your cache distribution are as fast as the ones in.

ants_everywhere

You can even track the request statistics live and disable the fast path if the distribution of requests changes significantly.

n3t

Anyone has a link to the talk?

n3t

Found it! https://youtu.be/nXaxk27zwlk?t=393 (there is more context before the timestamp)

MathMonkeyMan

Neither I nor ChatGPT could find it again.

sylware

I am writing assembly and often you have many code paths/data structures which "fit". Each combinaison of code paths/data structures will favor some specific usage/data in spite of the others. And this is not real "optimization" since usually those code paths have roughly "the same" cost. The bottom of this: it is what you think of semantics and real-life usage of this code which will drive those "choices".

That said, you can still have generic optimizations which will benefit nearly all (for instance quick sort).

jasonthorsness

Identifying a representative usage scenario to optimize towards and then implementing that scenario in a microbenchmark test driver are both massively difficult to get right, and a "failure" in this regard, as the author found, can be hard to detect before you sink a lot of time into it.

Even for seemingly trivial scenarios like searching an array, the contents and length of the array make a massive difference in results and how to optimize (as shown in the last section of this write-up where I tried to benchmark search algorithms correctly https://www.jasonthorsness.com/23).

I've not seen a perfect solution to this that isn't just "thinking carefully about the test setup" (profile-guided optimization/production profiles replayed for benchmarks seem like maybe it could be an alternative, but I haven't seen that used much).

bgirard

> Identifying a representative usage scenario to optimize towards and then implementing that scenario in a microbenchmark test driver are both massively difficult to get right, and a "failure" in this regard, as the author found, can be hard to detect before you sink a lot of time into it.

I can vouch for this. I've been doing web performance for nearly 15 years and finding clear representative problems is one the hardest parts of my job. Once I understood this and worked on getting better at finding representative issues, it became the single thing that I did to boost my productivity the most.

jasonthorsness

What have you found are the most useful approaches to collecting the representative issues in a way you can reuse and test? I haven’t worked as much in the web space.

bgirard

A combination of: 1) RUM, 2) Field tracing, 3) Local tracing with throttling (CPU, Network), 4) Focusing either on known problems, or if I'm on less certain territory I will do more vetting to make sure I'm chasing something real. I'll minimize the time spent on issues that I can only catch in one trace, but sometimes it's also all you get so I'll time box that work carefully. It's more of an art than a science.

viraptor

Tracing and continuous profiling made this task significantly easier than it was in the past, fortunately.

bgirard

It's great. Much harder on Web because it's difficult to get rich information from the browser. This is why I contributed to the JS Self-Profiling API in the first place: https://developer.mozilla.org/en-US/docs/Web/API/JS_Self-Pro...

CodesInChaos

Can your recommend tools for continuous profiling? I'm mainly interested in Rust and C# on Linux.

I'm not sure if it has gotten easier. Async is harder to profile than classic threaded code using blocking IO. And for database bound code, I'm not sure which databases output detailed enough performance information.

sfink

It's a subset of profile-guided optimization, but I fairly frequently will gather summary histograms from more or less representative cases and then use those to validate my assumptions. (And once in a while, write a benchmark that produces a similar distribution, but honestly it usually comes down to "I think this case is really rare" and checking whether it is indeed rare.)

anonymoushn

I have asked for like the ability to apply a filter that maps real strings from real users to a very small number of character classes, so that I may run some codec on data that is equivalent to user data for the purposes of the codec. I have heard, this is not a good use of my time, and if I really care so much I should instead make various guesses about the production data distribution of user-created strings (that need to be json-escaped) and then we can deploy each of them and keep the best, if we can even tell the difference.

jasonthorsness

If the real data is sensitive, it's hard to distill test data from it that succeeds in fully removing the sensitivity but that is still useful. Depending on the domain even median string length could be sensitive.

kazinator

If they were forking that JVM in order to offer it as a development tool to a broad developer base, then such an optimization might be worth keeping.

Someone out there might have an application in which they are serializing large numbers of large integers.

The case for abandoning it is not as strong, since you don't know who is doing what.

It's a guessing game in programming language run-times and compilers: "will anyone need this improvement?" You have room to invent "yes" vibes. :)

grogers

If the distribution of numbers you are serializing is uniformly random, you generally wouldn't choose a variable length encoding. Sometimes the choice is made for you of course, but not usually.

necovek

The problem is that this is a complex assembly implementation that needs to be maintained over the course of decades in the future. Unless you have a good reason to keep it, you should drop it.

spott

Why is just capturing real values not the answer here?

Something like grab .01% of the inputs to that function for a day or something like that (maybe a lower fraction over a week or month).

Is the cost of grabbing this that high?

adrianN

That works for some cases, but if you have a service that processes customer data you probably don’t want to make copies without consulting with legal.

heipei

Counterpoint: I once wrote a paper on accelerating blockciphers (AES et al) using CUDA and while doing so I realised that most (if not all) previous academic work which had claimed incredible speedups had done so by benchmarking exclusively on zero-bytes. Since these blockciphers are implemenented using lookup tables this meant perfect cache hits on every block to be encrypted. Benchmarking on random data painted a very different, and in my opinion more realistic picture.

atiedebee

Why not use real world data instead? Grab a large pdf or archive and use that as the benchmark.

bluGill

A common case is to compress data before encrypting it so random is realistic. Pdf might allow some optimization (how I don't know) that is not representative

jbaber

Or at least the canonical test vectors. All zeros is a choice.

rdc12

Do you have a link to that paper?

almostgotcaught

There are an enormous number of papers like this. I wrote a paper accelerating a small CNN classifier on FPGA and when I compared against the previous SOTA GPU implementation and the numbers were way off from the paper. I did a git blame on their repo and found that after the paper was published they deleted the lines that short-circuited eval if the sample was all 0 (which much of their synthetic data was ¯\_(ツ)_/¯).

Marazan

Back when Genetic Algorithms were a hot topic I discovered that a large number of papers discussion optimal parameterisation of the the approach (mutation rate, cross-over, populations etc) were written using '1-Max' as the "problem" to be solved by the GA. (1-Max being attempting to make every bit of the candidate string a 1 rather than 0)

This literally made the GA encoding exactly the same as the solution and also very, very obviously favoured techniques that would MAKE ALL THE BITS 1!

imtringued

This reminds me of this paper:

https://link.springer.com/article/10.1007/s10710-021-09425-5

They do the convergence analysis on the linear system Ax = 0, which means any iteration matrix (including a zero matrix) that produces shrinking numbers will converge to the obvious solution x = 0 and the genetic program just happens to produce iteration matrices with lots of zeros.

scottlamb

> For example, there was minimal use of java.lang.String because ain’t nobody got time for a second allocation of a char[] with associated indirection and GC churn.

An example of why I'd prefer to avoid Java for something like that. Java has a reputation for being slow because of JIT compilation and GC, but I also think a lot of it comes down to all non-primitive types being boxed Objects (boxed = in a separate allocation). This means it works that garbage collector so much harder, there's poorer locality, it fills RAM/cache with extra Object stuff like mutexes that no one ever uses. (Apparently now 8 bytes after a lot of effort to reduce it. https://openjdk.org/jeps/450)

deepsun

> it fills RAM/cache with extra Object stuff

But it has much lesser impact than in, say, Rust, where it's really an allocation (asking kernel for more RAM). Java's Object "allocation" happens in its own heap, which is a big chunk of RAM already pre-allocated from kernel. So in JVM boxing is really cheap, often just a pointer increment. Also oftentimes the wrapper object and its value are located near each other in the same memory page, so we're not adding a RAM access, more like L2 access.

PS: In some cases it can be even faster than Rust or C++, because it pre-allocates larger pages and drops them in chunks within generational GC (e.g. all values "allocated" to process one HTTP request can be GCed immediately after), while C++ is eager to destruct each object immediately. Also, a GC swipe can happen in a separate thread, not bothering the main thread user is waiting for. One can do the same in Rust and C++ using Arenas, of course.

scottlamb

> But it has much lesser impact than in, say, Rust, where it's really an allocation (asking kernel for more RAM). Java's Object "allocation" happens in its own heap, which is a big chunk of RAM already pre-allocated from kernel.

What? No. Rust doesn't ask the kernel for each allocation individually. That'd be insane; besides the crushing system call overhead, the kernel only does allocations in whole pages (4 KiB on x86-64) so it'd be incredibly wasteful of RAM.

Rust does the same thing as virtually every non-GCed language. It uses a memory allocator [1] that does bookkeeping in userspace and asks the kernel for big chunks via sbrk and/or mmap/munmap. Probably not the whole heap as a single chunk of virtual memory as in Java, but much closer to that than to the other extreme of a separate kernel call for each allocation.

[1] by default, just libc's malloc and free, although you can override this, and many people choose jemalloc or mimalloc instead. My high-level description applies equally well to any of the three.

gf000

While Java just does a thread local pointer bump, which will still be more efficient, and closer to stack allocation.

Of course you can optimize better with Rust/CPP/etc, but it is not trivial and you definitely not get it out of the box for free. My point is, this is a bit overblown how much overhead java has.

deepsun

Yes, my mistake, thanks for pointing out, upvoting. I meant asking memory allocator, not kernel.

I meant that Java usually already has that memory allocated, JVM is a memory allocator of its own. It operates within its own heap. One can do that within Rust of course (or easier in Zig, as it welcomes passing an allocator around), it's just built-in in JVM already. Drawback is that it's more complex to release that memory back from the JVM process, so Java apps (not AOT-compiled) usually consume more RAM.

labadal

I'm glad that I'm over that phase I had in university where I wanted to write a custom memory allocator for everything because "I understand my usage better". I will admit that it was a good bit of fun though.

IshKebab

Aside from Rust not working like that (as scottlamb said), Rust is faster than Java even if Java has a faster allocator because Rust code usually does much less allocation in the first place.

jeroenhd

I don't know if Rust code allocates more or less in general. It really depends on what kind of code you write. Once Rust code reaches the complexity of the Java stacks it's replacing, you get a lot of wrapper objects, locks, and intermediates to cross thread boundaries and to prove soundness to the borrow checker.

I recently encountered an example of someone writing a Rust version of a popular Java library by just taking the Java code, commenting it out, and writing the Rust equivalent almost line for line. The approach works great (no need to reinvent the wheel and you can point to the existing documentation and code samples) but in terms of allocations, you're not going to find many improvements.

There's a type of Java code that looks more like C code than anything else that runs blazing fast with minimal overhead. It's not the type of Java code you'll probably encounter when writing Java applications, but if you use Java as a kind of cross-platform C target, you can get pretty close to Rust (and for some use cases even beat it). Java has a LOT of tricks up its sleave (pointer compression, dynamic realignment) that Rust can't automatically take advantage of.

Your AbstractFunctorClassFactoryProducer isn't going to be very allocation efficient, but once you start seeing volatile ints all over the place, things quickly become a lot faster.

mcosta

An example of why I'd prefer to avoid Java for somethi [Core dumped]

pjc50

I like how dotnet has acknowledged this and provided a whole lot of machinery recently for things like value types on the stack and Span<T> to reduce array copying.

pjmlp

I would have liked that both had acknowledged this to the point of languages like Oberon (all variants predate them), Eiffel, Modula-3, CLU, Cedar,...

While they were used as inspiration on their design, the AOT approach, with automatic memory management, and programming language features for low level coding really took a while to get back from them, even the NGEN/unsafe from C# v 1.0, while welcomed wasn't that great.

I always think that in an alternate reality, had them been had those features already on version 1.0, C and C++ would have lost even more mindshare at the turn of the century, and something like C++11 would not have been as impactful.

snickerdoodle12

> but I also think a lot of it comes down to all non-primitive types being boxed Objects

They're working on it: https://openjdk.org/projects/valhalla/

eulgro

Project Valhalla started over 10 years ago. It keeps being mentioned, but with seemingly no end in sight. Is there an ETA?

bsder

It's not like C++ users don't say the exact same thing about String, though.

The problem is that String really isn't a language primitive.

Different people and programs always have a different notion of what a String should do (Should it be mutable? Should it always be valid UTF-8? Which operations should be O(1), O(n) or O(n log n)? etc.)

scottlamb

> It's not like C++ users don't say the exact same thing about String, though.

If they do, they're wrong, as the two languages are quite different here. In Java, String requires two allocations: your variable is implicitly a pointer to String allocation, which in turn has a pointer to a char[] allocation. In C++, the std::string itself is a value type. The actual bytes might be inline (short string optimization) or behind a single allocation accessible from the std::string.

Rust's std::string::String is somewhere between: it's a value type but does not have a short string optimization (unless you count the empty string returned by String::new).

> Different people and programs always have a different notion of what a String should do (Should it be mutable? Should it always be valid UTF-8? Which operations should be O(1), O(n) or O(n log n)? etc.)

Sure, there can be call for writing your own String type. But what's unique about Java as compared to say C, C++, Go, Rust, even to some extent C# is that you can't have a class or struct that bundles up the parts of your data structure (in the case of a mutable string, two fields: data pointer/capacity + the used length) without boxing. There's a heavy cost to any non-primitive data type.

josephg

> Sure, there can be call for writing your own String type. But what's unique about Java as compared to say C, C++, Go, Rust, even to some extent C# is that …

You also can’t make a first class string type in most of those language because “hi” is defined to be of a system specified class. You can make a different type to store strings but it’ll never be as ergonomic to use.

It’s even worse in JavaScript, where the standard library is written in a different language (usually C++). It’s impossible to write JavaScript code that matches the performance of the built in string primitive. At least outside of specific niche use cases.

Rust has lots of alternate string like libraries in cargo that are optimized better for different use cases - like smallstring or ropey. They’re fast, convenient and ergonomic. I imagine C++ is the same.

layer8

> In Java, String requires two allocations

That’s true, but thanks to GC an allocation also is just a pointer bump in Java, and the two allocations are likely to be close to each other. For short-lived strings, the GC cost is furthermore zero, because only the longer-lived objects need to be tracked and copied with generational GC. So, “heavy cost” is relative.

grogers

Also, you can't construct a java String without copying the data into it, because String is immutable. In other languages like c++ or rust the string type is mutable so you don't need an extra copy. Java doesn't even special case blessed APIs like StringBuilder to avoid the extra copy, since StringBuilder itself doesn't have a method to consume the inner buffer, you can only create a string from it without touching the buffer even though it's not the normal usage to create multiple strings from a given StringBuilder.

pjmlp

Except many aren't aware that as long as std::string fullfills the ISO C++ complexity requirements for its implementation, anything goes.

This is less problematic nowadays, because for many folks there are only three compilers that still matter, or even a single one.

charleslmunger

I tried implementing an optimized varint encoder that did something similar, by populating an 8 byte value and then storing it to ram, but unaligned overlapping stores caused big regressions. The approach that worked required a different technique. This one is for encoding backwards:

1. One branch for the one-byte case, always inlined, just store the byte

2. Calculate the size size of the unencoded zero prefix with a branch-free construction: (((uint32_t)clz + 7) * 9) >> 6

3. Hand roll a jump table taking advantage of arm's fixed instruction width to calculate the jump target with a shift.

https://github.com/protocolbuffers/protobuf/commit/b039dfe26...

This results in one branch for 1 byte varints, plus one additional branch for any larger size, and the branch predictor does not have to track a varying trip count through a loop. This approach resulted in a 2.64% speedup for overall encoding (which includes a lot more than just varints) on mid sized arm cores.

I think it's very difficult to beat a single comparison and branch for the 1 byte case for actual encoding forwards or backwards, unless you know there's going to be long runs of one-byte values.

OhMeadhbh

<snark>Time spent optimizing assembly language is never wasted.</snark>

That's only half-snark. It was "wasted" in the sense that it wasn't as useful as the OP thought it would be. But diving deep into a technical problem? That left an indelible pattern on the OP's psyche. If they ever have to do something like that again, it will certainly be easier. And since AI is going to be doing all the simple coding tasks, humans will have to dive into the deep, hard tasks.

bluGill

Also there is no way for author to know that his optimizations wouldn't work without trying them. We are not even sure that his analysis of why they work is correct (though it sounds reasonable). There is also the possibility that author could use the new found information to fix his optimizations so they are better.

xandrius

It could also burn someone out and take out the real reason technology exists: to help someone.

labadal

Kind of like math, huh? Some people burn out on "meaningless" impossible problems. Others get obsessed with them and make significant contributions along the way to failure.

chaosmachine

This reminds me of Benford's law[1]

"Benford's law, also known as the Newcomb–Benford law, the law of anomalous numbers, or the first-digit law, is an observation that in many real-life sets of numerical data, the leading digit is likely to be small."

[1] https://en.m.wikipedia.org/wiki/Benford%27s_law

Const-me

LEB128 is slow and there’s no way around that. I don’t know why so many data formats are using that codec.

The good variable integer encoding is the one used in MKV format: https://www.rfc-editor.org/rfc/rfc8794.html#name-variable-si... The compression win i.e. encoded size is equal to LEB128, however both encoding and decoding are way more efficient.

The encoder is a few fast instructions for all numbers no matter large or small: bit scan reverse (or leading zero count) to find count of bits required to store the input number, division by 7 rounding up to compute count of bytes (modern compiler reliably optimize into integer multiplication and bit shifts), bitwise shift + OR to set the VINT_MARKER bit, then byte swap to store in big endian format.

The decoder is equally efficient: compare first byte with 0 to detect numbers longer than 8 bytes (very rarely happens), bit scan reverse of either first or second byte to find encoded length, byte swap to load the big-endian format, then reset the most significant bit in the integer to get rid of the VINT_MARKER.

corysama

Reminds me of https://ridiculousfish.com/blog/posts/old-age-and-treachery....

Data condition can definitely affect runtime performance. All the way down to the micro architecture level. Random, sorted, reverse-sorted, uniform data sets often lead to dramatic differences in perf. Then you get into “Ooops, my dataset had a surprisingly large number of NaNs or denormalized floats!”

zem

> We were in the early phases of forking the JVM

definitely made me go :-O - how big a company do you need to be for that to be a realistic thing to maintain in house?

cbzoiav

A fork can be many things.

I've maintained in house forks of major code bases with a handful of minor code changes while we've tried to convince upstream to accept them.

scheme271

Off-hand I think that there were at least 7 forks/implementations out there. Oracle/Sun, Oracle's Graal, IBM, OpenJDK, Amazon Coretto, Twitter's fork, and Azul. There's also a GNU effort but I don't think that really went very far. I think there's probably a few others that aren't publicly well known for trading firms, etc.

throwaway2037

JetBrains also has a fork that they use to run some (all?) of their IDEs, such as CLion (C/C++) and IntelliJ (Java). As I understand, it has many fixes for HiDPI environments when painting GUI widgets (Swing, etc.).

brabel

> there were at least 7 forks/implementations out there

There are 17 listed on SDKMAN's website: https://sdkman.io/jdks

> OpenJDK

As far as I know, all other distributions of the JDK are forks of OpenJDK. Would love to know if there's any that isn't.

pjmlp

Commercial embedded vendors with RTOS capabilities,

https://www.ptc.com/en/products/developer-tools/perc

https://www.aicas.com/products-services/jamaicavm/

JikesRVM (https://www.jikesrvm.org) no longer active, but it was its own thing.

Then you have Azul Zing and OpenJ9, which use parts of OpenJDK for the standard library part, but the foundations are their own thing.

https://www.azul.com/products/prime/

https://developer.ibm.com/components/open-j9/

Then you have that mobile OS, which isn't really JVM compliant implementation, but it is close enough for many workloads and libraries to work on top of it.

_glass

SAP has also a fork: https://github.com/SAP/SapMachine.

dcrazy

You don’t need to be big, you just need to handle lots of money.

zem

fair point, I guess this was a high frequency trading firm or similar, who could justify paying a few engineers to maintain a jvm fork

titzer

I'm surprised they didn't just add a fast path of 1 and 2 byte LEBs to the hard-won assembly they generated. These fast paths are absolutely worth it. E.g. Wizard's Wasm interpreter uses a 3 instruction fastpath to decode 1-byte LEBs and then jumps out-of-line to handle the general case. It is worth 5-10% for the interpreter performance.

menaerus

The issue was that they were hand-optimizing the SIMD for a workload that doesn't even exist in their case so they ended up seeing one big nothing once they profiled the code with a new optimization - a classic trap with 95% "optimizations". It turned out that the distribution of their data is in those 1 and 2 byte LEBs, and in that case their SIMD approach doesn't give the gains as it does with 9 and 10 byte LEBs.

I am wondering more about the bit-by-bit compatibility part from "This was verified in a benchmark that ran both versions on billions of random numbers, confirming both the huge speedup and the bit-for-bit compatibility of the result.".

How does one test for 18.4 quintillion numbers? This isn't possible.

phire

The existing implementation was already "highly optimised", so it would have had a fast path for short ints (they probably profiled on realistic data sets).

Adding the fast path to the hard-won assembly is just optimising the slow path, and there is little point to optimising it. Complex optimisations are also a form of technical debt, so they are probably better off not integrating it.

gorset

In a previous project we used fixedint32/64 instead of varint values in the schema for messages where performance was critical.

That left only varint used for tags. For encoding we already know the size at compile time. We also don’t have to decode them since we can match on the raw bytes (again, known at compile time).

A final optimization is to make sure the message has a fixed size and then just mutate the bytes of a template directly before shipping it off. Hard to beat.

In a newer project we just use SBE, but it lacks some of the ergonomics of protobuf.

HN

I wasted weeks hand optimizing assembly because I benchmarked on random data

I wasted weeks hand optimizing assembly because I benchmarked on random data