Skip to content(if available)orjump to list(if available)

Strings Just Got Faster

Strings Just Got Faster

82 comments

·May 1, 2025

stevoski

I find the entire history of improvements to Java’s String class enjoyable to read about.

Over the years, the implementation of Java’s String class has been improved again and again, offering performance improvements and memory usage reduction. And us Java developers get these improvements with no work required other than updating the JRE we use.

All the low-hanging fruit was taken years ago, of course. These days, I’m sure most Java apps would barely get any noticeable improvement from further String improvements, such as the one in the article we’re discussing.

neuroelectron

When I started my career in software development, SDE, and soon advanced to SRE, I hated Java. The extreme OOP paradigm made enterprise class situations impossible to understand. But after a few short years, I began to appreciate it as a real, battle hardened ecology. Now, I consider it much better than modern trends such as Rust and Python.

These kinds of niche optimizations are still significant. The OOP model allows them to be implemented with much less fanfare. This is in the context of billion-dollar platforms. With some basic performance testing and API replays, we're saving thousands of dollars a day. Nobody gets a pat on the back. Maybe some pizza on Friday.

jimmaswell

I've seen Java described as made for companies to be able to rotate out mediocre programmers as efficiently as possible without letting them mess things up easily, and it makes a lot of sense from that perspective. Barebones semantics to the point of being Spartan (can't even define your own operator overloads), easy to take another class and copy it with small modifications but not mess it up for anyone else (inheritance)..

Then there's C# which most anyone who's enthusiastic about software dev will find far nicer to work with, but it's probably harder for bargain basement offshore sweatshops to bang their head against.

nradov

The lack of operator overloading is a bit annoying but in practice seldom a real problem. An operator is just a funny looking method. So what.

There are worse fundamental problems in Java. For example the lack of a proper numeric tower. Or the need to rely on annotations to indicate something as basic as nullabilty.

ivan_gammel

I remember the times on one of professional forums, where there was lots of questions about architecture in C# sections and almost none in Java section. Abundance of tools creates abundance of possibilities to get confused about what’s right. In Java many design decisions converged to some dominant design long time ago, so you no longer think about it and focus on business. It’s sometimes as bad as getter verbosity (thankfully record style is getting traction), but in most cases it’s just fine.

threeseed

You can use the JVM without needing to use OOP e.g. Scala, Clojure, Python, Javascript, Ruby etc.

Then you can get to benefit from Java's unparalleled ecosystem of enterprise hardened libraries, monitoring etc.

niuzeta

I love hearing more about this, especially the historical context, but don't have a good java writeups/articles on this. Would you mind sharing some suggestions/pointers? I'd very much appreciate it.

stevoski

A good starting point is Joshua Bloch’s Effective Java. He shares some stories there from Java’s early days, and - at least in passing - mentions some aspects of the String class’s history.

niuzeta

Ah, I certainly remember these anecdotes! What other resources would you recommend(even the tidbits) could there be for more modern Java? The original article like this one should be treasured.

DaiPlusPlus

> Would you mind sharing some suggestions/pointers?

I would, but unfortunately I got a NullPointerException.

I suggest you try Rust instead; its borrow checker will ensure you can't share pointers in an unsafe manner.

gavinray

This post makes mention of a new JEP I hadn't heard of before: "Stable Values"

https://openjdk.org/jeps/502

https://cr.openjdk.org/~pminborg/stable-values2/api/java.bas...

I don't understand the functional difference between the suggested StableValue and Records, or Value Classes.

They define a StableValue as:

  > "A stable value is a holder of contents that can be set at most once."
Records were defined as:

  > "...  classes that act as transparent carriers for immutable data. Records can be thought of as nominal tuples."
And Value Objects/Classes as:

  > "... value objects, class instances that have only final fields and lack object identity."
Both Records and Value Objects are immutable, and hence can only have their contents set upon creation or static initalization.

layer8

Record fields cannot be lazily initialized. The point of StableValue is lazy initialization, meaning that their value is stable if and only if they carry a non-default value (i.e. after initialization). If you don’t need lazy initialization, you can just use a regular final field. For a non-final field, without StableValue the JIT optimizer can’t tell if it is stable or not.

The implementation of a value object will be able to use StableValue internally for lazy computation and/or caching of derived values.

whartung

I don't know, these are mostly uninformed guesses, but the distinction between Records and Value objects is that the contents lack object identity.

Which, to me, means, potentially, two things.

One, that the JVM can de-dup "anything", like, in theory, it can with Strings now. VOs that are equal are the same, rather than relying on object identity.

But, also, two, it can copy the contents of the VO to consolidate them into a single unit.

Typically, Java Objects and records are blobs of pointers. Each field pointing to something else.

With Value Objects that may not be the case. Instead of acting as a collection of pointers, a VO with VOs in it may more be like a C struct containing structs itself -- a single, continuous block of memory.

So, an Object is a collection of pointers. A Record is a collection of immutable pointers. A Value Object is (may be) a cohesive, contiguous block of memory to represent its contents.

sagacity

Handwavy explanation: A stable value is used as a static constant, with the difference being that you can initialize it at runtime. Once initialized it is treated as fully constant by the JVM. It's similar to something like lateinit in Kotlin, except on the JVM level.

Records are also immutable, but you can create them and delete them throughout your application like you would a regular class.

w10-1

> used as a static constant

Yes, but remind people it's not static in the sense of being associated with the class, nor constant for compile-time purposes.

Perhaps better to say: A stable value is lazy, set on first use, resulting in pre- and post- initialization states. The data being set once means you cannot observe a data change (i.e., appears to be immutable), but you could observe reduction in resource utilization when comparing instances with pre-set or un-set values -- less memory or time or other side-effects of value initialization.

So even if data-immutable, a class with a stable value ends up with behavior combinations of two states for each stable value. Immutable records or classes without stable values have no such behavior changes.

But, writ large, we've always had this with the JVM's hotspot optimizations.

For String, it becomes significant whether hashcode is used when calculating equals (as a fast path to negative result). If not, one would have two equal instances that will behave differently (though producing the same data), at least for one hashcode-dependent operation.

pkulak

So every time you take the hash of a string you leak 4 bytes of memory???

I assume it's static in the context of it's containing object. So, it will be collected when it's string is collected.

gavinray

But this is also achievable with static init methods on records and value classes, right?

  record Rational(int num, int denom) {
      Rational {
          int gcd = gcd(num, denom);
          num /= gcd;
          denom /= gcd;
      }
  }

sagacity

How would you do the same thing with the late initialization of, say, a HashMap where you don't control the constructor?

leksak

> It's similar to something like lateinit in Kotlin, except on the JVM level.

What level are you suggesting lateinit happens at if not on the JVM?

Tmpod

I assume they mean this feature is built into the JVM itself, whereas Kotlin's lateinit more or less "just" desugars into code you could otherwise write yourself.

drob518

A stable value, as I understand it, is either not-yet-computed or is constant. In other words, once computed it’s constant and the JIT can therefore treat it as constant.

_old_dude_

In Java, constants are declared as static final.

  static final Complex CONSTANT = new Complex(1, 2);
If you want a lazy initialized constant, you want a stable value

  static final StableValue<Complex> STABLE_VALUE = StableValue.of();

  Complex getLazyConstant() {
    return STABLE_VALUE.orElseGet(() -> new Complex(1, 2))
  }
If you want the fields of a constant to be constant too, Complex has to be declared has a record.

null

[deleted]

blacklion

Records can be changed via reflection and thus doesn't participate in constant-folding in JIT phases, as this could break, for example, some serialization libraries and other nasty and dirty code. It will work in interpreter mode and eventually has heisenbugs after JIT. Not good.

`@Stable` annotation (only internal for now) and `StableValue<>` (for user code in future) says JIT that programmer guarantee (swear by his life!) that no dirty tricks are played with these values in whole codebase and JIT can constant-fold these values as soon as they are initialized.

_old_dude_

I've just tried to change a field of an instance of a record using reflection, and it's not possible, even with setAccessible(true).

blacklion

ooops, my bad. I remember, that Aleksey Shipilëv explained why even `final static` fields are not constant-folded by JIT, and thought that record classes which were introduced later, is the same.

It means, that `StableValue<>` can be used in simple classes (where `final` fields are still not constant-folded) and, additionally, supports late initialization.

int_19h

I'm rather surprised that string hashes aren't randomized in JVM - or at least that's what the article implies. These days, it's a fairly common mitigation for DDoS attacks based on hash collisions. Does Java not do it to preserve backwards compatibility because too much existing code relies on a specific hashing algorithm?

ncruces

The hashing algorithm is specified in the docs:

   s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
Per Hyrum's law, there's no changing it now.

https://docs.oracle.com/javase/8/docs/api/java/lang/String.h...

ivan_gammel

Java hashCode contract is to optimize calculation of hash for performance not for collision search resistance. Its sole purpose is to use it in collections. It must not be used in situations where you need cryptographic properties.

tialaramex

So, the problem here is that you're thinking of "cryptographic properties" as only the "cryptographic hashes" such as those built with the Merkle–Damgård construction (SHA-512/256 being the one you might reasonably pick today)

But, it's actually desirable to have some cryptographic properties in a faster one way function for making hash tables. Read about SipHash to see why.

Because Java didn't (and as others have discussed, now can't) choose otherwise the hash table structures provided must resist sabotage via collision, which isn't necessary if your attacker can't collide the hash you use.

ivan_gammel

There’s no reason to add extra properties and associated complexity to hash used in collections because of one exotic use case. To be able to execute hash flooding, the server must accumulate arbitrary user inputs in a hash map, which is problematic design anyway. Even if that would make sense, what kind of function could work as 32 bit gash? You mention SipHash as an example: it is poor example anyway, because this function requires a secret key - meaning Java would have to do the key management behind the scene (just for strings? For Number subclasses?) or impose key management on API users. What’s the point? The case is so rare that it’s easier for developers of vulnerable applications to use a wrapper with desired hash properties.

jillyboel

java hashcodes are just 4 bytes, there will always be collisions

GolDDranks

The thing is, even it being used just in collections can lead to DOS, if the attacker can control the string contents, and selectively choose strings that your hash table stops being a hash table and turns into a linked list.

ivan_gammel

That’s clear and that is not a reason to have it in general purpose collections or String::hashCode. If your app is vulnerable to this sort of attack, just use a wrapper for keys and specialized collection (you may want to limit the maximum size of it too).

null

[deleted]

andrewaylett

Even in collections, an unstable hash is desirable -- to avoid denial of service attacks caused by attacker-controlled hash collisions.

For example, Python back in 2012: https://nvd.nist.gov/vuln/detail/cve-2012-1150.

ivan_gammel

There exists CVE-2012-5373 for Java and it is not fixed because it is not a risk worth taking care of.

pfdietz

Or the hash table implementation should be resistant to collisions, falling back to another data structure in that case (as described below, using trees instead of lists in buckets with sufficiently many occupants.)

sedatk

Java uses a tree instead of a linked list for collided items, so the search performance degrades more gracefully (e.g. O(N) vs O(logN)).

77

To be more precise, Java initially uses a linked list for nodes within a bin. If the number of items inside the bin crosses TREEIFY_THRESHOLD (which is 8), then that specific bin is converted into a RB tree.

This is detailed in implementation notes comment here: https://github.com/openjdk/jdk/blob/56468c42bef8524e53a929dc...

yxhuvud

Nothing stops the jvm from caching hashes even if the hashes are unique per process invocation.

PaulHoule

They aren’t and it’s quite unfortunate that the empty string hashes to 0 so it will have to recompute every time although presumably it is quick to compute the hash of the empty string.

daxfohl

Cool that we're still seeing perf improvements after all these years! I'm confused by some of the details in the example. Like, would we see similar 8x improvement in a simpler example like a string hashset lookup? Is there something special about MethodHandle or immutable maps here that accentuates the improvement?

> Computing the hash code of the String “malloc” (which is always -1081483544)

Makes sense. Very cool.

> Probing the immutable Map (i.e., compute the internal array index which is always the same for the malloc hashcode)

How would this work? "Compute" seems like something that would be unaffected by the new attribute. Unless it's stably memoizing, but then I don't quite see what it would be memoizing here: it's already a hash map.

> Retrieving the associated MethodHandle (which always resides on said computed index)

Has this changed? Returning the value in a hash map once you've identified the index has always been zero overhead, no?

> Resolving the actual native call (which is always the native malloc() call)

Was this previously "lazyinit" also? If so, makes sense, though would be nice if this was explained in the article.

twic

> How would this work? "Compute" seems like something that would be unaffected by the new attribute.

The index is computed from the hashcode and the size of the array. Now that the hash code can be treated as a constant, and the size of the array is already a constant, the index can be worked out at compile time. The JVM can basically inline all the methods involved in creating and probing the map, and eliminate it entirely.

daxfohl

The bucket is computed from the hash code and the size of the array, but that's not necessarily the index. If there are no bucket collisions, then index==bucket and this works out. But if there are bucket collisions then the index will be different from the bucket. So you still need some computation IIUC. And there's no way to memoize that result, since memoization would require a hashmap that has the exact same characteristics as the original hashmap.

I guess a @Stable attribute on the array underlying the map would allow for the elimination of one redirection: in a mutable map the underlying array can get resized so its pointer isn't stable. With an annotated immutable map it could be (though IDK whether that'd work with GC defrag etc). But that seems like relatively small potatoes? I don't see a way to "pretend the map isn't even there".

rf15

Strange to make an entire piece on String specifically - it seems like this is just based on the work done for the @Stable annotation and the many effects it can have on classes during runtime.

layer8

It’s a bit unfortunate that the user code equivalent (JEP 502) comes at the cost of an extra object per “stable” field. Lazy initialization is often motivated by avoiding creating an object up-front, but with this new pattern you’ll have to create one anyway.

ashvardanian

Has anyone done/shared a recent benchmark comparing JNI call latency across Java runtimes? I’m exploring the idea of bringing my strings library to the JVM ecosystem, but in the past, JNI overhead has made this impractical.

jbverschoor

So strings don’t get hash code at compile time?

At first I thought the article was describing something similar to Ruby’s symbols

vault

why is a website about programming not writing code blocks in ASCII? I'm referring to quotes and other undesirable symbols

VWWHFSfQ

Java supports non-ASCII source code since forever. A lot of languages do.

dullcrisp

But it doesn't support using fancy unicode quotes as string delimiters.

kazinator

I experimented with hash codes in strings in TXR Lisp, but took it out after several days:

  commit a136f37015cc2513878f75afcf8ba49fa61a88e5
  Author: Kaz Kylheku <kaz@kylheku.com>
  Date:   Sat Oct 8 20:54:05 2022 -0700

    strings: revert caching of hash value.

    Research indicates that this is something useful in
    languages that abuse strings for implementing symbols.
    We have interned symbols.

    * lib.h (struct string): Remove hash member.

    * lib.c (string_own, string, string_utf8, mkustring,
    string_extend, replace_str, chr_str_set): Remove
    all initializations and updates of the removed
    hash member.

    * hash.c (equal_hash): Do not cache string hash value.
> This improvement will benefit any immutable Map<String, V> with Strings as keys and where values (of arbitrary type V) are looked up via constant Strings.

Wait, what? But, that's inherently constant foldable without reasoning about string hash codes; we don't need them at all.

We examine the expression [h "a"]: lookup the key "a" in hash table h, where h is a hash literal object, that we write as #H(() ("a" "b)). It contains the key "a", mapping it to "b":

  1> (compile-toplevel '[#H(() ("a" "b")) "a"])
  #<sys:vm-desc: 8eaa130>
What's the code look like?

  2> (disassemble *1)
  data:
      0: "b"
  syms:
  code:
      0: 10000400 end d0
  instruction count:
      1
  #<sys:vm-desc: 8eaa130>
One instruction: just return "b" from the static data register d0. The hash table is completely gone.

The keys don't even have to be strings; that's a red herring.

esafak

Will Kotlin and Scala benefit?

daxfohl

They should. It's a JVM feature on core JDK classes.

taeric

This implies they have their own String implementations? I wouldn't be shocked to find out that is the case, but it would surprise me.

daxfohl

Why? IIUC it implies that they _don't_ have their own string implementations. They get the benefit of Java JDK's string implementation (and JVM optimization thereof) for free. If they had their own string implementations, they'd be unable to use this optimization until it's publicly available.

null

[deleted]