Why is hash(-1) == hash(-2) in Python?
43 comments
·January 7, 2025Terr_
nayuki
In other words, this is a case of https://en.wikipedia.org/wiki/In-band_signaling versus out-of-band signaling https://en.wikipedia.org/wiki/Signaling_(telecommunications)... .
EricRiese
Clears throat the precise wiki page you're looking for is the Semipredicate Problem
kevinventullo
We all know that types are not Pythonic.
(I’m only mostly joking)
kstrauser
Not really, I don’t think. Python code doesn’t ever really use hash() for anything where specific outputs are expected. Even if hash(a)==hash(b), it’s not implied that a==b or a is b.
tmvphil
But -2 seems like just a bad choice here. -1 and -2 are very "common" integers. Seems they could have just picked a quasi-random large integer to reduce the likelihood of collision. I expect hash functions to have collisions, but I expect it to be "unlikely" in a way that scales with the size of the hash.
kstrauser
I don't think that matters here, though. This specific hash function is literally used just for putting values in dict buckets. The docs say:
> Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup.
They're not to be used as a crypto digest. The downside here, then is that if -1 and -2 are used as dict keys, they'll end up bucketed together. Apparently that hasn't been enough of an issue over the years to bother changing it. One advantage is that those values are common enough that it might break code that incorrectly expects hash() to have predictable outputs. If it used 0xffffffff as the value, lots of busted tests may never stumble across it.
TRiG_Ireland
Yes, but having result types would avoid the need for this special casing.
ks2048
The next k for which hash(k) != k: 2**61-1
print(hash(2**61 - 2))
2305843009213693950
print(hash(2**61 - 1))
0
Galanwe
> So, in C, when you design a function, and you want your function to be able to indicate “there was an error”, it has to return that error as its return value.
Well, you can also use an errno-like system. It has its own set of drawbacks as well, but it removes the "I need to reserve a sentinel value" problem.
secondcoming
Out params are thing too.
kstrauser
Because the hash function is only reliably useful for laying out dict keys, and its outputs are totally opaque and coincidental.
The pigeonhole principle says there are an infinite number of inputs with the same hash output. Trying to figure out how that happens can be fun and enlightening, but why? “Because it just does, that’s all.”
Etheryte
I think this is not constructive criticism. There's a good, clear cut reason for this specific overlap (error handling), saying it collides just because is not really useful.
kstrauser
But even that’s an implementation detail that happens to be true today, but could easily disappear tomorrow. There’s nothing in the docs saying it has to be that way, or that you can infer anything from a hash output other than that it’s going to be consistent within a single execution of that Python program.
tooltower
This article is not about the API contract of the hash function, or the abstraction it provides. If you are just trying to hash things, you don't need any info here.
It's very much trying to go _under_ the abstraction layer to investigate its behavior. Because it's interesting.
This is very similar to how people investigate performance quirks or security issues.
sedatk
It could have been a bug. There's nothing wrong with the question and seeking answers to it. It was an interesting read too.
TRiG_Ireland
I enjoyed this, because it seems to be written by an interested and interesting human, at a level that I could actually follow.
snickerbockers
>CPython is written in C, and unlike Python, C doesn’t have exceptions. So, in C, when you design a function, and you want your function to be able to indicate “there was an error”, it has to return that error as its return value
This is just wrong. C doesn't feature 'error handling' as a dedicated form of branching the way many higher level languages do and you're in no way required to use return codes as an error signal. This is a case of bad API design and is entirely python's fault.
jiggawatts
This seems like a loaded footgun: anyone writing a custome hash function has to know about -1 being treated like an error and handle it in a similar way.
I can’t think of any other language where this kind of thing happens, which means other developers won’t expect it either.
I can see the bug report now: “certain records cause errors, occurs only for about one in a few billion.”
dagw
If you're writing a custom hash function in python for a custom python class, then this won't affect you since it's only treated as an error in the underlying C code.
If you're adding a new type to the core python language then you have to be aware of this, but if you're hacking the C implementation to change the core language then you're probably pretty well versed in the cpython internals, or at least surrounded by people that are.
dagw
interestingly enough though, it doesn't seem to be possible to write a custom hash function that returns -1
>>> class Foo:
def __init__(self):
pass
def __hash__(self):
return -1
>>> f = Foo()
>>> print(hash(f))
-2
So if you're doing something where you have a custom __hash__ function that you're expecting to return -1 for a certain value and then are testing for value of the hash of an object rather than testing the property of the object directly, then this might bite you. But I cannot think of any reasonable case where you might want to do that.krab
I don't think so. This is true for people writing cpython. But if you implement custom `__hash__` method, you should be fine.
intalentive
>>> d = {-2: None}
>>> -1 in d
False
What gives?
hyperpape
Since hashes aren't guaranteed to be unique, a dictionary should use equality to actually confirm that the object with the given hash matches the key present in the dictionary.
nayuki
> The __hash__() method should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects ...
-- https://docs.python.org/3/reference/datamodel.html#object.__...
> The general contract of hashCode is:
> * If two objects are equal according to the equals method, then calling the hashCode method on each of the two objects must produce the same integer result.
> * It is not required that if two objects are unequal according to the equals method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
-- https://docs.oracle.com/en/java/javase/23/docs/api/java.base...
krab
Because -1 is not equal to -2. They will fall into the same bucket of the hash map, but the hash collisions are expected and must be accounted for.
Arnavion
If hashmaps only looked at hash(k1) == hash(k2) and not also k1 == k2 they would be quite useless as maps.
null
pmarreck
A better question might be, why are people still using OO dynamic languages without any typing?
kstrauser
Not sure. Is anyone using a language like that? Python certainly isn’t one.
worik
> Python certainly isn’t one.
I am not a Python expert, but...
Python is described as OO and I thought is is dynamically typed
Not quite "without any typing" but close
kstrauser
Not really. Python's strongly typed, but dynamically. You can think of Python variables as untyped references to strongly typed values: the values are typed, but their names are not.
nayuki
> without any typing?
>>> "abc" + 321
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can only concatenate str (not "int") to str
I rest my case.homebrewer
TaPL should be required reading for self proclaimed "engineers". Do untyped languages even exist, or could they? maybe something purely theoretical for ivory tower academics? Even POSIX shell and assembler type their values.
nayuki
> Do untyped languages even exist, or could they? Even POSIX shell and assembler type their values.
I program in assembly language. Memory is just a big array of bytes. Values don't have types, but each instruction chooses what type to interpret memory as - e.g. uint8, int16, float32, x86 real-mode pointer, long-mode pointer, etc.
In a way this is an argument for languages where it's normal to have result-types [0] or exceptions, instead of -1 etc. It just feels wrong to have any potential collision or confusion between normal results and error results.
[0] https://en.wikipedia.org/wiki/Result_type