Skip to content(if available)orjump to list(if available)

Undergraduate shows that searches within hash tables can be much faster

abetusk

Ok, big shout out to monort [0] for the link to the video [1].

This is just a quick overview from a single viewing of the video, but it's called "funnel hashing". The idea is to split into exponentially smaller sub arrays, so the first chunk is n/m, the second is n/(m^2), etc. until you get down to a single element. Call them A0, A1, etc., so |A0| = n/m, |A1| = n/(m^2) etc., k levels in total.

Try inserting into A0 c times. If it fails, try inserting into A1 c times. If it fails, go down the "funnel" until you find a free slot.

Call \delta the fraction of slots that are empty (I'm unclear if this is a parameter that gets set at hash table creation or one that's dynamically updated). Setting c = log(1/d) and k = log(1/d) to get worst case complexity O(log^2(1/d)).

This circumvents Yao's result by not being greedy. Yao's result holds true for greedy insertion and search policies and the above is non-greedy, as it's cascading down the funnels.

There are probably many little hairy details to work out but that's the idea, as far as I've been able to understand it. People should let me know if I'm way off base.

This very much reminds me of the "Distinct Elements in Streams" idea by Chakraborty, Vinodchandran and Meel[2].

[0] https://news.ycombinator.com/item?id=43007860

[1] https://www.youtube.com/watch?v=ArQNyOU1hyE

[2] https://arxiv.org/pdf/2301.10191

brink

Krapivin made this breakthrough by being unaware of Yao's conjecture.

The developer of Balatro made an award winning deck builder game by not being aware of existing deck builders.

I'm beginning to think that the best way to approach a problem is by either not being aware of or disregarding most of the similar efforts that came before. This makes me kind of sad, because the current world is so interconnected, that we rarely see such novelty with their tendency to "fall in the rut of thought" of those that came before. The internet is great, but it also homogenizes the world of thought, and that kind of sucks.

aidenn0

> I'm beginning to think that the best way to approach a problem is by either not being aware of or disregarding most of the similar efforts that came before. This makes me kind of sad, because the current world is so interconnected, that we rarely see such novelty with their tendency to "fall in the rut of thought" of those that came before. The internet is great, but it also homogenizes the world of thought, and that kind of sucks.

I think this is true only if there is a novel solution that is in a drastically different direction than similar efforts that came before. Most of the time when you ignore previous successful efforts, you end up resowing non-fertile ground.

layman51

Right, the other side is when you end up with rediscoveries of the same ideas. The example that comes to my mind is when a medical researcher found the trapezoidal rule for integration again[1].

[1]: https://fliptomato.wordpress.com/2007/03/19/medical-research...

sitkack

I think that shows how great the trapezoidal rule is. I feel like this is brought out too many times, that now it is used to make fun of people. It is 18 years old at this point.

gsf_emergency_2

This is the the strongest argument for not shaming reinvention...

Unless the victims are world-class..? (Because it's not entirely not self-inflicted)

https://news.ycombinator.com/item?id=42981356

Shades of the strong-link weak-link dilemma too

phyzix5761

But how can you ever discover a novel solution without attempting to resow the ground?

Dansvidania

there might be a decent amount of "survivorship bias" too. meaning you only hear of the few events where someone starts from first principles and actually finds, be it luck or smarts, a novel solution which improves on the status quo, but i'd argue there are N other similar situations where you don't end up with a better solution.

That being said, I so disagree with just taking the "state of the art" as written in stone, and "we can't possibly do better than library x" etc.

smaudet

Plenty of "state of the art", at least a decade ago, that was not very state of anything.

I think bias is inherent in our literature and solutions. But also, I agree that the probability of a better solution degrades over time (assuming that the implementations themselves do not degrade - building a faster hash table does not matter if you have made all operations exponentially more expensive for stupid, non-computational, reasons)

mmusson

Or worse, pursuing something already proven not to work.

phyzix5761

That's how the best discoveries are made.

somenameforme

The problem is that when the proof is wrong, as in this case a related conjecture held up for 40 years, which is not a "proof" per se, but still ostensibly an extremely high reliability indicator of correctness.

Another example is when SpaceX was first experimenting with reusable self landing rockets. They were being actively mocked by Tory Bruno, who was the head of ULA (basically an anti-competitive shell-but-not-really-corp merger between Lockheed and Boeing), claiming essentially stuff along the lines of 'We've of course already thoroughly researched and experimented with these ideas years ago. The economics just don't work at all. Have fun learning we already did!'

Given that ULA made no efforts to compete with what SpaceX was doing it's likely that they did genuinely believe what they were saying. And that's a company with roots going all the way back to the Apollo program, with billions of dollars in revenue, and a massive number of aerospace engineers working for them. And the guy going against them was 'Silicon Valley guy with no aerospace experience who made some money selling a payment processing tool.' Yet somehow he knew better.

porkbrain

Viewed through the lens of personal development I suppose one could make an argument that there wasn't much difference between rediscovering an existing valid or invalid solution. Both lead to internalisation of a domain's constraints.

godelski

I'm not sure this is true, though I think it looks true.

I think the issue is that when a lot of people have put work into something you think that the chances of success yourself are low. This is a pretty reasonable belief too. With the current publish or perish paradigm I think this discourages a lot of people from even attempting. You have evidence that the problem is hard and even if solvable, probably is timely, so why risk your entire career? There are other interesting things that are less risky. In fact, I'd argue that this environment in of itself results in far less risk being taken. (There are other issues too and I laid out some in another comment) But I think this would look identical to what we're seeing.

lo_zamoyski

Right. FWIW, Feynman predicted that physics would become rather boring in this regard, because physics education had become homogenized. This isn't to propose a relativism, but rather that top-down imposed curricula may do a good deal of damage to the creativity of science.

That being said, what we need is more rigorous thinking and more courage pursuing the truth where it leads. While advisors can be useful guides, and consensus can be a useful data point, there can also be an over-reliance on such opinions to guide and decide where to put one's research efforts, what to reevaluate, what to treat as basically certain knowledge, and so on. Frankly, moral virtue and wisdom are the most important. Otherwise, scientific praxis degenerates into popularity contest, fitting in, grants, and other incentives that vulgarize science.

MVissers

I think that's why most innovative science today happens at the intersection of two domains– That's where someone from a different field can have unique insights and try something new in an adjacent field. This is often hard to do when you're in the field yourself.

robotelvis

In my experience the best approach is to first try to solve the problem without having read the prior work, then read the prior work, then improve your approach based on the prior work.

If you read the prior work too early to you get locked into existing mindsets. If you never read it then you miss important things you didn’t thought of.

Even if your approach is less good than the prior work (the normal case) you gain important insights into why the state of the art approach is better by comparing it with what you came up with.

cubefox

> If you read the prior work too early to you get locked into existing mindsets.

I agree, though in some cases coming up with your own ideas first can result in you becoming attached to them, because they are your own. It is unlikely for this to happen if you read the prior work first.

Though I think overall reading the prior work later is probably still a good idea, but with the intention not to become too impressed with whatever you come up before.

brookst

What if you’ve already read the prior work before trying to solve the problem?

kortilla

Then you’re very unlikely to come up with a novel approach. It’s very difficult to not let reading “state of the art” research put up big guardrails in your mind about what’s possible.

All of the impressive breakthroughs I saw in academia in the CS side were from people who bothered very little with reading everything related in literature. At most it would be some gut checks of abstracts or a poll of other researchers to make sure an approach wasn’t well explored but that’s about it.

The people who did mostly irrelevant incremental work were the ones who were literature experts in their field. Dedicating all of that time to reading others’ work puts blinders on both your possible approaches as well as how the problems are even defined.

helloplanets

> The developer of Balatro made an award winning deck builder game by not being aware of existing deck builders.

He was aware of deck builders and was directly inspired by Luck be a Landlord, but he was not aware of just how massive the genre is.

Direct quote from the developer:

> The one largest influence on Balatro was Luck Be a Landlord. I watched Northernlion play for a few videos and loved the concept of a non-fanatsy themed score attach roguelike a ton, so I modified the card game I was working on at the time into a roguelike.

> I cut myself off from the genre at that point intentionally, I wanted to make my own mistakes and explore the design space naively just because that process is so fun. I hear the comparison to Slay the Spire a lot but the truth is that I hadn't played that game or seen footage of it when I designed Balatro, not until much later.

https://www.reddit.com/r/Games/comments/1bdtmlg/comment/kup7...

huijzer

Walter Isaacson said something similar about Einstein and Steve Jobs. Sometimes you need to reject commonly held assumptions to make progress. Einstein rejected the idea of ether. According to Isaacson this was probably because Einstein was working outside of university. Inside university, professors would likely have pushed Einstein to stick to the idea of ether.

somenameforme

I think going one layer lower - the fundamental issue is that the internet drives people to unrealistic perceptions of the competence of others. Think about all of the undeniably brilliant people that have been involved in software over the past 40 years, and how many of them used hash tables in performance critical environments. Let alone mathematicians and others using them in applied domains. And you think there's something fundamental that all of these people just somehow missed?

The argument of 'if that's such a good idea, why wouldn't somebody have just done it already?' seems to have grown exponentially with the advent of the internet. And I think it's because the visibility of competence of other's became so much more clear. For those who lived through e.g. Carmack's Golden Age you knew you were never going to be half the coder he was, at least based on the image he successfully crafted. That 'slight' at the end is not to say he wasn't a brilliant developer or even perhaps the best in the world at his peak, but rather that brilliance + image crafting creates this Gargantuan beast of infallibility and exceptionalism that just doesn't really exist in reality. I think it's from this exact phenomena that you also get the practical fetishism of expertise.

chambers

  “They’re cheering for you,” she said with a smile. 
  “But I could never have done it,” [Milo] objected, “without everyone else’s help.”
  “That may be true,” said Reason gravely, “but you had the courage to try; 
     and what you can do is often simply a matter of what you will do.”
  “That’s why,” said King Azaz, “there was one very important thing about your quest 
     that we couldn’t discuss until you returned.”
  “I remember,” said Milo eagerly. “Tell me now.”
  “It was impossible,” said the king, looking at the Mathemagician.
  “Completely impossible,” said the Mathemagician, looking at the king.
  “Do you mean … ,” said the bug, who suddenly felt a bit faint.
  “Yes, indeed,” they repeated together, “but if we’d told you then, you might not have gone … 
    and, as you’ve discovered, so many things are possible just as long as you don’t know they’re impossible.”
- The Phantom Tollbooth (1961)

youniverse

I watched a casual youtube video by a philosophy professor talking about the same thing that great scholars are different than great thinkers. Many great thinkers came up with great philosophies because they misread past works.

If anyone wants to watch: https://youtu.be/4vou_dXuB8M?si=Wdr7q96MFULPAEc4

Definitely something we should all keep in mind that sometimes you just have to pave your own way and hope it is great on its own merits.

rincebrain

A professor I had in college, whose first published result was from a piece of homework he turned in where he incidentally solved an open question about bound on a problem, had a curious habit.

I ended up failing and taking his course again (because I had A Lot going on in college), and thus, noticed something.

Each semester, on one of the assignments in the latter half of the class, he assigned one problem out of, perhaps, 30 in the problem set, where as written, it was actually an open problem, and then a day or two before they were due, he'd send out an "oops, my bad" revised version.

I suspect that this was not an accident, given that it always happened only once.

monort

abetusk

Thanks so much for this link. I remain convinced that papers are so much more understandable with an accompanying talk by the creators. I wish papers would just come with a video talk included.

joaohaas

Thanks for the video, def a lot better than the article.

I do find it a bit weird that this is somehow better than just over-allocating (and thus reducing the chances of key collisions, which also makes worst case 'less worse') given his approach also allocates more memory through the aux arrays.

hydroreadsstuff

Overallcoation has a limit. You only have so much RAM/storage. Beyond that you start swapping. I could really use a hash table (or similar structure) that degrades less with higher occupancy.

orlp

Skimming the paper [1], the key difference they used is that their hash table insertion algorithm will probe further than the first empty slot, instead of greedily filling the first empty slot it finds. They combine this with a clever probing sequence which provably finds empty slots efficiently, even if the table is very full.

This means insertions when the hash table is less full are slower, but you avoid the worst-case scenario where you're probing for the last (few) remaining open slot(s) without any idea as to where they are.

[1]: https://arxiv.org/pdf/2501.02305

---

An interesting theoretical result but I would expect the current 'trick' of simply allocating a larger table than necessary to be the superior solution in practice. For example, Rust's hashbrown intentionally leaves 1/8th (12.5%) of the table empty, which does cost a bit more memory but makes insertions/lookups very fast with high probability.

elchananHaas

I might be misreading their algorithm, but from my look at the paper the key improvement is a non-uniform strategy where they divide the array into buckets and focus on different buckets as they fill the table. This increases the average number of locations to be probed even when the table is emptier. They still place the item in the first empty slot they see with this strategy.

The "skipping slots" has to do with jumping ahead in the hash sequence.

SiempreViernes

But you could do some hybrid, where you do greedy fill for a while and then switch to this fancier fill once your table is approaching full (using some heuristic)?

elchananHaas

No, it has to do with the coupon collectors problem. The key idea behind this algorithm is to do more looking for an empty spot up front.

layer8

I would assume that you need to start early to be able to reap the benefits once the table is almost full.

gizmo

Not really because you have to use the same algorithms during subsequent lookups. Imagine you add a bunch of items with insert algorithm A, then you cross some threshold and you insert some more items with Algorithm B. Then you delete (tombstone) a bunch of items. Now you look up an item, and the slot is filled, and you're stuck. You have to proceed your lookup with both algorithms to find what you're looking for (thereby giving up any performance benefits) because the current occupancy doesn't tell you anything about the occupancy at the time when the item you were looking for was inserted.

quantum2022

This is neat! I always wondered if there would be a way to 'containerize' tables like this. IE a regular table is like a bulk carrier ship, with everything stuffed into it. If you could better organize it like a container ship, you could carry much more stuff more efficiently (and offload it faster too!)

trebligdivad

Anyone got a simple implementation of 'Tiny pointers'? My mind prefers code/pseudo-code first rather than the proof.

_1tan

Neat, started on some implementation: https://kraftwerk.social/innovation-in-hash-tables/

default-kramer

> And for this new hash table, the time required for worst-case queries and insertions is proportional to (log x)2 — far faster than x.

> The team’s results may not lead to any immediate applications

I don't understand why it wouldn't lead to immediate applications. Is this a situation where analysis of real-world use cases allows you to tune your hash implementation better than what a purely mathematical approach would get you?

pclmulqdq

I'm pretty sure nobody uses uniform probing hash tables in the wild. Every time I have wanted to have very high load factors (>90%), cuckoo hashing has been good enough, and below 70-80%, linear probing is blazing fast and absolutely good enough.

kragen

Linear probing also has better locality of reference.

frakt0x90

I haven't read the paper, but sometimes asymptotic improvements do not translate to real world improvements due to a large multiplicative factor in the complexity that gets factored out in the O() analysis. So the dataset required to see speed-up is impractically large.

orlp

In this case "x" is 1/d where d is the unused fraction of space.

So if you leave 0.1% of your hashtable unused your x is 1000 - quite problematic. However if you leave 12.5% of your hashtable unused your x is 8 - quite reasonable, and not something logarithmic behavior would necessarily speed up, for reasonable constants.

pclmulqdq

This is pretty much exactly the case for this algorithm. It is a very slow hash table due to the lack of locality. It seems to only have benefits at very large size and very high load factor.

At that scale, it may be practically better to use a B-tree or a radix tree over a 128-bit hash of the key.

MichaelDickens

I'm not up to date on the state of the art but I've implemented hash tables a few times and we would expand the hash table whenever it was 75% full, which means x is never greater than 4. Improving the runtime from O(x) to O((log x)^2) doesn't matter when x is so small.

I imagine there are some niche memory-constrained applications where you'd let x get larger but I never ran into them personally.

kevin_thibedeau

Robin Hood hash works well with high load factors ~95% because it balances the average lookup with a fair distribution of buckets. That makes it ideal when you don't want to waste memory on bloated tables.

null

[deleted]

plasticchris

In memory constrained environments you just make the underlying array a sparse vector, then it can be mostly empty without using much memory at all.

danpalmer

How does this work? Naively I'd have expected that to implement a sparse vector with efficient random access, you'd use a hash table, but I assume this must depend on the sparse vector not being a hash table or otherwise there wouldn't be an improvement. Are there other ways of implementing the efficient random access, or do you sacrifice the performance of random access?

layer8

In practice the worst-case operations are avoided by reserving a little more space for the hash table. And the new results come at the cost of slower “good case” insertions.

ofirg

it improves the worse case cost given a nearly full hash map, it hurts raises the cost in other cases.

rajnathani

Also, correct me if I'm wrong, but also there is a slight added memory complexity in adding these tiny pointers?

echoangle

Isn’t the problem that the scaling behavior only dominates with infinite n?

If you have a constant factor, that doesn’t go into the scaling rule, so having something scale (log x)2 could still be 100 times more expensive than something that scales linearly with x for all x smaller than 2^100.

thfuran

It's extremely unlikely that the constants would be nearly that large, but yes.

kragen

There are practically important algorithms like this. There are three linear-time algorithms for constructing a suffix array, but all of them have rather large constant factors. Packrat parsing is linear in the size of the input string, but can easily execute more than 100 instructions per byte. And there are numerous theoretically asymptotically superior algorithms whose constant factors are too high for them to be useful at all, the so-called "galactic algorithms", https://en.m.wikipedia.org/wiki/Galactic_algorithm.

There is nothing in the article suggesting that this is such a case, however.

oulipo

Most real-world hash table implementations are not "theoretical" but depends on "real" parameters like L2 cache, assembly instruction sizes, etc

kazinator

Real world hash table can either be resized to avoid worst-case degradation or else be provisioned to have a good amount of slack when the worst expected case occurs. (E.g. a hash table used a cache will apply pressure to evacuate old entries before it gets foo full.)

Bucket hashing obviously has no problem finding a free spot. I didn't say chained on purpose because chains can be resizable vectors, so we load at most one pointer when going from table to bucket.

null

[deleted]

slt2021

its more like real-world hash tables are tailored to the specific pattern of querying and inserting data.

if you know empirically that 80% of your requests come to a tiny subset of hot keys - you make a specific hashset just for these hot keys with constant access time, while keeping the rest of the table cold.

your L2 cache is an example of such optimization - a high bandwidth and low latency memory on the CPU die, that is orders of magnitude faster than random DRAM access - a tiered memory model

jeffbee

Complexity analysis and actual systems programming have been diverging for a while. I don't see anything in the paper that will inform practice.

naasking

How so?

zamadatix

Even when something isn't a galactic algorithm "how well it maps to being done efficiently in real hardware with hardware sized data sizes" is more and more often the important bit over "slightly better bounds at infinity". The former is probing the edge of mathematics while the latter is probing the edges of practical construction so they rarely seem to align much anymore at these depths.

nijave

A couple things immediately come to mind

1) complexity analysis ignores coefficients which can make a huge difference, especially since computers usually have bounds

2) real life may influence the likelihood of best/worst case. I think data tends to be somewhat sorted in practice so algorithms with best case on sorted data perform better

throwme_123

Is someone aware of a GitHub repo with an implementation of this?

seinecle

Anyone competent enough here to venture a guess on the speed gain to expect under various scenarios?

ThinkBeat

Do we have some nice implementations yet? I do better reading code than math.

foota

I guess the most we could hope for here is that this leads to some other discovery down the road, either in hashtables or maybe one of the similar structures like bloom filters?