Skip to content(if available)orjump to list(if available)

Undergraduate shows that searches within hash tables can be much faster

brink

Krapivin made this breakthrough by being unaware of Yao's conjecture.

The developer of Balatro made an award winning deck builder game by not being aware of existing deck builders.

I'm beginning to think that the best way to approach a problem is by either not being aware of or disregarding most of the similar efforts that came before. This makes me kind of sad, because the current world is so interconnected, that we rarely see such novelty with their tendency to "fall in the rut of thought" of those that came before. The internet is great, but it also homogenizes the world of thought, and that kind of sucks.

aidenn0

> I'm beginning to think that the best way to approach a problem is by either not being aware of or disregarding most of the similar efforts that came before. This makes me kind of sad, because the current world is so interconnected, that we rarely see such novelty with their tendency to "fall in the rut of thought" of those that came before. The internet is great, but it also homogenizes the world of thought, and that kind of sucks.

I think this is true only if there is a novel solution that is in a drastically different direction than similar efforts that came before. Most of the time when you ignore previous successful efforts, you end up resowing non-fertile ground.

layman51

Right, the other side is when you end up with rediscoveries of the same ideas. The example that comes to my mind is when a medical researcher found the trapezoidal rule for integration again[1].

[1]: https://fliptomato.wordpress.com/2007/03/19/medical-research...

Shorel

That's not really a problem.

In one hand, it shows the idea is really useful on its own.

And on the other hand, it shows that currently forgotten ideas have a chance to being rediscovered in the future.

djmips

I find a lot of the time concepts and patterns can be the same in different fields, just hidden by different nomenclature and constructed upon a different pyramid of knowledge.

It's nice we have a common language that is mathematics, the science of patterns, to unify such things but it's still going to be a challenge because not everyone is fluent in all the various branches of mathematics.

It's even mind-blowing how many ways you can approach the same problem with equivalencies between different types of mathematics itself.

sitkack

I think that shows how great the trapezoidal rule is. I feel like this is brought out too many times, that now it is used to make fun of people. It is 18 years old at this point.

gsf_emergency_2

This is the the strongest argument for not shaming reinvention...

Unless the victims are world-class..? (Because it's not entirely not self-inflicted)

https://news.ycombinator.com/item?id=42981356

Shades of the strong-link weak-link dilemma too

Dansvidania

there might be a decent amount of "survivorship bias" too. meaning you only hear of the few events where someone starts from first principles and actually finds, be it luck or smarts, a novel solution which improves on the status quo, but i'd argue there are N other similar situations where you don't end up with a better solution.

That being said, I so disagree with just taking the "state of the art" as written in stone, and "we can't possibly do better than library x" etc.

smaudet

Plenty of "state of the art", at least a decade ago, that was not very state of anything.

I think bias is inherent in our literature and solutions. But also, I agree that the probability of a better solution degrades over time (assuming that the implementations themselves do not degrade - building a faster hash table does not matter if you have made all operations exponentially more expensive for stupid, non-computational, reasons)

karparov

[dead]

croo

In 1973 Clifford Cock solved the problem of public keys first time in history that no one in GCHQ managed to solve in the past 3 years. He jolted down the solution in half hour after hearing about it then wondered why is it such a big thing for everyone else. A fresh view unclouded by prejudice can make all the difference.

mmusson

Or worse, pursuing something already proven not to work.

porkbrain

Viewed through the lens of personal development I suppose one could make an argument that there wasn't much difference between rediscovering an existing valid or invalid solution. Both lead to internalisation of a domain's constraints.

SkyBelow

Outside of math and computational science, nothing is proven to not work because scientific research doesn't work in proofs. Even in math and computational science, there are fields dedicated to researching known proven wrong logic because sometimes there are interesting findings, like hypercomputation.

phyzix5761

That's how the best discoveries are made.

somenameforme

The problem is that when the proof is wrong, as in this case a related conjecture held up for 40 years, which is not a "proof" per se, but still ostensibly an extremely high reliability indicator of correctness.

Another example is when SpaceX was first experimenting with reusable self landing rockets. They were being actively mocked by Tory Bruno, who was the head of ULA (basically an anti-competitive shell-but-not-really-corp merger between Lockheed and Boeing), claiming essentially stuff along the lines of 'We've of course already thoroughly researched and experimented with these ideas years ago. The economics just don't work at all. Have fun learning we already did!'

Given that ULA made no efforts to compete with what SpaceX was doing it's likely that they did genuinely believe what they were saying. And that's a company with roots going all the way back to the Apollo program, with billions of dollars in revenue, and a massive number of aerospace engineers working for them. And the guy going against them was 'Silicon Valley guy with no aerospace experience who made some money selling a payment processing tool.' Yet somehow he knew better.

godelski

I'm not sure this is true, though I think it looks true.

I think the issue is that when a lot of people have put work into something you think that the chances of success yourself are low. This is a pretty reasonable belief too. With the current publish or perish paradigm I think this discourages a lot of people from even attempting. You have evidence that the problem is hard and even if solvable, probably is timely, so why risk your entire career? There are other interesting things that are less risky. In fact, I'd argue that this environment in of itself results in far less risk being taken. (There are other issues too and I laid out some in another comment) But I think this would look identical to what we're seeing.

lo_zamoyski

Right. FWIW, Feynman predicted that physics would become rather boring in this regard, because physics education had become homogenized. This isn't to propose a relativism, but rather that top-down imposed curricula may do a good deal of damage to the creativity of science.

That being said, what we need is more rigorous thinking and more courage pursuing the truth where it leads. While advisors can be useful guides, and consensus can be a useful data point, there can also be an over-reliance on such opinions to guide and decide where to put one's research efforts, what to reevaluate, what to treat as basically certain knowledge, and so on. Frankly, moral virtue and wisdom are the most important. Otherwise, scientific praxis degenerates into popularity contest, fitting in, grants, and other incentives that vulgarize science.

MVissers

I think that's why most innovative science today happens at the intersection of two domains– That's where someone from a different field can have unique insights and try something new in an adjacent field. This is often hard to do when you're in the field yourself.

phyzix5761

But how can you ever discover a novel solution without attempting to resow the ground?

bluedino

Run a mile in 4 minutes. Eat 83 hot dogs in 10 minutes.

Everything is impossible until someone comes along that's crazy enough to do it.

robotelvis

In my experience the best approach is to first try to solve the problem without having read the prior work, then read the prior work, then improve your approach based on the prior work.

If you read the prior work too early to you get locked into existing mindsets. If you never read it then you miss important things you didn’t thought of.

Even if your approach is less good than the prior work (the normal case) you gain important insights into why the state of the art approach is better by comparing it with what you came up with.

dpatru

A decade ago I read this same advice in "The Curmudgeon's Guide to Practicing Law": spend at least a little time trying to solve the problem before you look to how other's have solved it. One benefit is that occasionally you may stumble on a better method. But the more common benefits is that it helps develop your problem-solving skills and it primes you to understand and appreciate existing solutions.

brookst

What if you’ve already read the prior work before trying to solve the problem?

kortilla

Then you’re very unlikely to come up with a novel approach. It’s very difficult to not let reading “state of the art” research put up big guardrails in your mind about what’s possible.

All of the impressive breakthroughs I saw in academia in the CS side were from people who bothered very little with reading everything related in literature. At most it would be some gut checks of abstracts or a poll of other researchers to make sure an approach wasn’t well explored but that’s about it.

The people who did mostly irrelevant incremental work were the ones who were literature experts in their field. Dedicating all of that time to reading others’ work puts blinders on both your possible approaches as well as how the problems are even defined.

HelloNurse

Worst case: you don't have a fresh perspective, but you have learned something and you can try plenty of other problems.

There's also a fair chance of finding possibilities that are "obviously" implicit in the prior work but haven't yet been pursued, or even noticed, by anyone.

ibejoeb

In all seriousness, if you're cool with it, LSD. Or anything else that can take you out of the ordinary course of though.

cubefox

> If you read the prior work too early to you get locked into existing mindsets.

I agree, though in some cases coming up with your own ideas first can result in you becoming attached to them, because they are your own. It is unlikely for this to happen if you read the prior work first.

Though I think overall reading the prior work later is probably still a good idea, but with the intention not to become too impressed with whatever you come up before.

helloplanets

> The developer of Balatro made an award winning deck builder game by not being aware of existing deck builders.

He was aware of deck builders and was directly inspired by Luck be a Landlord, but he was not aware of just how massive the genre is.

Direct quote from the developer:

> The one largest influence on Balatro was Luck Be a Landlord. I watched Northernlion play for a few videos and loved the concept of a non-fanatsy themed score attach roguelike a ton, so I modified the card game I was working on at the time into a roguelike.

> I cut myself off from the genre at that point intentionally, I wanted to make my own mistakes and explore the design space naively just because that process is so fun. I hear the comparison to Slay the Spire a lot but the truth is that I hadn't played that game or seen footage of it when I designed Balatro, not until much later.

https://www.reddit.com/r/Games/comments/1bdtmlg/comment/kup7...

chambers

  “They’re cheering for you,” she said with a smile. 
  “But I could never have done it,” [Milo] objected, “without everyone else’s help.”
  “That may be true,” said Reason gravely, “but you had the courage to try; 
     and what you can do is often simply a matter of what you will do.”
  “That’s why,” said King Azaz, “there was one very important thing about your quest 
     that we couldn’t discuss until you returned.”
  “I remember,” said Milo eagerly. “Tell me now.”
  “It was impossible,” said the king, looking at the Mathemagician.
  “Completely impossible,” said the Mathemagician, looking at the king.
  “Do you mean … ,” said the bug, who suddenly felt a bit faint.
  “Yes, indeed,” they repeated together, “but if we’d told you then, you might not have gone … 
    and, as you’ve discovered, so many things are possible just as long as you don’t know they’re impossible.”
- The Phantom Tollbooth (1961)

rincebrain

A professor I had in college, whose first published result was from a piece of homework he turned in where he incidentally solved an open question about bound on a problem, had a curious habit.

I ended up failing and taking his course again (because I had A Lot going on in college), and thus, noticed something.

Each semester, on one of the assignments in the latter half of the class, he assigned one problem out of, perhaps, 30 in the problem set, where as written, it was actually an open problem, and then a day or two before they were due, he'd send out an "oops, my bad" revised version.

I suspect that this was not an accident, given that it always happened only once.

vitus

> A professor I had in college, whose first published result was from a piece of homework he turned in where he incidentally solved an open question about bound on a problem, had a curious habit.

A related anecdote is about George Dantzig (perhaps best known for the simplex algorithm):

> During his study in 1939, Dantzig solved two unproven statistical theorems due to a misunderstanding. Near the beginning of a class, Professor Spława-Neyman wrote two problems on the blackboard. Dantzig arrived late and assumed that they were a homework assignment. According to Dantzig, they "seemed to be a little harder than usual", but a few days later he handed in completed solutions for both problems, still believing that they were an assignment that was overdue. Six weeks later, an excited Spława-Neyman eagerly told him that the "homework" problems he had solved were two of the most famous unsolved problems in statistics.

https://en.wikipedia.org/wiki/George_Dantzig#Education

somenameforme

I think going one layer lower - the fundamental issue is that the internet drives people to unrealistic perceptions of the competence of others. Think about all of the undeniably brilliant people that have been involved in software over the past 40 years, and how many of them used hash tables in performance critical environments. Let alone mathematicians and others using them in applied domains. And you think there's something fundamental that all of these people just somehow missed?

The argument of 'if that's such a good idea, why wouldn't somebody have just done it already?' seems to have grown exponentially with the advent of the internet. And I think it's because the visibility of competence of other's became so much more clear. For those who lived through e.g. Carmack's Golden Age you knew you were never going to be half the coder he was, at least based on the image he successfully crafted. That 'slight' at the end is not to say he wasn't a brilliant developer or even perhaps the best in the world at his peak, but rather that brilliance + image crafting creates this Gargantuan beast of infallibility and exceptionalism that just doesn't really exist in reality. I think it's from this exact phenomena that you also get the practical fetishism of expertise.

SideQuark

Picking two examples out of all people approaching problems, while ignoring wasted effort and failures to make progress because of not understanding current knowledge, is an absolutely terrible reason to approach from ignorance.

The biggest gains in theory and in practice are far more often obtained by masters of craft, giving much more weight to attacking problems from a position of knowledge.

In fact, even in this case, this progress required that the author was aware of very recent results in computer science, was thinking deeply about them, and most likely was scouring the literature for pieces to help. The “Tiny Pointers” paper is mentioned directly.

mangodrunk

Well said. I really dislike the narrative here, that ignorance is something that leads to discovery. One, the poster gives two examples, as if there's something we should gain for such a small sample. In addition to that, one of the examples isn't valid. The student's former professor is a co-author of the "Tiny Pointers" [1] paper that he was reading and working through. And, it was a conjecture, I don't see how someone should think that it would mean it's impossible.

I would rather, instead of thinking ignorance is a key ingredient to discovery, that instead it's the willingness to try things.

[1] https://arxiv.org/abs/2111.12800

dataviz1000

A similar idea came up in Veritasium's latest video today. Training AI by DeepMind to predict protein folding greatly improved by withholding the most evident information about a protein's primary structure — its linear polypeptide chain — within the The Structure Module step. [0]

After asking ChatGPT not to agree with me that your comment and these two different approaches to solving problems are the alike, it concluded there still might be similarities between the two.

[0] https://youtu.be/P_fHJIYENdI?feature=shared&t=1030

[1] https://chatgpt.com/share/67aa8340-e540-8004-8438-3200e0d4e8...

abetusk

Ok, big shout out to monort [0] for the link to the video [1].

This is just a quick overview from a single viewing of the video, but it's called "funnel hashing". The idea is to split into exponentially smaller sub arrays, so the first chunk is n/m, the second is n/(m^2), etc. until you get down to a single element. Call them A0, A1, etc., so |A0| = n/m, |A1| = n/(m^2) etc., k levels in total.

Try inserting into A0 c times. If it fails, try inserting into A1 c times. If it fails, go down the "funnel" until you find a free slot.

Call \delta the fraction of slots that are empty (I'm unclear if this is a parameter that gets set at hash table creation or one that's dynamically updated). Setting c = log(1/d) and k = log(1/d) to get worst case complexity O(log^2(1/d)).

This circumvents Yao's result by not being greedy. Yao's result holds true for greedy insertion and search policies and the above is non-greedy, as it's cascading down the funnels.

There are probably many little hairy details to work out but that's the idea, as far as I've been able to understand it. People should let me know if I'm way off base.

This very much reminds me of the "Distinct Elements in Streams" idea by Chakraborty, Vinodchandran and Meel[2].

[0] https://news.ycombinator.com/item?id=43007860

[1] https://www.youtube.com/watch?v=ArQNyOU1hyE

[2] https://arxiv.org/pdf/2301.10191

conaclos

Actually they propose two algorithms: Funnel Hashing and Elastic Hashing. Funnel Hashing is "greedy" and defeats the Yao's conjecture that concerns greedy hash mechanisms. Elastic Hashing is "non-greedy" and provides a better amortized time than greedy algorithms.

golly_ned

That it circumvents Yao’s conjecture by being non-greedy contradicts the article. Is the article wrong or is your understanding of the paper? I don’t know, just want to see if you’re noticing something the article’s authors don’t know.

abetusk

Can you elaborate?

From the article:

> Farach-Colton, Krapivin and Kuszmaul wanted to see if that same limit also applied to non-greedy hash tables. They showed that it did not by providing a counterexample, a non-greedy hash table with an average query time that’s much, much better than log x.

cma

Sibling comment above says funnel hashing is greedy, elastic hashing was the non-greedy method that did even better.

bajsejohannes

One thing I don't understand from watching the video, is what happens in the (very rare) case that you get collisions all the way down the funnel. I assume this is related to the "One special final level to catch a few keys" (around 14:41 in the video), but given that it has to be fixed size, this can also get full. What do you do in that case?

kaathewise

The dereference table allows allocations to fail:

https://arxiv.org/pdf/2501.02305#:~:text=If%20both%20buckets...

(the text fragment doesn't seem to work in a PDF, it's the 12th page, first paragraph)

bajsejohannes

Thanks! So I guess the best recourse then is to resize the table? Seems like it should be part of the analysis, even if it's low probability of it happening. I haven't read the paper, though, so no strong opinion here...

(By the way, the text fragment does works somewhat in Firefox. Not on the first load, but if load it, then focus the URL field and press enter)

akatsarakis

Quite a neat idea that could be useful for memory-constrained environments.

[Shameless plug]:

If you are into hashtables, you might want to check out Dandelion Hashtable [0]. We use it in our next-generation databases, and it was published in HPDC'24. It is currently the fastest in-memory hashtable in practice.

It improves closed addressing via bounded-cacheline chaining to surpass 1B in-memory requests/second on a commodity server.

[0] https://dandelion-datastore.com/#dlht

edflsafoiewq

Funnel hashing is greedy.

abetusk

This seems overly pedantic. Here I think "greedy" means uniform probing.

The authors very clearly state "non greedy":

https://www.youtube.com/watch?v=ArQNyOU1hyE&t=1087s

edflsafoiewq

What the author says there is "What we just showed was that we can achieve a worst-case expected probe complexity of log squared one over delta with a greedy algorithm. And we don't have too much time to go over the non-greedy algorithm but...".

The funnel hashing described in the video is greedy. The video doesn't cover the non-greedy elastic hashing.

"Greedy" means that the search and insertion do the same probe sequence, and insertion just uses the first free slot in that sequence.

monort

kristopolous

This strikes me as something that many people probably figured out a non-rigorous version of and didn't think it was special.

It's kind of one of those resource management hacks you do when you're constrained and screwed by limitations. Splitting things up by priority is a common go-to for resource allocation. This is a spin on that.

I wonder how many other "in the trenches hacks" people have done that overturn widely accepted things the inventors didn't realize were a big deal: "well I usually have a bunch of deliveries to make and I've figured out a clever way to map out the quickest path... "

Don't get me wrong - recognizing it and then formalizing it, doing the work, publishing the paper - that's a lot of effort. I'm not taking that away.

vanderZwan

Also relevant: in this particular case the authors themselves note that the results better theoretical behavior in the worst case, but no practical uses yet. So I think any software engineer exploring this direction would have abandoned it pretty quickly, for the same reason that galactic algorithms aren't typically invented by them either (unless they also do compsci as a hobby of course). In fact the Wiki page for galactic algorithm mentions another optimal-but-impractical hash table as one of its examples[0][1].

[0] https://en.wikipedia.org/wiki/Galactic_algorithm

[1] https://www.quantamagazine.org/scientists-find-optimal-balan...

br1

Leapfrog Triejoin is an example of the trenches contributing to academia and academia valuing it: https://x.com/RelationalAI/status/1836115579133939752

vanderZwan

> I wonder how many other "in the trenches hacks" people have done that overturn widely accepted things the inventors didn't realize were a big deal: "well I usually have a bunch of deliveries to make and I've figured out a clever way to map out the quickest path... "

A lot of them. Having said that: yes, I can imagine that others would have thought up Dijkstra's shortest path algorithm, since he himself said it came to him while shopping, and that it only took him twenty minutes to reason through the original O(n²) algorithm. (edit: oh wait, that's what you're alluding to isn't it? Heh, that went straight over my head).

On the other hand, I don't think the faster versions of Dijkstra's algorithm would have been invented by anyone without at least some understanding of priority queues and big-O behavior. And at that point I hope people realize that they possess some specialized knowledge that might not be entirely common.

In fact, I'd argue that the true strength of Dijkstra's write-up is that it gives us a vocabulary to reason about it and come up with specialized data structures for particular situations.

Anyway, what you're touching on is the difference between engineering and science: engineering works with confidence built from tests, rules of thumb that reflect lessons learned from historical results, and (in modern times) verified predictions from science. Those rules of thumb might be used when lacking a deeper scientific understanding of why it works. The tests might exist to work around the limitations of scientific knowledge (e.g. modelling turbulence). Science creates insights and predictions through modelling of empirical results. At least that's the difference according to Bill Hammack[0].

In an ideal world the two professions work together and build on each other's results to propel each other forward of course.

[0] https://www.youtube.com/playlist?list=PL0INsTTU1k2X4kCPqmi1e...

chrisweekly

> "some specialized knowledge that might now be entirely common"

now -> not, right?

great comment

I'm not being pedantic about a typo, but it reverses the point I think you're making about UNcommon knowledge...

kristopolous

I was referring to the general TSP being solved.

abetusk

Thanks so much for this link. I remain convinced that papers are so much more understandable with an accompanying talk by the creators. I wish papers would just come with a video talk included.

elcritch

Exactly, the authors get to eschew the formalism required in papers. Often the core ideas of research are simple in themselves and the real complexity lies in formally proving the results.

Also, I'd not be surprised if someone already invented and used this funnel hashing technique in say the 80's in some game or whatnot but just never realized what they had stumbled onto. Not to diminish the research, it's very ingenius.

forrestthewoods

Academic papers are terrible at knowledge transfer. A more casually spoken blog post is 100% more effective at communicating ideas imho.

Academia is a weird and broken place.

Disclaimer: work in a research lab full of awesome PhDs who largely agree with me!

abetusk

I think papers make good references. I think of it more like the equivalent of a "datasheet" for an electronic part, say. Once you understand the intricacies, it's a valuable reference but more often than not, it's not very good and conveying motivation or intuition.

amelius

> Academic papers are terrible at knowledge transfer.

Well, at least they are better than patents.

joaohaas

Thanks for the video, def a lot better than the article.

I do find it a bit weird that this is somehow better than just over-allocating (and thus reducing the chances of key collisions, which also makes worst case 'less worse') given his approach also allocates more memory through the aux arrays.

yencabulator

I don't think anybody is really saying it is. Academics treat big-Oh performance on very very full hash tables like a sport. Real world code on real world CPUs often has a more complex cost function than what the academics considered; cache sizes, fitting in a cacheline, memory bandwidth, icache pressure, ...

sigbottle

He's not allocating through aux arrays, he's splitting the already allocated memory into log(n) layers. You can just track those aux arrays with math in the implementation.

It's probably not better than over-allocating except in memory constrained scenarios. But the overhead of funnel hashing is not high - it requires 0 extra memory

hydroreadsstuff

Overallcoation has a limit. You only have so much RAM/storage. Beyond that you start swapping. I could really use a hash table (or similar structure) that degrades less with higher occupancy.

rocqua

Could it be that overallocation means you need a bigger search to find empty places or answer queries?

orlp

Skimming the paper [1], the key difference they used is that their hash table insertion algorithm will probe further than the first empty slot, instead of greedily filling the first empty slot it finds. They combine this with a clever probing sequence which provably finds empty slots efficiently, even if the table is very full.

This means insertions when the hash table is less full are slower, but you avoid the worst-case scenario where you're probing for the last (few) remaining open slot(s) without any idea as to where they are.

[1]: https://arxiv.org/pdf/2501.02305

---

An interesting theoretical result but I would expect the current 'trick' of simply allocating a larger table than necessary to be the superior solution in practice. For example, Rust's hashbrown intentionally leaves 1/8th (12.5%) of the table empty, which does cost a bit more memory but makes insertions/lookups very fast with high probability.

elchananHaas

I might be misreading their algorithm, but from my look at the paper the key improvement is a non-uniform strategy where they divide the array into buckets and focus on different buckets as they fill the table. This increases the average number of locations to be probed even when the table is emptier. They still place the item in the first empty slot they see with this strategy.

The "skipping slots" has to do with jumping ahead in the hash sequence.

SiempreViernes

But you could do some hybrid, where you do greedy fill for a while and then switch to this fancier fill once your table is approaching full (using some heuristic)?

elchananHaas

No, it has to do with the coupon collectors problem. The key idea behind this algorithm is to do more looking for an empty spot up front.

layer8

I would assume that you need to start early to be able to reap the benefits once the table is almost full.

gizmo

Not really because you have to use the same algorithms during subsequent lookups. Imagine you add a bunch of items with insert algorithm A, then you cross some threshold and you insert some more items with Algorithm B. Then you delete (tombstone) a bunch of items. Now you look up an item, and the slot is filled, and you're stuck. You have to proceed your lookup with both algorithms to find what you're looking for (thereby giving up any performance benefits) because the current occupancy doesn't tell you anything about the occupancy at the time when the item you were looking for was inserted.

trebligdivad

Anyone got a simple implementation of 'Tiny pointers'? My mind prefers code/pseudo-code first rather than the proof.

quantum2022

This is neat! I always wondered if there would be a way to 'containerize' tables like this. IE a regular table is like a bulk carrier ship, with everything stuffed into it. If you could better organize it like a container ship, you could carry much more stuff more efficiently (and offload it faster too!)

doublerabbit

Easy stuff.

Convert table row to a string, json to whatever

Apply base16 to the that variable

You've now got a base16 string of that data.

Create a hash table, setup a key value for that base16 string.

You now have a container holding the data.

All you need to do is decode the hex string and you've got base32 data.

internetter

What you just described is magnitudes slower than even conventional hash tables

doublerabbit

I'd like to see your solution.

null

[deleted]

joe_the_user

The theoretical properties of hash table always seemed so impressive to me that they bordered on magic (and this just extends them). What seemed crazy was how they could be so much better than trees, which to me were intuitively the most efficient way to store data.

What I realized is that the theory of hash tables involves a fixed-sized collection of objects. For this fixed collection, you create a hash-function and used that like a vector-index and store the collection in a (pre-allocated) vector. This gives a (fuzzy-lens'd) recipe for O(1) time insert, deletion and look-up. (The various tree structures, in contrast, don't assume a particular size).

The two problems are you have to decide size beforehand and if your vector gets close to full, you insert etc processes might bog-down. So scanning the article, it seems this is a solution to the bogging down part - it allows quick insertion to a nearly-full table. It seems interesting and clever but actually not a great practical advance. In practice, rather than worrying a clever way to fill the table, I'd assume you just increase your assumed size.

Edit: I'm posting partly to test my understanding, so feel to correct me if I'm not getting something.

hcs

Proofs of constant time operations include time taken to resize the table. This takes much more time (linear in the size of the table), on insertions when the table is resized, but that time is amortized over all the insertions already done. It still works out to constant average time if you grow the table enough each time (once it starts to get too full) so it happens with decreasing frequency.

nijave

>What seemed crazy was how they could be so much better than trees, which to me were intuitively the most efficient way to store data.

Trees (sorted) are good at finding subsets and ranges "scanning" or "searching" but hashmaps are better at "seeking" like a key-value lookup

zelphirkalt

I think this is only true in the imperative world, where mutation is used. For the functional world it is probably still trees.

vessenes

It looks to me like the idea is, as you generally describe, that you segment your table into a 2d structure (well conceptually) and proceed to fill one ‘row’ at a time until it’s about 75% full, at which point you move on to the next one.

I don’t have time to fully grok the paper, but they claim this makes insertion consistently fast (I believe this until we’re at 75% of total capacity, but maybe they have some other mode for filling when they’re at 75% in every row?). They also claim retrieval is fast, and I didn’t read enough to understand how even retrieval works, or why it is faster.

I’ll put out that there a lot of times that it would be really nice to have a nearly full hash table still, you know, work. You can’t always change the size of one during execution of a program. And, in some environments memory counts a lot. That said, I would like to see and play with an implementation — I’m not sure this is ‘worth it’ in the general case.

It is also probably cache inefficient, as are most things about hash tables, with the exception of linear probing for reading out of a fairly full one, in which case, you get to just keep pulling stuff directly out of memory to check it. So, it’s not clear to me that this is performance wise worth it. Anyway, I’d like to fully understand it, it seems like an interesting new idea.

default-kramer

> And for this new hash table, the time required for worst-case queries and insertions is proportional to (log x)2 — far faster than x.

> The team’s results may not lead to any immediate applications

I don't understand why it wouldn't lead to immediate applications. Is this a situation where analysis of real-world use cases allows you to tune your hash implementation better than what a purely mathematical approach would get you?

frakt0x90

I haven't read the paper, but sometimes asymptotic improvements do not translate to real world improvements due to a large multiplicative factor in the complexity that gets factored out in the O() analysis. So the dataset required to see speed-up is impractically large.

orlp

In this case "x" is 1/d where d is the unused fraction of space.

So if you leave 0.1% of your hashtable unused your x is 1000 - quite problematic. However if you leave 12.5% of your hashtable unused your x is 8 - quite reasonable, and not something logarithmic behavior would necessarily speed up, for reasonable constants.

pclmulqdq

This is pretty much exactly the case for this algorithm. It is a very slow hash table due to the lack of locality. It seems to only have benefits at very large size and very high load factor.

At that scale, it may be practically better to use a B-tree or a radix tree over a 128-bit hash of the key.

MichaelDickens

I'm not up to date on the state of the art but I've implemented hash tables a few times and we would expand the hash table whenever it was 75% full, which means x is never greater than 4. Improving the runtime from O(x) to O((log x)^2) doesn't matter when x is so small.

I imagine there are some niche memory-constrained applications where you'd let x get larger but I never ran into them personally.

null

[deleted]

kevin_thibedeau

Robin Hood hash works well with high load factors ~95% because it balances the average lookup with a fair distribution of buckets. That makes it ideal when you don't want to waste memory on bloated tables.

JacksonAllan

I think the 0.95 figure for Robin Hood hash tables is rather optimistic. Robin Hood helps in a few ways: it allows for early termination of failed lookups, and it reduces probe-length variance (thereby making the performance of any one hash table operation more predictable). However, it does not reduce the average number of probes needed to perform a successful lookup. The hash-table benchmarks I published last year (https://jacksonallan.github.io/c_cpp_hash_tables_benchmark/#...), which test various hash tables in the 0.44-0.875 load-factor range, show that the performance of Robin Hood tables ("ankerl" and "tsl" in the article) deteriorates rapidly as the load factor climbs high, just like conventional linear- and quadratic-probing tables. This is in contrast to the SIMD and hybrid open-addressing/separate-chaining tables, whose performance is much more stable across the whole load-factor range.

plasticchris

In memory constrained environments you just make the underlying array a sparse vector, then it can be mostly empty without using much memory at all.

danpalmer

How does this work? Naively I'd have expected that to implement a sparse vector with efficient random access, you'd use a hash table, but I assume this must depend on the sparse vector not being a hash table or otherwise there wouldn't be an improvement. Are there other ways of implementing the efficient random access, or do you sacrifice the performance of random access?

swiftcoder

For this sort of high-load factor application, its not so much "memory constrained" in the sense of embedded hardware, as is it "memory constrained" in the sense of "my petabyte database table needs 1 TB just to store the index"...

pclmulqdq

I'm pretty sure nobody uses uniform probing hash tables in the wild. Every time I have wanted to have very high load factors (>90%), cuckoo hashing has been good enough, and below 70-80%, linear probing is blazing fast and absolutely good enough.

kragen

Linear probing also has better locality of reference.

mordae

Last time I needed high occupancy I was for a cache. So I've stirred my 32b keys with Phi-mul-rshift and randomly (xor-shift) displaced picked slot old value by log2(size)/2 with linear probing of up to log2(size).

layer8

In practice the worst-case operations are avoided by reserving a little more space for the hash table. And the new results come at the cost of slower “good case” insertions.

jeffbee

Complexity analysis and actual systems programming have been diverging for a while. I don't see anything in the paper that will inform practice.

naasking

How so?

zamadatix

Even when something isn't a galactic algorithm "how well it maps to being done efficiently in real hardware with hardware sized data sizes" is more and more often the important bit over "slightly better bounds at infinity". The former is probing the edge of mathematics while the latter is probing the edges of practical construction so they rarely seem to align much anymore at these depths.

nijave

A couple things immediately come to mind

1) complexity analysis ignores coefficients which can make a huge difference, especially since computers usually have bounds

2) real life may influence the likelihood of best/worst case. I think data tends to be somewhat sorted in practice so algorithms with best case on sorted data perform better

oulipo

Most real-world hash table implementations are not "theoretical" but depends on "real" parameters like L2 cache, assembly instruction sizes, etc

kazinator

Real world hash table can either be resized to avoid worst-case degradation or else be provisioned to have a good amount of slack when the worst expected case occurs. (E.g. a hash table used a cache will apply pressure to evacuate old entries before it gets foo full.)

Bucket hashing obviously has no problem finding a free spot. I didn't say chained on purpose because chains can be resizable vectors, so we load at most one pointer when going from table to bucket.

xxs

Bucket can evolve to red/black tree, if there are too many entries given O(logK) (collisions) for lookups. Still bucket based ones tend to be outclasses (both latency and memory footprint) by the dumbest linear probe.

null

[deleted]

slt2021

its more like real-world hash tables are tailored to the specific pattern of querying and inserting data.

if you know empirically that 80% of your requests come to a tiny subset of hot keys - you make a specific hashset just for these hot keys with constant access time, while keeping the rest of the table cold.

your L2 cache is an example of such optimization - a high bandwidth and low latency memory on the CPU die, that is orders of magnitude faster than random DRAM access - a tiered memory model

echoangle

Isn’t the problem that the scaling behavior only dominates with infinite n?

If you have a constant factor, that doesn’t go into the scaling rule, so having something scale (log x)2 could still be 100 times more expensive than something that scales linearly with x for all x smaller than 2^100.

thfuran

It's extremely unlikely that the constants would be nearly that large, but yes.

kragen

There are practically important algorithms like this. There are three linear-time algorithms for constructing a suffix array, but all of them have rather large constant factors. Packrat parsing is linear in the size of the input string, but can easily execute more than 100 instructions per byte. And there are numerous theoretically asymptotically superior algorithms whose constant factors are too high for them to be useful at all, the so-called "galactic algorithms", https://en.m.wikipedia.org/wiki/Galactic_algorithm.

There is nothing in the article suggesting that this is such a case, however.

ofirg

it improves the worse case cost given a nearly full hash map, it hurts raises the cost in other cases.

rajnathani

Also, correct me if I'm wrong, but also there is a slight added memory complexity in adding these tiny pointers?

federiconafria

From what I understood, they are just "reserved" areas. e.g. if you have 200 slots, the first 100 are the first "area", the second 50, 25 etc.

dooglius

It looks like the result only matters in the case where the hash table is close to full. But couldn't one just deal with this case by making the table size 10% bigger? (Or, if it is resizeable, resizing earlier)

nhumrich

Yes, which is what most real world hash tables do. They resize themselves once hash collision is too probable.

xxs

In reality 75% is the standard fill factor for linear probe that also exhibits the best locality (if the table gets too full it just allocated double (or x) the memory, and copies the existing entries). Most non-linear probe tables (e.g. cookoo) suffer due to the fact RAM is not 'random' at all.

throwme_123

Is someone aware of a GitHub repo with an implementation of this?

matsemann

The intro picture about pointers in a drawer immediately reminded me of a talk I saw at FUN with Algorithms 2018 called Mind the Gap that gave me an aha moment about leaving space in data structures. Cool then to try to locate it, and see that it was by the same professor in the article, Martín Farach-Colton.

Not sure if it's viewable somewhere. But the conference itself was so fun. https://sites.google.com/view/fun2018/home

I'm not an academic and got my company to sponsor a trip to this Italian island to relax on the beach and watch fun talks, heh.

ThinkBeat

Do we have some nice implementations yet? I do better reading code than math.