New Book-Sorting Algorithm Almost Reaches Perfection
37 comments
·January 24, 2025ColinDabritz
pradn
That's a good insight. I had always thought the key to good algorithm / data structure design was to use all the information present in the data set. For example, if you know a list is sorted, you can use binary sort. But perhaps choosing how much of it to omit is key, too. It comes up less often, however. I can't think of a simple example.
koolba
> But perhaps choosing how much of it to omit is key, too. It comes up less often, however. I can't think of a simple example.
An example of that is a linked list with no particular sort order. By not caring about maintaining an order the insertion appends or preprends a node and is O(1).
As soon as you have to maintain any additional invariant, the insertion cost goes up. Either directly or amortized.
nickburns
So it's basically a matter of figuring out what problem context can and should be selectively hidden from the algorithm in order to make it work 'smarter' and not 'harder.' Weird.
zeroonetwothree
The actual better algorithm does use history dependence though. So I found this part of the article misleading.
acbart
I was actually just looking at this problem last week, when I realized I needed to keep items in a database table positioned arbitrarily, ideally without needing to manipulate the rest of the list. So if a user adds in a new element after element 5, that element becomes 6, without needing to update the original item that came after element 5. There are indeed very sophisticated algorithms to manage this problem and minimize theoretical bounds. But it also seemed like the simplest solution for this particular version of the problem is to just use fractional amounts, and occasionally pay a penalty to relayout the list.
zeroonetwothree
Wikipedia has a section on this algorithm under exponential labels: https://en.m.wikipedia.org/wiki/List-labeling_problem
Basically it works as long as your label space is large compared to the number of items. The more sophisticated methods are necessary when that isn’t the case. For example, say you have 4 bytes for the label and 1 billion items.
crdrost
This sort of problem also occurs when you're trying to do CRDTs, which can roughly be described also as "design something that does Git better."
So e.g. to frame this, one approach to a CRDT is to just treat the document as a list of facts, "line 1 is 'foo', line 2 is 'bar'", and each fact has a number of times it has been asserted, and to "merge" you just add together the assertion counts, and then you can detect conflicts when a fact has been asserted more than once or fewer than zero times. So a patch says "change line 2 to 'baz'", this becomes "unassert that line 2 is 'bar', assert that line 2 is 'baz'" and it conflicts with a patch that says "change line 2 to 'quux'" because the fact "line 2 is 'bar'" has an assertion count of -1.
But anyway, in this context you might want to allow inserting lines, and then you have the list-labeling problem, you don't want the patch to unassert lines 4,5,6 just to insert a new line after line 3. So then an obvious thing is to just use a broader conception of numbers, say "line 3.5 is <X>" when you insert, and then we hide the line numbers from the user anyways, they don't need to know that internally the line numbers of the 7 lines go "1, 2, 3, 3.5, 4, 5, 6".
So then you need a relabeling step because you eventually have some line at 3.198246315 and you want to be able to say "yeah, that's actually line 27, let's have some sanity again in this thing."
This also maybe hints at the fun of adding randomization, consider that one person might add line 3.5, then add line 3.75, and then remove line 3.5; meanwhile someone else might add a different line 3.5, add a line 3.25, and then remove their line 3.5, and these patches would both amount to "assert line 3.25 is A, assert line 3.75 is B", and would merge without conflict. This means that in general if two people are messing with the same part of the same document asynchronously, this model is not able to consistently flag a merge failure in that case, but will sometimes instead just randomly order the lines that were added.
We can then just make that a feature rather than a fault: you don't insert at 3.5, which is 3 + (4 - 3) / 2, rather you insert at 3 + (4 — 3) * rand(). And then when two people both try to insert 12 lines between line 3 and 4 independently, when you merge them together, you get 24 lines from both, in their original orders but interleaved randomly, and like that's not the end of the world, it might be legitimately better than just declaring a merge failure and harrumphing at the user.
williamdclt
Ha, I had this exact problem asked as an interview question.
IIRC their real life solution was to leave gaps between elements (eg index 0, 100, 200… instead of 0, 1, 2) and re index when needed. Probably works well enough, what I came up with is (as you say) the idea of fractional indexing, but dealing with decimals is a pain so you can instead represent them as vectors, which you can just represent as a string of numbers that you sort lexicographically.
So an element inserted between elements 1 and 2 gets an index 11 (anything between 11-19 is fine). Between 1 and 11 is 101. Between 11 and 2 is 12 (or anything between 12-19). Note that these indexes are not numbers, they’re string that are compared lexicographically.
I’m sure there’s downsides, eg it takes a lot more memory to sort these indexes (strings are much larger than numbers). It feels a bit too smart to not have unexpected downsides.
WalterBright
> solution was to leave gaps between elements (eg index 0, 100, 200… instead of 0, 1, 2)
Reminds me of my days writing BASIC programs.
Etheryte
Very similar to my first intuition on how to approach this problem. An interesting question that comes to mind is how should you pick index sizes and how often should you rebalance the whole thing. Make the index too large and you're wasting a lot of space you'll never need, too small and you're forced to rebalance too often. I'm guessing an ideal index size is such that you can rebalance at most once a night with a cron job and then avoid rebalances the whole rest of the day.
To be fair, this sounds like one of those classic problems that someone for sure already figured out in the 50s or 60s, just under a different name and/or context. Hash chaining is a similar idea, but not quite the same.
macNchz
The "real life" solution of leaving gaps & reindexing is actually my earliest programming memory (as a teenager)/lesson in cleverness, of feeling like I should have been able to come up with a more clever/optimal/~~scalable~~ solution but settling for this awkward 100 200 300 thing. Nowadays I've used that approach like...dozens of times over the decades of real world projects since, and I'm still maintaining that web app I made in 2003, so I'm very glad I never came up with something unduly clever.
zeroonetwothree
That is how I implemented it at work around 9 years ago. Use strings for ordering and if you use the full byte range they end up fairly compact (rather than just 1-9 as in your example). There are some edge cases that can cause it to balloon in size so there is a separate reordering step but it almost never needs to be invoked.
thaumasiotes
> IIRC their real life solution was to leave gaps between elements (eg index 0, 100, 200… instead of 0, 1, 2) and re index when needed. Probably works well enough, what I came up with is (as you say) the idea of fractional indexing
Those are the same thing. Leaving gaps is fractional indexing. It's just fixed-point rather than floating point.
> an element inserted between elements 1 and 2 gets an index 11 (anything between 11-19 is fine). Between 1 and 11 is 101. Between 11 and 2 is 12 (or anything between 12-19). Note that these indexes are not numbers, they’re string that are compared lexicographically.
This reminds me of the most interesting method of generating random integers in an arbitrary range from random bits: interpret the bitstream as a string representing a real number (in binary) between 0 and 1. If, for example, you want to use bits to generate a number between 0 and 5, you divide the unit interval into sixths, and examine bits until you've determined conclusively that your number lies within one of those intervals:
+---- 0 = 0.000000000... ---+
| 0.000 -> 0 |
| 0.00100 -> 0 |
+-- 1/6 = 0.001010101... --+
| 0.0011 -> 1 |
| 0.0100 -> 1 |
+-- 2/6 = 0.010101010... --+
| 0.011 -> 2 |
+-- 3/6 = 0.100000000... --+
| 0.100 -> 3 |
+-- 4/6 = 0.101010101... --+
| 0.1011 -> 4 |
| 0.1100 -> 4 |
+-- 5/6 = 0.110101010... --+
| 0.11011 -> 5 |
| 0.111 -> 5 |
+---------------------------+
jes5199
like line numbers in an old BASIC program
ryao
Is there any reason to think that this algorithm will actually be faster than what is currently done in practice?
For arrays in B-tree nodes, which is the main place where I have encountered this problem, I doubt it will be faster than just using `memmove()`, and for truly large arrays, it would be easier to use a B tree.
If that is the case, this joins a number of algorithms that are asymptotically faster than algorithms in use and paradoxically slower than them. Examples of such algorithms include all of the fast matrix multiplication algorithms, which are slower than good implementations of the the textbook O(n^3) algorithm (GEMM).
melvyn2
These are sometimes called Galactic Algorithms: https://en.wikipedia.org/wiki/Galactic_algorithm
The very first example on the page has a pretty good quote describing their usefulness:
>>> An example of a galactic algorithm is the fastest known way to multiply two numbers, which is based on a 1729-dimensional Fourier transform. It needs O(n log n) bit operations, but as the constants hidden by the big O notation are large, it is never used in practice. However, it also shows why galactic algorithms may still be useful. The authors state: "we are hopeful that with further refinements, the algorithm might become practical for numbers with merely billions or trillions of digits."
dchftcs
One interesting thing about working on numbers so large is that it becomes harder to assume that memory access is constant time. It's already not that easy now (cache miss vs hit, and external memory), but when you operate on data sizes that large, memory accesses matter by at least the square or cube root of N. If you require largely sequential access it is possible to outperform an algorithm that requires conditional random access by some function of those factors, which can make some algorithms look much worse or better compared to the classical analysis. Of course, it still does not matter whether your algorithm is going to finish in a couple of trillion years or a trillion trillion years.
baruchel
I recall presenting a problem to my students a few years ago based on the 'Library Sort' algorithm. I still clearly remember the title of the original paper: 'Insertion Sort is O(n log n)'.
deathanatos
Presumably this: https://www3.cs.stonybrook.edu/~bender/newpub/BenderFaMo06-l...
Kind of a click-baity title.
foota
It's sort of click bait, but also true, right? This is just insertion sort using non-fixed indexing?
zeroonetwothree
This is a different problem though, despite the similar name.
msmitha
I was shocked to discover how the British Library (millons of books, many many new ones every week) manages this. The first book to arrive earlier this year went on the shelf at 2025.0000001. The next went at 2025.0000002, right next to it. The electronic catalogue does the rest. No need to re-shuffle books around, but not a solution supportive of book-browsing.
crazygringo
Reminds me of how Amazon doesn't arrange its items by similarity like a store does, so a model of vacuum cleaner might be next to a set of kitchen plates. They actually intentionally avoid similarities so pickers won't accidentally grab a similar but wrong thing.
I lose track of so many infrequently used objects in my home -- which storage bin in which closet did I put the x-acto blade refills in? -- and one bin will overflow while another is half-empty because I try to keep similar items together. Sometimes I fantasize about tracking every possession in a spreadsheet that points to which bin, so I'd never lose anything and could always use storage space with maximum efficiency. But then I know I'd be lazy and skip updating the spreadsheet when I put something new away... plus it just seems so inhumanly weird, like something a robot would so rather than a person.
tetris11
Clean links for mobile users
Game: https://patorjk.com/games/subpixel-snake/ Code: https://github.com/patorjk/subpixel-snake
Resource for subpixel geometries: https://geometrian.com/resources/subpixelzoo/
9dev
This probably belongs to the post on subpixel snake, not this one on the library sort algorithm :)
TheRealNGenius
Wrong post
doodpants
Off topic, but now I want a screen saver of the animation at the top of the article...
thaumasiotes
> lowering the upper bound to (log n) times (log log n)^3 — equivalent to (log n)^(1.000...1)
This is true! One of the things I find cool about big-O complexity using polynomial reference classes is that logarithms give you an infinitesimal value. Take that, "infinitesimals don't really exist" people!
sfelicio
Wait. What? Do you have a reference where I can learn about this?
AstroJetson
Cool. This is Lisa and Laura, they are my two retired teachers that are now helping in the library to sort shelves.
Can you explain to them what you want them to now do?
I read the paper and I would like to say, it's confusing at best. But I'm hoping you can help out.
gessha
A lot of things are confusing at first but then we discover intuitive ways to explain and teach them to others.
Also, don’t discredit the intelligence of retirees, teachers and librarians in the same post. It’s bad for karma and makes you sound like a douche. I’m sure they can figure it out.
AstroJetson
Wasn't my intent. The article was "how to do this" with lots of great pictures. Linked page was full of high level math. I was mocking the "This one simple trick can make your shelves full" attitude.
So you answered and I'll ask, "what do they do, it does not seem simple" except for the Database people filling ....
Thanks (former teacher)
"Kuszmaul remembers being surprised that a tool normally used to ensure privacy could confer other benefits."
It was surprising to me too! But reflecting on it more closely, most performance isn't about "faster" in a literal sense of "more instructions run per time", but about carefully choosing how to do less work. The security property here being "history independence" is also in a way stating "we don't need to, and literally cannot, do any work that tracks history".
It's definitely an interesting approach to performance, essentially using cryptography as a contraint to prevent more work. What properties do we need, and what properties can we ignore? The question becomes if we MUST ignore this property cryptographically, how does that affect the process and the related performance?
It certainly feels like it may be a useful perspective, a rigorous approach to performance that may be a path to more improvements in key cases.