Skip to content(if available)orjump to list(if available)

Douglas McIlroy responds to Unix spell article with new implementation details

Douglas McIlroy responds to Unix spell article with new implementation details

14 comments

·February 6, 2025

I originally wrote this article explaining the data model as implemented in the original Unix "spell": https://blog.codingconfessions.com/p/how-unix-spell-ran-in-6....

This reached Doug, who then responded to it with the follow up work he did which wasn't published anywhere.

re

The text of the email (courtesy of Firefox/macOS Text recognition):

___________________

Subject: yes, it's me, the author of "spell"

You did a nice job of gently describing "spell"s data representation. But the formula x = 2^{log_2(m)}-m is a complicated way to say x=0. Did you mean that?

When memory got bigger, I kept the literal dictionary in memory, but compressed it on secondary memory for fast transfer. The compression was trivial: store a suffix preceded by one byte that contained the length of the prefix that the word shared with its predecessor in dictionary order.

We also added one byte per enry for affixing info, e.g. whether in- (or im- or ir-) is preferred over un-, whether a final consonant is doubled before adding a suffix that begins with a vowel, part of speech, etc. Although there were lots of such attributes, the number of distinct combinations of attributes that actually occurred was fewer than 256, so could be coded into one byte, and decoded by lookup in a 256-entry table. I automated the construction of that table, which would change if a dictionary revision created a new combination of attributes.

With affixing info we could at the same time be more aggressive about affixing, and make fewer mistakes like allowing -s on a word that cannot be used as a noun or a verb.

pronoiac

> The compression was trivial: store a suffix preceded by one byte that contained the length of the prefix that the word shared with its predecessor in dictionary order.

Oh, that looks familiar; the database for the locate command uses something similar - https://www.gnu.org/software/findutils/manual/html_node/find...

devin

There used to be a page of Doug McIlroy "facts". I managed to hunt this page down by searching around.

http://git.9front.org/plan9front/plan9front/c28f07b5840216c4...

One of the "facts" is: > Doug McIlroy can address 8 terabytes of RAM with only 32 bits.

cb321

Many of those are very funny. Where would we be without the Chuck Norris joke format?

Of course, paraphrasing @sitkack 's observation in the other comment thread (combining with my idea that streaming alone is faster than all that repeated hashing), "better address space is worse real-time perf". For the guy behind the famous 1986 good natured trash talking "industrial-strength Fabergé egg–..., a museum piece from the start", it made me wonder if between 1980 (when McIlroy1982 was likely written) and 1986 (when the famous Knuth-McIlroy "battle" happened) if Doug came to consider that v7spell was perhaps guilty of the very same "crime" he used to roast Knuth.

In truth, of course, as with most things, it's just a matter of scales - at some small document scale the v7 compression works better while at a closer to dictionary-size scale the merging does (https://news.ycombinator.com/item?id=42980934). I suppose it's also worth mentioning that the optimizations of that linked program automagically compresses prefixes and optimizes suffix byte-comparisons in a human language-neutral way (i.e., just from the sorted structure/vague concept, with no linguistic insight beyond "lexicographic" order). Esp. given the i18n push of the 90s, it's possible the POSIX guys might have included some prefix-suffix delta format merge utility had it existed and English-specificity might be what blocked `spell`. Of course, Seymour Cray's computers had already begun vector processing and those techniques might also not be very SIMD-friendly. { So..many..trade-off..dimensions! }

cb321

> The compression was trivial: store a suffix preceded by one byte that contained the length of the prefix that the word shared with its predecessor in dictionary order

This is more or less what was speculated upon as compatible with his stemming ideas [1], but I (and partly McIlroy1982) doubt you ever really needed the fancy hashing-compression in the first place (though I'm sure it was fun to figure out!). I got better perf (3x faster on a 5500 word document with an 82,000 word "unabridged" dictionary that prefix-compresses to only 278K which would fit on a 360K floppy of the era) than v7spell on v7 with just a merge algorithm somewhat more finely tuned against that format. Of course, on a shared computer getting either the streaming disk access (without seeks spoiling the party) or all that RAM to yourself (without swapping spoiling the party) is a guess. And, yeah, probably the background Doug mentions is all in the combined-unix-git-history by now { and it would not have been as attractive an academic paper :-) }.

[1] https://news.ycombinator.com/item?id=42931145

dang

Recent and related:

How Unix spell ran in 64kb RAM - https://news.ycombinator.com/item?id=42752604 - Jan 2025 (51 comments)

jll29

Slightly related:

In the paper Leidner and Plachouras (2017), we reported an observation initially due to Bentley, namely that the McIllroy spell(1) implementation emailed a list of unknown words when it was run over a document to the author. While technically, this is a neat way to increase dictionary size by "mining user data", and certain versions of Microsoft Word and Microsoft Edge (see https://news.ycombinator.com/item?id=35208333 ) had the same behavior, it is privacy-violating at least if users are not informed beforehand [1]. Of course this has to be seen in the cultural context of the UNIX community at the time (people were even weary of using passwords then to protect their accounts), but still harm could be done if an email tasked about "malinoma" when that term was still absent from the dictionary, possibly revealing a sensitive condition of the email author or their circle of friends and family to a third party, the software engineer of the speel checker.

[1] https://aclanthology.org/W17-1604.pdf

BarbaryCoast

I've done something similar at my job. I don't think it implicates privacy because 1) it's a list of misses from the (public) dictionary, and 2) the email is sent as the program, not as the user. So if the user misspells melinoma, you have no information on which user that was.

ggm

That's a very rude thing to say about the President's wife. Notice has been taken.

MonkeyClub

  > the speel checker
        ~~~~~

null

[deleted]

null

[deleted]

CodeCompost

Can we please stop posting links from far-right conspiracy websites?