Skip to content(if available)orjump to list(if available)

The Bluesky Dictionary

The Bluesky Dictionary

28 comments

·August 6, 2025

wantlotsofcurry

I'm very curious as to how this works in the backend. I realize it uses Bluesky's firehose to get the posts, but I'm more curious on how it's checking whether a post contains any of the available words. Any guesses?

avibagla1

Hey! this is my site - it's not all that complex, i'm just using a sqlite db with two tables - one for stats, the other for all the words that's just word | count | first use | last use | post.

I... did not expect this to be so popular

f311a

You can probably fit all words under 10-15MB of memory, but memory optimisations are not even needed for 250k words...

Trie data structures are memory-efficient for storing such dictionaries (2-4x better than hashmaps). Although not as fast as hashmaps for retrieving items. You can hash the top 1k of the most common words and check the rest using a trie.

The most CPU-intensive task here is text tokenizing, but there are a ton of optimized options developed by orgs that work on LLMs.

gpm

Probably just a big hashtable mapping word -> the number of times it's been seen, and another hashset of all the words it hasn't seen. When a post comes in you hash all the words in it and look them up in the hashtable, increment it, and if the old value was 0 remove it from the hash set.

250k words at a generous 100 bytes per word is only 25MB of memory...

stwrzn

I very much hope that the backend uses one of the bluesky jetstream endpoints. When you only subscribe to new posts, it provides a stream of around 20mbit/s last time I checked, while the firehose was ~200mbit/s.

avibagla1

yes it does!

bangaladore

Maybe I'm being naive, but with only ~275k words to check against, this doesn't seem like a particularly hard problem. Ingest post, split by words, check each word via some db, hashmap, etc... and update metadata.

neaden

Is this not working or am I missing something, it just shows as seeing 0 words for me. Firefox on a PC.

accrual

You may need to allow scripts from the domain avibagla.com, it shows 0 when the scripts are blocked.

zem

ugh, it ought to be building the results on the server and serving up static pages.

rafram

But it updates live...

AgentME

For me it took a minute to start loading data and switch from just showing 0.

SirFatty

Same... maybe you need a Bluesky account, which I don't have.

gpm

It doesn't... I can open it in a private browsing window.

GalaxyNova

It's working fine for me on Firefox

refreeze654

I've wondered how blueksy affords the bandwidth to let anyone stream the full firehose.

dgacmu

Not an answer to your question, but I suspect most people don't -- my bot (a pi searcher bot, of course) just runs on Jetstream, which is pretty lightweight and heavily compressed.

(The website in question uses jetstream also.)

null

[deleted]

pona-a

For a moment I thought it would be an AT-Proto based Urban Dictionary clone.

spullara

I did this against a pretty large tweet archive and got hits on about 125k of the words in the unix dictionary.

GalaxyNova

fascinating! I think it's really cool that this is possible, and at the same time kine of sad that the norm is slowly moving towards more locked-down APIs.

timeon

> slowly moving towards

Depends what we accept as norm.

75345d4c

I just saw it indexed "eluvium," but the post was referring to a band with that same name

Kye

GeologySky will get to it soon enough.

atlgator

I checked out the author's other projects and this is common issue. For example, he has a "lean checker" for bluesky that claims it is right-leaning simply because of all the people saying "That's right," "He was right," etc. None of the supposed right-leaning posts were actually conservative in nature. They just used to word right to mean correct.

avibagla1

one, thank you for checking my website. two, that is the joke, 100% - at the time people kept talking about how "left leaning" bsky was and that idea came to mind

tough

Words We Haven't Seen

- Search unseen words

made me chuckle

crm9125

I've found content for all of my future skeets.