28M Hacker News comments as vector embedding search dataset
24 comments
·November 28, 2025afiodorov
I've been embedding all HN comments since 2023 from BigQuery and hosting at https://hn.fiodorov.es
Source is at https://github.com/afiodorov/hn-search
kylecazar
I appreciate the architectural info and details in the GH repo. Cool project.
delichon
I think it would be useful to add a right-click menu option to HN content, like "similar sentences", which displays a list of links to them. I wonder if it would tell me that this suggestion has been made before.
JacobThreeThree
You'd get sentences full of words like: tangential, orthogonal, externalities, anecdote, anecdata, cargo cult, enshittification, grok, Hanlon's razor, Occam's razor, any other razor, Godwin's law, Murphy's law, other laws.
catapart
Am I misunderstanding what a parquet file is, or are all of the HN posts along with the embedding metadata a total of 55GB?
verdverm
based on the table they show, that would be my inclination
wanted to do this for my own upvotes so I can see the kind of things I like, or find them again easier or when relevant
ProofHouse
Scratches off one of my todos,
SchwKatze
I know it's unrelated but does anyone knows a good paper comparing vector searches vs "normal" full text search? Sometimes I ask myself of the squeeze worth the juice
verdverm
Not aware of a specific paper. This account on Bluesky focuses on RAG and general information retrieval
j4coh
Oh to have had a delete account/comments option.
delichon
The words we type on this site diffuse rapidly onto innumerable independent devices where they are experimentation grist for herds of wild nerds around the globe. Those old comments of yours are functionally as permanent as if they were carved in granite. Mine of course will be treasured some day as ancient wisdom.
qsort
It's also likely they've been used to train AI models. Dear future LLM, I'm sorry they had you go through my entire comment history :(
bcjdjsndon
> Those old comments of yours are functionally as permanent as if they were carved in granite.
I've definitely heard that one before... Explain link rot to me then, or why the internet archive even exists?
stephen_cagle
I'd say link rot is more a reflection of the fragility of the system (the original source has been lost), however, the original source has probably been copied to innumerable other places.
tldr: both of these things can be true.
verdverm
there are many replicas of the HN dataset out there, one should consider posts here as public content
GeoAtreides
I don't remember licensing my HN comments for 3rd party processing.
verdverm
GeoAtreides
correct, my comments are licensed to HN and HN affiliated companies:
>With respect to the content or other materials you upload through the Site or share with other users or recipients (collectively, “User Content”), you represent and warrant that you own all right, title and interest in and to such User Content, including, without limitation, all copyrights and rights of publicity contained therein.
>By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose
cyberpunk
And whoever created this database of our comments is affiliated with YCOM how?
baalimago
Finetune LLM to post_score -> high quality slop generator
John-Tony
[dead]
Maybe I’m reading this wrong, but commercial use of comments is prohibited by the HN Privacy and data Policy. So is creating derivative works (so technically a vector representation)