Reproducing Hacker News writing style fingerprinting

165 comments

·April 16, 2025

mtlynch

>Well, the first problem I had, in order to do something like that, was to find an archive with Hacker News comments. Luckily there was one with apparently everything posted on HN from the start to 2023, for a huge 10GB of total data.

This is actually super easy. The data is available in BigQuery.[0] It's up to date, too. I tried the following query, and the latest comment was from yesterday.

    SELECT 
      id,
      text,
      `by` AS username,
      FORMAT_TIMESTAMP('%Y-%m-%dT%H:%M:%SZ', TIMESTAMP_SECONDS(time)) AS timestamp
    FROM 
      `bigquery-public-data.hacker_news.full`
    WHERE 
      type = 'comment'
      AND EXTRACT(YEAR FROM TIMESTAMP_SECONDS(time)) = 2025
    ORDER BY 
      time DESC
    LIMIT 
      100

https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1s...

leetrout

My favorite which is also up to date is the ClickHouse playground.

For example:

  SELECT * FROM hackernews_history ORDER BY time DESC LIMIT 10;

https://gh-api.clickhouse.tech/play?user=play#U0VMRUNUICogRl...

I subscribe to this issue to keep up with updates:

https://github.com/ClickHouse/ClickHouse/issues/29693#issuec...

And ofc, for those that don't know, the official API https://github.com/HackerNews/API

antirez

I didn't know there was an official API! This explains why the data is so readily available in many sources and formats. That's very cool.

laborcontract

…i can’t believe i’ve been running a script to ingest the data for the last six hours. thank you.

Frieren

It works for me. The accounts I used long time ago are there in high positions. I guess that my style is very distinctive.

But I also have seen some accounts that seem to be from other non-native English speakers. They may even have a Latin language as their native one (I just read some of their comments, and, at minimum, some of them seem to also be from the EU). So, I guess, that it is also grouping people by their native language other than English.

So, maybe, it is grouping many accounts by the shared bias of different native-languages. Probably, we make the same type of mistakes while using English.

My guess will be that native Indian or Chinese speakers accounts will also be grouped together, for the same reason. Even more so, as the language is more different to English and the bias probably stronger.

It would be cool that Australians, British, Canadians tried the tool. My guess is that the probability of them finding alt-accounts is higher as the populations is smaller and the writing more distinctive than Americans.

Thanks for sharing the projects. It is really interesting.

Also, do not trust the comments too much. There is an incentive to lie as to not acknowledge alt-accounts if they were created to remain hidden.

gostsamo

I discover 2 people in my top 20 who I can bet are from the same country as me and it is not a big country.

Tade0

> Probably, we make the same type of mistakes while using English.

That is most likely the case. Case in point: My native language doesn't have articles, so locally they're a common source of mistakes in English.

laurentlb

It would be fun to have a tool try guess your native language, based on your English writing.

Tade0

I noticed that it also depends on the vendor of the autocorrect/dictionary you're using.

The project referenced in the post put me next to Brits on the similarity list and indeed I am using an English(UK) dictionary. Meanwhile this iteration aligns me with Americans despite the only change being the vendor (formerly Samsung, now Google).

I guess the Samsung keyboard corrects to proper Bri'ish.

I picked up the language as a child from a collection of people, half of whom weren't native speakers, so I don't speak any specific dialect.

legohead

Didn't catch my original account when I tried it, not anywhere in top 100.

But, if I do the reverse (search using my original account), this one shows up as #2.

The main difference between the accounts is this one has a lot more posts, and my original account was actively posting ~11 years ago.

6510

I never knew A can be like B without B being like A.

selcuka

The matching score is probably the same, or very close in both ways, but this fact does not necessarily help in a three-way scenario:

    A <-> B: 80%
    A <-> C: 90%
    B <-> C: 70%

When you search for A the best match will be C, but if you start with B it will be A. If one of the accounts has a smaller sample set as in GP's case, the gap could be quite big.

hammock

The "analyze" feature works pretty well.

My comments underindex on "this" - because I have drilled into my communication style never to use pronouns without clear one-word antecedents, meaning I use "this" less frequently that I would otherwise.

They also underindex on "should" - a word I have drilled OUT of my communication style, since it is judgy and triggers a defensive reaction in others when used. (If required, I prefer "ought to")

My comments also underindex on personal pronouns (I, my). Again, my thought on good, interesting writing is that these are to be avoided.

In case anyone cares.

throwanem

I would prefer the "analyze" feature focus on content rather than structure words. I forget the specific linguistic terms but to a first approximation, nouns and verbs would be of interest, prepositions and articles not. Let's call the former "syntactic" and the latter "semantic."

I suppose it's possible the "analyze"-reported proportions are a lot more precise and reliably diagnostic than I imagine. I haven't yet looked in detail at the statistical method.

Also, of course, it would require integration with NLP tooling such as WordNet (or whatever's SOTA there something like a decade and a half on) and a bit of Porter stemming to do part-of-speech tagging. If one 0.7GB dataset is heavyweight where this is running, that could be a nonstarter; stemming is trivial and I recall WordNet being acceptably fast if maybe memory hungry on a decade ago's kinda crappy laptop, but I could see it requiring some expensive materialization just to get datasets to inspect. (How exactly do we define "more common" for eg "smooth?" Versus semantic words, all words, both, or some combination? Do we need another dataset filtered to semantic words? Etc.)

If we're dreaming and I can also have a pony, then it would be neat to see both the current flavor, one focused on semantics as above, and one focused specifically on syntax as this one coincidentally often seems to act like. I would be tempted to offer an implementation, but I'm allergic to Python this decade.

throwanem

Of course, immediately after the edit window closes, I revisit this comment and discover that in the first paragraph I swapped my terms and made a hash of the rest of the thing. Please cut out and paste into your printouts the following corrected version. Thank you!

> I would prefer the "analyze" feature focus on content rather than structure words. I forget the specific linguistic terms but to a first approximation, nouns and verbs would be of interest, prepositions and articles not. Let's call the former "semantic" and the latter "syntactic."

milesrout

Should is a commonly used word and a fine one. You should feel free to use it. If someone gets hot under the collar because you said he should do something then he is an idiot.

"Ought to" is essentially a synonym. Anyone that gets upset when you said they should do something but is fine when you say that they ought to do something is truly a moron.

heresie-dabord

    > If someone [test] then he is an idiot.
    > Anyone [test] when [test] is truly a moron.

These structures are worse habits in communication than subtle, colloquially interchangeable word choices.

milesrout

This isn't a habit of communication. I honestly mean it: if you get upset that someone said you "should" do something, but you are fine when they say you "ought to" do it, then you must be stupid. They mean the same thing in modern English.

Loughla

The only time to avoid command words like should is when the person could conceivably see them as a command. Because then you're being a dick.

Otherwise, if someone wants to take the time to dissect meaning from add-on meaningless words like should in a sentence, they should find something better to do with their time. Or just ask instead of being a moron.

milesrout

How are you being a dick?! There are loads of reasons why you may want or need to instruct someone to do something. I prefer the imperative mood. It is more direct. "Sudo make me a cup of tea".

hammock

Most people are more moronic than one might think

jcims

I (also?) felt the 'words used less often' were much easier to connect to as a conscious effort. I pointed chatgpt to the article and pasted in my results and asked it what it could surmise about my writing style based on that. It probably connected about as well as the average horoscope but was still pretty interesting!

antirez

That's very interesting as I noticed that certain outliers seemed indeed conscious attempts.

croemer

Since you seem to care about your writing, I'm wondering why you used "that" here?

> I use "this" less frequently that I would otherwise

Isn't it "less than" as opposed to "less that"?

hammock

Typo. Good catch

WhyNotHugo

I think “should” and “ought to” end up being equivalent.

I prefer to avoid such absolutes and portray causality instead.

For example, in place of “you should not do drugs at work” I prefer “if you take drugs at work you’ll get in trouble”.

hammock

They do, and your suggestion is a great alternative. I'll try to do more of that

6510

I often times go back to replace the instances of "should" wiht "could" as I could not tell people what to do. “you could not do drugs at work”

SoftTalker

Nothing wrong with implying that people ought to behave according to mainstream social norms.

brookst

Isn’t that the same as saying that counterculture, fringe culture, and subcultures ought not exist?

generalizations

Interestingly, when most people simply choose to do what most people choose to do, you get an emergent 'herd mentality' which can lead to some very strange places. It is also sensitive to very small purturbations - which in real terms means, the one person who does manage to think for themselves may find they have an outsized effect on the direction of the crowd.

I think this mentality is also where the term 'sheeple' comes from.

Joker_vD

> I prefer "ought to"

I too like when others use it, since a very easy and pretty universal retort against "you ought to..." is "No, I don't owe you anything".

albumen

Are you saying there's a connection between "ought" and "owe"? All I see is "I don't want to hear any criticism".

Joker_vD

Yes, "ought" is the past tense of "owe". At some point, the second alternative spelling "owed" was introduced to better separate the two meanings (literal and figurative), but it's still the same word; a similar thing happened with "flower" and "flour", those used to be interchangeable spellings of the same word but then somebody decided that the two meanings of that word should be separated and given specific spellings.

And the construct "you owe it to <person> to <verb>" still exists even today but is not nearly as popular as "you should <verb>" precisely because it has to state to whom exactly your owe the duty; with "should" it sounds like an impersonal, quasi-objective statement of fact which suits the manipulative uses much better.

oasisbob

The etymology makes a connection through old English. Oxford dictionary also contains this meaning:

> used to indicate duty or correctness

A duty to others is something you owe them; think, a duty of care and its lack, which is negligence.

tobr

> Again, my thought on good, interesting writing is that these are to be avoided.

You mean, ”I think this should be avoided”? ;)

hammock

Nice one high five

xnorswap

I wonder how much accuracy would be improved if expanding from single words to the most common pairs or n-tuples.

You would need more computation to hash, but I bet adding frequency of the top 50 word-pairs and top 20 most common 3-tuples would be a strong signal.

( The nothing the accuracy is already good of course. I am indeed user eterm. I think I've said on this account or that one before that I don't sync passwords, so they are simply different machines that I use. I try not to cross-contribute or double-vote. )

antirez

Maybe there isn't enough data for each user for pairs, but I thought about mixing the two approaches (but had no time to do it), that is, to have 350 components like now, for the single word frequency, plus other 350 for the most common pairs frequency. In this way part of the vector would remain a high enough signal even for users with comparable less data.

xnorswap

I've been thinking some more about this, and it occurred to me that you'd want to encode sentence boundaries as a pseudo-word in the n-tuples.

I then realised that "[period] <word>" would likely dominate most common pairs, and that a lot of time could be saved by simply recording the first word of sentences as their own vector set, in addition but separate to the regular word vector.

Whether this would be a stronger or weaker signal per-vector-space than the tail of words in the regular common-words vector I don't know.

jedberg

Maybe I talk too much on HN. :)

When I ran it, it gave me 20 random users, but when I do the analyze, it says my most common words are [they because then that but their the was them had], which is basically just the most common English words.

Probably would be good to exclude those most common words.

marginalia_nu

I had a similar result. 85%+ similar to a bunch of random accounts, and my perhaps most distinguishing feature is I don't use the word 'app' or 'company' a lot. The former because I dislike the word, and the latter maybe because I'm self-employed.

I figured it maybe would cluster me with other non-native speakers but it doesn't appear to. Of all the accounts where I could identify a country of origin, all were American.

OuterVale

Funnily enough, my top 10 words used less often are as follows:

you, are, have, they, at, an, we, if, do, to

I'm frankly not quite sure how I've avoided them given how common they are.

nomilk

For visibility, here's the tool where you can enter your hn username:

https://antirez.com/hnstyle?username=pg&threshold=20&action=...

null

[deleted]

keepamovin

This is great example of what's possible and how true anonymity, even online, is only "technological threshold" anonymity. People obsessed with biometrics might not consider this is another biometric.

Instead of just HN, now do it with the whole internet, imagine what you'd find. Then imagine that it's not being done already.

consp

None of my throwaways and not even my old account shows up. We are not at that level yet. ymmv.

tgv

This technique yields so many false positives and negatives, it's practically useless. Possibly it works reliably for mono-lingual, prolific writers. Someone like the Qanon shaman (or whatever the name was) might be picked up, if it doesn't happen to be a collective.

085744858

Except that technology is on the side of anonymity this time. LLMs can provide a pretty solid defense against such attacks — just ask ChatGPT to rewrite your message in a random writer's style. The issue is that you'll end up sounding like an LLM, but hey, tradeoffs.

Using throwaways whenever possible mitigates a lot of the risk, too.

keepamovin

That’s true. The old security versus convenience hack.

But if i were a government agency I would be pressing AI providers for data, or fingerprinting the output with punctuation/whitespace or something more subtle.

Tho i guess with open models that people can run on device that’s mitigated a lot.

paxys

It did find my "alt" (really an old account with a lost password), but the rest of the list – all users with very high match scores (0.8+) – is random.

Taking a look at comments from those users, I think the issue is that the algorithm focuses too much on the topic of discussion rather than style. If you are often in conversations about LLMs or Musk or self driving cars then you will inevitably end up using a lot of similar words as others in the same discussions. There's only so many unique words you can use when talking about a technical topic.

I see in your post that you try to mitigate this by reducing the number of words compared, but I don't think that is enough to do the job.

lcnPylGDnU4H9OF

In case you haven't seen it, the author addressed this point of topic vs. style in a comment (albeit in a different context): https://news.ycombinator.com/item?id=43708474.

emporas

It did find an old account of mine that got banned, top of the list. I have to say 500 words for fingerprinting, that's mindblowing.

It focuses on topic a lot, that's true.

chrismorgan

I wonder how much curly quote usage influences things. I type things like curly quotes with my Compose key, and so do most of my top similars; and four or five words with straight quotes show up among the bottom ten in our analyses. (Also etc, because I like to write &c.)

I’m not going to try comparing it with normalising apostrophes, but I’d be interested how much of a difference it made. It could easily be just that the sorts of people who choose to write in curly quotes are more likely to choose words carefully and thus end up more similar.

dayvigo

Curly vs. straight quotes is mainly a mobile vs. desktop thing AFAIK. Not sure what Mac does by default, but Windows and Linux users almost exclusively use plain straight quotes everywhere.

chrismorgan

My impression is that iOS is the only major platform to even support automatically curlify quotation marks. Maybe some Android keyboards are more sensible about it, but none that I’ve used make it anything but manual.

keepamovin

We can improve this. antirez has made a highly compelling poc but it could be refined for authorship attribution judging by the number of misses in the comments here, and how this compares to greater accuracy of the original post to which antirez refers. I’m no expert, but some ideas:

- remove super high frequency non specific words from the comparison bags, because they don’t distinguish much, have less semantic value and may skew the data

- remove stop words (NLP definition of stop words)

- perform stemming/tokenization/depluralization etc (again, NLP standard)

- implement commutativity and transitivity in the similarity function

- consider words as hyperlinks to the sets of people who use them often enough, and do something Pageranky to refine similarity

- consider word bigrams, etc

- weight variations and misspellings higher as distinguishing signals

What are your ideas ?

MivLives

Managed to find an alt I forgot I made and gave up using years ago. I do wonder about other high up people. Like what about our mutual histories makes us have similar word usage? Are we from the same areas or did we hang out in similar places online?

declan_roberts

This is exactly why HN needs to allow us to delete accounts.

gkbrk

It wouldn't change anything though. Unless you delete your comment / account a few minutes after you post, it's gonna get scraped and saved into a DB almost instantly. After that, the fact that HN deleted them won't save you from this.