Show HN: I modeled the Voynich Manuscript with SBERT to test for structure

141 comments

·May 18, 2025

I built this project as a way to learn more about NLP by applying it to something weird and unsolved.

The Voynich Manuscript is a 15th-century book written in an unknown script. No one’s been able to translate it, and many think it’s a hoax, a cipher, or a constructed language. I wasn’t trying to decode it — I just wanted to see: does it behave like a structured language?

I stripped a handful of common suffix-like endings (aiin, dy, etc.) to isolate what looked like root forms. I know that’s a strong assumption — I call it out directly in the repo — but it helped clarify the clustering. From there, I used SBERT embeddings and KMeans to group similar roots, inferred POS-like roles based on position and frequency, and built a Markov transition matrix to visualize cluster-to-cluster flow.

It’s not translation. It’s not decryption. It’s structural modeling — and it revealed some surprisingly consistent syntax across the manuscript, especially when broken out by section (Botanical, Biological, etc.).

GitHub repo: https://github.com/brianmg/voynich-nlp-analysis Write-up: https://brig90.substack.com/p/modeling-the-voynich-manuscrip...

I’m new to the NLP space, so I’m sure there are things I got wrong — but I’d love feedback from people who’ve worked with structured language modeling or weird edge cases like this.

Visit

patcon

I see that you're looking for clusters within PCA projections -- You should look for deeper structure with hot new dimensional reduction algorithms, like PaCMAP or LocalMAP!

I've been working on a project related to a sensemaking tool called Pol.is [1], but reprojecting its wiki survey data with these new algorithms instead of PCA, and it's amazing what new insight it uncovers with these new algorithms!

https://patcon.github.io/polislike-opinion-map-painting/

Painted groups: https://t.co/734qNlMdeh

(Sorry, only really works on desktop)

[1]: https://www.technologyreview.com/2025/04/15/1115125/a-small-...

brig90

Thanks for pointing those out — I hadn’t seen PaCMAP or LocalMAP before, but that definitely looks like the kind of structure-preserving approach that would fit this data better than PCA. Appreciate the nudge — going to dig into those a bit more.

loxias

Try TDA ("mapper", or really, anything based on kernel density computed connectivity), it's a whole new world.

This ain't your parents' "factor analysis".

patcon

Ooooo I will definitely check it out! It's strangely hard to find any comparisons in youtube videos -- it seems TDA isn't actually a dimensional reduction algorithm, but something closely relayed, maybe?

khafra

LLM model interpretability also uses Sparse Autoencoders to find concept representations (https://openai.com/index/extracting-concepts-from-gpt-4/), and, more recently, linear probes.

staticautomatic

I’ve had much better luck with umap than PCA and t-sne for reducing embeddings.

patcon

PaCMAP (and its descendant localmap) are comparable to t-sne at preserving both local and global structure (but without messing much with finicky hyperparameters)

https://youtu.be/sD-uDZ8zXkc

minimaxir

A point of note is that the text embeddings model used here is paraphrase-multilingual-MiniLM-L12-v2 (https://huggingface.co/sentence-transformers/paraphrase-mult...), which is about 4 years old. In the NLP world, that's effectively ancient, particularly as the robustness of even small embeddings models due to global LLM improvements has increased dramatically both in information representation and distinctiveness in the embedding space. Even modern text embedding models not explicitly trained for multilingual support still do extremely well on that type of data, so they may work better for the Voynich Manuscript which is a relatively unknown language.

The traditional NLP techniques of stripping suffices and POS identification may actually harm embedding quality than improvement, since that removes relevant contextual data from the global embedding.

brig90

Totally fair — I defaulted to paraphrase-multilingual-MiniLM-L12-v2 mostly for speed and wide compatibility, but you’re right that it’s long in the tooth by today’s standards. I’d be really curious to see how something like all-mpnet-base-v2 or even text-embedding-ada-002 would behave, especially if we keep the suffixes in and lean into full contextual embeddings rather than reducing to root forms.

Appreciate you calling that out — that’s a great push toward iteration.

Ey7NFZ3P0nzAe

Be careful: they have super short context length AND silently crop if the text is too long. To me there is really no reason to use them.

I recommend ollama to run the artic-embed-v2 model, it also is multimingual and you can use --quantize when loading the modelfile to get it even smaller.

thih9

(I know nothing about NLP)

Does it make sense to check the process with a control group?

E.g. if we ask a human to write something that resembles a language but isn’t, then conduct this process (remove suffixes, attempt grouping, etc), are we likely to get similar results?

flir

I suppose if you've got a hypothesis about how it was written (eg the Cardan grille method) you could generate some texts via that method and see if they display the same characteristics?

awinter-py

yes exactly, why did we not simply ask 100 people to write voynich manuscripts and then train on that dataset

cedws

I had a look at the manuscript for a while and found it suspicious how tightly packed the writing was against the illustrations on some pages. In common language words and letters vary in width, so when you approach the end of the line when writing, you naturally insert a break to begin a new word and avoid overrun. The manuscript is missing these kinds of breaks - I saw many places where it looked like whatever letter might squeeze in had been written at the end of the line.

I wanted to do an analysis of what letters occur just before/after a line break to see if there is a difference from the rest of the text, but couldn't find a transcribed version.

My completely amateur take is that it's an elaborate piece of art or hoax.

null

[deleted]

IAmBroom

Some languages do this by split-

ting words at the end of lines.

tetris11

UMAP or TSNE would be nice, even if PCA already shows nice separation.

Reference mapping each cluster to all the others would be a nice way to indicate that there's no variability left in your analysis

brig90

Great points — thank you. PCA gave me surprisingly clean separation early on, so I stuck with it for the initial run. But you’re right — throwing UMAP or t-SNE at it would definitely give a nonlinear perspective that could catch subtler patterns (or failure cases).

And yes to the cross-cluster reference idea — I didn’t build a similarity matrix between clusters, but now that you’ve said it, it feels like an obvious next step to test how much signal is really being captured.

Might spin those up as a follow-up. Appreciate the thoughtful nudge.

lukeinator42

Do you have examples of how this reference mapping is performed? I'm interested in this for embeddings in a different modality, but don't have as much experience on the NLP side of things

tetris11

Nothing concrete, but you essentially perform shared nearest neighbours using anchor points to each cluster you wish to map to. These form correction vectors you can then use to project from one dataset to another

jszymborski

When I get nice separation with PCA, I personally tend to eschew UMAP, since the relative distance of all the points to one another is easier to interpret. I avoid t-SNE at all costs, because distance in those plots are pretty much meaningless.

(Before I get yelled out, this isn't prescriptive, it's a personal preference.)

minimaxir

PCA having nice separation is extremely uncommon unless your data is unusually clean or has obvious patterns. Even for the comically-easy MNIST dataset, the PCA representation doesn't separate nicely: https://github.com/lmcinnes/umap_paper_notebooks/blob/master...

jszymborski

"extremely uncommon" is very much not my experience when dealing with well-trained embeddings.

I'd add that just because you can achieve separability from a method, the resulting visualization may not be super informative. The distance between clusters that appear in t-SNE projected space often have nothing to do with their distance in latent space, for example. So while you get nice separate clusters, it comes at the cost of the projected space greatly distorting/hiding the relationship between points across clusters.

tomrod

We are of a like mind.

DonaldFisk

This is very interesting. You should post a link to https://www.voynich.ninja/index.php

I'm not familiar with SBERT, or with modern statistical NLP in general, but SBERT works on sentences, and there are no obvious sentence delimiters in the Voynich Manuscript (only word and paragraph delimiters). One concern I have is "Strips common suffixes from Voynich words". Words in the Voynich Manuscript appear to be prefix + suffix, so as prefixes are quite short, you've lost roughly half the information before commencing your analysis.

You might want to verify that your method works for meaningful text in a natural language, and also for meaningless gibberish (encrypted text is somewhere in between, with simpler encryption methods closer to natural language and more complex ones to meaningless gibberish). Gordon Rugg, Torsten Timm, and myself have produced text which closely resembles the Voynich Manuscript by different methods. Mine is here: https://fmjlang.co.uk/voynich/generated-voynich-manuscript.h... and the equivalent EVA is here: https://fmjlang.co.uk/voynich/generated-voynich-manuscript.t...

Avicebron

Maybe I missed it in the README but how did you do the initial encoding for the "words"? so for example, if you have ""okeeodair" as a word, where do you map that back to original symbols?

brig90

Yep, that’s exactly right — the words like "okeeodair" come directly from the EVA transliteration files, which map the original Voynich glyphs to ASCII approximations. So I’m not working with the glyphs themselves, but rather the standardized transliterated words based on the EVA (European Voynich Alphabet) system. The transliterations I used can be found here: https://www.voynich.nu/

I didn’t re-map anything back to glyphs in this project — everything’s built off those EVA transliterations as a starting point. So if "okeeodair" exists in the dataset, that’s because someone much smarter than me saw a sequence of glyphs and agreed to call it that.

null

[deleted]

us-merul

I’ve found this to be one of the most interesting hypotheses: http://voynichproject.org/

The author made an assumption that Voynichese is a Germanic language, and it looks like he was able to make some progress with it.

I’ve also come across accounts that it might be an Uralic or Finno-Ugric language. I think your approach is great, and I wonder if tweaking it for specific language families could go even further.

veqq

This thread discusses the many purported "solutions": https://www.voynich.ninja/thread-4341.html While Bernholz' site is nice, Child's work doesn't shed much light on actually deciphering the MS.

us-merul

Thanks for this! I had come across Child’s hypothesis after doing a search related to Old Prussian and Slavic languages, so I don’t have much context for this solution, and this is helpful to see.

philistine

With how undecipherable the manuscript is, my personal theory is that it's the work of a naive artist and that there's no language behind it. Just someone aping language without knowing the rules about language: https://en.wikipedia.org/wiki/Naïve_art

It's not a mental issue, it's just a rare thing that happens. Voynich fits the whole bill for the work of a naive artist.

cronopios

And that naïve artist somehow managed to create a work that follows Zipf's law, 4 centuries before it was discovered?

DonaldFisk

Random Texts Exhibit Zipf’s-Law-Like Word Frequency Distribution: https://www.nslij-genetics.org/wp-content/uploads/2022/12/ie...

It also applies to a range of natural phenomena, e.g. lunar craters and earthquakes: https://www.cs.cornell.edu/courses/cs6241/2019sp/readings/Ne...

So the fact that word frequencies in the Voynich Manuscript follow Zipf's law doesn't prove it's written in a natural language.

poulpy123

Why would it not ?

riffraff

You're not alone. Many have hypothesized this is just made up gibberish given the unusual distribution of glyphs.

Not a recent hoax/scam, but an ancient one.

It's not like there weren't a ton of fake documents in the middle age and renaissance, from the donation of Constantine to Preserve John's letter.

philistine

The way you describe it is why it’s not readily accepted. It’s misunderstood. You called it a hoax/scam and a fake. It’s not!

Whoever made the document was sincere in making up something that doesn’t exist. They had no intention to mislead. You wouldn’t call a D&D campaign a hoax because it features nonexistent things?

GolfPopper

Edward Kelly[1] was in the right place at the right time, and I recall reading many years ago (though I cannot now find the source) some evidence that he was familiar with the Cardan grille[2], which was sufficient to convince me that he was mostly likely the author, and that the book was intended as a hoax or fraud.

1.https://en.wikipedia.org/wiki/Edward_Kelley

2.https://en.wikipedia.org/wiki/Cardan_grille

renhanxue

These days the manuscript is quite conclusively dated to the first half of the 15th century; the parchment it's written on is definitely from that period, since it's been carbon dated to 1404–1438 with 95% confidence. The general style is also consistent with that dating. For example, medievalist Lisa Fagin Davis writes in a recent paper: "[t]he humanistic tendencies of the glyphset, the color palette, and style of the illustrations suggest an origin in the early fifteenth century" [0].

Edward Kelly was born over a hundred years later, so him "being at the right time" seems to be a bit of a stretch.

[0]: https://ceur-ws.org/Vol-3313/keynote2.pdf

emmelaich

I think it's entirely possible the inks are much later. Possibly Kelly erased whatever was on the parchment previously. In fact the drawings might have made liberal use of the original, just to hide that fact.

Which is worse actually. Kelly may have semi-erased an existing valuable manuscript.

quantadev

Being from the 15th Century the obvious reason to encrypt text was to avoid religious persecution during "The Inquisition" (and other religion-motivated violence of that time). So it would be interesting to run the same NLP against the Gospels and look for correlations with that. You'd want to first do a 'word'-based comparison, and then a 'character'-based comparison. I mean compare the graphs from Bible to graphs from Voynich.

Also there might be some characters that are in there just to confuse. For example that bizarre capital "P"-like thing that has multiple variations seems to appear sometimes far too often to represent real language, so it might be just an obfuscator that's removed prior to decryption. There may be other characters that are abnormally "frequent" and they're maybe also unused dummy characters. But the "too many Ps" problem is also consistent with just pure fiction too, I realize.

codesnik

what I'd expect from a handwritten book like that, if it is just a gibberish, and not a cypher of any sorts - the style, calligraphy, the words used, even letters themselves should evolve from page 1 to the last page. Pages could be reordered of course, but it still should be noticeable.

Unless author hadn't written tens of books exactly like that before, which didn't survive, of course.

I don't think it's a very novel idea, but I wonder if there's analysis for pattern like that. I haven't seen mentions of page to page consistency anywhere.

veqq

> I haven't seen mentions of page to page consistency anywhere.

A lot of work's been done here. There are believed to have been 2 scribes (see Prescott Currier), although Lisa Fagin Davis posits 5. Here's a discussion of an experiment working off of Fagin Davis' position: https://www.voynich.ninja/thread-3783.html

empath75

My favorite part of this thread is like a dozen different people replying that it's already been deciphered and none of them posted the same one.

bunderbunder

> Traditional analyses often fall into two camps: statistical entropy checks or wild guesswork.

I'd argue that these are just the camps that non-traditional, amateur analysis efforts fall into. I've only briefly skimmed Voynich work, but my impression is that, traditionally, more academic analyses rely on a combination of linguistic and cryptological analysis. This does happen to be informed by some statistical analysis, but goes way beyond that.

For example, as I recall the strongest argument that Voynichese probably isn't just an alternative alphabet for a well-known language relies on comparing Voynichese to the general patterns for how writing systems map symbols to sounds. That permits the development of more specific hypotheses about how it could possibly function, including how likely it is to be an alphabet or abjad, and, hypotheses about which characters could plausibly represent more than one sound, possible digraphs, etc. All of that work casts severe doubt on the likelihood of it representing a language from the area because it just can't plausibly represent a language with the kinds of phonological inventories we see in the language families that existed in that place and time.

There's also been some pretty interesting work on identifying individual scribes based on a confluence of factors including, but not limited to, analysis of the text itself. Some of the inferred scribes exclusively wrote in the A language (oh yeah, Voynichese seems to contain two distinct "languages"), some exclusively wrote in the B language, I think they've even hypothesized that there's one who actually used both languages.

There isn't a lot of popular awareness of this work because it's not terribly sexy to anyone but a linguistics nerd. But I'd guess that any attempt to poke at the Voynich manuscript that isn't informed by it is operating at a severe disadvantage. You want to be standing on the shoulders of the tallest giants, not the ones with the best social media presence.