Using generative AI as part of historical research: three case studies
210 comments
·January 22, 2025eviks
carschno
I wanted to say this, but could not express it as well. I think what your points also reveal is the biggest success factor of ChatGPT: it can do many things that specialised tools have been doing (better), but many ChatGPT users had not known about those tools.
I do understand that a mere user of e.g. OCR tooling does not perform a systematic evaluation with the available tools, although it would be the scientific way to decide for one. For a researcher, however, the lack of knowledge about the tooling ecosystem seems concerning.
z3c0
Knowing about tooling is what you hire people for, but that's opposite the value prop being put forward by OpenAI. Given that ChatGPT is rumored to be a Mixture-of-Experts, it's clear that OpenAI recognized the impressively mediocre jack-of-all-trades-master-of-none that any LLM is destined to be on its own.
If they focused on making a high-performing system on a single task (the original approach to problems like OCR before all the GPTs came around), they would have a substantially smaller market to reach, which isn't interesting to investors.
simonw
Full quote:
> Granted, Monte had unusually legible handwriting, but even “easy” early modern paleography like this is still the sort of thing that requires days or weeks of training to get the hang of.
He isn't talking about weeks of training to learn to use OCR software, he means weeks of training to learn to read that handwriting without any assistance from software at all.
eviks
And this would change how? If you needed to learn to read it before despite being able to use OCR, why would this new tool allow you to not learn anything?
Or, to get back to my original comment, if it's ok to be illiterate, why would you need weeks to learn using an alternative OCR tool?
simonw
Why are you talking about spending weeks learning to use an OCR tool?
pjc50
Do you know any OCR tools that work on early modern English handwriting?
conjectures
I used to work for a historical records org. As of 10 years back, OCR was getting humans to transcribe such work. So whatever the limitations of genai, my prior is against there being a perfectly good old fashioned OCR solution to the 'obscure hisotrical handwriting' problem.
simonw
I'd love to read way more stuff like this. There are plenty of people writing about LLMs from a computer science point of view, but I'm much more interested in hearing from people in fields like this one (academic history) who are vigorously exploring applications of these tools.
dr_dshiv
I’m working with Neo-Latin texts at the Ritman Library of Hermetic Philosophy in Amsterdam (aka Embassy of the Free Mind).
Most of the library is untranslated Latin. I have a book that was recently professionally translated but it has not yet been published. I’d like to benchmark LLMs against this work by having experts rate preference for human translation vs LLM, at a paragraph level.
I’m also interested in a workflow that can enable much more rapid LLM transcriptions and translations — whereby experts might only need to evaluate randomized pages to create a known error rate that can be improved over time. This can be contrasted to a perfect critical edition.
And, on this topic, just yesterday I tried and failed to find English translations of key works by Gustav Fechner, an early German psychologist. This isn’t obscure—he invented the median and created the field of “empirical aesthetics.” A quick translation of some of his work with Claude immediately revealed concept I was looking for. Luckily, I had a German around to validate the translation…
LLMs will have a huge impact on humanities scholarship; we need methods and evals.
benbreen
Thank you! Have been a big fan of your writing on LLMs over the past couple years. One thing I have been encouraged by over this period is that there are some interesting interdisciplinary conversations starting to happen. Ethan Mollick has been doing a good job as a bridge between people working in different academic fields, IMO.
grobbyy
A basic problem is they're trained on the Internet, and take on all the biases. Ask any of them so purposed edX to MIT or wrote the platform. You'll get back official PR. Look at a primary source (e.g. public git history or private email records) and you'll get a factual story.
The tendency to reaffirm popular beliefs would make current LLMs almost useless for actual historical work, which often involves sifting fact from fiction.
dmix
Couldn’t LLMs cite primary sources much the same way as a textbook or Wikipedia? Which is how you circumvent the biases in textbooks and wikipedia summaries?
bandrami
They can, but they also hallucinate non-existent references:
simonw
A raw LLM is a bad tool for citations, because you can't guarantee that their model weights will contain accurate enough information to be citable.
Instead, you should find the primary sources through other means and then paste them into the LLMs to help translate/evaluate/etc, which is what this author is doing.
Almondsetat
Circumventing the bias would mean providing a uniform sampling of the primary sources, which is not guaranteed to happen
dartos
This is a showcase of exactly what LLMs are good at.
Handwriting recognition, a classic neural network application, and surfacing information and ideas, however flawed, that one may not have had themselves.
This is really cool. This is AI augmenting human capabilities.
BeefWellington
Good read on what someone in a specific field considers to have been achieved (rightly or wrongly). It does lead me to wonder how many of these old manuscripts and their translations are in the training set. That may limit its abilities against any random sample that isn't included.
Then again, maybe not; OCR is one of the most worked on problems, so the quality of parsing characters into text maybe shouldn't be as surprising.
Off topic: it's wild to me that in 2025 sites like substack don't apply `prefers-color-scheme` logic to all their blogs.
throwup238
> After all (he said, pleadingly) consciousness really is an irreducible interior fortress that refuses to be pinned down by the numeric lens (really, it is!)
I love this line and the “flattening of human complexity into numbers” quote above it. It sums up perfectly how I feel about the whole LLM to AGI hype/debate (even though he’s talking about consciousness).
Everyone who develops a model has to jump through the benchmark hoop which we all use to measure progress but we don’t even have anything approaching a rigorous definition of intelligence. Researchers are chasing benchmarks but it doesn’t feel like we’re getting any closer to true intelligence, just flattening its expression into next token prediction (aka everything is a vector).
voidhorse
Yeah precisely. Ever since the "brain as computer" metaphor was birthed in the 50s-60s the chief line of attack in the effort to make "intelligent" machines has been to continually narrow what we mean by intelligence further and further until we can divest it of any dependence on humanist notions. We have "intelligent" machines today more as a byproduct of our lowering the bar for what constitutes intelligence than by actually producing anything we'd consider remotely capable of the same ingenuity as the average human being.
afthonos
I find this take strange. My observation has been the opposite. We used to say it would take human intelligence to play chess. Then Deep Blue came up and we said, no, not like that. Then it was go. Then AlphaGo came up and we said no, not like that. Along the way, it was recognizing images. And then AlexNet came along, and we said no, not like that. Then it was creating art, and then LLMs came along, and we said no, not like that.
I agree a narrowing has happened. But the narrowing is to move us closer to saying "if it's not implemented in a brain, located inside a skull, in a body that was developed by DNA-coded cells replicating in a controlled manner over a period of years, it's not really AI."
There's an emotional attachment to intelligence being what makes us human that causes people to lose their minds when machines approach our intelligence. Machines aren't humans. If we value humanity, we should recognize that distinction—even as machines become intelligent and even sentient.
And we should definitely think twice, or, you know, many many many many more times, before building intelligent machines. But I don't think pretending we're not doing that right now is helpful.
simonw
This is called the "AI effect" - the constant shifting of goalposts when the term AI is used, which has been going on for 50+ years at this point: https://en.m.wikipedia.org/wiki/AI_effect
jolmg
> explicación poética
> There are, again, a couple errors here: it should be “explicación phisica” [physical explanation] not “poetic explanation” in the first line, for instance.
The image seems to say "phicica" (with a "c"), but that's not Spanish. "ph" is not even a thing in Spanish. "Physical" is "física", at least today, IDK about the 1700's. So, if you try to make sense of it in such a way that you assume a nonsense word is you misreading rather than the writer "miswriting", I can see why it assumes it might say "poética", even though that makes less sense semantically.
benbreen
Author here, I agree that my read may not be correct either. It’s tough to make out. Although keep in mind that “ph” is used in Latin and Greek (or at least transliterations of Greek into the Roman alphabet) so in an early modern medical context (I.e. one in which it is assumed the reader knows Latin, regardless of the language being used) “ph” is still a plausible start to a word. Early modern spelling in general is famously variable - common to see an author spell the same word two different ways in the same text.
jolmg
> So, if you try to make sense of it in such a way that you assume a nonsense word is you misreading
> I agree that my read may not be correct either
Just in case, by "you", I meant from the POV of the AI, not you the author.
That's interesting to know about "ph". I didn't know it was present in Latin, and I wonder if that's also the case with Spanish.
schoen
I just looked in the Corpus Diacrónico del Español
https://corpus.rae.es/cordenet.html
and it found 33 hits for "phisica" and 99 for "phisico", mostly from the 1490s. Now some of these can be deceptive, like a few are from a bilingual Spanish-Latin book and occur in the Latin portions rather than the Spanish portions, but it seems like some authors in the 1400s wrote "ph" in some Spanish words, at least when they knew the Latin or Greek etymologies.
I don't know when the Iberian languages first got their more phonetic orthographies, especially suppressing that h (that was originally in Latin digraphs used to transliterate Greek letters θ, φ, χ).
Edit: There are also about two dozen hits for physico/physica, interestingly more from the 1700s rather than 1400s.
Animats
"LLMs, which are exquisitely well-tuned machines for finding the median viewpoint on a given issue..."
That's an excellent way to put it. It's the default mode of an LLM. You can ask an LLM for biases, and get them, of course.
astrange
I don't think there is any reason to believe this except that everyone seems to want it to be true.
An easy way to make it not be true would be to emphasize some sources in pretraining by putting them in the corpus multiple times.
miki123211
A much better way is to RLHF the LLM until you get the behavior you want.
As far as I know, modern LLMs try to strike a balance between being somewhat neutral, while not being too neutral on topics outside of the overton window. They'll give you a "both sides have their good points" argument on abortion, religion, guns or immigration, but won't do that for obvious racism or nazi viewpoints.
Early LLMs had a problem with getting this balance right, I feel like many of them were a lot more left-leaning. I don't know how much of the change is caused by us understanding the technology better and how much is just the political winds shifting, though.
I felt like we had a moment there when some models were a bit too "well it depends", even on very uncontroversial subjects.
pjc50
> I feel like many of them were a lot more left-leaning
"Reality has a liberal bias"
dleeftink
Maybe not 'median' but rather 'sufficiently representative', as with all distributional semantics, given a large enough corpus we can approach the 'true' distribution of word/phrases in a given language.
krainboltgreene
Except the corpus itself is fractional of all media. This is like saying Twitter is sufficiently representative of all human history.
zwischenzug
I wrote this piece in 2023, which argues similarly that LLMs are a boon, not a threat to historians
https://zwischenzugs.com/2023/12/27/what-i-learned-using-pri...
adamredwoods
>> One of the well-known limitations with ChatGPT is that it doesn’t tell you what the relevant sources are that it looked at to generate the text it gives you.
This isn't a limitation, this is critically dangerous. Commercial AI is a centralized, controlled, biased LLM. At what point will someone train it to say something they want people to believe? How can it be trusted?
Consensus based information is still best, and I don't feel LLMs will give us that.
thom
This is the thing I specifically use LLMs for when I’m doing history courses. I’ll remember some vague quote or event and ask for the primary sources and latest ChatGPTs are excellent and getting the right reference, which I can then look up and check myself. Maybe this works better for Latin and Greek texts when it’s gobbled up all the Loebs out there but it works well for me.
lmm
Consensus based history has similar problems. It's extremely easy for the consensus to be distorted by contemporary politics.
delichon
On the contrary. The heart of an LLM is a next word predictor, based on statistics. They do much the same with concepts, making them essentially consensus distillation devices. They are zeitgeisters. They get weird mainly when their training data is too sparse to find actual consensus, so instead tell you to stick cheese to your pizza with glue.
astrange
> They get weird mainly when their training data is too sparse to find actual consensus, so instead tell you to stick cheese to your pizza with glue.
That's exactly not how that happened. That happened because Google's summaries are based on their search results and one of the search results contained that.
ericjmorey
This is only useful if you know what data was used to train the model.
dang
Discussed here!
What I learned using private LLMs to write an undergraduate history essay - https://news.ycombinator.com/item?id=38813297 - Dec 2023 (81 comments)
gcanyon
I wonder (hope) that for any given issue, the majority of the internet/the training data, and therefore the model's output, will be fairly near to the truth. Maybe not for every topic, but most.
E.g., the models won't report that unicorns are real because the majority of the internet doesn't report that unicorns are real. Of course, there may be issues (like ghosts?) where the majority of the internet isn't accurate?
DennisP
It was pretty neat seeing this because a recent paper found that AI models are bad historians: https://techcrunch.com/2025/01/19/ai-isnt-very-good-at-histo...
But the gist of its argument just seems to be that they don't know fine details of history, and make the same generalized assumptions that humans would make with only a cursory knowledge of a particular topic. This seems unavoidable for a model that compresses a broad swath of human knowledge down to a couple hundred gigabytes.
Using AI as a research tool instead of a fact database is of course a whole different thing.
trgn
One thing I'd love if models would get to help me confirm a thing or find the source od soemthing I have a vague memory of and which may be right or wrong, I just don't know.
E.g. I have this recollection of a quote, slightly pithy, from around the 19 hundreds about hobby clubs controlling social life, maybe from Mark twain, maybe not.
I just cannot come up with the prompt that gets me the answer, instead I just get hallucination after hallucination, just confirming whatever I put in, like a student who didn't study for the test and is just going along with what the professor is asking at the oral exam.
For a case study would be nice if the case were actually studied…
> had unusually legible handwriting, but even “easy” early modern paleography like this is still the sort of thing that requires days or weeks of training to get the hang of.
Why would you need weeks of training to use some OCR tool? No comparison to any used alternatives in the article. And only using "unusually legible" isn't that relevant for the… usual cases
> This is basically perfect,
I’ve counted at least 5 errors on the first line, how is this anywhere close to perfection???
Same with translation: first, is this an obscure text that has no existing translation to compare the accuracy to instead of relying on your own poor knowledge? Second, what about existing tools?
> which I hadn’t considered as being relevant to understanding a specific early modern map, but which, on reflection, actually are (the Peter Burke book on the Renaissance sense of the past).
How?
> Does this replace the actual reading required? Not at all.
With seemingly irrelevant books like the previous one, yes, it does, the poor student has a rather limited time budget