Skip to content(if available)orjump to list(if available)

Using generative AI as part of historical research: three case studies

eviks

For a case study would be nice if the case were actually studied…

> had unusually legible handwriting, but even “easy” early modern paleography like this is still the sort of thing that requires days or weeks of training to get the hang of.

Why would you need weeks of training to use some OCR tool? No comparison to any used alternatives in the article. And only using "unusually legible" isn't that relevant for the… usual cases

> This is basically perfect,

I’ve counted at least 5 errors on the first line, how is this anywhere close to perfection???

Same with translation: first, is this an obscure text that has no existing translation to compare the accuracy to instead of relying on your own poor knowledge? Second, what about existing tools?

> which I hadn’t considered as being relevant to understanding a specific early modern map, but which, on reflection, actually are (the Peter Burke book on the Renaissance sense of the past).

How?

> Does this replace the actual reading required? Not at all.

With seemingly irrelevant books like the previous one, yes, it does, the poor student has a rather limited time budget

benbreen

I agree, I probably should've gone into more detail on the actual case studies and implications. I may write this up as a more academic article at some point so I have space to do that.

To your point about OCR: I think you'll find that the existing OCR tools will not know where to begin with the 18th century Mexican medical text in the second case study. If you can find one that is able to transcribe that lettering, please do let me know because it would be incredibly useful.

Speaking entirely for myself here, a pretty significant part of what professional historians do is to take a ton of photos of hard-to-read archival documents, then slowly puzzle them out after the fact - not by using any OCR tool (because none of them that I'm aware of are good enough to deal with difficult paleography) but the old fashioned way, by printing them out, finding individual letters or words that are readable, and then going from there. It's tedious work and it requires at least a few days of training to get the hang of.

If anyone wants to get a sense of what this paleography actually looks like, this is something I wrote about back in 2013 when I was in grad school - https://resobscura.blogspot.com/2013/07/why-does-s-look-like...

For those looking for a specific example of an intermediate-difficulty level manuscript in English, that post shows a manuscript of the John Donne poem "A Triple Fool" which gives a sense of a typical 17th century paleography challenge that GPT-4o is able to transcribe (and which, as far as I know, OCR tools can't handle - though please correct me if I'm wrong). The "Sea surgeon" manuscript below it is what I would consider advanced-intermediate and is around the point where GPT-4o, and probably most PhD students in history, gets completely lost.

re: basically perfect, the errors I see are entirely typos which don't change the meaning (descritto instead of descritta, and the like). So yes, not perfect, but not anything which would impact a historical researcher. In terms of existing tools for translation, the state of the art that I was aware of before LLMs is Google Translate, and I think anyone who tries both on the same text can see which works better there.

re: "irrelevant books," there's really no way to make an objective statement about what's relevant and what's not until you actually read something rather than an AI summary. For that reason, in my own work, this is very much about augmenting rather than replacing human labor. The main work begins after this sort of LLM-augmented research. It isn't replaced by it in any way.

eviks

> To your point about OCR: I think you'll find that the existing OCR tools will not know where to begin with the 18th century Mexican medical text in the second case study. If you can find one that is able to transcribe that lettering, please do let me know because it would be incredibly useful.

My point about OCR is you haven't done any comparison and is now making the same mistake of claiming without any evidence. The most basic one from google translate does "know where to begin", it even doesn't make the "physical" mistake, though makes others. It also does know where to begin with the image from your second post, although it seems worse. And it's not the state of the art, and I don't know what that is for spanish either, but again, that wasn't my point. You do not have a care-free option, to be able to understand that "physical" mistake you'd still need to read the source, which means you still need those days/weeks of training

> none of them that I'm aware of are good enough to deal with difficult paleography

And you haven't demonstrated anything re. difficult paleography for the LLMs in your article either!

> entirely typos which don't change the meaning

First, you'd need to actually demonstrate that, and that would require the full accounting which you haven't done (and no, I don't plan to do that either) This could be a typo in a name or a year, which is bound to have some impact on a historical researcher? He'd try searching for a misspelled name and find nothing while there could've been an interesting connection in some other text?

>translation, the state of the art that I was aware of before LLMs is Google Translate, and I think anyone who tries both on the same text can see which works better there.

Yes, do try it, for example, in Deepl, to see that it's not any worse

> no way to make an objective statement about what's relevant and what's not until you actually read something rather than an AI summary

Sure, but presumably you've done that before making the claim of relevance "on reflection"? So how is it relevant to demand this "human labor" of the students?

benbreen

Such a confusing comment, because when I enter the text from case study #1 into Deepl, it's very clearly much worse than what Claude or GPT4o can come up with (the first few lines from Deepl are: "With his expositions to all the Tables, particularly of the quality of the countries, et of the most notable things, to be found in them. Which Tables, can be, and t are taught to reduce' together" and so on).

Likewise with using Google translate on both case studies #1 and #2 - the results are self-evidently far worse. In both cases there were multiple errors in each line and in case study #2 it was entirely unable to transcribe or translate the title line. If you see this, please email me at bebreen [at] ucsc dot edu to share the better results you are seeing because I genuinely am interested and open to using alternative tools - I just am not seeing what you are seeing, apparently.

In terms of typos not changing the meaning, yes naturally a real human needs to double check absolutely everything if it's being used in research. We agree on that - the point is simply that this significantly speeds up the initial research process, not that it replaces the expertise necessary to, for instance, double check that a name or year is transcribed correctly. A huge amount of historical research is simply about skimming through documents looking for relevent info to zero in on - this is where LLMs can really help.

carschno

I wanted to say this, but could not express it as well. I think what your points also reveal is the biggest success factor of ChatGPT: it can do many things that specialised tools have been doing (better), but many ChatGPT users had not known about those tools.

I do understand that a mere user of e.g. OCR tooling does not perform a systematic evaluation with the available tools, although it would be the scientific way to decide for one. For a researcher, however, the lack of knowledge about the tooling ecosystem seems concerning.

null

[deleted]

simonw

Full quote:

> Granted, Monte had unusually legible handwriting, but even “easy” early modern paleography like this is still the sort of thing that requires days or weeks of training to get the hang of.

He isn't talking about weeks of training to learn to use OCR software, he means weeks of training to learn to read that handwriting without any assistance from software at all.

eviks

And this would change how? If you needed to learn to read it before despite being able to use OCR, why would this new tool allow you to not learn anything?

Or, to get back to my original comment, if it's ok to be illiterate, why would you need weeks to learn using an alternative OCR tool?

simonw

Why are you talking about spending weeks learning to use an OCR tool?

pjc50

Do you know any OCR tools that work on early modern English handwriting?

conjectures

I used to work for a historical records org. As of 10 years back, OCR was getting humans to transcribe such work. So whatever the limitations of genai, my prior is against there being a perfectly good old fashioned OCR solution to the 'obscure hisotrical handwriting' problem.

carschno

I would start here: https://www.transkribus.org/

Experts in the field might know more specialized tools, or how to train an actually better Transkribus model without deep technical knowledge required.

simonw

I'd love to read way more stuff like this. There are plenty of people writing about LLMs from a computer science point of view, but I'm much more interested in hearing from people in fields like this one (academic history) who are vigorously exploring applications of these tools.

dr_dshiv

I’m working with Neo-Latin texts at the Ritman Library of Hermetic Philosophy in Amsterdam (aka Embassy of the Free Mind).

Most of the library is untranslated Latin. I have a book that was recently professionally translated but it has not yet been published. I’d like to benchmark LLMs against this work by having experts rate preference for human translation vs LLM, at a paragraph level.

I’m also interested in a workflow that can enable much more rapid LLM transcriptions and translations — whereby experts might only need to evaluate randomized pages to create a known error rate that can be improved over time. This can be contrasted to a perfect critical edition.

And, on this topic, just yesterday I tried and failed to find English translations of key works by Gustav Fechner, an early German psychologist. This isn’t obscure—he invented the median and created the field of “empirical aesthetics.” A quick translation of some of his work with Claude immediately revealed concept I was looking for. Luckily, I had a German around to validate the translation…

LLMs will have a huge impact on humanities scholarship; we need methods and evals.

benbreen

Thank you! Have been a big fan of your writing on LLMs over the past couple years. One thing I have been encouraged by over this period is that there are some interesting interdisciplinary conversations starting to happen. Ethan Mollick has been doing a good job as a bridge between people working in different academic fields, IMO.

grobbyy

A basic problem is they're trained on the Internet, and take on all the biases. Ask any of them so purposed edX to MIT or wrote the platform. You'll get back official PR. Look at a primary source (e.g. public git history or private email records) and you'll get a factual story.

The tendency to reaffirm popular beliefs would make current LLMs almost useless for actual historical work, which often involves sifting fact from fiction.

dmix

Couldn’t LLMs cite primary sources much the same way as a textbook or Wikipedia? Which is how you circumvent the biases in textbooks and wikipedia summaries?

bandrami

They can, but they also hallucinate non-existent references:

https://journals.sagepub.com/doi/10.1177/05694345231218454

simonw

A raw LLM is a bad tool for citations, because you can't guarantee that their model weights will contain accurate enough information to be citable.

Instead, you should find the primary sources through other means and then paste them into the LLMs to help translate/evaluate/etc, which is what this author is doing.

Almondsetat

Circumventing the bias would mean providing a uniform sampling of the primary sources, which is not guaranteed to happen

dartos

This is a showcase of exactly what LLMs are good at.

Handwriting recognition, a classic neural network application, and surfacing information and ideas, however flawed, that one may not have had themselves.

This is really cool. This is AI augmenting human capabilities.

BeefWellington

Good read on what someone in a specific field considers to have been achieved (rightly or wrongly). It does lead me to wonder how many of these old manuscripts and their translations are in the training set. That may limit its abilities against any random sample that isn't included.

Then again, maybe not; OCR is one of the most worked on problems, so the quality of parsing characters into text maybe shouldn't be as surprising.

Off topic: it's wild to me that in 2025 sites like substack don't apply `prefers-color-scheme` logic to all their blogs.

satisfice

The intractable problem, here, is that “LLMs are good historians” is a nearly useless heuristic.

I’m not a historian. I don’t speak old spanish. I am not a domain expert at all. I can’t do what the author of this post can do: expertly review the work of an LLM in his field.

My expertise is in software testing, and I can report that LLMs sometimes have reasonable testing ideas— but that doesn’t mean they are safe and effective when used for that purpose by an amateur.

Despite what the author writes, I cannot use an LLM to get good information about history.

simonw

This is similar to the problem with some of the things people have been doing with o1 and o3. I've seen people share "PhD level" results from them... but if I don't have a PhD myself in that subject it's almost impossible for me to evaluate their output and spot if it makes sense or not.

I get a ton of value out of LLMs as a programmer partly because I have 20+ years of programming experience, so it's trivial for me to spot when they are doing "good" work as opposed to making dumb mistakes.

I can't credibly evaluate their higher level output in other disciplines at all.

xigency

This begs the question, is this wave of LLM AI anything more than a fancy mirror? They're certainly very good at agreeing with people and following along, but, as many have noted, not really useful for anything acting on their own.

amelius

You __can__ get good information from an LLM, however you just have to backtrack every once in a while because the information turned out to be false.

userbinator

however you just have to backtrack every once in a while because the information turned out to be false.

The problem is, how do you know? I've seen developers go completely off-course just from bad search engine results and one did admit he felt something wasn't right but kept going because he didn't know better; now imagine he's being told by a very confident but incorrect LLM, and you can see how hazardous that'll be.

"You don't know what you don't know."

sebmellen

Unless you have a very good understanding of the system you’re working on or the tools you’re using, it’s very possible to get knee deep in crap without knowing it. That’s one of the biggest risks of using LLMs as assistants.

simonw

You need to develop skills like critical thinking, metacognition, analytical reasoning - being able to get to a robust mental model from a bunch of different inputs, some of which may even contradict each other.

null

[deleted]

mvdtnz

And therein lies the problem - if you're not already an expert there's no way to tell when is the right moment to backtrack.

nithril

The exact definition of a useful heuristic, "good enough"

jolmg

> explicación poética

> There are, again, a couple errors here: it should be “explicación phisica” [physical explanation] not “poetic explanation” in the first line, for instance.

The image seems to say "phicica" (with a "c"), but that's not Spanish. "ph" is not even a thing in Spanish. "Physical" is "física", at least today, IDK about the 1700's. So, if you try to make sense of it in such a way that you assume a nonsense word is you misreading rather than the writer "miswriting", I can see why it assumes it might say "poética", even though that makes less sense semantically.

benbreen

Author here, I agree that my read may not be correct either. It’s tough to make out. Although keep in mind that “ph” is used in Latin and Greek (or at least transliterations of Greek into the Roman alphabet) so in an early modern medical context (I.e. one in which it is assumed the reader knows Latin, regardless of the language being used) “ph” is still a plausible start to a word. Early modern spelling in general is famously variable - common to see an author spell the same word two different ways in the same text.

jolmg

> So, if you try to make sense of it in such a way that you assume a nonsense word is you misreading

> I agree that my read may not be correct either

Just in case, by "you", I meant from the POV of the AI, not you the author.

That's interesting to know about "ph". I didn't know it was present in Latin, and I wonder if that's also the case with Spanish.

schoen

I just looked in the Corpus Diacrónico del Español

https://corpus.rae.es/cordenet.html

and it found 33 hits for "phisica" and 99 for "phisico", mostly from the 1490s. Now some of these can be deceptive, like a few are from a bilingual Spanish-Latin book and occur in the Latin portions rather than the Spanish portions, but it seems like some authors in the 1400s wrote "ph" in some Spanish words, at least when they knew the Latin or Greek etymologies.

I don't know when the Iberian languages first got their more phonetic orthographies, especially suppressing that h (that was originally in Latin digraphs used to transliterate Greek letters θ, φ, χ).

Edit: There are also about two dozen hits for physico/physica, interestingly more from the 1700s rather than 1400s.

throwup238

> After all (he said, pleadingly) consciousness really is an irreducible interior fortress that refuses to be pinned down by the numeric lens (really, it is!)

I love this line and the “flattening of human complexity into numbers” quote above it. It sums up perfectly how I feel about the whole LLM to AGI hype/debate (even though he’s talking about consciousness).

Everyone who develops a model has to jump through the benchmark hoop which we all use to measure progress but we don’t even have anything approaching a rigorous definition of intelligence. Researchers are chasing benchmarks but it doesn’t feel like we’re getting any closer to true intelligence, just flattening its expression into next token prediction (aka everything is a vector).

voidhorse

Yeah precisely. Ever since the "brain as computer" metaphor was birthed in the 50s-60s the chief line of attack in the effort to make "intelligent" machines has been to continually narrow what we mean by intelligence further and further until we can divest it of any dependence on humanist notions. We have "intelligent" machines today more as a byproduct of our lowering the bar for what constitutes intelligence than by actually producing anything we'd consider remotely capable of the same ingenuity as the average human being.

afthonos

I find this take strange. My observation has been the opposite. We used to say it would take human intelligence to play chess. Then Deep Blue came up and we said, no, not like that. Then it was go. Then AlphaGo came up and we said no, not like that. Along the way, it was recognizing images. And then AlexNet came along, and we said no, not like that. Then it was creating art, and then LLMs came along, and we said no, not like that.

I agree a narrowing has happened. But the narrowing is to move us closer to saying "if it's not implemented in a brain, located inside a skull, in a body that was developed by DNA-coded cells replicating in a controlled manner over a period of years, it's not really AI."

There's an emotional attachment to intelligence being what makes us human that causes people to lose their minds when machines approach our intelligence. Machines aren't humans. If we value humanity, we should recognize that distinction—even as machines become intelligent and even sentient.

And we should definitely think twice, or, you know, many many many many more times, before building intelligent machines. But I don't think pretending we're not doing that right now is helpful.

voidhorse

I think that's a great take and though they appear contradictory, I actually think both perspectives are correct.

I think what both viewpoints show is that, at the end of the day, intelligence is a broad, fuzzily defined thing, and attempting to claim that a single capability is evidence of intelligence always seems to be insufficient (from either direction).

I also think your points about our own emotional attachment and thinking carefully about intelligent machines are superb. I see a lot of people chasing certain tech right now and I see a far smaller number asking whether or not this tech is something we need or want. I personally don't need to live in a world in which robots are 1:1 emulations of humans (or better). I'd be just as content to live in a world of highly specific and highly optimized collections of robots or "intelligences" only capable of doing one thing really well (a unix theory of "agents", as it were)

simonw

This is called the "AI effect" - the constant shifting of goalposts when the term AI is used, which has been going on for 50+ years at this point: https://en.m.wikipedia.org/wiki/AI_effect

zwischenzug

I wrote this piece in 2023, which argues similarly that LLMs are a boon, not a threat to historians

https://zwischenzugs.com/2023/12/27/what-i-learned-using-pri...

adamredwoods

>> One of the well-known limitations with ChatGPT is that it doesn’t tell you what the relevant sources are that it looked at to generate the text it gives you.

This isn't a limitation, this is critically dangerous. Commercial AI is a centralized, controlled, biased LLM. At what point will someone train it to say something they want people to believe? How can it be trusted?

Consensus based information is still best, and I don't feel LLMs will give us that.

thom

This is the thing I specifically use LLMs for when I’m doing history courses. I’ll remember some vague quote or event and ask for the primary sources and latest ChatGPTs are excellent and getting the right reference, which I can then look up and check myself. Maybe this works better for Latin and Greek texts when it’s gobbled up all the Loebs out there but it works well for me.

lmm

Consensus based history has similar problems. It's extremely easy for the consensus to be distorted by contemporary politics.

delichon

On the contrary. The heart of an LLM is a next word predictor, based on statistics. They do much the same with concepts, making them essentially consensus distillation devices. They are zeitgeisters. They get weird mainly when their training data is too sparse to find actual consensus, so instead tell you to stick cheese to your pizza with glue.

astrange

> They get weird mainly when their training data is too sparse to find actual consensus, so instead tell you to stick cheese to your pizza with glue.

That's exactly not how that happened. That happened because Google's summaries are based on their search results and one of the search results contained that.

ericjmorey

This is only useful if you know what data was used to train the model.

dang

Discussed here!

What I learned using private LLMs to write an undergraduate history essay - https://news.ycombinator.com/item?id=38813297 - Dec 2023 (81 comments)

Animats

"LLMs, which are exquisitely well-tuned machines for finding the median viewpoint on a given issue..."

That's an excellent way to put it. It's the default mode of an LLM. You can ask an LLM for biases, and get them, of course.

astrange

I don't think there is any reason to believe this except that everyone seems to want it to be true.

An easy way to make it not be true would be to emphasize some sources in pretraining by putting them in the corpus multiple times.

miki123211

A much better way is to RLHF the LLM until you get the behavior you want.

As far as I know, modern LLMs try to strike a balance between being somewhat neutral, while not being too neutral on topics outside of the overton window. They'll give you a "both sides have their good points" argument on abortion, religion, guns or immigration, but won't do that for obvious racism or nazi viewpoints.

Early LLMs had a problem with getting this balance right, I feel like many of them were a lot more left-leaning. I don't know how much of the change is caused by us understanding the technology better and how much is just the political winds shifting, though.

I felt like we had a moment there when some models were a bit too "well it depends", even on very uncontroversial subjects.

pjc50

> I feel like many of them were a lot more left-leaning

"Reality has a liberal bias"

dleeftink

Maybe not 'median' but rather 'sufficiently representative', as with all distributional semantics, given a large enough corpus we can approach the 'true' distribution of word/phrases in a given language.

krainboltgreene

Except the corpus itself is fractional of all media. This is like saying Twitter is sufficiently representative of all human history.

gcanyon

I wonder (hope) that for any given issue, the majority of the internet/the training data, and therefore the model's output, will be fairly near to the truth. Maybe not for every topic, but most.

E.g., the models won't report that unicorns are real because the majority of the internet doesn't report that unicorns are real. Of course, there may be issues (like ghosts?) where the majority of the internet isn't accurate?

DennisP

It was pretty neat seeing this because a recent paper found that AI models are bad historians: https://techcrunch.com/2025/01/19/ai-isnt-very-good-at-histo...

But the gist of its argument just seems to be that they don't know fine details of history, and make the same generalized assumptions that humans would make with only a cursory knowledge of a particular topic. This seems unavoidable for a model that compresses a broad swath of human knowledge down to a couple hundred gigabytes.

Using AI as a research tool instead of a fact database is of course a whole different thing.