Can you read this cursive handwriting? The National Archives wants your help

266 comments

·January 18, 2025

Visit

geuis

It's a really interesting project. But boy do they make it hard to participate.

* Article doesn't provide a direct link to the topic mission

* Signup is pretty easy. Well organized and even gently requires you to have two forms of 2FA.

* Sign up complete. Go back to the primary page and try to find the mission. A little buried but not too deep.

* Notice I'm not signed in. Ok, let's do that. Now I'm back on the main page and navigate back. Find the first document and open it. Really interesting to scan through the doc and to read. People back then generally had really nice handwriting.

* Ok, what next, how do I transcribe? ... ? Oh it says I'm not logged in again. Fine, click the link and...

* I'm logged in and directed back to the main page, again.

Look, this is an interesting project and I'd love to spend my spare cycles to help out. But they really need to clean up this process.

Volunteers shouldn't have to jump through kinda poorly designed interfaces to help out.

rtkwe

The social post embedded in the page links directly to this page with all the instructions. Once I created an account and signed in I just selected a state in the original tab and was right there and could start translating.

Do you perhaps have uBlock Origin enabled or some other limitation on Javascript/cookies that might be messing with your login status?

The direct link to the mission that was in the social post. https://www.archives.gov/citizen-archivist/missions/revoluti...

jcoby

I had the exact same experience when I tried to contribute last week. I had to jump between multiple sessions and browsers and eventually managed to log in after about 30 minutes of trying. There is no indication of what is going right or wrong. Once you're in the UI changes very little as well so it's quite easy to miss that you've managed to log in.

Once I was logged in I spent another 45 minutes trying to find a document to transcribe. Every single one I found or was given from a challenge had either already been transcribed or was a typewritten document or manifest that the OCR had already done an OK job with. I reviewed a few documents for accuracy, closed the browser, and never went back.

It's a shame it's so hard to use. I really was hoping for something I could pop open for 15-30 minutes a day as a break from work and contribute to instead of doing a crossword or watching a video.

DidYaWipe

"and even gently requires you to have two forms of 2FA"

WTF, why? I'm not putting my bank info in there. The whole process sounds like a PITA in several ways, but in general I'm getting fed up with no-importance sites requiring 2FA as if they're a brokerage.

khana

[dead]

demosthanos

Before commenting asking about why they don't just use LLMs, please note that the article specifically calls out that they do, but it's not always a viable solution:

> The agency uses artificial intelligence and a technology known as optical character recognition to extract text from historical documents. But these methods don’t always work, and they aren’t always accurate.

The document at the top is likely an especially easy document to read precisely because it's meant to be the hook to get people to sign up and get started. It isn't going to be representative of the full breadth of documents that the National Archives want people to go through.

tptacek

OK, fair enough, but can you find one in this article that's hard for an LLM? The gnarliest one I saw, 4o handled instantly, and I went back and looked carefully at the image and the text and I'm sold.

Like if this is a crowdsourcing project, why not do a first pass with an LLM and present users with both the image and the best-effort LLM pass?

Later

I signed up, went to the current missions, and they all seem to post post-1900 and all typeset. They're blurry, but 4o cuts through them like a hot knife through butter.

defaultcompany

My parents have saved letters from their parents which are written in cursive but in two perpendicular layers. Meaning the writing goes horizontally in rows and then when they got to the end of the page it was turned 90 degrees and continued right on top of what was already there for the whole page. This was apparently to save paper and postage. It looks like an unintelligible jumble but my mother can actually decipher it. Maybe that’s what the LLMs are having trouble with?

Edit: apparently it’s called cross writing [1]

1: https://highshrink.com/2018/01/02/criss-cross-letters/

tptacek

Are they having trouble? You can sign up right now and get tasks from the archive that seem trivial for 4o (by which I mean: feed a screenshot to 4o, get a transcription, and spot check it).

anaisbetts

Did you actually check it? Sonnet 3.5 generates text that seems legitimate and generally correct, but misreads important details. LLMs are particularly deceptive because they will be internally consistent - they'll reuse the same incorrect name in both places and will hallucinate information that seems legit, but in fact is just made-up.

myth_drannon

You don't use LLM but other transformer based ocr models like trocr which has very low CER and WER rates

dr_dshiv

Just have version control, and allow randomized spot checks with experts to have a known error rate.

ellen364

> Like if this is a crowdsourcing project, why not do a first pass with an LLM and present users with both the image and the best-effort LLM pass?

Possibly for the reason that came up in your other post: you mentioned that you spot checked the result.

Back when I was in historical research, and occasionally involved in transcription projects, the standard was 2-3 independent transcriptions per document.

Maybe the National Archive will pass documents to an LLM and use the output as 1 of their 2-3 transcriptions. It could reduce how many duplicate transcriptions are done by humans. But I'll be surprised if they jump to accepting spot checked LLM output anytime soon.

tptacek

You get that I'm not saying they should just commit LLM outputs as transcriptions, right?

varenc

My guess is because it’s the Smithsonian, they’re just not willing to trust an LLM’s transcription enough to put their name on it. I imagine they’re rather conservative. And maybe some AI-skeptic protectionist sentiments from the professional archivists. Seems like it could change with time though.

ugh123

> My guess is because it’s the Smithsonian, they’re just not willing to trust an LLM’s transcription enough to put their name on it. I imagine they’re rather conservative

I expect thats a common theme from companies like that, yet I don't think they understand the issue they think they have there.

Why not have the LLMs do as much work as possible and have humans review and put their own name on it? Do you think they need to just trust and publish the output of the LLM wholeheartedly?

I think too many people saw what a few idiot lawyers did last year and closed the book on LLM usage.

dfc

The article is from The Smithsonian. The actual project is with the National Archives.

doodlebugging

I'm doing some genealogy work right now on my family's old papers covering the time period from recent years back to the late 17th century. Handwriting styles changed a lot over the centuries and individuals can definitely be identified by their personal cursive style of writing and you can see their handwriting change as they aged.

Then you have the problem that some of these ancestors not only had terrible penmanship but also spelled multi-syllabic words phonetically since they likely were barely educated kids who spent more time when they were young working on the farm or ranch instead of attending school where they would've learned how to spell correctly.

I don't know whether your LLM can handle English words spelled phonetically written in cursive by an individual who had no consistency in forming letters in the words. It is clear after reading a lot of correspondence from this person that they ignored things that didn't seem important in the moment like dotting i's or crossing t's or forming tails on g's, p's, j's, or even beginning letters consistently since they switched between cursive and block letters within a sentence, maybe while they paused to clarify their thoughts. I don't know but it is fascinating to take a walk through life with someone you'll never meet and to discover that many of the things that seemed awesome to you as a kid were also awesome to them and that their life had so many challenges that our generations will never need to endure.

Some of my people have the most beautiful flowing cursive handwriting that looks like the cursive that I was taught in grade school. Others have the most beautiful flowing cursive with custom flourishes and adornments that make their handwriting instantly recognizable and easy to read once you understand their style.

I think there are plenty of edge cases where LLMs will take a drunkard's walk through the scribble and spit out gibberish.

I'm reminded of an old joke though.

Ronald Reagan woke up one snowy Washington, DC morning and took a look out of the window to admire the new-fallen snow. He enjoys the beautiful scene laid out before him until he sees tracks in the snow below his window and a message obviously written in piss that said - "Reagan sucks".

He dispatched the Secret Service to the site where samples were taken of the affected snow and photos of the tracks of two people were made.

After an investigation he receives a call from the Secret Service agent in charge who tells him he has some good news and some bad news for him.

The good news is that they know who pissed the message. It was George HW Bush, his Vice President. The bad news is that it was Nancy's handwriting.

vintermann

I don't know about this project, but I can easily find thousands of images that gpt-4o can't read, but a human expert can. It can do typed text excellently, antika-style cursive if it's very neat, and kurrent-style cursive never.

tptacek

For straightforward reasons, I am commenting on this project, not the space of all possible projects. I did try, once, to get 4o to decode the Zodiac Killer's message. It didn't work.

Avshalom

Real quick, how long do you think chatgpto4 has existed? How long do you think the National Archive has been archiving?

tptacek

It's 4o. The crowdsourced transcription project dates back to 2012. My comment is mostly on this article.

rtkwe

One that require additional work beyond simply feeding the image into the model would be this example which is a mix of barely legible hand written cursive and easy to read typed form. [0] Initially 4o just transcribes (successfully) the bottom half of the text and has to be prompted to attempt the top half at which point it seems to at best summarize the text instead of giving a direct transcription. [1] In fact it seems to mix up some portions of the latter half of the typed text with the written text in the portion of it's "transcription" about "reduced and indigent circumstances".

[0] https://catalog.archives.gov/id/54921817?objectPage=8&object...

[1] Reproducing here since I cannot share the chat since it has user uploaded images. " The text in the top half of the image is handwritten and partially difficult to read due to its cursive style and some smudging. Here's my best transcription attempt for the top section:

...resident within four? years, swears and says that the name of the John Hopper mentioned in the foregoing declaration is the same person, and he verily believes the facts as stated in the declaration are true.

He further swears that the said John Hopper is in reduced and indigent circumstances and requires the aid of his country.

The declarant further swears he has no evidence now in his power of service, except the statement of Capt. (illegible name), as to his reduced circumstances ...

Sworn to before me, this day...

Some parts remain unclear due to the handwriting, but let me know if you'd like me to attempt further clarification on specific sections!"

thaumasiotes

> this example which is a mix of barely legible hand written cursive and easy to read typed form.

> In fact it seems to mix up some portions of the latter half of the typed text with the written text in the portion of it's "transcription" about "reduced and indigent circumstances".

What typed form? What typed text? That image is a single handwritten page, and the writing is quite clean, not "barely legible".† The file related to John Hopper appears to be 59 pages, and some of them are typed, but they're all separate images.

Are you trying to process all 59 pages at once? Why?

I should note that transcription is an excellent use of an LLM in the sense of a language model, as opposed to an "LLM" in the sense of several different pieces of software hooked together in cryptic ways. It would be a lot more useful, for this task, to have direct access to the language model backing 4o than to have access to a chatbot prompt that intermediates between you and the model.

† My biggest problems in reading the page: Cursive n and u are often identical glyphs (both written и), leading me to read "Ind." as "Jud."; and I had trouble with the "roster" at the bottom of the page. What felt weirdest about that was that the crossbar of the "t" is positioned well above the top of the stem, but that can't actually be what tripped me up, because on further review it's a common feature of the author's handwriting that I didn't even notice until I got to the very end of the letter. It's even true in the earlier instance of "Roster" higher up on the page. So my best guess is that the "os" doesn't look right to me.

I misread 1758 as 1958, too, but hopefully (a) that kind of thing wears off as you get used to reading documents about the Revolutionary War; and (b) it's a red flag when someone who died in 1838 was born in 1958 according to a letter written in 1935.

prng2021

Determining whether the latest off the shelf LLMs are good enough should be straight forward because of this:

“Some participants have dedicated years of their lives to the program—like Alex Smith, a retiree from Pennsylvania. Over nine years, he transcribed more than 100,000 documents”

Have different LLMs transcribe those same documents and compare to see if the human or machine is or accurate and by how much.

sandworm101

This is not an LLM problem. It was solved years ago via OCR. Worldwide, postal services long ago deployed OCR to read handwitten addresses. And there was an entire industry of OCR-based data entry services, much of it translating the chicken scratch of doctor's handwiting on medical forms, long before LLMs were a thing.

prng2021

It was never “solved” unless you can point me to OCR software that is 100% accurate. You can take 5 seconds to google “ocr with llm” and find tons of articles explaining how LLMs can enhance OCR. Here’s an example:

https://trustdecision.com/resources/blog/revolutionizing-ocr...

lukeschlather

LLMs improve significantly on state of the art OCR. LLMs can do contextual analysis. If I were transcribing these by hand, I would probably feed them through OCR + an LLM, then ask an LLM to compare my transcription to its transcription and comment on any discrepancies. I wouldn't be surprised if I offered minimal improvement over just having the LLM do it though.

dambi0

For the addresses it might be a bit easier because they are a lot more structured and in theory and the vocabulary is a lot more limited. I’m less sure about medical notes although I’d suspect that there are fairly common things they are likely to say.

Looking at the (admittedly single) example from the National Archives seems a bit more open than perhaps the other two examples. It’s not impossible thst LLMs could help with this

WillAdams

Yes, but there was usually a fall-back mechanism where an unrecognized address would be shown on a screen to an employee who would type it so that it could then be inkjetted with a barcode.

iandanforth

Fun fact, convolutional neural networks developed by Yann LeCunn were instrumental in that roll out!

pinoy420

Agree. Sounds like not wanting to let go of a legacy

tedunangst

Something about extraordinary claims and extraordinary evidence? The evidence presented, a seemingly easily transcribed image, is hardly persuasive.

rtkwe

Some are significantly harder to read. I took the page below and tried to get GPT 4o to transcribe it and it basically couldn't do it. I'm not going to sit and prompt hack for ages to see if it can but it seems unable to tackle the handwritten text at the top. When I first just fed it the image and asked for a transcription it only (but successfully) read the bottom portion, prompted for a transcription of the top it dropped into more of a summary of the whole document mainly pulling some phrases from the bottom text. (Sadly can't share it but I copied it's reply out in a comment upthread) [0]

It was more successful at a few others I tried but it's still a task that requires manual processing like a lot of LLM output to check for accuracy and prompt modification to get it to output what you need for some documents.

https://catalog.archives.gov/id/54921817?objectPage=8&object...

[0] https://news.ycombinator.com/item?id=42746490

null

[deleted]

jeffbee

Drives me crazy that they are saying "AI and OCR". It sucks that charlatans have occupied the field of "AI" so thoroughly now that OCR is considered something separate.

interludead

Still, the fact that they’re combining AI and human effort makes sense

mkoubaa

High quality human transcriptions are the most valuable kind of training data

Unearned5161

Ok I did one letter, from a woman in 1814 writing to James Monroe (then Secretary of State) asking for a passport to go to Scotland to get her late brother's property. What a trip! So enjoyable to get into the flow once you've "synchronized" with the persons handwriting. Furthermore, due to the fact that you're reading and re-writing word for word of whatever you're transcribing, the stories you end up reading have tremendous memory-stick. This is not surprising, considering that you are dedicating an inordinate amount of time per page, but it's a welcome side effect when you try and recollect.

jhanschoo

> Furthermore, due to the fact that you're reading and re-writing word for word of whatever you're transcribing, the stories you end up reading have tremendous memory-stick. This is not surprising, considering that you are dedicating an inordinate amount of time per page, but it's a welcome side effect when you try and recollect.

This was something I enjoyed when I decided to learn a language by translating short stories. (Edit: Of course, you have to choose an author whose diction you respect. Your unfamiliarity with the target language encourages you to mull over the author's use of diction and the nuances the author is trying to convey, and then find appropriate diction in English. This means you spend a long time immersed in the imagery.)

Unearned5161

What a brilliant idea. I've had learning to read French on my list for a while now, I'm going to try transcription as another way at it.

Daneel_

I wish this technique worked for me. I can transcribe something verbatim and then have absolutely no idea what I've written - I have to go back and read it to actually parse the text.

dylan604

That’s not uncommon. I was the same way back when I took an actual typing class. The part of my brain used for storage/recall just seems to go to sleep when doing the whole transcription stage. Maybe it was a mental thing realizing it was just a task and no actual interest in the content other than accomplishing a task vs doing it something I had a vested interest???

makeitdouble

That's my whole school life. Bonus difficulty as it was pen and paper, my writing sucked enough that I couldn't read back a bunch of it. I also couldn't read half of the cursive in this project, I'm really bad at that.

It worked better when I realized I could stop taking most notes.

interludead

I love the idea of "synchronizing" with someone’s handwriting

seletskiy

To tptacek and other guys who seem to have unwavering trust in OCRs/LLMs, as well as to opposite party who think that technology is not there yet — you are all partially right, but somehow fail to hear each other while also spending time on baseless arguing instead of factual examples and attempts to find common truth.

Can it be used to greatly simplify efforts by getting through boilerplate? — Yes.

Should the result be reviewed and proof-read by human? — Also yes.

---

Here subtle one: https://catalog.archives.gov/id/34384201?objectPage=40

Here is (one of) transcripts made by `o1-pro`:

  (2)

  …and I don’t know whether it can be reset for a
  date in December or not. Cornell seemed
  anxious that it should not come up too close to Christmas,
  and of course new suspicion [would be aroused?] [about?] him.
  I will take this up with the Judge as soon as I can get rid of the brief.
  Meanwhile I would like to know whether there is anything else
  in which I can be useful to you, since it behooves me
  in ways of uncomfortable relations with the present management.

  Are you going East in December?
  Has any word come from Hagerman?
  Were there any noteworthy developments at the hearings
  on the [Teapot?] trial?

  I have no inclination yet whether Wheeler will be wanted in
  Washington, but the chances are that he will not.

  With regards to all the brethren and [flock?], I am

  very sincerely yours,
  George A. H. Fraser

I'm not native english speaker, but even I can read where it is wrong. I'll leave it to be an excercise for the reader to find out mistakes, but it is certainly not a Teapot trial.

Somehow GPT-4o performs better on this example and fails only on "New Mexican practise" part.

wriggler

From https://www.handwritingocr.com - seemed to be more accurate, mostly getting the New Mexican and possibly other parts:

---

and I don't know whether it can be reset for a date in December or not. Cornell seemed anxious that it should not come off too close to Christmas, and of course New Mexican practice would support him. I will take this up with the Judge and with Hanna the moment I can get rid of the brief. Meanwhile I would like to know whether there is anything else in which I can be useful to you, since it behooves me to be diligent in view of uncomfortable relations with the present management.

Are you going East in December?

Has any word come from Hagerman?

Were there any noteworthy developments at the hearings on the Tenorio tract?

I have no intimation yet whether I will be wanted in Washington, but the chances are that I will not.

With regards to all the brethren and flock, Dan

Very sincerely yours, George H. H. Baser

dahart

Looks entirely accurate except for the end. It’s interesting it didn’t catch “I am” or George’s name correctly, given how difficult some of the text is on this page.

Edit: Oh I see from another thread this OCR site is your creation. Nice work!

dahart

Consider using the reply feature so that your comment appears in context.

Also your link goes to the wrong page. Here’s the right one: https://catalog.archives.gov/id/34384201?objectPage=190

teddyh

A “Teapot trial” is not actually that farfetched: <https://en.wikipedia.org/w/index.php?title=Teapot_Dome_scand...>

thaumasiotes

Unless you're looking at the writing, that is.

tptacek

I don't have "unwavering" trust in OCR and LLM.

Unearned5161

cheers! I was looking for something semi productive to sink a Friday night into

on a more serious note, working through a transcription project for letters and journals that nobody has touched since they've been archived is such a wonderful feeling. Aside from being in front of the physical document itself, your degree of separation from the writer and point is time is vanishingly small!

I always like to observe when they cross something out or make a mistake and think about what could have caused that. Did a friend pass by the door and scare them? Did they get distracted looking out the window? It's all so close and yet so far away :)

saagarjha

Seems like something that some of those big AI companies that are desperately starved of training material could chip in on, no? Actually do something for the public good, spend a few cents of that VC money, get some high-quality training data out of it?

ChrisMarshallNY

They should ask a medical school for help ;)

My family is Ivy-League, all the way, and has the worst goddamn cursive writing I've ever seen. It can take me an hour to read a Christmas card from my sister.

wkjagt

I've always wondered how pharmacists can read those prescriptions. There must be some kind of course in university that they followed.

ivanjermakov

I think with experience they know how each medicine is usually written? It's often easier to listen/read when you already know what it is about.

FireBeyond

A lot of it is understanding the abbreviations.

"2T BD IAF UF", 2 tablets, twice a day, immediately after food until, finished"

valiant55

Not really a problem anymore, it's all been digitized at least for the most part.

Decabytes

I’m interested to give this a go because I want to practice reading cursive. I do a lot of longhand writing including writing all my notes in cursive. It’s exciting to watch my binding fill up with all sorts of different subjects!

I like to write in cursive for a few reasons

1. I find it makes my hand cramp less 2. It offers some shallow privacy in public 3. I don’t want to lose the skill 4. It’s fun!

gabeio

All of the same reasons I love practicing a little calligraphy! I love how it looks as well. I don’t use a special pen but just add my own style to my cursive to make it look even nicer. But I used to write my notes in school with calligraphy (mostly because it gave me an excuse to not care about the subject) but it made the teachers hate me because I would never finish copying their scribbles fast enough.

iambateman

This is all very cool so I’m not trying to be dismissive. In a lot of ways, giving a hobby out as a way to participate in the national archives is an end in itself.

But…computers can definitely do this way better, right?

jonahx

I had the same thought but maybe on old hand writing they can't?

EDIT:

I tried giving the sample to 4o and it gave:

The following is the declaration of James Lambert, a soldier of the Revolutionary War in North America.

The said James Lambert this day personally appeared in the Probate Court of the County of Dearborn in the State of Indiana and at the November Term of said Court (1841), it being a court of record created by the laws of Indiana and made oath that:

On the 25th day of March 1842, he will be eighty-five years old, that he was born in the State of Maryland, that he is now a resident of said county and has been for the 27 years last past; that he has lived in Virginia, Maryland, and Pennsylvania...

null

[deleted]

null

[deleted]

AdieuToLogic

> This is all very cool so I’m not trying to be dismissive. In a lot of ways, giving a hobby out as a way to participate in the national archives is an end in itself.

> But…computers can definitely do this way better, right?

No.

Cursive writing is analog and fluid, lacking consistency across authors and often inconsistent by an individual author as well. When done well, it could be classified as its own art form. When done poorly, it can resemble the path walked by a chicken on meth.

musicale

iPad seems to do OK, but it has more to go by since it has the timing and pressure as well as the written text.

sulam

Current LLMs can absolutely do this as well as you can, probably better.

AdieuToLogic

> Current LLMs can absolutely do this as well as you can, probably better.

This is obviously disprovable, in that if they could, they would, and this call to action would not exist.

null

[deleted]

aaron695

[dead]

zabzonk

After using a keyboard for circa 50 years, I can't read my own handwriting. I can't even give a reproduceable signature.

dpb001

Same here. Old enough to remember when your signature on a credit card receipt would be given a quick look to compare it to the scrawl on the back of the card. If this was still being done I’d probably fail 50% of the transactions I attempt.

kmoser

Nobody has checked the back of my credit card for the presence of a signature in decades, let alone whether the signature matches. (I also haven't bothered to sign my credit card for this reason, but also because why would I want somebody to have my actual signature if my card is stolen?) These days my "signature" on a credit card purchase is usually a smiley face. Nobody has ever complained.

dpb001

Yup, it’s been decades - I remember it happening with the carbon copy imprinting devices and it may have been more common in the US rural South where I was working at the time. The squiggles I fingerpaint on checkout screens now are my version of your smiley face.

munchler

Me too, and I used to be proud of my handwriting back in the 90's. Definitely a loss in self-expression.

tptacek

Isn't this like a bread-and-butter AI task?

“The following is the declaration of James Lambert, a soldier of the Revolutionary War in North America.” “The said James Lambert, on this day personally appeared in the Probate Court of the County of Dearborn in the State of Indiana, at the November Term of said Court [1841], it being a court of record created by the laws of Indiana, and made oath that on the 25th day of March 1842 he will be eighty‐five years old; that he was born in the State of Maryland; that he is now a resident of [said] county and has been for the [27] years last past; that he has lived in Virginia, Maryland, [and Pennsylvania]; that…”

These kinds of problems, matching up cursive to actual text, would seem to play to the absolute best strengths of an LLM, given how much basic language structure the models encode.

saagarjha

edelbitter

I've seen people do that, and the results are.. just sad. These modern models insert their twitter-era "what grabs attention must be true" view into the very little authentic past we still possess.

tptacek

What did 4o get wrong about the title image in the transcription I just gave you?

jll29

Prompt:

  You are a paleologist specializing in analysis of cursive handwriting; tell me what the following text says: (pasting the picture).

Output:

  The following is the declaration of James Lambert, a soldier of the Revolutionary War in North America.

  The said James Lambert this day personally appeared in the Probate Court of the County of Dearborn in the state of Indiana and at the November Term of said court (1841), it being a court for claims created by the laws of Indiana and makes oath that:

  "On the 25th day of March 1842, he will be eighty-five years old; that he was born in the state of Maryland; that he is now a resident of said county and has been for the 27 years last past; that he has lived in Virginia, Maryland, Pennsylvania ..."

nejsjsjsbsb

Might be in the training data:

https://www.google.com/url?sa=t&source=web&rct=j&opi=8997844...

null

[deleted]

HN

Can you read this cursive handwriting? The National Archives wants your help

Can you read this cursive handwriting? The National Archives wants your help