How OpenElections uses LLMs
26 comments
·June 19, 2025fasthands9
xp84
I could be wildly off-base, but I wonder if some of these systems are airgapped, and the only way the data comes off of the closed system is via printing, to avoid someone inserting a flash drive full of malware in the guise of "copying the CSV file." Obviously there are or should be technical ways to safely extract data in a digital format, but I can see a little value in the provable safety that airgapping gives you.
dwillis
In some cases that's true, but for many jurisdictions the results systems are third-party vendor platforms, too.
simonw
This is such an excellent example of a responsible and thorough application of vision LLMs to a gnarly data entry problem.
polskibus
It’s also an excellent example on how lack of forced machine-readable format for gov publishing is a PITA.
Mtinie
If I was in power and wanted to continue said rule, I’d definitely discourage the adoption of any standardized formatting for election results.
Not, you know, for any nefarious purpose…but because what we’ve used forever was good enough for grandpappy, so it’s obviously good enough for us.
/cough
sitkack
json to qr code would be a good start. PRIOR ART inb4 a troll.
GardenLetter27
Why is the original source data not available anywhere digitally?
Since it's printed it is clearly already in a database somewhere. Why can't that just be made public too.
Seems bizarre to OCR printed documents (although I am aware of many companies doing this to parse invoices, etc.)
simonw
Welcome to government data.
One key problem is that the US has tens of thousands of local governments, and each of them get to solve problems in their own way.
Digital literacy of the kind that understands why releasing a CSV file is more valuable than a PDF is rare enough that most of them won't have someone with that level of thinking in a decision making role.
codingdave
> most of them won't have someone with that level of thinking
That is an unfair take on it. Come out to the midwest and talk to some of the clerks in the small townships and counties out here. They do know the value of improved data and tech. And they know that investing in better tech can result in a little less money in the bank, which results in less gas to plow the roads, less money to pay someone to mow the ditches, which means on more car wrecked by hitting a deer. So the question is often not about CSV vs. PDF. It is about overall budget to do all the things that matter to the people of their town. Tech sometime just doesn't make the cut.
Besides, elections tend to have their own tech provided by the county or state, so there is standardization and additional help on such critical processes.
People running the smallest of government entities in this country tend to have pretty good heads on their shoulders. They get voted out pdq when they don't.
simonw
I'm not convinced by that argument. The data is clearly already in a spreadsheet of some sort already. I don't think "click export as CSV" v.s. "print out as paper and scan as PDF" is a cost decision.
This isn't meant as shade! I have full respect for people working in those roles. Knowing the difference between a CSV file and a PDF file - and understanding why there are people out there who curse the existence of PDFs and celebrate CSVs - is pretty arcane knowledge.
Also note that I blamed people in "a decision making role" - changing procedures requires buy-in from management. People in management roles are less likely to be thinking about CSVs v.s. PDFs than the people actually executing on the work.
As Derek pointed out in https://news.ycombinator.com/item?id=44320001#44322987 this may often be a vendor limitation - in which case there is a cost factor to consider, and the blame can also be shared between the vendor and the person who made the purchasing decision without understanding the difference between PDF and CSV export.
nxrabl
Very interesting! Is this the state of the art for accurate OCR of tabular PDFs, or is there other work in the space to compare against?
SnooSux
There's lots of posts on HN for developments and companies doing OCR and Document Extraction. It's a classic CV problem but still has come a long way in the past couple years
dwillis
Yeah, this is a very well-traveled road, but LLMs have made some big improvements. If you asked me (the guy who wrote the original piece linked above) what I'd use if accuracy alone was the goal, probably would be AWS Textract. But accuracy and structure? Gemini.
benob
I wonder how difficult it would be to bias a model so that it subtly corrupts election results when performing OCR.
croemer
Surely not hard but why?
bilbo0s
Easier to steal elections?
Don't have to bother with gerrymandering, or slick legal ways to arrest people for voting with the wrong documents. Or just good old fashioned intimidation, like making the polling place the police station or the ICE detention facility.
It's just a lot smoother process when you can simply write some software to manipulate the count.
Who's gonna check?
(No, seriously, Who's gonna check? Because you also need to layoff everyone in that department once you're in power.)
simonw
Corrupted OCR won't help you steal elections. The result counting is a different process, with well designed checks and safeguards.
The problem is that once the counts are done and have been reported a lot of places then print those results out on paper and then scan those papers into a PDF for anyone who asks for a copy!
dwillis
Many jurisdictions do risk-limiting audits using the original ballots, so futzing with the results wouldn't necessarily make that easier. Also, cast vote records are public in many states - those are records of each ballot cast. So people can check.
philips
You may consider reading about risk limiting audits. https://www.voting.works/audits
curtisszmania
[dead]
In college (about 15 years ago) I worked for a professor who was compiling precint level results for old elections. My job was just to request the info and then do manual data entry. It was abysmally slow.
This application seems very good - but still a bit amazing that lawmakers haven't just required that all data be uploaded via csv! Even if every csv was slightly different format, it would be way easier for everyone (LLM or not).