Local LLMs versus offline Wikipedia

50 comments

·July 19, 2025

dcc

One important distinction is that the strength of LLMs isn't just in storing or retrieving knowledge like Wikipedia, it’s in comprehension.

LLMs will return faulty or imprecise information at times, but what they can do is understand vague or poorly formed questions and help guide a user toward an answer. They can explain complex ideas in simpler terms, adapt responses based on the user's level of understanding, and connect dots across disciplines.

In a "rebooting society" scenario, that kind of interactive comprehension could be more valuable. You wouldn’t just have a frozen snapshot of knowledge, you’d have a tool that can help people use it, even if they’re starting with limited background.

progval

An unreliable computer treated as a god by a pre-information-age society sounds like a Star Trek episode.

gretch

Definitely sounds like a plausible and fun episode.

On the other hand, real history if filled with all sorts of things being treated as a god that were much worse than "unreliable computer". For example, a lot of times it's just a human with malice.

So how bad could it really get

bryanrasmussen

hey generally everything worked pretty good in those societies, it was only people who didn't fit in who had a brief painful headache and then died!

bigyabai

Or the plot to 2001 if you managed to stay awake long enough.

beeflet

I think some combination of both search (perhaps of an offline database of wikipedia and other sources) and a local LLM would be the best, as long as the LLM is terse and provides links to relevant pages.

I find LLMs with the search functionality to be weak because they blab on too much when they should be giving me more outgoing links I can use to find more information.

gonzobonzo

Indeed. Ideally, you don't want to trust other people's summaries of sources, but you want to look at the sources yourself, often with a critical eye. This is one of the things that everyone gets taught in school, everyone's says they agree with, and then just about no one does (and at times, people will outright disparage the idea). Once out of school, tertiary sources get treated as if they're completely reliable.

I've found using LLM's to be a good way of getting an idea of where the current historiography of a topic stands, and which sources I should dive into. Conversely, I've been disappointed by the number of Wikipedia editors who become outright hostile when you say that Wikipedia is unreliable and that people often need to dive into the sources to get a better understanding of things. There have been some Wikipedia articles I've come across that have been so unreliable that people who didn't look at other sources would have been greatly mislead.

ianmcgowan

A tangent - sounds like https://en.wikipedia.org/wiki/The_Book_of_Koli - a key plot component is a chatty Sony AI music player. A little YA, but a fun read..

fzeroracer

In a 'rebooting society' doomsday scenario you're assuming that our language and understanding would persist. An LLM would essentially be a blackbox that you cannot understand or decipher, and would be doubly prone to hallucinations and issues when interacting with it using a language it was not trained on. Wikipedia is something you could gradually untangle, especially if the downloaded version also contained associated images.

thakoppno

> associated images

fun to imagine whether images help in this scenario

lblume

I would not subscribe to your certainty. With LLMs, even empty or nonsensical prompts yield answers, however faulty they may be. Based on its level of comprehension and ability to generalize between languages I would not be too surprised to see LLMs being able to communicate on a very superficial level in a language not part of the training data. Furthermore, the compression ratio seems to be much better with LLMs compared to Wikipedia, considering the generality of questions one can pose to e.g. Qwen that Wikipedia cannot answer even when knowing how to navigate the site properly. It could also come down to the classic dichotomy between symbolic expert systems and connectionist neural networks which has historically and empirically been decisively won by the latter.

ranger_danger

> LLMs will return faulty or imprecise information at times

To be fair, so do humans and wikipedia.

redserk

It appears there's an expectation many non-tech people have that humans can be incorrect but refuse to hold LLMs to the same standard, despite warnings.

belter

> LLMs will return faulty or imprecise information at times, but what they can do is understand vague or poorly formed questions and help guide a user toward an answer.

- "'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' "

cyanydeez

which means you'd still want wikipedia, as the impercision will get in the way of real progress beyond the basics.

omneity

I just posted incidentally about Wikipedia Monthly[0], a monthly dump of wikipedia broken down by language and cleaned MediaWiki markup into plain text, so perfect for a local search index or other scenarios.

There are 341 languages in there and 205GB of data, with English alone making up 24GB! My perspective on Simple English Wikipedia (from the OP), it's decent but the content tends to be shallow and imprecise.

0: https://omarkama.li/blog/wikipedia-monthly-fresh-clean-dumps...

badsectoracula

I've found this amusing because right now i'm downloading `wikipedia_en_all_maxi_2024-01.zim` so i can use it with an LLM with pages extracted using `libzim` :-P. AFAICT the zim files have the pages as HTML and the file i'm downloading is ~100GB.

(reason: trying to cross-reference my tons of downloaded games my HDD - for which i only have titles as i never bothered to do any further categorization over the years aside than the place i got them from - with wikipedia articles - assuming they have one - to organize them in genres, some info, etc and after some experimentation it turns out an LLM - specifically a quantized Mistral Small 3.2 - can make some sense of the chaos while being fast enough to run from scripts via a custom llama.cpp program)

simonw

This is a sensible comparison.

My "help reboot society with the help of my little USB stick" thing was a throwaway remark to the journalist at a random point in the interview, I didn't anticipate them using it in the article! https://www.technologyreview.com/2025/07/17/1120391/how-to-r...

A bunch of people have pointed out that downloading Wikipedia itself onto a USB stick is sensible, and I agree with them.

Wikipedia dumps default to MySQL, so I'd prefer to convert that to SQLite and get SQLite FTS working.

1TB or more USB sticks are pretty available these days so it's not like there's a space shortage to worry about for that.

cyanydeez

the real valuable would be both of them. the LLM is good for refining/interpreting questions or longer form progress issues, and the wiki would be actual information for each component of whatever you're trying to do.

But neither are sufficient for modern technology beyond pointing to a starting point.

meander_water

One thing to note is that the quality of LLM output is related to the quality and depth of the input prompt. If you don't know what to ask (likely in the apocalypse scenario), then that info is locked away in the weights.

On the other hand, with Wikipedia, you can just read and search everything.

antonkar

A bit related: AI companies distilled the whole Web into LLMs to make computers smart, why humans can't do the same to make the best possible new Wikipedia with some copyrighted bits to make kids supersmart?

Why kids are worse than AI companies and have to bum around?)

horseradish7k

we did that and still do. people just don't buy encyclopedias that much nowadays

antonkar

Imagine taking the whole Web, removing spam, duplicates, bad explanations

It will be the free new Wikipedia+ to learn anything in the best way possible, with the best graphs, interactive widgets, etc

What LLMs have for free but humans for some reason don’t

In some places it is possible to use copyrighted materials to educate if not directly for profit

literalAardvark

Love it when Silicon Valley reinvents encyclopedias

spankibalt

Wikipedia-snapshots without the most important meta layers, i. e. a) the article's discussion pages and related archives, as well as b) the version history, would be useless to me as critical contexts might be/are missing... especially with regards to LLM-augmented text analysis. Even when just focusing on the standout-lemmata.

pinkmuffinere

I’m a massive Wikipedia fan, have a lot of it downloaded locally on my phone, binge read it before bed, etc. Even so, I rarely go through talk pages or version history unless I’m contributing something. What would you see in an article that motivates you to check out the meta layers?

spankibalt

> "I’m a massive Wikipedia fan, have a lot of it downloaded locally on my phone, binge read it before bed, etc."

Me too, albeit these days I'm more interested in its underrated capabilities to foster teaching of e-governance and democracy/participation.

> "What would you see in an article that motivates you to check out the meta layers?"

Generally: How the lemma came to be, how it developed, any contentious issues around it, and how it compares to tangential lemmata under the same topical umbrella, especially with regards to working groups/SIGs (e. g. philosophy, history), and their specific methods and methodologies, as well as relevant authors.

With regards to contentious issues, one obviously gets a look into what the hot-button issues of the day are, as well as (comparatives of) internal political issues in different wiki projects (incl. scandals, e. g. the right-wing/fascist infiltration and associated revisionism and negationism in the Croatian wiki [1]). Et cetera.

I always look at the talk pages. And since I mentioned it before: Albeit I have almost no use for LLMs in my private life, running a Wiki, or a set of articles within, through an LLM-ified text analysis engine sounds certainly interesting.

1. [https://en.wikipedia.org/wiki/Denial_of_the_genocide_of_Serb...]

nine_k

Try any article on a controversial issue.

pinkmuffinere

I guess if I know it’s controversial then I don’t need the talk page, and if I don’t then I wouldn’t think to check

asacrowflies

Any article with social or political controversy ... Try gamergate. Or any of the presidents pages for since at least bush lol

wangg

Wouldn’t Wikipedia compress a lot more than llms? Are these uncompressed sizes?

GuB-42

The downloads are (presumably) already compressed.

And there are strong ties between LLMs and compression. LLMs work by predicting the next token. The best compression algorithms work by predicting the next token and encoding the difference between the predicted token and the actual token in a space-efficient way. So in a sense, a LLM trained on Wikipedia is kind of a compressed version of Wikipedia.

Philpax

Yes, they're uncompressed. For reference, `enwiki-20250620-pages-articles-multistream.xml.bz2` is 25,176,364,573 bytes; you could get that lower with better compression. You can do partial reads from multistream bz2, though, which is handy.

GuB-42

Kiwix (what the author used) uses "zim" files, which are compressed. I don't know where the difference come from, but Kiwix is a website image, which may include some things the raw Wikipedia dump doesn't.

And 57 GB to 25 GB would be pretty bad compression. You can expect a compression ratio of at least 3 on natural English text.

s1mplicissimus

Upvoted this because I like the lighthearted, honest approach.

vFunct

Why not both?

LLM+Wikipedia RAG

loloquwowndueo

Because old laptop that can’t run a local LLM in reasonable time.

NitpickLawyer

0.6b - 1.5b models are surprisingly good for RAG, and should work reasonably well even on old toasters. Then there's gemma 3n which runs fine-ish even on mobile phones.

ozim

Most people who can nag about old laptops on HN can just afford newer one but are cheap as Scrooge Mcduck.

mlnj

FYI: non-Western countries exist.

moffkalast

Now this is an avengers level threat.

haunter

I thought this would be about training a local LLM with an offline downloaded copy of Wikipedia