Indigenous engineers are using AI to preserve their culture

96 comments

·February 7, 2025

antics

I am one of these people! I am one of a handful of people who speak my ancestral language, Kiksht. I am lucky to be uniquely well-suited to this work, as I am (as far as I know) the lone person from my tribe whose academic research background is in linguistics, NLP, and ML. (We have, e.g., linguists, but very few computational linguists.)

So far I have not had that much luck getting the models to learn the Kiksht grammar and morphology via in-context learning, I think the model will have to be trained on the corpus to actually work for it. I think this mostly makes sense, since they have functionally nothing in common with western languages.

To illustrate the point a bit: the bulk of training data is still English, and in English, the semantics of a sentence are mainly derived from the specific order in which the words appear, mostly because it lost its cases some centuries ago. Its morphology is mainly "derivational" and mainly suffixal, meaning that words can be arbitrarily complicated by adding suffixes to them. So baked into English is word order that sometimes we insert words into sentences simply to make the word order sensible. e.g., when we say "it's raining outside", the "it's" refers to nothing at all—it is there entirely because the word order of English demands that it exists.

Kiksht in contrast is completely different. Its semantics are nearly entirely derived from triple-prefixal structure of (in particular) verbs. Word ordering almost does not matter. There are, like, 12 tenses, and some of them require both a prefix and a reflective suffix. Verbs are often 1 or 2 characters, and with the prefix structure, a single verb can often be a complete sentence. And so on.

I will continue working on this because I think it will eventually be of help. But right now the deep learning that has been most helpful to me has been to do things like computational typology. For example, discovering the "vowel inventory" of a language is shockingly hard. Languages have somewhat consistent consonants, but discovering all the varieties of `a` that one can say in a language is very hard, and deep learning is strangely good at it.

ks2048

Awesome. Good luck to you!

I am also working on low-resource languages (in Central America, but not my heritage). I see on Wikipedia [0] it seems it's a case of revival. Are you collecting resources/data or using existing? (I see some links on Wikipedia).

[0] https://en.wikipedia.org/wiki/Upper_Chinook_language

antics

We are fortunate to have a (comparatively) large amount of written and recorded language artifacts. Kiksht (and Chinookan languages generally) were heavily studied in the early 1900s by linguists like Sapir.

re: revival, the Wikipedia article is a little misleading, Gladys was the last person whose first language was Kiksht, not the last speaker. And, in any event, languages are constantly changing. If we had been left alone in 1804 it would be different now than it was then. We will mold the language to our current context just like any other people.

yaseer

Super interesting, thank you very much for sharing your thoughts!

HN is still one of the few places on the internet to get such esoteric, expert and intellectually stimulating content. It's like an Island where the spirit of 'the old internet' still lives on.

bovermyer

I am applying for graduate school (after 20 years in the software industry) with the intent of studying computational linguistics; specifically, for the documentation and support of dying/dead languages.

While I am not indigenous, I hope to help alleviate this problem. I'd love to hear about your research!

antics

I did research on entirely unrelated NLP, actually. I worked on search for awhile (I am corecipient of SIGIR best paper ‘17) ad a bit in space- and memory- efficient nlp tasks.

bovermyer

Still, that's cool. Is this the paper in question? https://dl.acm.org/doi/10.1145/3077136.3080789

amarant

Wow kiksht sounds like a pretty cool language! Are there any resources you'd recommend for the language itself? I'm mostly curious about the whole "a verb with prefix structure can be a whole sentence" thing, that sounds like a pretty cool language feature!

thaumasiotes

> I'm mostly curious about the whole "a verb with prefix structure can be a whole sentence" thing, that sounds like a pretty cool language feature!

That's a fairly common language feature; such languages are generally called "agglutinating".

Prominent examples of agglutinating languages are the Eskimo languages, Turkic languages, and Finnish.

https://archive.is/QQnB6

There should be no shortage of resources available if you want to learn Turkish or Finnish.

layer8

And many more: https://linguisticmaps.tumblr.com/post/120857875008/morpholo...

antics

So, bad news. Culturally, the Wasq'u consider Kiksht something that is for the community rather than outsiders. So unfortunately I think it will be extremely challenging to find someone to teach you, or resources to teach yourself.

diggan

How do you combine that feeling/observation together with what you're working on now, which I'm guessing you'll eventually want to publish/make available somehow? Or maybe I misunderstand what the "it will eventually be of help" part of your first comment.

RobotToaster

Does that present philosophical questions on if an AI is part of the community or an outsider?

macinjosh

I wonder why their language is dying. /s

fnordpiglet

Good luck I wish you the best. I think you will almost certainly need to create a LoRA and fine tune an existing model. Is there enough written material available? I think this would be a valuable effort for humanity, as I think the more languages we can model, the more powerful our models will become because they embody different semantic structures with different strengths. (Beyond the obvious benefits of language preservation)

antics

There is more material than you'd normally find, but it is very hard to even fine-tune at that volume, unfortunately. I think we might be able to bootstrap something like that with a shared corpus of related languages, though.

fnordpiglet

You can always fine tune with the corpus you have and then try in context on top the fine tuning even if it’s insufficient. Then with that - and perhaps augmenting with RAG against the corpus - you might be able to build a context that’s stable enough in the language that you can generate human mediated synthetic corpuses and other reinforcement to enrich the LORA.

troyvit

> I will continue working on this because I think it will eventually be of help.

They say language shapes thought. Having an LLM speak such a different language natively seems like it would uncover so much about how these models work, as a side effect of helping preserve Kiksht. What a cool place to be!

tomrod

Almost sounds like Cebuano / Waray-Waray in that sense.

graemep

I was wondering about what the limitations were.

Lots of languages, even Indo-European languages, have very different word order from English or a much less significant word order.

thomasfromcdnjs

I'm an Australian indigenous and have been slowly working on this problem in my own way for a few years.

https://github.com/australia/mobtranslate.com/

In it's current iteration the homepage is just running dictionaries through OpenAI. (my tribes dictionary fits in a 100k context window)

My old ambitions can be found somewhat here -> https://github.com/australia/mobtranslate-server

That being said, the OpenAI models do a fantastic job at translating sentences so I've put my own model research further to the back. (will try to find some examples)

I can't speak of the true preservation, not many native speakers left, but in my mind that's not even all that important from a personal/cultural perspective.

If the youth who are interested in learning more about their language have a nice interface with 70-80% accurate results and they enjoy doing/learning it then that is a win to me. (and kind of how language evolves anyway) (the noun replacement seems to work great, but grammar is obviously wishy-washy)

(At this point, I just rushed to get my tribes dictionary crawlable so hopefully it will be in a few models next training phases)

prawn

I was looking to name a property recently and tried ChatGPT for suggestions on a theme of geographical features in various languages. I had it try Kaurna (Adelaide area) on a whim and was pleasantly surprised (given that Google Translate doesn't cover it, I guess) to find that it gave loads of relevant suggestions without any issue. Any that I picked to selectively check verified with Kaurna dictionaries OK.

AnotherGoodName

Fwiw this is the original usage of llms. The whole context awareness came about for the purpose of translations and the famous ‘attention is all you need’ paper was from the google translation team.

Say what you will about llms being overhyped but this is the original core use case.

sdsd

I've always resented the way "traditional" cultures are "preserved". It's like when people want to protect "authentic" locales from the corrupting influence of tourism. I'm grateful that my own culture is not (yet) seen primarily through this lens. Imagine the day when legislation makes exceptions for fentanyl as a traditional ritual substance of nomadic Trailer Americans.

To me the beauty of these things is in their liveliness, in the aspiration to flourish and grow, not merely to conserve a little longer, to spend one more night with terminal cancer before the inevitable.

throwaway970598

It seems like this could be incredibly fraught with danger once LLMs are involved (the story isn't exactly clear whether they are). If there are no surviving native speakers of a language (or very few) doing something like training an LLM to generate text in that language would run the risk of e.g. transferring English grammar to the vocabulary, and causing a hybrid language to become the dominant form of that language, because the LLM is used widely for study and the native-speaking elders are not.

flocciput

Is that so bad? If the only way to ensure a language's preservation and continued use is to mangle and evolve it? I think that's kind of a neat concept, to be honest. To live is to change.

joshdavham

This is a fantastic use case for LLM’s. Also, godspeed to these researchers! There’s unfortunately not a lot of time left for many of these languages.

romaaeterna

"His dream is to revive dying languages..."

The content of modern culture is too much for dying or ancient languages, and what you actually get is English/modern thoughtspace expressed in the lexicon of an until-now separate culture. This flood of spam destroys what was unique and interesting about the culture, and "skin-suits" it.

rexpop

Maybe, unless some form of Sapir-Whorf is true, and the lexica—and grammar, and phonology, etc.—impose an inalielable umwelt, one inaccessible to hegemony.

romaaeterna

They certainly do. But languages are also very elastic. If you speak modern German, for example, you'll find that it's very similar to modern American English in thought and expression. But this was not true of late 19th-century German and 19th-century English. There has unfortunately been substantial thought convergence due to mass communication. If all you speak is English, you may also find that the current dialectical forms are all much weaker and samey compared to the older forms.

Anyway, in resurrected ancient languages, this modern samey-ness is a problem. Go read Latin Wikipedia, for example. It's much more like reading modern English authors (though with Latin words and grammar) than it is like reading classical authors.

I add a natural language translation (from ChatGPT) of your statement for anyone who doesn't share your lexica:

> "Maybe, unless some version of the Sapir-Whorf hypothesis is correct, and language—along with its grammar, sounds, and other aspects—creates an inescapable worldview that can't be dominated."

rexpop

ChatGPT has misinterpreted me.

By "inalienable unwelt," I mean it in the sense of "inalienable rights" that cannot be removed by external circumstance.

By "inaccessible to hegemony," I mean an umwelt which cannot be perceived by speakers of the dominant language.

userbinator

s/preserve/hallucinate/

The next few decades are going to be really, really weird.

lolinder

This take us overly cynical in this case—the original and primary use case for LLMs is to model languages in a comprehensive way. It's literally in the name.

Hallucinations rarely make up invalid grammar or invent non-existent words, what we're concerned about is facts, which isn't relevant at all when the goal is language preservation.

userbinator

Depending on how powerful the language modeling is, I suspect it could lead to an LLM which will confidently and convincingly tell you how to say things like "floppy disk" and "SSD" in every extinct language and even those that went extinct long before computers ever existed, which is... interesting, but not exactly truth.

I've seen LLMs hallucinate nonexistent things in programming languages. It's hard to believe it won't do the same to human ones.

lolinder

Importantly, the hallucinating non-existent things in programming languages is still stringing together valid English words to make something that looks like it ought to be a correct concept. It doesn't construct new words from scratch.

If a language model were asked what the word for "floppy disk" was in an extinct language and it invented a decent circumlocution, I don't think that would be a bad thing. People who are just engaging with the model as a way of connecting with their cultural heritage won't mind much if there is some creative application, and scholars are going to be aware of the limitations.

Again, the misapplication of language models as databases is why hallucinations are a problem. This use case isn't treating the model as a database, it's treating the model as a model, so the consequences for hallucination are much smaller to the point of irrelevance.

diggan

How would terms like "floppy disk" and "SSD" even appear in the target language if those terms weren't around when the speakers of the language was alive? Or you're thinking a multi-language LLM that tries to automatically translate between terms it didn't actually see in the source/target language during training?

idunnoman1222

…Hallucinates a non-existent setting that could easily be added in a PR and merged tomorrow

ahoef

Isn't this something humans would also do if there were native speakers left to speak the language?

cess11

They're kind of bad at pretty much all languages, except simpler forms of english and Python. The tonality in the big LLM:s tends to be distinctly inhuman as well.

I suspect it'll be hard to find more material in some obscure, dying language than there is of either of those in the common training sets.

lolinder

What is "they"? Are you saying transformer architecture somehow is biased towards English? Or are you saying that existing LLMs have that bias?

The only way this project is going to make sense will be to train it fresh on text in the language to be preserved, in order to avoid accidentally corrupting your model with English. If it's trained fresh on only target language content, I'm not sure how we can possibly generalize from the whole-internet models that you're familiar with.

ahoef

What language have you tried that they're bad at? I've tried a bunch of European languages and they are all perfect (or perfect enough for me to never know otherwise)

joseda-hg

Have you had the opportunity to interact with less wrapped versions of the models? There's a lot of intentionality behind the way LLM's are presented from places like ChatGPT/DeepSeek/Claude, you're distinctly trying to talk to something that's actively limited in the way it can speak to you

It's not exactly nonexistant outside of them, but they make it worse than it is

sofixa

Have you tried Mistral's models? They're explicitly trained on a bunch of languages, not only English.

bitwize

This sounds like a grounded use of LLMs. Presumably they're feeding indigenous-language text into the NN to build a model of how the language works. From this model one may then derive starting points for grammar, morphology, vocabulary, and so on. Like, how would you say "large language model" in Navajo? If fed data on Navajo neologisms, an LLM might come up with some long word that means "the large thing by means of which, one can teach metal to speak" or similar. And the tribal community can take, leave, or modify that suggestion but it's based on patterns that are manifest in the language which AI statistical methods can elicit.

Machine learning techniques are really, really good at finding statistical patterns in data they're trained on. What they're not good at is making inferences on facts they haven't been specifically trained to accommodate.

sealeck

[flagged]

deadbabe

Finally, a good use case for LLMs that isn’t just trying to anthropomorphize some already solved automation problem.

DrillShopper

Yeah but anthropomorphizing the LLM makes VCs spend stupid amounts of money!

Mengkudulangsat

On an side note, can anyone recommend an AI tool I can use to learn a random niche language as a hobby (e.g. Toki Pona)?

diggan

I think what most LLMs have in common, is that they're pretty good at translations out-of-the-box, even the general purpose ones. Probably even the free versions of ChatGPT, Claude or fully free DeepSeek can help you out pretty well with that.

tho23i423434

I wonder how useful this really is.

No doubt, it's excellent for archiving, but that's not the same as "preserving" culture. If it's not alive and kicking it's not a culture IMO. You see this happen even with texts : once things start being written down, the actual knowledge tends to get lost (see India for example).

This "AI to help low-resource languages" thing is a big deal in India too, but it just feels like another "jumla" for academics/techbros to make money. I mean, India has brutal/vicious policies that are out to destroy any and every language that's not English (since it's automatically a threat to central rule from Delhi), but pretty much no intellectual, either in India or the US, actually cares about the mass-wiping out of Indian languages by English... Not even the ones who go "ree bad British man destroyed India" on twitter all day.

BurningFrog

It's really documenting the culture. A great thing in itself.

I think of cultures and languages as tools. When they don't serve a purpose anymore, they should be replaced by something more functional.

krapp

>When they don't serve a purpose anymore, they should be replaced by something more functional.

If and when that's a voluntary and organic process, sure. The problem is that replacement more often that not comes about through ethnic cleansing and violence. These indigenous languages were perfectly functional to the people who spoke them at the time but they were replaced because they didn't serve the purpose of colonizers.

And "lost" languages do get reclaimed from time to time. Hebrew and Irish being two examples.

BurningFrog

Hebrew and Irish are illustrative examples.

Hebrew was brought back from the dead for the Jewish refugees populating Israel to have a common language. This solved a genuine practical problem.

Teaching Irish to kids who all speak English does not help anyone communicate better. It seems like a nationalist pride project, and those are not my favorite.

meigwilym

What is the purpose of culture? It's a way of life. Arguably no culture has purpose, so do we force people to live in a different way?

Cultures and languages die out because they're slowly (or revolutionarily quickly!) replaced by another. It's not like there are people out there speaking no language because their mother tongue has died out.

And who gets to decide that a language has no purpose?

I don't like the way this argument goes.

hollerith

>who gets to decide that a language has no purpose?

The parents of the child doing the language learning or the adult doing it.

encipriano

This demonstrates English utilitarianism basically won the philosophical war in Europe.

DrillShopper

> When they don't serve a purpose anymore, they should be replaced by something more functional.

Okay colonizer.

Boldened15

> once things start being written down, the actual knowledge tends to get lost (see India for example)

Curious, what do you mean by this?

> pretty much no intellectual, either in India or the US, actually cares about the mass-wiping out of Indian languages by English

Well I've never heard of this so lack of awareness would be an obvious cause if it's an issue, are there any orgs raising awareness of it? Also seems surprising to me, Bollywood movies are immensely popular and are all in Hindi. Is there a danger of English overtaking Indian society to the extent where Bollywood movies would mostly be made in English?

aussieguy1234

Once an LLM knows an indigenous language, even if the last speaker dies out, future generations will be able to learn the language and use the LLM to converse with them in that language.

Learning a new language is a good use case for LLMs, not just indigenous languages, but any language.

As for your comment "ree bad British man destroyed India" this sound more like politics than anything of substance.

bloak

Yes, but the LLM is roughly equivalent to lossy compression of the corpus it is trained on so why wouldn't you preserve that actual corpus so that it can be used to train some better LLM, or something better than an LLM, in the future?

(There may be a good answer to that question: perhaps, for example, the corpus can't be preserved for data protection reasons but the LLM trained on it can be preserved? For various reasons that doesn't seem very plausible, however.)

crackalamoo

You can do both. Preserving the corpus and building the LLM probably gives the best chance for future generations.

ks2048

> but pretty much no intellectual, either in India or the US, actually cares about the mass-wiping out of Indian languages by English

That’s surprising and seems different than what I’ve seen for other languages in other parts of the world (even if it’s a relatively new phenomenon).

slothtrop

Right, you can't keep a culture in stasis. It always changes. There's something to be said for protectionism though (e.g. language laws), with varying degrees of success (Japan quashed Christianity quite well, brutally). There's a reason behind the choice in semantics particularly when it comes to traditional cultures. We don't call it "preserving culture" when historians and archivists document things.

tomp

Terrible title. Every engineer is indigenous somewhere.

rexpop

Indigeneity generally refers to the descendants of people who inhabited a territory prior to colonization or the formation of a country, often those who are disadvantaged as a result, and who continue to inhabit the land. Connection to the land is a fundamental element of indigeneity, as is the specific condition of dispossession by colonization: indigeneity, as it is understood today, emerges in relation to colonial processes.

So, although all distinct ethnicities may originate from specific places and times, indigeneity as a political and social identity is meaningful only in the context of colonial domination and resistance.

Not every engineer is engaged in an ongoing struggle for sovereignty against colonizing powers.