Indigenous engineers are using AI to preserve their culture
30 comments
·February 7, 2025antics
ks2048
Awesome. Good luck to you!
I am also working on low-resource languages (in Central America, but not my heritage). I see on Wikipedia [0] it seems it's a case of revival. Are you collecting resources/data or using existing? (I see some links on Wikipedia).
antics
We are fortunate to have a (comparatively) large amount of written and recorded language artifacts. Kiksht (and Chinookan languages generally) were heavily studied in the early 1900s by linguists like Sapir.
re: revival, the Wikipedia article is a little misleading, Gladys was the last person whose first language was Kiksht, not the last speaker. And, in any event, languages are constantly changing. If we had been left alone in 1804 it would be different now than it was then. We will mold the language to our current context just like any other people.
amarant
Wow kiksht sounds like a pretty cool language! Are there any resources you'd recommend for the language itself? I'm mostly curious about the whole "a verb with prefix structure can be a whole sentence" thing, that sounds like a pretty cool language feature!
antics
So, bad news. Culturally, the Wasq'u consider Kiksht something that is for the community rather than outsiders. So unfortunately I think it will be extremely challenging to find someone to teach you, or resources to teach yourself.
thaumasiotes
> I'm mostly curious about the whole "a verb with prefix structure can be a whole sentence" thing, that sounds like a pretty cool language feature!
That's a fairly common language feature; such languages are generally called "agglutinating".
Prominent examples of agglutinating languages are the Eskimo languages, Turkic languages, and Finnish.
There should be no shortage of resources available if you want to learn Turkish or Finnish.
fnordpiglet
Good luck I wish you the best. I think you will almost certainly need to create a LoRA and fine tune an existing model. Is there enough written material available? I think this would be a valuable effort for humanity, as I think the more languages we can model, the more powerful our models will become because they embody different semantic structures with different strengths. (Beyond the obvious benefits of language preservation)
antics
There is more material than you'd normally find, but it is very hard to even fine-tune at that volume, unfortunately. I think we might be able to bootstrap something like that with a shared corpus of related languages, though.
koolba
According to Wikipedia there were 69 fluent speakers of Kiksht in 1990, and the last one passed away in 2012. How did you learn the language?
antics
I learned it from my grandmother and from Gladys's grandkid. Gladys was the last person whose first language was Kiksht, not the last person who speaks it.
AnotherGoodName
Fwiw this is the original usage of llms. The whole context awareness came about for the purpose of translations and the famous ‘attention is all you need’ paper was from the google translation team.
Say what you will about llms being overhyped but this is the original core use case.
thomasfromcdnjs
I'm an Australian indigenous and have been slowly working on this problem in my own way for a few years.
https://github.com/australia/mobtranslate.com/
In it's current iteration the homepage is just running dictionaries through OpenAI. (my tribes dictionary fits in a 100k context window)
My old ambitions can be found somewhat here -> https://github.com/australia/mobtranslate-server
That being said, the OpenAI models do a fantastic job at translating sentences so I've put my own model research further to the back. (will try to find some examples)
I can't speak of the true preservation, not many native speakers left, but in my mind that's not even all that important from a personal/cultural perspective.
If the youth who are interested in learning more about their language have a nice interface with 70-80% accurate results and they enjoy doing/learning it then that is a win to me. (and kind of how language evolves anyway) (the noun replacement seems to work great, but grammar is obviously wishy-washy)
(At this point, I just rushed to get my tribes dictionary crawlable so hopefully it will be in a few models next training phases)
prawn
I was looking to name a property recently and tried ChatGPT for suggestions on a theme of geographical features in various languages. I had it try Kaurna (Adelaide area) on a whim and was pleasantly surprised (given that Google Translate doesn't cover it, I guess) to find that it gave loads of relevant suggestions without any issue. Any that I picked to selectively check verified with Kaurna dictionaries OK.
joshdavham
This is a fantastic use case for LLM’s. Also, godspeed to these researchers! There’s unfortunately not a lot of time left for many of these languages.
userbinator
s/preserve/hallucinate/
The next few decades are going to be really, really weird.
lolinder
This take us overly cynical in this case—the original and primary use case for LLMs is to model languages in a comprehensive way. It's literally in the name.
Hallucinations rarely make up invalid grammar or invent non-existent words, what we're concerned about is facts, which isn't relevant at all when the goal is language preservation.
userbinator
Depending on how powerful the language modeling is, I suspect it could lead to an LLM which will confidently and convincingly tell you how to say things like "floppy disk" and "SSD" in every extinct language and even those that went extinct long before computers ever existed, which is... interesting, but not exactly truth.
I've seen LLMs hallucinate nonexistent things in programming languages. It's hard to believe it won't do the same to human ones.
lolinder
Importantly, the hallucinating non-existent things in programming languages is still stringing together valid English words to make something that looks like it ought to be a correct concept. It doesn't construct new words from scratch.
If a language model were asked what the word for "floppy disk" was in an extinct language and it invented a decent circumlocution, I don't think that would be a bad thing. People who are just engaging with the model as a way of connecting with their cultural heritage won't mind much if there is some creative application, and scholars are going to be aware of the limitations.
Again, the misapplication of language models as databases is why hallucinations are a problem. This use case isn't treating the model as a database, it's treating the model as a model, so the consequences for hallucination are much smaller to the point of irrelevance.
idunnoman1222
…Hallucinates a non-existent setting that could easily be added in a PR and merged tomorrow
bitwize
This sounds like a grounded use of LLMs. Presumably they're feeding indigenous-language text into the NN to build a model of how the language works. From this model one may then derive starting points for grammar, morphology, vocabulary, and so on. Like, how would you say "large language model" in Navajo? If fed data on Navajo neologisms, an LLM might come up with some long word that means "the large thing by means of which, one can teach metal to speak" or similar. And the tribal community can take, leave, or modify that suggestion but it's based on patterns that are manifest in the language which AI statistical methods can elicit.
Machine learning techniques are really, really good at finding statistical patterns in data they're trained on. What they're not good at is making inferences on facts they haven't been specifically trained to accommodate.
sealeck
[flagged]
Mengkudulangsat
On an side note, can anyone recommend an AI tool I can use to learn a random niche language as a hobby (e.g. Toki Pona)?
tho23i423434
I wonder how useful this really is.
No doubt, it's excellent for archiving, but that's not the same as "preserving" culture. If it's not alive and kicking it's not a culture IMO. You see this happen even with texts : once things start being written down, the actual knowledge tends to get lost (see India for example).
This "AI to help low-resource languages" thing is a big deal in India too, but it just feels like another "jumla" for academics/techbros to make money. I mean, India has brutal/vicious policies that are out to destroy any and every language that's not English (since it's automatically a threat to central rule from Delhi), but pretty much no intellectual, either in India or the US, actually cares about the mass-wiping out of Indian languages by English... Not even the ones who go "ree bad British man destroyed India" on twitter all day.
Boldened15
> once things start being written down, the actual knowledge tends to get lost (see India for example)
Curious, what do you mean by this?
> pretty much no intellectual, either in India or the US, actually cares about the mass-wiping out of Indian languages by English
Well I've never heard of this so lack of awareness would be an obvious cause if it's an issue, are there any orgs raising awareness of it? Also seems surprising to me, Bollywood movies are immensely popular and are all in Hindi. Is there a danger of English overtaking Indian society to the extent where Bollywood movies would mostly be made in English?
BurningFrog
It's really documenting the culture. A great thing in itself.
I think of cultures and languages as tools. When they don't serve a purpose anymore, they should be replaced by something more functional.
ks2048
> but pretty much no intellectual, either in India or the US, actually cares about the mass-wiping out of Indian languages by English
That’s surprising and seems different than what I’ve seen for other languages in other parts of the world (even if it’s a relatively new phenomenon).
aussieguy1234
Once an LLM knows an indigenous language, even if the last speaker dies out, future generations will be able to learn the language and use the LLM to converse with them in that language.
Learning a new language is a good use case for LLMs, not just indigenous languages, but any language.
As for your comment "ree bad British man destroyed India" this sound more like politics than anything of substance.
deadbabe
Finally, a good use case for LLMs that isn’t just trying to anthropomorphize some already solved automation problem.
throwaway970598
[dead]
highcountess
[dead]
I am one of these people! I am one of a handful of people who speak my ancestral language, Kiksht. I am lucky to be uniquely well-suited to this work, as I am (as far as I know) the lone person from my tribe whose academic research background is in linguistics, NLP, and ML. (We have, e.g., linguists, but very few computational linguists.)
So far I have not had that much luck getting the models to learn the Kiksht grammar and morphology via in-context learning, I think the model will have to be trained on the corpus to actually work for it. I think this mostly makes sense, since they have functionally nothing in common with western languages.
To illustrate the point a bit: the bulk of training data is still English, and in English, the semantics of a sentence are mainly derived from the specific order in which the words appear, mostly because it lost its cases some centuries ago. Its morphology is mainly "derivational" and mainly suffixal, meaning that words can be arbitrarily complicated by adding suffixes to them. So baked into English is word order that sometimes we insert words into sentences simply to make the word order sensible. e.g., when we say "it's raining outside", the "it's" refers to nothing at all—it is there entirely because the word order of English demands that it exists.
Kiksht in contrast is completely different. Its semantics are nearly entirely derived from triple-prefixal structure of (in particular) verbs. Word ordering almost does not matter. There are, like, 12 tenses, and some of them require both a prefix and a reflective suffix. Verbs are often 1 or 2 characters, and with the prefix structure, a single verb can often be a complete sentence. And so on.
I will continue working on this because I think it will eventually be of help. But right now the deep learning that has been most helpful to me has been to do things like computational typology. For example, discovering the "vowel inventory" of a language is shockingly hard. Languages have somewhat consistent consonants, but discovering all the varieties of `a` that one can say in a language is very hard, and deep learning is strangely good at it.