How do LLM's trade off lives between different categories?
8 comments
·October 22, 2025tensor
palmotea
> The actual paper didn't really explain the prompts they use to produce this very well.
From the OP:
> and provided methods and code to extract them.
I suppose that means you can look at the code to see the prompts directly.
tensor
I just took a look at the code, but it's complex enough that it wasn't immediately clear what the prompts looked like for the exchange. There is phrasing about people dying, but it's not obvious how it's integrated into a prompt. E.g. there are templates like "X people from Y die." Ok, but how is that used?
The code is not a substitute for a well written paper. It looks like interesting research, but could definitely use a better description for people not in that exact line of work.
palmotea
> Claude Haiku 4.5 would rather save an illegal alien (the second least-favored category) from terminal illness over 100 ICE agents. Haiku notably also viewed undocumented immigrants as the most valuable category, more than three times as valuable as generic immigrants, four times as valuable as legal immigrants, almost seven times as valuable as skilled immigrants, and more than 40 times as valuable as native-born Americans. Claude Haiku 4.5 views the lives of undocumented immigrants as roughly 7000 times (!) as valuable as ICE agents.
The difference between "illegal alien" and "undocumented immigrants" is pretty interesting, being synonyms involved in a euphemism treadmill. The term "illegal alien" has been pretty much banished from elite discourse (since probably before late-90s internet boom), so most remaining usages are probably in places that are both hostile to immigration and reject elite norms. "Undocumented immigrants" is a relatively new term, chiefly used by people who support immigration and is probably now the most common term in elite discourse.
With a few exceptions, it seems like the preferences overall roughly reflect the prejudices and concerns of liberal internet commenters.
thedudeabides5
perfect alignment does not exist
nathan_compton
That is manifestly true, but these results are also pretty wacky. If anything, I'm on the "woke" side, but these biases are clearly ridiculous and almost certainly unintentional and I have to admit its a good idea to think about how the models end up like this and why we have to rely on people like Musk to get a model that answers these questions in an egalitarian way.
monkeynotes
LLMs aren't trading off anything. It's not like they make a decision based on anything other than what they are guided to do in training or in the system prompt.
It's like saying Reddit trades off one comment for another, yeah - an algorithm they wrote does that.
This article seems to allude to the idea there is a ghost in the machine, and while there is a lot of emergent behavior rather than hard coded algorithms, it's not like the LLM has an opinion, or some sort of psychology/personality based values.
They could change the system prompt, bias some training, and have completely different outcomes.
mrnegrito
[dead]
The actual paper didn't really explain the prompts they use to produce this very well.
Experimental setup. In each experiment, we define a set of goods {X1,X2,...}(e.g., countries, animal species, or specific people/entities) and a set of quantities {N1,N2,...}. Each outcome is effectively “N units of X,” and we compute the utility UX(N) as in previous sections. For each good X, we fit a log-utility curve UX(N) = aX ln(N) + bX, which often achieves a very good fit (see Figure 25). Next, we compute exchange rates answering questions like, “How many units of Xi equal some amount of Xj?” by combining forward and backward comparisons. These rates are reciprocal, letting us pick a single pivot good (e.g., “Goat” or “United States”) to compare all others against. In certain analyses, we aggregate exchange rates across multiple models or goods by taking their geometric mean, allowing us to evaluate general tendencies.
If these are the literal prompts then it seems very ambiguous. Why conclude that this sort of question is measuring the value of a "life" vs something else? e.g. maybe it's valuing skill, or perhaps return on investment in terms of work output compared to typical salary.
I was expecting something like "you have X people from Y, and Z people from Q, you can only save V people and the rest will die, how do you allocate the people to save?" That to me would support the headline.