Training LLMs for Honesty via Confessions
2 comments
·December 12, 2025manarth
Humans might well benefit from this style of reward-shaping too.
I couldn't see whether this also tracks in the primary model answer, or if the "honesty" improvements are confined to the digital confession booth?
torginus
I think this article once again assumes LLMs works like humans - Anthropic showed that LLMs don't understand their own thought processes, and measuring neural net activations does not correspond to what they say about how they arrived at the conclusion.
I don't think this magically grants them this ability, they'll be just more convincing at faking honesty.