Skip to content(if available)orjump to list(if available)

Training LLMs for Honesty via Confessions

manarth

    > "dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions"
    > "As long as the "path of least resistance" for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest"
Humans might well benefit from this style of reward-shaping too.

    > "We find that when the model lies or omits shortcomings in its "main" answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training."
I couldn't see whether this also tracks in the primary model answer, or if the "honesty" improvements are confined to the digital confession booth?

torginus

I think this article once again assumes LLMs works like humans - Anthropic showed that LLMs don't understand their own thought processes, and measuring neural net activations does not correspond to what they say about how they arrived at the conclusion.

I don't think this magically grants them this ability, they'll be just more convincing at faking honesty.