Skip to content(if available)orjump to list(if available)

HN

After 27 years within budget Austria open 6thlongest railway tunnel in the world

infrastruktur.oebb.at

SQLite JSON at Full Index Speed Using Generated Columns

4 billion if statements (2023)

andreasjhkarlsson.github.io

Fedora: Open-source repository for long-term digital preservation

fedorarepository.org

From text to token: How tokenization pipelines work

The tiniest yet real telescope I've built

lucassifoni.info

The Tor Project is switching to Rust

Nokia N900 Necromancy

Show HN: Tripwire: A new anti evil maid defense

Google de-indexed Bear Blog and I don't know why

journal.james-zhan.com

Guarding My Git Forge Against AI Scrapers

vulpinecitrus.info

Show HN: Autofix Bot – Hybrid static analysis and AI code review agent

CRISPR fungus: Protein-packed, sustainable, and tastes like meat

What folk can do

He set out to walk around the world. After 27 years, his quest is nearly over

washingtonpost.com

Rivian Unveils Custom Silicon, R2 Lidar Roadmap, and Universal Hands Free

riviantrackr.com

The highest quality codebase

Training LLMs for Honesty via Confessions

Denial of service and source code exposure in React Server Components

Octo: A Chip8 IDE

Smartphone without a battery (2022)

Programmers and software developers lost the plot on naming their tools

An SVG is all you need

Training LLMs for Honesty via Confessions

Training LLMs for Honesty via Confessions

2 comments

·December 12, 2025

manarth

    > "dishonesty may arise due to the effects of reinforcement learning (RL), where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions"
    > "As long as the "path of least resistance" for maximizing confession reward is to surface misbehavior rather than covering it up, this incentivizes models to be honest"

Humans might well benefit from this style of reward-shaping too.

    > "We find that when the model lies or omits shortcomings in its "main" answer, it often confesses to these behaviors honestly, and this confession honesty modestly improves with training."

I couldn't see whether this also tracks in the primary model answer, or if the "honesty" improvements are confined to the digital confession booth?

torginus

I think this article once again assumes LLMs works like humans - Anthropic showed that LLMs don't understand their own thought processes, and measuring neural net activations does not correspond to what they say about how they arrived at the conclusion.

I don't think this magically grants them this ability, they'll be just more convincing at faking honesty.