Skip to content(if available)orjump to list(if available)

HN

Novelist Cormac McCarthy's tips on how to write a great science paper [pdf]

Scientists find that ice generates electricity when bent

SCREAM CIPHER ("ǠĂȦẶAẦ ĂǍÄẴẶȦ")

sethmlarson.dev

Living microbial cement supercapacitors with reactivatable energy storage

Images over DNS

Britain jumps into bed with Palantir in £1.5B defense pact

theregister.com

Microsoft memo advises H1B employees to return immediately if currently abroad

MapSCII – World Map in Terminal

Claude Can (Sometimes) Prove It

Less is safer: How Obsidian reduces the risk of supply chain attacks

Is Zig's New Writer Unsafe?

Escapee pregnancy test frogs colonised Wales for 50 years

Show HN: Math2Tex – Convert handwritten math and complex notes to LaTeX text

Bezier Curve as Easing Function in C++

If all the world were a monorepo

jtibs.substack.com

Are Touchscreens in Cars Dangerous?

The best YouTube downloaders, and how Google silenced the press

Evals in 2025: benchmarks to build models people can use

Bosch Unveils New Brake Technology

thebrakereport.com

PyPI Blog: Token Exfiltration Campaign via GitHub Actions Workflows

LLM-Deflate: Extracting LLMs into Datasets

Ants that seem to defy biology – They lay eggs that hatch into another species

smithsonianmag.com

Git: Introduce Rust and announce that it will become mandatorty

lore.kernel.org

Show HN: FocusStream – Focused, distraction-free YouTube for learners

focusstream.media

Evals in 2025: benchmarks to build models people can use

Evals in 2025: benchmarks to build models people can use

2 comments

·September 18, 2025

aplassard

I think cost should also be a direct consideration. Model performance varies wildly on benchmarks when given a budget. https://substack.com/@andrewplassard/note/p-173487568?r=2fqo...

elemeno

I’ve been building a tool to help with this - Safety Evals In-a-Box [https://github.com/elemeno/seibox]. It’s a work in progress and not quite ready for public release, but its a multi-model eval runner (primarily for safety oriented evals, but no reason why it can run other types as well!) and includes cost and latency in it reporting.