Skip to content(if available)orjump to list(if available)

Evals in 2025: benchmarks to build models people can use

aplassard

I think cost should also be a direct consideration. Model performance varies wildly on benchmarks when given a budget. https://substack.com/@andrewplassard/note/p-173487568?r=2fqo...

elemeno

I’ve been building a tool to help with this - Safety Evals In-a-Box [https://github.com/elemeno/seibox]. It’s a work in progress and not quite ready for public release, but its a multi-model eval runner (primarily for safety oriented evals, but no reason why it can run other types as well!) and includes cost and latency in it reporting.