Skip to content(if available)orjump to list(if available)

HN

GrapheneOS and Forensic Extraction of Data

discuss.grapheneos.org

Gregg Kellogg has passed away

Behind the Scenes of Bun Install

Reshaped is now open source

DeepCodeBench: Real-World Codebase Understanding by Q&A Benchmarking

Mapping to the PICO-8 palette, perceptually

KDE launches its own distribution

Piramidal (YC W24) Is Hiring Back End Engineer

ycombinator.com

C++20 Modules: Practical Insights, Status and TODOs

chuanqixu9.github.io

Show HN: Term.everything – Run any GUI app in the terminal

AI's $344B 'Language Model' Bet Looks Fragile

Germany is not supporting ChatControl – blocking minority secured

digitalcourage.social

How the tz database works (2020)

Ireland will not participate in Eurovision if Israel takes part

DOOMscrolling: The Game

ironicsans.ghost.io

Removing yellow stains from fabric with blue light

PgEdge Goes Open Source

ChatGPT Developer Mode: Full MCP client access

platform.openai.com

Brussels faces privacy crossroads over encryption backdoors

theregister.com

Hashed sorting is typically faster than hash tables

Where did the Smurfs get their hats (2018)

pipelinecomics.com

Court rejects Verizon claim that selling location data without consent is legal

arstechnica.com

A desktop environment without graphics (tmux-like)

The HackberryPi CM5 handheld computer

DeepCodeBench: Real-World Codebase Understanding by Q&A Benchmarking

DeepCodeBench: Real-World Codebase Understanding by Q&A Benchmarking

3 comments

·September 11, 2025

four_fifths

If you do a bit of digging into most of the popular benchmarks that all the big labs report on, you'll see pretty quickly that they have almost zero correlation with any real world tasks.

The approach that they're taking here of working backwards from a OS repo pull request and reverse engineering a question is unusually well thought out for a benchmark.

I haven't dug into more of the dataset questions yet, but the example they give in the blog post for the question generated for Hugging Face Transformer's repo gives me hope that this could actually be a solid benchmark:

> How do the fast image and video processor base classes prevent shared mutable state when instantiating multiple instances?

qsort

I particularly like their usage of LLM-as-a-judge. They don't go "hey chatgpt, sort these from best to worst based on vibes", rather they extract a set of ground truths and check how the answer compares, a task that SOTA LLM can do kind of reliably. It's a very smart way to circumvent the problems introduced by pure LLM-as-a-judge methods.

blazarquasar

[flagged]