Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%
12 comments
·September 17, 2025dlojudice
I wish they had published what prompt was given to Claude to improve GPT-5-mini's performance, as well as a before and after comparison of a prompt that underwent this transformation.
blndrt
Thanks for the feedback, appreciate it! It makes lot of sense - I'll update the article with links to the actual prompts. Initially I thought these would be too lengthy for the article and no one would care, but as it seems people are really interested in it. Of course I'd be happy to share the details.
barrkel
Using an LLM to (re)write your prompt or system prompt (for local models) is free alpha.
csoham
Really intresting. What did the original prompt look like? Perhaps the original prompt was not that good? I feel like the changes claude suggested (except a couple maybe) are already pretty well known prompt engineering practices.
blndrt
Thank you for the feedback!
In this (telecom) benchmark you can review agent policies and manuals here: 1) https://github.com/sierra-research/tau2-bench/blob/main/data... 2) https://github.com/sierra-research/tau2-bench/blob/main/data...
Of course these are just parts of the prompt, you can inspect benchamark code to see how these are rendered to actual LLM calls.
In case someone is not familiar with framework methodology I've wrote a separate article covering that (with some of my thoughts) -> https://quesma.com/blog/tau2-from-llm-benchmark-to-blueprint...
BrunoDCDO
I wonder if it would be possible to improve even further on the benchmark by simply showing Claude the current hardest problems and asking it to improve the prompt without including any specifics related to the problems
amelius
My take: we have no clue how this works and the performance can be down tomorrow just as well.
grej
DSPy was ahead of its time and still underutilized.
CuriouslyC
This sort of stuff is trodden ground, if this seems exciting to you check out DSPy.
null
moralestapia
No before/after prompt.
Into the trash it goes.
Here is the summary of key improvements made:
1. Structure & Flow
2. AI Agent Optimizations 3. Cognitive Load Reduction 4. Actionable Language