Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%

jari_mustonen

Here is the summary of key improvements made:

1. Structure & Flow

    - Decision Trees: Clear branching logic with ├── and └── notation

    - Sequential Steps: Numbered, ordered procedures instead of scattered explanations

    - Prerequisites: Explicit dependency checks before proceeding

2. AI Agent Optimizations

    - Tool Call Clarity: Exact function names and parameters

    - Binary Decisions: Clear yes/no conditions instead of ambiguous language

    - Error Handling: Specific failure conditions and next steps

    - Verification Steps: "Recheck" instructions after each fix

3. Cognitive Load Reduction

    - Reference Tables: Quick lookup for tools and purposes

    - Pattern Recognition: Common issue combinations and their solutions

    - Critical Reminders: Common AI mistakes section to prevent errors

4. Actionable Language

    - Removed verbose explanations mixed with instructions

    - Consolidated multiple documents' logic into single workflows 

    - Used imperative commands: "Check X", "If Y then Z"

    - Added immediate verification steps

dlojudice

I wish they had published what prompt was given to Claude to improve GPT-5-mini's performance, as well as a before and after comparison of a prompt that underwent this transformation.

blndrt

Thanks for the feedback, appreciate it! It makes lot of sense - I'll update the article with links to the actual prompts. Initially I thought these would be too lengthy for the article and no one would care, but as it seems people are really interested in it. Of course I'd be happy to share the details.

barrkel

Using an LLM to (re)write your prompt or system prompt (for local models) is free alpha.

csoham

Really intresting. What did the original prompt look like? Perhaps the original prompt was not that good? I feel like the changes claude suggested (except a couple maybe) are already pretty well known prompt engineering practices.

blndrt

Thank you for the feedback!

In this (telecom) benchmark you can review agent policies and manuals here: 1) https://github.com/sierra-research/tau2-bench/blob/main/data... 2) https://github.com/sierra-research/tau2-bench/blob/main/data...

Of course these are just parts of the prompt, you can inspect benchamark code to see how these are rendered to actual LLM calls.

In case someone is not familiar with framework methodology I've wrote a separate article covering that (with some of my thoughts) -> https://quesma.com/blog/tau2-from-llm-benchmark-to-blueprint...

BrunoDCDO

I wonder if it would be possible to improve even further on the benchmark by simply showing Claude the current hardest problems and asking it to improve the prompt without including any specifics related to the problems

amelius

My take: we have no clue how this works and the performance can be down tomorrow just as well.

grej

DSPy was ahead of its time and still underutilized.

CuriouslyC

This sort of stuff is trodden ground, if this seems exciting to you check out DSPy.

null

[deleted]

moralestapia

No before/after prompt.

Into the trash it goes.

HN

Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%

Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%