Skip to content(if available)orjump to list(if available)

Structured Outputs on the Claude Developer Platform (API)

jmathai

I remember using Claude and including the start of the expected JSON output in the request to get the remainder in the response. I couldn't believe that was an actual recommendation from the company to get structured responses.

Like, you'd end your prompt like this: 'Provide the response in JSON: {"data":'

jascha_eng

I feel like this is so core to any LLM automation it was crazy that anthropic is only adding it now.

I built a customized deep research internally earlier this year that is made up of multiple "agentic" steps, each focusing on specific information to find. And the outputs of those steps are always in json and then the input for the next step. Sure you can work you way around failures by doing retries but its just one less thing to think about if you can guarantee that the random LLM output adheres at least to some sort of structure.

simonw

Prior to this it was possible to get the same effect by defining a tool with the schema that you wanted and then telling the Anthropic API to always use that tool.

I implemented structured outputs for Claude that way here: https://github.com/simonw/llm-anthropic/blob/500d277e9b4bec6...

fnordsensei

Same, but it’s a PITA when you also want to support tool calling at the same time. Had to do a double call: call and check if it will use tools. If not, call again and force the use of the (now injected) return schema tool.

veonik

I have had fairly bad luck specifying the JSONSchema for my structured outputs with Gemini. It seems like describing the schema with natural language descriptions works much better, though I do admit to needing that retry hack at times. Do you have any tips on getting the most out of a schema definition?

sails

Agree, it feels so fundamental. Any idea why? Gemini has also had it for a long time

igor47

Curious if they're planning to support more complicated schemas. They claim to support JSON schema, but I found it only accepts flat schemas and not, for example, unions or discriminated unions. I've had to flatten some of my schemas to be able to define tool for them.

huevosabio

Whoa I always thought that tool use was Anthropics way for structured outputs. Can't believe only now are they supporting this.

barefootford

I switched from structured outputs on OpenAI apis to unstructured on Claude (haiku 4.5) and haven't had any issues (yet). But guarantees are always nice.

mkagenius

I always wondered how they achieved this - is it just retries while generating tokens and as soon as they find mismatch - they retry? Or the model itself is trained extremely well in this version of 4.5?

simonw

They're using the same trick OpenAI have been using for a while: they compile a grammar and then have that running as part of token inference, such that only tokens that fit the grammar are selected as the next-token.

This trick has also been in llama.cpp for a couple of years: https://til.simonwillison.net/llms/llama-cpp-python-grammars

huevosabio

Yea, and now there are mature OSS solutions with outlines and xgrammar, so it makes even more weird that only now do we have this supported by Anthropic.

Kuinox

The inference doesn't return a single token, but the probably for all tokens. You just select the token that is allowed according to the compiler.

mkagenius

Hmm, wouldn't it sacrifice a better answer in some cases (not sure how many though)?

I'll be surprised if they hadn't specifically trained for structured "correct" output for this, in addition to picking next token following the structure.

mirekrusin

Sampling is already constrained with temperature, top_k, top_p, top_a, typical_p, min_p, entropy_penalty, smoothing etc. – filtering tokens to valid ones according to grammar is just yet another alternative. It does make sense and can be used for producing programming language output as well – what's the point in generating/bothering with up front know, invalid output? Better to filter it out and allow valid completions only.

tdfirth

In my experience (I've put hundreds of billions of tokens through structured outputs over the last 18 months), I think the answer is yes, but only in edge cases.

It generally happens when the grammar is highly constrained, for example if a boolean is expected next.

If the model assigns a low probability to both true and false coming next, then the sampling strategy will pick whichever one happens to score highest. Most tokens have very similar probabilities close to 0 most of the time, and if you're picking between two of these then the result will often feel random.

It's always the result of a bad prompt though, if you improve the prompt so that the model understands the task better, then there will then be a clear difference in the scores the tokens get, and so it seems less random.

Kuinox

The "better answer" wouldnt had respected the schema in this case.

luke_walsh

makes sense