Structured Outputs in LLMs

iandanforth

This is a great writeup! There was a period where reliable structured output was a significant differentiator and was the 'secret sauce' behind some companies success. A NL->SQL company I am familiar with comes to mind. Nice to see this both public and supported by a growing ecosystem of libraries.

One statement surprised me was that the author thinks "models over time will just be able to output JSON perfectly without the need for constraining over time."

I'm not sure how this conclusion was reached. "Perfectly" is a bar that probabilistic sampling cannot meet.

electroglyph

If the current position in the structure only has one possibility (like a comma, bracket, etc.) do you just force that as the next token and continue?

maccam912

I don't think so, because multiple tokens might match. If it needs a comma as the next character, but you have tokens for `, "blah` and `, "foo` you still want to leave those on the table.

frotaur

When doing structured sampling, why is the token sampled, checked against the grammar, and resampled if it's wrong by applying the mask ?

Why wouldn't we apply the mask immediately for the first sampling? Is this an optimization somehow, is masking expensive?

2THFairy

Implementation preference.

> is masking expensive?

It's not expensive per-se; A single element-wise multiplication of the output vector.

The real "expense" is that you need to prepare masks for every element of your grammar as they are expensive to recompute as needed; LLM tokens do not cleanly map onto elements of your grammar. (Consider JSON: LLM tokens often combine various special characters such as curly braces, colons, and quotes.)

This isn't that hard to compute, it's just more work to implement.

myflash13

Hmm, so if structured output affects the quality of the response maybe it's better to convert the output to a structured format as a post-processing step?

NitpickLawyer

It's a tradeoff between getting "good enough" performance w/ guided/constrained generation and using 2x calls to do the same task. Sometimes it works, sometimes it's better to have a separate model. One good case of 2 calls is the "code merging" thing, where you "chat" with a model giving it a source file + some instruction, and if it replies with something like ... //unchanged code here ... some new code ... //the rest stays the same, then you can use a code merging model to apply the changes. But that's become somewhat obsolete by the new "agentic" capabilities where models learn how to diff files directly.

BoredPositron

Haiku is my favorite model for the second pass. It's small cheap and usually gets it right. If I see hallucinations they are mostly from the base model in the first pass.

amelius

This constrains the output of the LLM to some grammar.

However, why not use a grammar that does not have invalid sentences, and from there convert to any grammar that you want?

cyptus

What if the converted version is not in the wanted syntax?

NitpickLawyer

Constrained generation guarantees syntax. It does not guarantee semantic correctness tho. Imagine you want a json object with "hp" and "damage". If you use a grammar, the model will be forced to output a json object with those two values. But it's not guaranteed to get sensible values.

With a 2nd pass you basically "condition" it on the text right above, hoping to get better semantic understanding.

lyu07282

I'm pretty sure the grammar is generated from the Json schema, it doesn't just constrain json syntax, it constraints on the schema (including enums and such). The schema is also given to the model (at least in openai) you can put instructions in the json schema as well that will be taken into account.

CuriouslyC

Just wait till people realize that if you have agents speak in structured output rather than chatting with you, your observability and ability to finely program your agent goes through the roof.

k__

Sounds like brute force to me.

thrance

It's still baffling to me that the various API providers don't let us upload our custom grammars. It would enable so many use cases, like HTML generation for example, at essentially no cost on their part.

barrkel

Using grammar constrained output in llama.cpp - which has been available for ages and I think is a different implementation to the one described here - does slow down generation quite a bit. I expect it has a naive implementation.

As to why providers don't give you a nice API, maybe it's hard to implement efficiently.

It's not too bad if inference is happening token by token and reverting to the CPU every time, but I understand high performance LLM inference uses speculative decoding, with a smaller model guessing multiple tokens in advance and the main model doing verification. Doing grammar constraints across multiple tokens is tougher, there's an exponential number of states that need precomputing.

So you'd need to think about putting the parser automaton onto the GPU/TPU and use it during inference without needing to stall a pipeline by going back CPU.

And then you start thinking about how big that automaton is going to be. How many states, pushdown stack. You're basically taking code from the API call and running it on your hardware. There's dragons here, around fair use, denial of service etc.

bubblyworld

Wouldn't that have implications for inference batching, since you would have to track state and apply a different mask for each sequence in the batch? If so, I think it would directly affect utilisation and hence costs. But I could be talking out of my ass here.

esafak

When you say custom grammar, do you mean something other than a JSON schema, because they support that?

SamLeBarbare

This post dives into that "black magic" layer, especially in the context of emerging thinking models and tools like Ollama or GPT-OSS. It’s a thoughtful look at why sampling, formatting, and standardization are not just implementation details, but core to the future of working with LLMs.

electroglyph

I don't know if you're purposely trying to be funny, but this is obnoxious, lol

HN

Structured Outputs in LLMs

Structured Outputs in LLMs