About AI Evals

calebkaiser

I'm biased in that I work on an open source project in this space, but I would strongly recommend starting with a free/open source platform for debugging/tracing, annotating, and building custom evals.

This niche of the field has come a very long way just over the last 12 months, and the tooling is so much better than it used to be. Trying to do this from scratch, beyond a "kinda sorta good enough for now" project, is a full-time engineering project in and of itself.

I'm a maintainer of Opik, but you have plenty of options in the space these days for whatever your particular needs are: https://github.com/comet-ml/opik

null

[deleted]

afro88

Some great info, but I have to disagree with this:

> Q: How much time should I spend on model selection?

> Many developers fixate on model selection as the primary way to improve their LLM applications. Start with error analysis to understand your failure modes before considering model switching. As Hamel noted in office hours, “I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence. Does error analysis suggest that your model is the problem?”

If there's a clear jump in evals from one model to the next (ie Gemini 2 to 2.5, or Claude 3.7 to 4) that will level up your system pretty easily. Use the best models you can, if you can afford it.

simonw

I think the key part if that advice is the without evidence bit:

> I suggest not thinking of switching model as the main axes of how to improve your system off the bat without evidence.

If you try to fix problems by switching from eg Gemini 2.5 Flash to OpenAI o3 but you don't have any evals in place how will you tell if the model switch actually helped?

shrumm

The ‘with evidence’ part is key as simonw said. One anecdote from evals at Cleric - it’s rare to see a new model do better on our evals vs the current one. The reality is that you’ll optimize prompts etc for the current model.

Instead, if a new model only does marginally worse - that’s a strong signal that the new model is indeed better for our use case.

phillipcarter

> If there's a clear jump in evals from one model to the next (ie Gemini 2 to 2.5, or Claude 3.7 to 4)

How do you know that their evals match behavior in your application? What if the older, "worse" model actually does some things better, but if you don't have comprehensive enough evals for your own domain, you simply don't know to check the things it's good at?

FWIW I agree that in general, you should start with the most powerful model you can afford, and use that to bootstrap your evals. But I do not think you can rely on generic benchmarks and evals as a proxy for your own domain. I've run into this several times where an ostensibly better model does no better than the previous generation.

ndr

Quality can drop drastically even moving from Model N to N+1 from the same provider, let alone a different one.

You'll have to adjust a bunch of prompts and measure. And if you didn't have a baseline to begin with good luck YOLOing your way out of it.

softwaredoug

I might disagree as these models are pretty inscrutable, and behavior on your specific task can be dramatically different on a new/“better” model. Teams would do well to have the right evals to make this decision rather than get surprised.

Also the “if you can afford it” can be fairly non trivial decision.

lumost

The vast majority of a I startups will fail for reasons other than model costs. If you crack your use case, model costs should fall exponentially.

smcleod

Yeah totally agree, I see so many systems perform badly only to find out they're using an older generation mode and simply updating to the current mode fixes many of their issues.

davedx

I've worked with LLM's for the better part of the last couple of years, including on evals, but I still don't understand a lot of what's being suggested. What exactly is a "custom annotation tool", for annotating what?

calebkaiser

Typically, you would collect a ton of execution traces from your production app. Annotating them can mean a lot of different things, but often it means some mixture of automated scoring and manual review. At the earliest stages, you're usually annotating common modes of failure, so you can say like "In 30% of failures, the retrieval component of our RAG app is grabbing irrelevant context." or "In 15% of cases, our chat agent misunderstood the user's query and did not ask clarifiying questions."

You can then create datasets out of these traces, and use them to benchmark improvements you make to your application.

andybak

> About AI Evals

Maybe it's obvious to some - but I was hoping that page started off by explaining what the hell an AI Eval specifically is.

I can probably guess from context but I'd love to have some validation.

phren0logy

Here's another article by the same author with more background on AI Evals: https://hamel.dev/blog/posts/evals/

I've appreciated Hamel's thinking on this topic.

fossa1

[dead]

HN

About AI Evals

About AI Evals