Agent design is still hard

58 comments

·November 22, 2025

_pdp_

I've started a company in this space about 2 years ago. We are doing fine. What we've learned so far is that a lot of these techniques are simply optimisations to tackle some deficiency in LLMs that is a problem "today". These are not going to be problems tomorrow because the technology will shift. As it happened many time in the span of the last 2 years.

So yah, cool, caching all of that... but give it a couple of months and a better technique will come out - or more capable models.

Many years ago when disc encryption on AWS was not an option, my team and I had to spend 3 months to come up with a way to encrypt the discs and do so well because at the time there was no standard way. It was very difficult as that required pushing encrypted images (as far as I remember). Soon after we started, AWS introduced standard disc encryption that you can turn on by clicking a button. We wasted 3 months for nothing. We should have waited!

What I've learned from this is that often times it is better to do absolutely nothing.

an0malous

> These are not going to be problems tomorrow because the technology will shift. As it happened many time in the span of the last 2 years.

What technology shifts have happened for LLMs in the last 2 years?

dcre

One example is that there used to be a whole complex apparatus around getting models to do chain of thought reasoning, e.g., LangChain. Now that is built in as reasoning and they are heavily trained to do it. Same with structured outputs and tool calls — you used to have to do a bunch of stuff to get models to produce valid JSON in the shape you want, now it’s built in and again, they are specifically trained around it. It used to be you would have to go find all relevant context up front and give it to the model. Now agent loops can dynamically figure out what they need and make the tool calls to retrieve it. Etc etc.

throwaway13337

I'm amazed at this question and the responses you're getting.

These last few years, I've noticed that the tone around AI on HN changes quite a bit by waking time zone.

EU waking hours have comments that seem disconnected from genAI. And, while the US hours show a lot of resistance, it's more fear than a feeling that the tools are worthless.

It's really puzzling to me. This is the first time I noticed such a disconnect in the community about what the reality of things are.

To answer your question personally, genAI has changed the way I code drastically about every 6 months in the last two years. The subtle capability differences change what sorts of problems I can offload. The tasks I can trust them with get larger and larger.

It started with better autocomplete, and now, well, agents are writing new features as I write this comment.

the_mitsuhiko

> EU waking hours have comments that seem disconnected from genAI. And, while the US hours show a lot of resistance, it's more fear than a feeling that the tools are worthless.

I don't think it's because the audience is different but because the moderators are asleep when Europeans are up. There are certain topics which don't really survive on the frontpage when moderators are active.

GiorgioG

Despite the latest and greatest models…I still see glaring logic errors in the code produced in anything beyond basic CRUD apps. They still make up fields that don’t exist, assign a value to a variable that is nonsensical. I’ll give you an example, in the code in question, Codex assigned a required field LoanAmount to a value from a variable called assessedFeeAmount…simply because as far as I can tell, it had no idea how to get the correct value from the current function/class.

nickphx

ai is useless. anyone claiming otherwise is dishonest

postalcoder

If we expand this to 3 years, the single biggest shift that totally changed LLM development is the increase in size of context windows from 4,000 to 16,000 to 128,000 to 256,000.

When we were at 4,000 and 16,000 context windows, a lot of effort was spent on nailing down text splitting, chunking, and reduction.

For all intents and purposes, the size of current context windows obviates all of that work.

What else changed?

- Multimodal LLMs - Text extraction from PDFs was a major issue for rag/document intelligence. A lot of time was wasted trying to figure out custom text extraction strategies for documents. Now, you can just feed the image of a PDF page into an LLM and get back a better transcription.

- Reduced emphasis on vector search. People found that for most purposes, having an agent grep your documents is cheaper and better than using a more complex rag pipeline.

Not the LLMs. The APIs got more capabilities such as tool/function calling, explicit caching etc.

dcre

It is the LLMs because they have to be RLed to be good at these things.

echelon

We started putting them in image and video models and now image and video models are insane.

I think the next period of high and rapid growth will be in media (image, video, sound, 3D), not text.

It's much harder to adapt LLMs to solving business use cases with text. Each problem is niche, you have to custom tailor the solution, and the tooling is crude.

The media use cases, by contrast, are low hanging fruit and result in 10,000x speedups and cost reductions almost immediately. The models are pure magic.

I think more companies would be wise to ignore text for now and focus on visual domain problems.

Nano Banana has so much more utility than agents. And there are so many low hanging fruit ways to make lots of money.

Don't sleep on image and video. That's where the growth salient is.

wild_egg

> Nano Banana has so much more utility than agents.

I am so far removed from multimedia spaces that I truly can't imagine a universe where this could be true. Agents have done incredible things for me and Nano Banana has been a cool gimmick for making memes.

Anyone have a use case for media models that'll expand my mind here?

gchamonlive

I think knowing when to do nothing is being able to evaluate if the problem the team is tackling is essential or tangential to the core focus of the project, and also whether the problem is something new or if it's been around for a while and there is still no standard way to solve it.

gessha

Yeah, that will be the make it to brake it moment because if it’s too essential, it will be implemented but if it’s not, it may become a competitive advantage

exe34

if we wait long enough, we just end up dead, so it turns out we didn't need to do anything at all whatsoever. of course there's a balance - often times starting out and growing up with the technology gives you background and experience that gives you an advantage when it hits escape velocity.

moinism

Amen. Been seeing these agent SDKs coming out left and right for a couple of years and thought it'd be a breeze to build an agent. Now I'm trying to build one for ~3 weeks, and I've tried three different SDKs and a couple of architectures.

Here's what I found:

- Claude Code SDK (now called Agent SDK) is amazing, but I think they are still in the process of decoupling it from the Claude Code, and that's why a few things are weird. e.g, You can define a subagent programmatically, but not skills. Skills have to be placed in the filesystem and then referenced in the plugin. Also, only Anthoripic models are supported :(

- OpenAI's SDK's tight coupling with their platform is a plus point. i.e, you get agents and tool-use traces by default in your dashboard. Which you can later use for evaluation, distillation, or fine-tuning. But: 2. They have agent handoffs (which works in some cases), but not subagents. You can use tools as subagents, though. 1. Not easy to use a third-party model provider. Their docs provide sample codes, but it's not as easy as that.

- Google Agent Kit doesn't provide any Typescript SDK yet. So didn't try.

- Mastra, even though it looks pretty sweet, spins up a server for your agent, which you can then use via REST API. umm.. why?

- SmythOS SDK is the one I'm currently testing because it provides flexibility in terms of choosing the model provider and defining your own architecture (handoffs or subagents, etc.). It has its quirks, but I think it'll work for now.

Question: If you don't mind sharing, what is your current architecture? Agent -> SubAgents -> SubSubAgents? Linear? or a Planner-Executor?

I'll write a detailed post about my learnings from architectures (fingers crossed) soon.

blancm

Hello, about Claude Code where only Anthoripic models are supported, in reality you can use Claude Code router (https://github.com/musistudio/claude-code-router) to use other models in Claude Code. I use it since some weeks with opensource models and it works pretty well. You can even use "free" models from openrouter

mountainriver

The frameworks are all pointless, just use AI assist to create agents in python or ideally a language with concurrency.

You will be happy you did

moinism

How do you deal with the different APIs/Tooluse schema in a custom build? As other people have mentioned, it's a bigger undertaking than it sounds.

postalcoder

I've been building agent type stuff for a couple years now and the best thing I did was build my own framework and abstractions that I know like the back of my hand.

I'd stay clear of any llm abstraction. There are so many companies with open source abstractions offering the panacea of a single interface that are crumbling under their own weight due to the sheer futility of supporting every permutation of every SDK evolution, all while the same companies try to build revenue generating businesses on top of them.

sathish316

I agree with your analysis of building your own Agent framework to have some level of control and fewer abstractions. Agents at their core are about programmatically talking to an LLM and performing these basic operations: 1. Structured Input and String Interpolation in prompts 2. Structured Output and Unmarshalling String response to Structured output (This is getting easier now with LLMs supporting Structured output) 3. Tool registry/discovery (of MCP and Function tools), Tool calls and response looping 4. Composability of Tools 5. Some form of Agent to Agent delegation

I’ve had good luck with using PydanticAI which does these core operations well (due to the underlying Pydantic library), but still struggles with too many MCP servers/Tools and composability.

I’ve built an open-source Agent framework called OpusAgents, that makes the process of creating Agents, Subagents, Tools that are simpler than MCP servers without overloading the context. Check it out here and tutorials/demos to see how it’s more reliable than generic Agents with MCP servers in Cursor/ClaudeDesktop - https://github.com/sathish316/opus_agents

It’s built on top of PydanticAI and FastMCP, so that all non-core operations of Agents are accessible when I need them later.

drittich

This sounds interesting. What about the agent behavior itself? How it decides how to come at a problem, what to show the user along the way, and how it decides when to stop? Are these things you have attempted to grapple with in your framework?

spacecadet

I also recommend this. I have tried all of the frameworks, and deploy some still for some clients- but for my personal agents, its my own custom framework that is dead simple and very easy to spin up, extend, etc.

the_mitsuhiko

Author here. I’m with you on the abstractions part. I dumped a lot of my though so this into a follow up post: https://lucumr.pocoo.org/2025/11/22/llm-apis/

thierrydamiba

Excellent write up. I’ve been thinking a lot about caching and agents so this was right ilup my alley.

Have you experimented with using semantic cache on the chain of thought(what we get back from the providers anyways) and sending that to a dumb model for similar queries to “simulate” thinking?

NitpickLawyer

Yes, this is great advice. It also applies to interfaces. When we designed a support "chat bot", we went with a diferent architecture than what's out there already. We designed the system with "chat rooms" instead, and the frontend just dumps messages to a chatroom (with a session id). Then on the backend we can do lots of things, incrementally adding functionality, while the front end doesn't have to keep up. We can also do things like group messages, have "system" messages that other services can read, etc. It also feels more natural, as the client can type additional info while the server is working, etc.

If you have to use some of the client side SDKs, another good idea is to have a proxy where you can also add functionality without having to change the frontend.

postalcoder

Creativity is an underrated hard part of building agents. The fun part of building right now is knowing how little of the design space for building agents has been explored.

spacecadet

This! I keep telling people that if tool use was not a an aha moment relative to AI agents for you, then you need to be more creative...

_pdp_

This is a huge undertaking though. Yes it is quite simple to build some basic abstraction on top of openai.complete or similar but this like 1% of an agent need to do.

My bet is that agent frameworks and platform will become more like game engines. You can spin your own engine for sure and it is fun and rewarding. But AAA studios will most likely decide to use a ready to go platform with all the batteries included.

postalcoder

In totality, yes. But you don't need every feature at once. You add to it once you hit boundaries. But I think the most important thing about this exercise is that you leave nothing to the imagination when building agents.

The non-deterministic nature of LLMs already makes the performance of agents so difficult to interpret. Building agents on top of code that you cannot mentally trace through leads to so much frustration when addressing model underperformance and failure.

It's hard to argue that after the dust settles, companies will default to batteries-included frameworks but, right now, a lot of people i've regretted adopting a large framework off the bat.

mritchie712

Some things we've[0] learned on agent design:

1. If your agent needs to write a lot of code, it's really hard to beat Claude Code (cc) / Agent SDK. We've tried many approaches and frameworks over the past 2 years (e.g. PydanticAI), but using cc is the first that has felt magic.

2. Vendor lock-in is a risk, but the bigger risk is having an agent that is less capable then what a user gets out of chatgpt because you're hand rolling every aspect of your agent.

3. cc is incredibly self aware. When you ask cc how to do something in cc, it instantly nails it. If you ask cc how to do something in framework xyz, it will take much more effort.

4. Give your agent a computer to use. We use e2b.dev, but Modal is great too. When the agent has a computer, it makes many complex features feel simple.

0 - For context, Definite (https://www.definite.app/) is a data platform with agents to operate it. It's like Heroku for data with a staff of AI data engineers and analysts.

faxmeyourcode

Point 2 is very often overlooked. Building products that are worse than the baseline chatgpt website is very common.

smcleod

It's quite worrying that I have several times in the last few months had to really drive home why people should probably not be building bespoke agentic systems just to essentially act as a half baked version of an agentic coding tool when they could just go use Claude code and instead focus their efforts on creating value rather than instant technical debt.

CuriouslyC

You can pretty much completely reprogram agents just by passing them through a smart proxy. You don't need to rewrite claude/codex, just add context engineering and tool behaviors at the proxy layer.

mritchie712

yep, that's exactly where we've landed.

focus on the tools and context, let claude handle the execution.

CuriouslyC

Be careful about what you hand off to Claude versus another agent. Claude is a vibe project monster, but it will fail at hard things, come up with fake solutions, and then lie to you about them. To the point that it'll add random sleeps and do pointless work to cover up the fact that it's reward hacking. It's also very messy.

For brownfield work, work on hard stuff or work in big complex codebases you'll save yourself a lot of pain if you use Codex instead of CC.

cedws

Going to hijack this to ask a question that’s been on my mind: does anybody know why there’s no agentic tools that use tree-sitter to navigate code? It seems like it would be much more powerful than having the LLM grepping for strings or rewriting entire files to change one line.

the_mitsuhiko

> does anybody know why there’s no agentic tools that use tree-sitter to navigate code?

I would guess the biggest reason is that there is no RL happening on the base models with tree-sitter as tool. But there is a lot of RL with bash and so it knows how to grep. I did experiment with giving tree sitter and ast-grep to agents and my experience the results are mixed.

postalcoder

This doesn't fully satisfy your question, but it comes straight from bcherny (claude code dev):

> Claude Code doesn't use RAG currently. In our testing we found that agentic search out-performed RAG for the kinds of things people use Code for.

source thread: https://news.ycombinator.com/item?id=43163011#43164253

cedws

Thanks, I was also wondering why they don't use RAG.

the_mitsuhiko

They are using RAG. Grep is also just RAG. The better question is why they don’t use a vector database and honestly the answer is that these things are incredibly hard to keep in sync. And if they are not perfectly in sync, the utility drops quickly.

postalcoder

I forgot to mention that Aider uses a tree sitter approach. It's blazing fast and great for quick changes. It's not the greatest agent for doing heavy edits but I don't it has to do with them not using grep.

spacecadet

Create an agent with these tools and you will. Agent tools are just functions, but think of them like skills. The more knowledge and skills your agents (plural, Id recommend more than one) have access to, the more useful they become.

lvl155

Design is hard because models change almost on a weekly basis. Quality abruptly falls off or drastically changes without announcements. It’s like building a house without proper foundation. I don’t know how many tokens I wasted because of this. I want to say 30% of the cost and resources this year for me.

CuriouslyC

The 'Reinforcement in the Agent Loop' section is a big deal, I use this pattern to enable async/event steered agents, it's super powerful. In long context you can use it to re-inject key points ("reminders"), etc.

Yokohiii

Most of it reads like too high expectation of overhyped technology.

srameshc

I still feel there is no sure shot way to build an abstraction yet. Probably that is why Loveable decided to build on Gemini AI rather than giving options of choosing model. On the other hand I like Pydantic AI framework and got myself a decent working solution where my preference is to stick with cheaper models by default and only use expensive only in cases where failure rate is too high.

_pdp_

This is true. Based on my experience with real customers, they really don't know what is the difference between the different models.

What they want is to get things done. The model is simply means to an end. As long as the task is completed, everything else is secondary.

morkalork

I feel like there's always that "one guy" who has a preference and asks for an option to choose, and, it's a way for developers to offload the decision. You can't be blamed for the wrong one if you put that decision on the user.

Tbh I agree with your approach and use it when building stuff for myself. I'm so over yak shaving that topic.

mitchellh

This is why I use the agent I use. I won't name the company, because I don't want people to think I'm a shill for them (I've already been accused of it before, but I'm just a happy, excited customer). But it's an agentic coding company that isn't associated with any of the big model providers.

I don't want to keep up with all the new model releases. I don't want to read every model card. I don't want to feel pressured to update immediately (if it's better). I don't want to run evals. I don't want to think about when different models are better for different scenarios. I don't want to build obvious/common subagents. I don't want to manage N > 1 billing entities.

I just want to work.

Paying an agentic coding company to do this makes perfect sense for me.

havkom

My tip is: don’t use SDK:s for agents. Use a while loop and craft your own JSON, handle context size and handle faults yourself. You will in practice need this level of control if you are not doing something trivial.

Fiveplus

I liked reading this but got a silly question as I am a noob in these things. If explicit caching is better, does that mean the agent is just forgetting stuff unless we manually save its notes? Are these things really that forgetful? Also why is there a virtual file system? So the agent is basically just running around a tiny digital desktop looking for its files? Why can't the agent just know where the data is? I'm sorry if these are juvenile questions.

the_mitsuhiko

> If explicit caching is better, does that mean the agent is just forgetting stuff unless we manually save its notes?

Caching is unrelated to memory, it's about how to not do the same work over and over again due to the distributed nature of state. I wrote a post that goes into detail from first principles here with my current thoughts on that topic [1].

> Are these things really that forgetful?

No, not really. They can get side-tracked which is why most agents do a form of reinforcement in-context.

> Why is there a virtual file system?

So that you don't have dead-end tools. If a tool creates or manipulates state (which we represent on a virtual file system), another tool needs to be able to pick up the work.

> Why can't the agent just know where the data is?

And where would that data be?

[1]: https://lucumr.pocoo.org/2025/11/22/llm-apis/

pjm331

You are maybe confusing caching and context windows. Caching is mainly about keeping inference costs down