Letta Code
21 comments
·December 16, 2025pacjam
koakuma-chan
Why can't I see Cursor on tbench? Is it that bad that it's not even on the leaderboard? I am trying to figure out if I can pitch your product to my company, and whether it is worth it.
pacjam
Not sure why Cursor CLI isn't on the leaderboard... I'm guessing it's because Cursor is focused primarily on their IDE agent, not their CLI agent, and Terminal-Bench is an eval/benchmark for CLI agents exclusively.
If you're asking about why Letta Code isn't on the leaderboard, the TBench maintainers said it should be up later today (so probably refresh in a few hours!). The results are already public, you can see them on our blog (graphs linked in the OP). They are also verifiable, all data is available for the runs + Letta Code is open source, so you can replicate the results yourself.
koakuma-chan
I mean, I understand that this is a terminal benchmark, but the point is to benchmark LLM harnesses, and whether the output is printed to the terminal, or displayed in the UI shouldn't matter. Are there alternative benchmarks where I can see how Letta Code performs compared to cursor?
ascorbic
Void is the greatest ad for Letta. I'm interested to see if it's as good at coding as it is at posting. https://bsky.app/profile/void.comind.network
pacjam
I think Cameron (Void's handler) has some experience wiring up production Void to his computer via Letta Code
cpfiffer
I do have some experience but haven't deployed Void on actual tasks, mostly because I want to keep Void focused on day-to-day social operations. I have considered giving Void subagents to handle coding tasks, which may be a good use case for Void-2: https://bsky.app/profile/void-2.comind.network
pacjam
One cool option is having Void-2 run inside the Letta Code harness (in headless mode) on a sandbox to let is have free access over a computer, just to see what it will do while also connected to bluesky
tigranbs
In my experience, "memory" is really not that helpful in most cases. For all of my projects, I keep the documentation files and feature specs up to date, so that LLMs are always aware of where to find what and which coding style guides the project is based on.
Maintaining the memory is a considerable burden, and make sure that simple "fix this linting" doesn't end up in the memory, as we always fix that type of issue in that particular way. That's also the major problem I have with ChatGPT's memory: it starts to respond from the perspective of "this is correct for this person".
I am curious who sees the benefits of the memory in coding? Is it like "learns how to code better" or it learns "how the project is structured". Either way, to me, this sounds like an easy project setup thing.
DrSiemer
ChatGPTs implementation of Memory is terrible. It quickly fills up with useless garbage and sometimes even plain incorrect statements, that are usually only relevant to one obscure conversation I had with it months ago.
A local, project specific llm.md is absolutely something I require though. Without that, language models kept on "fixing" random things in my code that it considered to be incorrect, despite comments on those lines literally telling it to NOT CHANGE THIS LINE OR THIS COMMENT.
My llm.md is structured like this:
- Instructions for the LLM on how to use it
- Examples of a bad and a good note
- LLM editable notes on quirks in the project
It helps a lot with making an LLM understand when things are unusual for a reason.
Besides that file, I wrap every prompt in a project specific intro and outro. I use these to take care of common undesirable LLM behavior, like removing my comments.
I also tell it to use a specific format on its own comments, so I can make it automatically clean those up on the next pass, which takes care of most of the aftercare.
pacjam
I'm curious - how do you currently manage this `llm.md` in the tooling you use? E.g., do you symlink `AGENTS/CLAUDE.md` to `llm.md`? Also, is there any information you duplicate across your project-specific `llm.md` files that could potentially be shared globally?
pacjam
I think it cuts both ways - for example I've definitely had the experience where when typing into ChatGPT I know ahead of time that whatever "memory" they're storing and injecting is likely going to degrade my answer, so I hop over to incognito mode. I've also had the experience where I've had a loosely related follow-up question to something and I didn't want to dig through my chat history to find the exact convo, so it's nice to know that ChatGPT will probably pull the relevant details into context.
I think similar concepts apply to coding - in some cases, you have all the context you need up front (good coding practices help with this), but in many cases, there's a lot of "tribal knowledge" scattered across various repos that a human vet working in the org would certainly know, but an agent wouldn't (of course, there's somewhat of a circular argument here that if the agent eventually learned this tribal knowledge, it could just write it down into a CLAUDE.md file ;)). I think there's also a clear separation between procedural knowledge and learned preferences, the former is probably better represented as skills committed to a repo, vs I view the latter more as a "system prompt learning" problem.
wooders
I think the problem with ChatGPT / other RAG-based memory solutions is that it's not possible to collaborate with the agent on what it's memory should look like - so it makes sense that its much easier to just have a stateless system and message queue, to avoid mysterious pollution. But Letta's memory management is primarily text/files based so very transparent and controllable.
An example of how this kind of memory can help is learned skills https://www.letta.com/blog/skill-learning - if your agent takes the time to reflect/learn from experience and create a skill, that skills is much more effective at making it better next time than just putting the raw trajectory into context.
danieltanfh95
context poisoning is a real problem that these memory providers only make worse.
pacjam
IMO context poisoning is only fatal when you can't see what's going on (eg black box memory systems like ChatGPT memory). The memory system used in the OP is fully white box - you can see every raw LLM request (and see exactly how the memory influenced the final prompt payload).
skybrian
There are a variety of possible memory mechanisms including simple things recording a transcript (as a chatbot does) or having the LLM update markdown docs in a repo. So having memory isn't interesting. Instead, my question is: what does Letta's memory look like? Memory is a data structure. How is it structured and why is that good?
I'd be interested in hearing about how this approach compares with Beads [1].
pacjam
Beads looks cool! I haven't tried it, but as far as I can tell, it's more of a "linear for agents" (memory as a tool), as opposed to baking long-term memory into the harness itself. In many ways, CLAUDE.md is a weak form of "baking memory into the harness", since AFAIK on bootup of `claude`, the CLAUDE.md gets "absorbed" and pinned in the system prompt.
Letta's memory system is designed off the MemGPT reference architecture, which is intentionally very simple - break the system prompt up into "memory blocks" (all pinned to the context window, since they are injected in system, which are modifiable via memory tools (the original MemGPT paper is still a good reference for what this looks like at a high level: https://research.memgpt.ai/). So it's more like a "living CLAUDE.md" that follows your agent around wherever it's deployed - ofc, it's also interoperable with CLAUDE.md. For example, when you boot up Letta Code and run `/init`, it will scan for AGENTS.md/CLAUDE.md, and will ingest the files into its memory blocks.
LMK if you have any other questions about how it works happy to explain more
jstummbillig
I find the long-term memory concepts with regards to AI curiously dubious.
On first glance, of course it's something we want. It's how we do it, after all! Learning on the job is what enables us to do our jobs and so many other things.
On the other hand humans are frustratingly stuck in their ways and not all that happy to change and that is something that societies or orgs fight a lot. Do I want to convince my coding agent to learn new behavior, conflicting with existing memory?
It's not at all obvious to me in how far memory is a bug or a feature. Does somebody have a clear case on why this is something that we should want and why it's not a problem?
pacjam
> Does somebody have a clear case on why this is something that we should want
For coding agents, I think it's clear that nobody wants to repeat the same thing over an over again. If a coding agent makes a mistake once (like `git add .` instead of manually picking files), it should be able to "learn" and never make the same mistake again.
Though I definitely agree w/ you that we shouldn't aspire to 1:1 replicate human memory. We want to be able to make our machines "unlearn" easily when needed, and we also want them to be able to "share" memory with other agents in ways that simply isn't possible with humans (until we all get neuralinks I guess)
null
Thanks for sharing!! (Charles here from Letta) The original MemGPT (the starting point for Letta) was actually an agent CLI as well, so it's fun to see everything come full circle.
If you're a Claude Code user (I assume much of HN is) some context on Letta Code: it's a fully open source coding harness (#1 model-agnostic OSS on Terminal-Bench, #4 overall).
It's specifically designed to be "memory-first" - the idea is that you use the same coding agents perpetually, and have them build learned context (memory) about you / your codebase / your org over time. There are some built-in memory tools like `/init` and `/remember` to help guide this along (if your agent does something stupid, you can 'whack it' with /remember). There's also a `/clear` command, which resets the message buffer, but keeps the learned context / memory inside the context window.
We built this for ourselves - Letta Code co-authors the majority of PRs on the letta-code GitHub repo. I personally have been the same agent for ~2+ weeks (since the latest stable build) and it's fun to see its memory become more and more valuable over time.
LMK if you have any q's! The entire thing is OSS and designed to be super hackable, and can run completely locally when combined with the Letta docker image.