Skip to content(if available)orjump to list(if available)

Claude Code Can Debug Low-Level Cryptography

simonw

Using coding agents to track down the root cause of bugs like this works really well:

> Three out of three one-shot debugging hits with no help is extremely impressive. Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it.

The approach described here could also be a good way for LLM-skeptics to start exploring how these tools can help them without feeling like they're cheating, ripping off the work of everyone who's code was used to train the model or taking away the most fun part of their job (writing code).

Have the coding agents do the work of digging around hunting down those frustratingly difficult bugs - don't have it write code on your behalf.

rtpg

I understand the pitch here ("it finds bugs! it's basically all upside because worst case there's no output anyways"), but I'm finding some of these agents to be ... uhhh... kind of agressive at trying to find the solution and end up missing the forest for the trees. And there's some "oh you should fix this" stuff which, while sometimes isn't _wrong_, is completely besides the point.

The end result being these robots doing bikeshedding. When paired with junior engineers looking at this output and deciding to act on it, it just generates busywork. Not helping that everyone and their dog wants to automatically run their agent against PRs now

I'm trying to use these to some extent when I find myself in a canonical situation that should work and am not getting the value everyone else seems to get in many cases. Very much "trying to explain a thing to a junior engineer taking more time than doing it myself" thing, except at least the junior is a person.

Wowfunhappy

Sometimes you hit a wall where something is simply outside of the LLM's ability to handle, and it's best to give up and do it yourself. Knowing when to give up may be the hardest part.

Notably, these walls are never where I expect them to be; despite my best efforts, I can't find any sort of pattern. LLMs can find really tricky bugs and get completely stuck on relatively simple ones.

ori_b

Doing it yourself is how you build and maintain the muscles to do it yourself. If you only do it yourself when the LLM fails, how will you maintain those muscles?

adastra22

So you feed the output into another LLM call to re-evaluate and assess, until the number of actual reports is small enough to be manageable. Will this result in false negatives? Almost certainly. But what does come out the end of it has a higher prior for being relevant, and you just review what you can.

Again, worst case all you wasted was your time, and now you've bounded that.

majormajor

They're quite good at algorithm bugs, a lot less good at concurrency bugs, IME. Which is very valuable still, just that's where I've seen the limits so far.

Their also better at making tests for algorithmic things than for concurrency situations, but can get pretty close. Just usually don't have great out-of-the-box ideas for "how to ensure these two different things run in the desired order."

Everything that I dislike about generating non-greenfield code with LLMs isn't relevant to the "make tests" or "debug something" usage. (Weird/bad choices about when to duplicate code vs refactor things, lack of awareness around desired "shape" of codebase for long-term maintainability, limited depth of search for impact/related existing stuff sometimes, running off the rails and doing almost-but-not-quite stuff that ends up entirely the wrong thing.)

bongodongobob

Well if you know it's wrong, tell it, and why. I don't get the expectation for one shotting everything 100% of the time. It's no different than bouncing ideas off a colleague.

ewoodrich

The weak points raised by the parent comment are specifically examples where the problem exists outside the model's "peripheral vision" from its context window and speaking from personal experience, aren't as simple as as adding a line to the CLAUDE.md saying "do this / don't do this".

I agree that the popular "one shot at all costs / end the chat at the first whiff of a mistake" advice is much too reductive but unlike a colleague, after putting in all that effort into developing a shared mental model of the desired outcome you reach the max context and then all that nuanced understanding instantly evaporates. You then have to hope the lossy compression into text instructions will actually steer it where you want next time but from experience that unfortunately is far from certain.

majormajor

I don't care about one-shotting; the stuff it's bad for debugging at is the stuff where even when you tell it "that's not it" it just makes up another plausible-but-wrong idea.

For code modifications in a large codebase the problem with multi-shot is that it doesn't take too many iterations before I've spent more time on it. At least for tasks where I'm trying to be lazy or save time.

nicklaf

It's painfully apparent when you've reached the limitations of an LLM to solve a problem it's ill-suited for (like a concurrency bug), because it will just keep spitting out non-sense, eventually going in circles or going totally off the rails.

lxgr

I’ve been pretty impressed with LLMs at (to me) greenfield hobby projects, but not so much at work in a huge codebase.

After reading one of your blog posts recommending it, I decided to specifically give them a try as bug hunters/codebase explainers instead, and I’ve been blown away. Several hard-to-spot production bugs down in two weeks or so that would have all taken me at least a few focused hours to spot all in all.

mschulkind

One of my favorite ways to use LLM agents for coding is to have them write extensive documentation on whatever I'm about to dig in coding on. Pretty low stakes if the LLM makes a few mistakes. It's perhaps even a better place to start for skeptics.

manquer

[delayed]

dboreham

Same. Initially surprised how good it was. Now routinely do this on every new codebase. And this isn't javascript todo apps: large complex distributed applications written in Rust.

jack_tripper

>Have the coding agents do the work of digging around hunting down those frustratingly difficult bugs - don't have it write code on your behalf.

Why? Bug hunting is more challenging and cognitive intensive than writing code.

theptip

Bug hunting tends to be interpolation, which LLMs are really good at. Writing code is often some extrapolation (or interpolating at a much more abstract level).

simonw

Sometimes it's the end of the day and you've been crunching for hours already and you hit one gnarly bug and you just want to go and make a cup of tea and come back to some useful hints as to the resolution.

null

[deleted]

lxgr

Why as in “why should it work” or “why should we let them do it”?

For the latter, the good news is that you’re free to use LLMs for debugging or completely ignore them.

null

[deleted]

teaearlgraycold

I’m a bit of an LLM hater because they’re overhyped. But in these situations they can be pretty nice if you can quickly evaluate correctness. If evaluating correctness is harder than searching on your own then there are net negative. I’ve found with my debugging it’s really hard to know which will be the case. And as it’s my responsibility to build a “Do I give the LLM a shot?” heuristic that’s very frustrating.

qa34514324

I have tested the AI SAST tools that were hyped after a curl article on several C code bases and they found nothing.

Which low level code base have you tried this latest tool on? Official Anthropic commercials do not count.

simonw

You're posting this comment on a thread attached to an article where Filippo Valsorda - a noted cryptography expert - used these tools to track down gnarly bugs in Go cryptography code.

tptacek

They're also using "AI SAST tools", which: I would not expect anything branded as a "SAST" tool to find interesting bugs. SAST is a term of art for "pattern matching to a grocery list of specific bugs".

delusional

These are not "gnarly bugs".

pton_xd

> Full disclosure: Anthropic gave me a few months of Claude Max for free. They reached out one day and told me they were giving it away to some open source maintainers.

Related, lately I've been getting tons of Anthropic Instagram ads; they must be near a quarter of all the sponsored content I see for the last month or so. Various people vibe coding random apps and whatnot using different incarnations of Claude. Or just direct adverts to "Install Claude Code." I really have no idea why I've been targeted so hard, on Instagram of all places. Their marketing team must be working overtime.

simonw

I think it might be that they've hit product-market fit.

Developers find Claude Code extremely useful (once they figure out how to use it). Many developers subscribe to their $200/month plan. Assuming that's profitable (and I expect it is, since even for that much money it cuts off at a certain point to avoid over-use) Anthropic would be wise to spend a lot of money on marketing to try and grow their paying subscriber base for it.

chatmasta

What makes it better than VSCode Co-pilot with Claude 4.5? I barely program these days since I switched to PM but I recently started using that and it seems pretty effective… why should I use a fork instead?

timr

There’s really no functional difference. The VSC agent mode can do everything you want an agent to do, and you can use Claude if you like. If you want to use the CLI instead, you can use Claude Code (or the GitHub one, or Codex, or Aider, or…)

I suspect that a lot of the “try using Claude code” feedback is just another version of “you’re holding it wrong” by people who have never tried VSC (parent is not in this group however). If you’re bought into a particular model system, of course, it might make more sense to use their own tool.

Edit: I will say that if you’re the YOLO type who wants your bots to be working a bunch of different forks in parallel, VSC isn’t so great for that.

danielbln

Claude Code is not a VSCode fork, it's a terminal CLI. It's a rather different interaction paradigm compared to your classical IDE (that said, you can absolutely run Claude Code inside a terminal inside VSCode).

undeveloper

I find copilot simply worse at "understanding" codebases than claude code

XenophileJKO

Personally my biggest piece of advice is: AI First.

If you really want to understand what the limitations are of the current frontier models (and also really learn how to use them), ask the AI first.

By throwing things over the wall to the AI first, you learn what it can do at the same time as you learn how to structure your requests. The newer models are quite capable and in my experience can largely be treated like a co-worker for "most" problems. That being said.. you also need to understand how they fail and build an intuition for why they fail.

Every time a new model generation comes up, I also recommend throwing away your process (outside of things like lint, etc.) and see how the model does without it. I work with people that have elaborate context setups they crafted for less capable models, they largely are un-neccessary with GPT-5-Codex and Sonnet 4.5.

imiric

> By throwing things over the wall to the AI first, you learn what it can do at the same time as you learn how to structure your requests.

Unfortunately, it doesn't quite work out that way.

Yes, you will get better at using these tools the more you use them, which is the case with any tool. But you will not learn what they can do as easily, or at all.

The main problem with them is the same one they've had since the beginning. If the user is a domain expert, then they will be able to quickly spot the inaccuracies and hallucinations in the seemingly accurate generated content, and, with some effort, coax the LLM into producing correct output.

Otherwise, the user can be easily misled by the confident and sycophantic tone, and waste potentially hours troubleshooting, without being able to tell if the error is on the LLM side or their own. In most of these situations, they would've probably been better off reading the human-written documentation and code, and doing the work manually. Perhaps with minor assistance from LLMs, but never relying on them entirely.

This is why these tools are most useful to people who are already experts in their field, such as Filippo. For everyone else who isn't, and actually cares about the quality of their work, the experience is very hit or miss.

> That being said.. you also need to understand how they fail and build an intuition for why they fail.

I've been using these tools for years now. The only intuition I have for how and why they fail is when I'm familiar with the domain. But I had that without LLMs as well, whenever someone is talking about a subject I know. It's impossible to build that intuition with domains you have little familiarity with. You can certainly do that by traditional learning, and LLMs can help with that, but most people use them for what you suggest: throwing things over the wall and running with it, which is a shame.

> I work with people that have elaborate context setups they crafted for less capable models, they largely are un-neccessary with GPT-5-Codex and Sonnet 4.5.

I haven't used GPT-5-Codex, but have experience with Sonnet 4.5, and it's only marginally better than the previous versions IME. It still often wastes my time, no matter the quality or amount of context I feed it.

XenophileJKO

I guess there are several unsaid assumptions here. The article is by a domain expert working on their domain. Throw work you understand at it, see what it does. Do it before you even work on it. I kind of assumed based on the audience that most people here would be domain experts.

As for the building intuition, perhaps I am over-estimating what most people are capable of.

Working with and building systems using LLMs over the last few years has helped me build a pretty good intuition about what is breaking down when the model fails at a task. While having an ML background is useful in some very narrow cases (like: 'why does an LLM suck at ranking...'), I "think" a person can get a pretty good intuition purely based on observational outcomes.

I've been wrong before though. When we first started building LLM products, I thought, "Anyone can prompt, there is no barrier for this skill." That was not the case at all. Most people don't do well trying to quantify ambiguity, specificity, and logical contridiction when writing a process or set of instructions. I was REALLY surprised how I became a "go-to" person to "fix" prompt systems all based on linguistics and systematic process decomposition. Some of this was understaing how the auto-regressive attention system benefits from breaking the work down into steps, but really most of it was just "don't contradict yourself and be clear".

Working with them extensively also has helped me hone in on how the models get "better" with each release. Though most of my expertise is with OpenAI and Antrhopic model families.

I still think most engineers "should" be able to build intuition generally on what works well with LLMs and how to interact with them, but you are probably right. It will be just like most ML engineers where they see something work in a paper and then just paste it onto their model with no intuition about what systemically that structurally changes in the model dynamics.

fn-mote

> I kind of assumed based on the audience that most people here would be domain experts.

No take on the rest of your comment, but it’s the nature of software engineering that we work on a breadth of problems. Nobody can be a domain expert in everything.

For example: I use a configurable editor every day, but I’m not a domain expert in the configuration. An LLM wasted an hour of my day pointing me in “almost the right direction” when after 10 minutes I really needed to RTFM.

I am a domain expert in some programming languages, but now I need to implement a certain algorithm… I’m not an expert in that algorithm. There’s lots of traps for the unwary.

I just wanted to challenge the assumption that we are all domain experts in the things we do daily. We are, but … with limitations.

null

[deleted]

zcw100

I just recently found a number of bugs in both the RELIC and MCL libraries. It took a while to track them down but it was remarkable that it was able to find them at all.

Frannky

CLI terminals are incredibly powerful. They are also free if you use Gemini CLI or Qwen Code. Plus, you can access any OpenAI-compatible API(2k TPS via Cerebras at 2$/M or local models). And you can use them in IDEs like Zed with ACP mode.

All the simple stuff (creating a repo, pushing, frontend edits, testing, Docker images, deployment, etc.) is automated. For the difficult parts, you can just use free Grok to one-shot small code files. It works great if you force yourself to keep the amount of code minimal and modular. Also, they are great UIs—you can create smart programs just with CLI + MCP servers + MD files. Truly amazing tech.

BrokenCogs

How good is Gemini CLI compared to Claude code and openAi codex?

nl

Not great.

It's ok for documentation or small tasks, but consistently fails at tasks that both Claude or Codex succeed at.

Frannky

I started with Claude Code, realized it was too much money for every message, then switched to Gemini CLI, then Qwen. Probably Claude Code is better, but I don't need it since I can solve my problems without it.

distances

I've found the regular Claude Pro subscription quite enough for coding tasks when you anyway have a bunch of other things like code reviews to do in addition to coding, and won't spend the whole day running it.

luxuryballs

Yeah I was using openrouter for Claude code and burned through $30 in credits to do things that if I had just used the openrouter chat for it would have been like $1.50, I decided it was better for now to do the extra “secretary work” of manual entry and context management of the chat and pain of attaching files. It was pretty disappointing because at first I had assumed it would not be much different in price at all.

cmrdporcupine

Try what I've done: use the Claude Code tool but point your ANTHROPIC_URL at a DeepSeek API membership. It's like 1/10th the cost, and about 2/3rds the intelligence.

Sometimes I can't really tell.

wdfx

Gemini and it's tooling is absolute shit. The LLM itself is barely usable and needs so much supervision you might as well do the work yourself. Then couple that with an awful cli and vscode interface and you'll find that it's just a complete waste of time.

Compared to the anthropic offering is night and day. Claude gets on with the job and makes me way more productive.

Frannky

It's probably a mix of what you're working on and how you're using the tool. If you can't get it done for free or cheaply, it makes sense to pay. I first design the architecture in my mind, then use Grok 4 fast (free) for single-shot generation of main files. This forces me to think first, and read the generated code to double-check. Then, the CLI is mostly for editing, clerical work, testing, etc. That said, I do try to avoid coding altogether if the CLI + MCP servers + MD files can solve the problem.

SamInTheShell

> Gemini and it's tooling is absolute shit.

Which model were you using? In my experience Gemini 2.5 Pro is just as good as Claude Sonnet 4 and 4.5. It's literally what I use as a fallback to wrap something up if I hit the 5 hour limit on Claude and want to just push past some incomplete work.

I'm just going to throw this out there. I get good results from a truly trash model like gpt-oss-20b (quantized at 4bits). The reason I can literally use this model is because I know my shit and have spent time learning how much instruction each model I use needs.

Would be curious what you're actually having issues with if you're willing to share.

idiotsecant

Claude code ux is really good, though.

qsort

This resonates with me a lot:

> As ever, I wish we had better tooling for using LLMs which didn’t look like chat or autocomplete

I think part of the reason why I was initially more skeptical than I ought to have been is because chat is such a garbage modality. LLMs started to "click" for me with Claude Code/Codex.

A "continuously running" mode that would ping me would be interesting to try.

imiric

On the one hand, I agree with this. The chat UI is very slow and inefficient.

But on the other, given what I know about these tools and how error-prone they are, I simply refuse to give them access to my system, to run commands, or do any action for me. Partly due to security concerns, partly due to privacy, but mostly distrust that they will do the right thing. When they screw up in a chat, I can clean up the context and try again. Reverting a removed file or messed up Git repo is much more difficult. This is how you get a dropped database during code freeze...

The idea of giving any of these corporations such privileges is unthinkable for me. It seems that most people either don't care about this, or are willing to accept it as the price of admission.

I experimented with Aider and a self-hosted model a few months ago, and wasn't impressed. I imagine the experience with SOTA hosted models is much better, but I'll probably use a sandbox next time I look into this.

cmrdporcupine

Aider hurt my head it did not seem... good. Sorry to say.

If you want open source and want to target something over an API "crush" https://github.com/charmbracelet/crush is excellent

But you should try Claude Code or Codex just to understand them. Can always run them in a container or VM if you fear their idiocy (and it's not a bad idea to fear it)

Like I said sibling, it's not the right modality. Others agree. I'm a good typer and good at writing, so it doesn't bug me too much, but it does too much without asking or working through it. Sometimes this is brilliant. Other times it's like.. c'mon guy, what did you do over there? What Balrog have I disturbed?

It's good to be familiar with these things in any case because they're flooding the industry and you'll be reviewing their code for better or for worse.

cmrdporcupine

I absolutely agree with this sentiment as well and keep coming back to it. What I want is more of an actual copilot which works in a more paired way and forces me to interact with each of its changes and also involves me more directly in them, and teaches me about what it's doing along the way, and asks for more input.

A more socratic method, and more augmentic than "agentic".

Hell, if anybody has investment money and energy and shares this vision I'd love to work on creating this tool with you. I think these models are being misused right now in attempt to automate us out of work when their real amazing latent power is the intuition that we're talking about on this thread.

Misused they have the power to worsen codebases by making developers illiterate about the very thing they're working on because it's all magic behind the scenes. Uncorked they could enhance understanding and help better realize the potential of computing technology.

reachableceo

Have you tried to ask the agents to work with you in the way you want?

I’ve found that using some high level direction / language and sharing my wants / preferences for workflow and interaction works very well.

I don’t think that you can find an off the shelf system todo what you want. I think you have to customize it to your own needs as you go.

Kind of like how you customize emacs as it’s running to your desires.

I’ve often wondered if you could put a mini LLM into emacs or vscode and have it implement customizations :)

cmrdporcupine

I have, but the problem is in part the tool itself and the way it works. It's just not written with an interactive prompting style in mind. CC is like "Accept/Ask For Changes/Reject" for often big giant diffs, and it's like... no, the UI should be: here's an editor let's work on this together, oh I see what you did there, etc...

mccoyb

I'm working on such a thing, but I'm not interested in money, nor do I have money to offer - I'm interested in a system which I'm proud of.

What are your motivations?

Interested in your work: from your public GitHub repos, I'm perhaps most interested in `moor` -- as it shares many design inclinations that I've leaned towards in thinking about this problem.

cmrdporcupine

Unfortunately... mooR is my passion project, but I also need to get paid, and nobody is paying me for that.

I'm off work right now, between jobs and have been working 10, 12 hours a day on it. That will shortly have to end. I applied for a grant and got turned down.

My motivations come down to making a living doing the things I love. That is increasingly hard.

spacechild1

> Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it.

Except they regularly come up with "explanations" that are completely bogus and may actually waste an hour or two. Don't get me wrong, LLMs can be incredibly helpful for identifying bugs, but you still have to keep a critical mindset.

danielbln

OP said "for me to reason about it", not for the LLM to reason about it.

I agree though, LLMs can be incredible debugging tools, but they are also incredibly gullable and love to jump to conclusions. The moment you turn your own fleshy brain off is when they go to lala land.

delaminator

> For example, how nice would it be if every time tests fail, an LLM agent was kicked off with the task of figuring out why, and only notified us if it did before we fixed it?

You can use Git hooks to do that. If you have tests and one fails, spawn an instance of claude a prompt -p 'tests/test4.sh failed, look in src/ and try and work out why'

    $ claude -p 'hello, just tell me a joke about databases'

    A SQL query walks into a bar, walks up to two tables and asks, "Can I JOIN you?"

    $ 
Or if, you use Gogs locally, you can add a Gogs hook to do the same on pre-push

> An example hook script to verify what is about to be pushed. Called by "git push" after it has checked the remote status, but before anything has been pushed. If this script exits with a non-zero status nothing will be pushed.

I like this idea. I think I shall get Claude to work out the mechanism itself :)

It is even a suggestion on this Claude cheet sheet

https://www.howtouselinux.com/post/the-complete-claude-code-...

jamesponddotco

This could probably be implemented as a simple Bash script, if the user wants to run everything manually. I might just do that to burn some time.

delaminator

sure, there a multiple ways of spawning an instance

the only thing I imagine might be problem is claude demanding a login token as it happens quite regularly

jasonjmcghee

I found llm debugging to work better if you give the llm access to a debugger.

You can build this pretty easily: https://github.com/jasonjmcghee/claude-debugs-for-you

phendrenad2

A whole class of tedious problems have been eliminated by LLMs because they are able to look at code in a "fuzzy" way. But this can be a liability, too. I have a codebase that "looks kinda" like a nodejs project, so AI agents usually assume it is one, even if I rename the package.json, it will inspect the contents and immediately clock it as "node-like".

gdevenyi

Coming soon, adversarial attacks on LLM training to ensure cryptographic mistakes.

didibus

This is basically the ideal scenario for coding agents. Easily verifiable through running tests, pure logic, algorithmic problem. It's the case that has worked the best for me with LLMs.