I used o3 to find a remote zeroday in the Linux SMB implementation
68 comments
·May 24, 2025nxobject
kweingar
How do we benchmark these different methodologies?
It all seems like vibes-based incantations. "You are an expert at finding vulnerabilities." "Please report only real vulnerabilities, not any false positives." Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?
nindalf
The author is up front about the limitations of their prompt. They say
> In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.
mrlongroots
I think there's two aspects around LLM usage:
1. Having workflows to be able to provide meaningful context quickly. Very helpful.
2. Arbitrary incantations.
I think No. 2 may provide some random amounts of value with one model and not the other, but as a practitioner you shouldn't need to worry about it long-term. Patterns models pay attention to will change over time, especially as they become more capable. No. 1 is where the value is at.
As my example as a systems grad student, I find it a lot more useful to maintain a project wiki with LLMs in the picture. It makes coordinating with human collaborators easier too, and I just copy paste the entire wiki before beginning a conversation. Any time I have a back-and-forth with an LLM about some design discussions that I want archived, I ask them to emit markdown which I then copy paste into the wiki. It's not perfectly organized but it keeps the key bits there and makes generating papers etc. that much easier.
p0w3n3d
Listen to a video made by Karpathy about LLM, he explains why made up html tags work. It's to help the tokenizer
threeseed
It's amusing to me how people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.
Those prompts should be renamed as hints. Because that's all they are. Every LLM today ignores prompts if they conflict with its sole overarching goal: to give you an answer no matter whether it's true or not.
roywiggins
Engineering principles are probably the best we've got when it comes to trying to work with a poorly understood system? That doesn't mean they'll work necessarily, but...
Retr0id
The article cites a signal to noise ratio of ~1:50. The author is clearly deeply familiar with this codebase and is thus well-positioned to triage the signal from the noise. Automating this part will be where the real wins are, so I'll be watching this closely.
ianbutler
We’ve been working on a system that increases signal to noise dramatically for finding bugs, we’ve at the same time been thoroughly benchmarking the entire popular software agents space for this
We’ve found a wide range of results and we have a conference talk coming up soon where we’ll be releasing everything publicly so stay tuned for that itll be pretty illuminating on the state of the space
Edit: confusing wording
sebmellen
Interesting. This is for Bismuth? I saw your pilot program link — what does that involve?
ianbutler
Yup! So we have multiple businesses working with us and for pilots its deploying the tool, providing feedback (we're connected over slack with all our partners for a direct line to us), and making sure the uses fit expectations for your business and working towards long term partnership.
We have several deployments in other peoples clouds right now as well as usage of our own cloud version, so we're flexible here.
null
tough
I was thinking about this the other day, wouldn't it be feasible to make fine-tune or something like that into every git change, mailist, etc, the linux kernel has ever hard?
Wouldn't such an LLM be the closer -synth- version of a person who has worked on a codebase for years, learnt all its quirks etc.
There's so much you can fit on a high context, some codebases are already 200k Tokens just for the code as is, so idk
sodality2
I'd be willing to bet the sum of all code submitted via patches, ideas discussed via lists, etc doesn't come close to the true amount of knowledge collected by the average kernel developer's tinkering, experimenting, etc that never leaves their computer. I also wonder if that would lead to overfitting: the same bugs being perpetuated because they were in the training data.
andix
1:50 is a great detection ratio for finding a needle in a haystack.
manmal
If the LLM wrote a harness and proof of concept tests for its leads, then it might increase S/N dramatically. It’s just quite expensive to do all that right now.
threeseed
Except that in my experience half the time it will modify the implementation in order to make the tests pass.
And it will do this no matter how many prompts you try or you forcefully you ask it.
moyix
With security vulnerabilities, you don't give the agent the ability to modify the potentially vulnerable software, naturally. Instead you make them do what an attacker would have to do: come up with an input that, when sent to the unmodified program, triggers the vulnerability.
How do you know if it triggered the vulnerability? Luckily for low-level memory safety issues like the ones Sean (and o3) found we have very good oracles for detecting memory safety, like KASAN, so you can basically just let the agent throw inputs at ksmbd until you see something that looks kind of like this: https://groups.google.com/g/syzkaller/c/TzmTYZVXk_Q/m/Tzh7SN...
quentinp
Exactly. Many AI users can’t triage effectively, as a result open source projects get a lot of spam now: https://arstechnica.com/gadgets/2025/05/open-source-project-...
iandanforth
The most interesting and significant bit of this article for me was that the author ran this search for vulnerabilities 100 times for each of the models. That's significantly more computation than I've historically been willing to expend on most of the problems that I try with large language models, but maybe I should let the models go brrrrr!
JFingleton
Zero days can go for $$$, or you can go down the bug bounty route and also get $$. The cost of the LLM would be a drop in the bucket.
When the cost of inference gets near zero, I have no idea what the world of cyber security will look like, but it's going to be a very different space from today.
roncesvalles
A lot of money is all you need~
bbarnett
A lot of burned coal, is what.
The "don't blame the victim" trope is valid in many contexts. This one application might be "hackers are attacking vital infrastructure, so we need to fund vulnerabilities first". And hackers use AI now, likely hacked into and for free, to discover vulnerabilities. So we must use AI!
Therefore, the hackers are contributing to global warming. We, dear reader, are innocent.
sdoering
So basically running a microwave for about 800 seconds, or a bit more than 13 minutes per model?
Oh my god - the world is gonna end. Too bad, we panicked because of exaggerated energy consumption numbers for using an LLM when doing individual work.
Yes - when a lot of people do a lot of prompting, these 0ne tenth of a second to 8 seconds of running the microwave per prompt adds up. But I strongly suggest, that we could all drop our energy consumption significantly using other means, instead of blaming the blog post's author about his energy consumption.
The "lot of burned coal" is probably not that much in this blog post's case given that 1 kWh is about 0.12 kg coal equivalent (and yes, I know that we need to burn more than that for 1kWh. Still not that much, compared to quite a few other human activities.
If you want to read up on it, James O'Donnell and Casey Crownhart try to pull together a detailed account of AI energy usage for MIT Technology Review.[1] I found that quite enlightening.
[1]: https://www.technologyreview.com/2025/05/20/1116327/ai-energ...
wongarsu
How much longer would OP have needed to find the same vulnerability without LLM help? Then multiply that by the energy used to produce 2000kcal/day of food as well as the electricity for running their computer.
Usually LLMs come out far ahead in those types of calculations. Compared to humans they are quite energy efficient
Balooga
Between $3k and $30k to solve a single ARC-AGI problem [1]. Not sure if "100 runs" makes this comparable.
[1] https://techcrunch.com/2025/04/02/openais-o3-model-might-be-...
KTibow
> With o3 you get something that feels like a human-written bug report, condensed to just present the findings, whereas with Sonnet 3.7 you get something like a stream of thought, or a work log.
This is likely because the author didn't give Claude a scratchpad or space to think, essentially forcing it to mix its thoughts with its report. I'd be interested to see if using the official thinking mechanism gives it enough space to get differing results.
iamdanieljohns
Could you provide some links to relevant work/research on using a "scratchpad" that you liked?
gizmodo59
Having tried both I’d say o3 is in a league of it’s own compared to 3.7 or even Gemini 2.5 pro. The benchmarks may show not a lot of gain but that matters a lot when the task is very complex. What’s surprising is that they announced it last November and only now it’s released a month back now? (I’m guessing lots of safety took time but no idea). Can’t wait for o4!
firesteelrain
I really hope this is legit and not what keeps happening to curl
[1] https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...
simonw
There's a beautiful little snippet here that perfectly captures how most of my prompt development sessions go:
> I tried to strongly guide it to not report false positives, and to favour not reporting any bugs over reporting false positives. I have no idea if this helps, but I’d like it to help, so here we are. In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.
logifail
My understanding is that ksmbd is a kernel-space SMB server "developed as a lightweight, high-performance alternative" to the traditional (user-space) Samba server...
Q1: Who is using ksmbd in production?
Q2: Why?
donnachangstein
1. People that were using the in-kernel SMB server in Solaris or Windows.
2. Samba performance sucks (by comparison) which is why people still regularly deploy Windows for file sharing in 2025.
Anybody know if this supports native Windows-style ACLs for file permissions? That is the last remaining reason to still run Solaris but I think it relies on ZFS to do so.
Samba's reliance on Unix UID/GID and the syncing as part of its security model is still stuck in the 1970s unfortunately.
The caveat is the in-kernel SMB server has been the source of at least one holy-shit-this-is-bad zero-day remote root hole in Windows (not sure about Solaris) so there are tradeoffs.
raverbashing
> Samba's reliance on Unix UID/GID and the syncing as part of its security model is still stuck in the 1970s unfortunately.
Sigh. This is why we can't have nice things
Like yeah having smb in kernel is faster but honestly it's not fundamentally faster. But it seems the will to make samba better isn't there
AshamedCaptain
Licensing. Samba is GPLv3, Linux is only GPLv2.
noname120
The same reason people use kmod-trelay instead of relayd I guess
pixl97
I would assume for the reason of being lightweight and high performance?
foobar10000
Smb over 25gbit networks - user space samba is much worse there.
Henchman21
This is interesting to me! I regularly deploy 25G network connections, but I don’t think we’d run SMB over that. I am super curious the industry and use case if you’re willing to share!
martinald
I think this is the biggest alignment problem with LLMs in the short term imo. It is getting scarily good at this.
I recently found a pretty serious security vulnerability in an open source very niche server I sometimes use. This took virtually no effort using LLMs. I'm worried that there is a huge long tail of software out there which wasn't worth finding vulnerabilities in for nefarious means manually but if it was automated could lead to really serious problems.
tekacs
The (obvious) flipside of this coin is that it allows us to run this adversarially against our own codebases, catching bugs that could otherwise have been found by a researcher, but that we can instead patch proactively.\
I wouldn't (personally) call it an alignment issue, as such.
Legend2440
If attackers can automatically scan code for vulnerabilities, so can defenders. You could make it part of your commit approval process or scan every build or something.
zielmicha
(To be clear, I'm not the author of the post, the title just starts with "How I")
null
jobswithgptcom
Wow, interesting. I been hacking a tool called https://diffwithgpt.com with a similar angle but indexing git changelogs with qwen to have it raise risks for backward compat issues, risks including security when upgrading k8s etc.
dboreham
I feel like our jobs are reasonably secure for a while because the LLM didn't immediately say "SMB implemented in the kernel, are you f-ing joking!?"
A small thing, but I found the author's project-organization practices useful – creating individual .prompt files for system prompt, background information, and auxiliary instructions [1], and then running it through `llm`.
It reveals how good LLM use, like any other engineering tool, requires good engineering thinking – methodical, and oriented around thoughtful specifications that balance design constraints – for best results.
[1] https://github.com/SeanHeelan/o3_finds_cve-2025-37899