Surprisingly fast AI-generated kernels we didn't mean to publish yet

197 comments

·May 30, 2025

miki123211

I think how the authors of this post think about "AI agents" is really interesting.

Most people think of agents like they think of human employees. They set up a limited number of agents to run in parallel (often just one), with each agent running in a loop and doing one task at a time. They're still in a world where you have a fixed (on the timescale of hours or days) number of employees, each employee can only do one thing at a time, and transferring tasks between employees is slow and costly.

LLMs don't really work like that. You effectively have an infinite number of agents that you can conjure out of thin air at any time. There's no cost advantage to performing LLM requests in series rather than in parallel.

If you realize this, the pattern of each agent fanning out and forking itself into as many sub-agents as are needed to fulfill the task becomes obvious. This is exactly what the authors have done.

I think a better way to think of agents is as "tasks" or "jobs", like those you might find in Celery or sidekik, and apply the learnings from those.

neom

For fun last month I decided to see if i could build a fully functional business of agents. It's 175 python files (employees) build up of internal roles within those files (tasks). So what I have is 175 employees who are able to pass output around each other, understand the work, complete the work, understand where to send the output. The whole system has the ability to do around 275 base processes (same as a business at > 100MM arr) I started on a Friday afternoon and slept a little bit and finished on Monday afternoon. After I had it running I sent it to a VC friend to show them and they sent back the deck of a startup that is in stealth with $25MM doing it the exact same way. With 1 month and a designer and an engineer, I could have it mvp functional for anyone to use ($40k?). Times are changing. Here is kinda how it looks: https://s.h4x.club/9ZuO4XQR / https://s.h4x.club/jkuB8ZED (I've evolved it a little since this, and if you're an engineer and look at my files and think, this guy is a moron: I know!:))

yusina

> understand the work

LLMs don't understand. It's mind-boggling to me that large parts of the tech industry think that.

Don't ascribe to them what they don't have. They are fantastic at faking understanding. Don't get me wrong, for many tasks, that's good enough. But there is a fundamental limit to what all this can do. Don't get fooled into believing there isn't.

motorest

> LLMs don't understand. It's mind-boggling to me that large parts of the tech industry think that.

I think you might be tied to a definition of "understanding" that doesn't really apply.

If you prompt a LLM with ambiguous instructions, it requests you to clarify (i.e., extend prompt to provide more context) and once you do the LLM outputs something that exactly meets the goals of the initial prompt, does it count as understanding?

If it walks like a duck and quacks like a duck, it's a duck,or something so close to a duck that we'd be better off calling it that.

GoatInGrey

I don't believe the user meant "understand" in the classical biological and philosophical sense, or were otherwise attempting to anthropomorphize the systems. They were speaking from the practical experience of "this thing takes a somewhat ambiguous input with unique constraints and implements the ask more-or-less as intended".

squidbeak

They understand. Anything able to reason about any arbitrary request and form a plan tailored to that request understands well enough to qualify for the verb. The mechanism behind it may feel hollow or fake. But if its responses reliably show understanding, the LLM understands - by any ordinary measure.

hayst4ck

Nearly every argument like this has the same fatal flaw, and it's generally not the critique of the AI, but the critique reflected back on to humans.

Humans also don't understand and are frequently faking understanding, which for many tasks is good enough. There are fundamental limits to what humans can do.

The AI of a few months ago before OpenAI's sycophancy was quite impressive, less so now which means it is being artificially stunted so more can be charged later. It means privately it is much better than what is public. I can't say it "understands," but I can say it outclasses many many humans. There are already numbers of tasks based around understanding where I would already choose an LLM over a human.

It's worth looking at bloom's taxonomy (https://en.wikipedia.org/wiki/Bloom%27s_taxonomy): In the 2001 revised edition of Bloom's taxonomy, the levels were renamed and reordered: Remember, Understand, Apply, Analyze, Evaluate, and Create. In my opinion it is at least human competitive for everything but create.

I used to be very bearish on AI, but if you haven't had a "wow" moment when using one, then I don't think you've tried to explore what it can do or tested it's limits with your own special expertise/domain knowledge, or if you have then I'm not sure we're using the same LLMs. Then compare that experience to normal people, not your peer groups. Compare an LLM to people into astrology, crystal healing, or homeopathy and ask which has more "understanding."

zenburnmyface

meh. I feel this is just a linguistic shortcut, similar to how _trained_ biologists can talk about a species or organism evolving some trait. Of course the organism isn't _really_ evolving with any goal in mind, but that's clear to the speaker and audience. Whether or not LLMs understand (very unlikely), it's clear what we mean by an LLM "understanding": has the context + prior training to make reasonable predictions. But no one wants to write that each time.

bobxmax

How do you know?

neom

What is the limit my system will reach?

rzz3

Thats an interesting word to pick on. Understanding still means something here in a relative sense.

robbomacrae

Is anyone else annoyed that VC's are out there sharing decks of startups in stealth with potential competitors? How often does this happen?

eterm

I would be annoyed along with you if I thought the post was true.

acchow

> The whole system has the ability to do around 275 base processes

It’s incredibly easy to get LLMs to do a lot of stuff that seems convincing.

They are literally trained for plausibility.

literalAardvark

Engineers who would judge someone's frontier MVP like that are not even worth worrying about.

This is epic work. Would love to see more of it but I guess you're gonna take it the startup route since you have connections. Best of luck.

neom

Thanks!!! I decided not to build it, that space is already too busy, there is a startup with $25MM in stealth, who else is in stealth? On top of that, this method will get stale very very quickly, foundation model businesses are just too hard to work around right now, it's a silly way to do business. My magic is I've build a startup from scratch to over 400 people and watched what they do, it won't be long till that isn't worth much.

jprokay13

I’ve been floating around a similar set of ideas and it’s been very fun (if not all that useful yet) to build Did you try taking it one step further where a “recruiter” has to hire the engineers after a screening process? I wonder if this could get you even better AI engineers

mucha

Cool. What goods/services does your business provide to customers?

neom

Goods and services are a byproduct of business, business is primarily concerned with systems and processes that facilitate value exchange, so my tool, can work with a user, to build a business, not a product or a service. If you bake cupcakes, my tool can get you 100 people at your door, it cannot open the door or provide the cakes.

iammrpayments

Sounds really interesting but I have no idea what’s the purpose of having 175 “employees” here? Maybe it is a smart way to sell the idea you’re going to replace 175 people if you buy the product? Could just buy chatgpt instead I guess, but a chatbot doesn’t sound as cool as 175 employees.

neom

I would love to know how to do it another way if you have any ideas, I'm sadly not experienced or intelligent enough to think of another way to do it.

immibis

Does this experiment do anything useful or does it just soak up investor money? Not that there's anything wrong with the latter.

neom

The only investor is me. I build it on my own over a weekend, on my own. I just wanted to confirm it can be done therefore will exist, that is all. Personally, I decided not to peruse it because I am old and lazy and don't want to compete against a16z and sequoia funded adderall filled teenagers.

dwohnitmok

> If you realize this, the pattern of each agent fanning out and forking itself into as many sub-agents as are needed to fulfill the task becomes obvious.

And this is precisely how really bad things could happen:

https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality...

londons_explore

> forking itself into as many sub-agents as are needed to fulfill the task

The forking is free. Running the sub-agents is linear cost, but the expensive bit is joining the agents responses back together again.

If a task has 6 subtasks and an agent is spawned for each, at some point some 'joiner' agent needs to parse and summarize the findings of the sub agents and feed it back to the parent. That step necessarily involves information loss, and uses more computation that a single linear agent design would not use.

neom

I designed something for a business and found I needed 4 major sub-systems (like a real business) - insight/data, cognition, meta cognition and execution, and if you don't define all 4, the system is junk.

motorest

> I designed something for a business and found I needed 4 major sub-systems (like a real business) - insight/data, cognition, meta cognition and execution, and if you don't define all 4, the system is junk.

Might it be just another realization of Conway's law?

https://en.wikipedia.org/wiki/Conway%27s_law

Might it be possible that the only reason you're assuming a system is junk is just that it doesn't resemble the systems you know and expect? There are so many ways to skin a cat, and certainly no business process represents the optimal process.

yusina

> You effectively have an infinite number of agents

You don't.

Sincerely, Your Electricity Bill

TimPC

The challenge with fan out is constructing a linear conversation that makes sense that captures previous history. In any context where the LLM needs that information linear loops often perform better than trying to splice together conversations from multiple parallel processes.

kposehn

This is similar to something we've been doing for a while. Instead of individual agents we are creating many iterations and sub-iterations of spawned agents that are largely autonomous. A lot of the human-centric paradigms just don't really apply to LLMs/AI but people are used to approaching them that way.

viraptor

> They set up a limited number of agents to run in parallel (often just one),

Most of what people use agents for daily can often be one-shotted though and even collating/rating 10 results would be costly.

If I had a harness for evaluating the results and VC level money, I'd be throwing an army at well defined experimental tasks as well.

ekelsen

"FP32 is less common in modern ML workloads and often less optimized on recent hardware compared to FP16 or BF16, which may partly explain why it’s easier to achieve performance gains over PyTorch with FP32 kernels."

People haven't spent time optimizing the fp32 versions of these kernels in years. This will be much more interesting if they can improve the kernels where developer effort has gone and that are actually used.

adrian_b

I believe that these good results are explained at least in part by the fact that NVIDIA does not provide detailed enough documentation for their GPUs.

For a processor with well-documented microarchitecture, for which a programmer or a compiler can deterministically write an optimal program, it is much less likely that applying ML/AI can be successful, except as a substitute for searching already known solutions.

On the other hand, for less documented microarchitectures, like of the NVIDIA GPUs, finding an optimal program may be impossible other than by doing a random search guided by examples of previous optimized programs, and possibly doing some reverse-engineering work to determine the real behavior of the GPU in some circumstances.

Improving over something like this is likely to be feasible for ML/AI, where training over known good programs may be able to extract some of the undocumented behavior that may be non-obvious for humans reading those examples.

mjlee

> For a processor with well-documented microarchitecture, for which a programmer or a compiler can deterministically write an optimal program

We don't even know the optimal algorithms! AlphaEvolve recently found "an algorithm to multiply 4x4 complex-valued matrices using 48 scalar multiplications, improving upon Strassen’s 1969 algorithm that was previously known as the best in this setting." - https://www.nature.com/articles/s41586-022-05172-4

hmry

For those who don't want to read the article: The previous best was 49 scalar multiplications.

david-gpu

> For a processor with well-documented microarchitecture, for which a programmer or a compiler can deterministically write an optimal program

You severely underestimate the landscape of possible implementations for these kernels. There are many ways of performing a matrix multiplication and predicting which one will perform best without running them all is nontrivial, even with perfect knowledge of the underlying system.

This is just a completely incorrect take, speaking as a former insider.

pca006132

While it is decidable, people typically never produce optimal programs even for the hot path. It is just intractable and too slow to do right now.

For register allocation and instruction selection, there is hope because it is FPT and there are algorithms to do it optimally in polynomial time, albeit with a large constant factor (FPT), making it impractical to apply to compilers as of today. For instruction scheduling, it is just too hard. If you read literature on scheduling algorithms, it is NP-hard for apparently simple instances, e.g., 2 parallel identical machines with no preemption and bounding completion time (https://www2.informatik.uni-osnabrueck.de/knust/class/), while actual microarchitecture is much more complicated than this...

Needless to say, these are already the simpler problems. The longer the program or the more profiling data you can optimize for, the more tricks you can throw at it, and most of them are NP-hard to optimize optimally.

Being NP-hard doesn't imply that you can't obtain the optimal result, but compilers that I know of do not implement them, because most users are not willing to wait for days for such a compilation to complete. Ideally, one should make something that can run on clusters of CPUs or GPUs to optimize this, and people having those clusters will typically be willing to do this because they want to optimize the program they later run on the clusters. However, to my knowledge, no one is working on this at the moment.

fulafel

Even with full information, we generally (or practically) aren't able to write optimal programs.

almostgotcaught

[flagged]

pca006132

While I think the OP did not mean the compilation process is nondeterministic, I won't be surprised if it is actually non-deterministic. A lot of algorithms and data structures rely on nondeterminism for performance or security (by default). It is too easy to introduce nondeterminism accidentally, and it is tempting to use that to speed up algorithms. Also, if it relies on floating point, results on different machines and environments may be different (depending on libm and hardware implementation), which is, in some sense, nondeterministic.

throwaway81523

The running time of a CUDA kernel is apparently impossible to determine except by experiment and measurement, and might be nondeterministic. By contrast for a more typical CPU, there's a compiler whose assembly output you can examine, and there's a processor manual that gives the cycle timing of each instruction. So you can compute the running time at least of inner loops that stay in cache, and that sort of thing.

speerer

The point was about being able to write an optimal program with certainty, not about just getting the thing to operate.

suddenlybananas

I wonder if it's using known improvements from the fp16/bf16 kernels that are transferable to fp32?

null

[deleted]

moralestapia

>People haven't spent time optimizing the fp32 versions of these kernels in years.

Wow, so, you're basically saying the AI created new algos in a domain with no pre-existing solutions? Awesome!

Aurornis

No one said the AI created new algorithms nor that there weren’t pre-existing solutions.

The implication was that the FP32 versions of these kernels have lagged behind the more popular versions. There was opportunity to translate the advancements from other kernels into these. Someone would need to look closely to see exactly what was done, but it’s premature to suggest anything like “new algos” or “no pre-existing solutions”

This is a great use case for LLMs, though. I often do something similar where I make improvements to something I use most frequently and ask an LLM to translate that pattern to other similar parts of the code.

moralestapia

>The implication was that the FP32 versions of these kernels have lagged behind the more popular versions.

Help me understand this 'cause I'm a bit slow these days ...

Does that mean optimized FP32 versions of these kernels were already there or not?

vlovich123

The solution not existing in PyTorch does not mean the solution doesn’t exist elsewhere on the internet. Remember - PyTorch is largely maintained by employees of companies that have their own priorities for the SW and those priorities may not include hyper optimizing fp32 kernels.

That being said, it is cool if AI is enabling lower cost adoption of better more optimized kernels with less effort.

imtringued

Read the article before spouting lies. Actually never mind that.

Read the damn comment you're responding to. There have been human written kernels for both fp16 and fp32 for a long time.

Here is the corrected version of your comment:

"Wow, so, you're basically saying the AI created the same but faster algos in a well known domain with established pre-existing solutions, whose overall impact on the runtime of practical workloads is insignificant? Awesome!"

uoaei

The hype cycle in action, folks. Pay heed.

null

[deleted]

thorum

My takeaway - from this article, from Google’s AlphaEvolve [1], and the recent announcement about o3 finding a zero day in the Linux kernel [2] - is that Gemini Pro 2.5 and o3 in particular have reached a new level of capability where these ideas that were tried unsuccessfully with other models, suddenly just work.

[1] https://deepmind.google/discover/blog/alphaevolve-a-gemini-p...

[2] https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...

therealpygon

In my opinion, I wouldn’t say so much that they are suddenly working. Rather we’ve reached a point where they can iterate and test significantly faster than humans are capable of doing and have the ability to call on significantly more immediately available information that it can make sense of, and as a result, the combination information, advancement and intelligently applied brute force seems to be having success in certain applications.

thorum

Good points. I suspect that o3 is able to reason more deeply about different paths through a codebase than earlier models, though, which might make it better at this kind of work in particular.

westoncb

I was blown away by some debugging results I got from o3 early on and have been using it heavily since. The early results that caught my attention were from a couple cases where it tracked down some problematic cause through several indirect layers of effects in a way where you'd typically be tediously tracing step-by-step through a debugger. I think whatever's behind this capability has some overlap with really solid work it'll do in abstract system design, particularly in having it think through distant implications of design choices.

MangoToupe

In the context of LLMs, what do you mean by "reason"? What does reasoning look like in LLMs and how do you recognize it, and more importantly, how do you invoke it? I haven't had much success in getting LLMs to solve, well, basically any problem that involves logic.

Chain of thought at least introduces some skepticism, but that's not exactly reasoning. It makes me wonder what people refer to when they say "reason".

therealpygon

Very likely. Larger context is significantly beneficial to the LLMs when they can maintain attention, which was part of my point. Imagine being able to hold the word for word text of your required reading book while you are taking a test, while older models were more like a couple chapters worth of text. Two years ago.

geraneum

It’s true that there are similarities between what you mentioned and what’s happening in this case. From the article:

> The result is a test-time loop that looks less like “chat with a compiler” in the case of sequential revision, and more like structured exploratory search, guided by explicit optimization hypotheses and aggressively parallel evaluation.

My conclusion would be that we’ve now learned to apply LLMs’ capabilities to shrink solution space where we have a clear evaluation function as well as solutions to problems that might follow similar patterns. This applies in this case as well.

IMO, It’s not about model X gaining on other models or model Y being able to reason about the solutions, etc. in a way that other models couldn’t.

MangoToupe

Interesting. Do you have stronger evidence to support your claim? A sample size of one is pretty unconvincing.

jiggawatts

Gemini Pro 2.5 is the first AI that I can productively use for anything other than human language translation, but it's just barely crossed that threshold. Sometimes I get success hit rates below 20%.

When 3.0 comes out, that... that's going to start getting a little scary.

manmal

o3 is in my experience often even better, but too slow and too rate limited to use it all the time.

jacob019

What domain?

jiggawatts

SRE / DevOps / coding mostly in the Azure and .NET ecosystems.

The problems I have to solve tend to be the horrible ones that nobody has answers to, anywhere on the Internet, so unsurprisingly the AIs aren't good at it either.

The trick has been to use the AIs for what they are good that, which used to be "nothing" for me at least, but now I can use them productively for certain "spot" tasks.

Random examples:

- Cross-language and cross-platform benchmarking of a bunch of different database clients to see how they stack up. I gave the AI a working example in one language and got it to whip up a series of equivalents with other DB drivers and languages. Sure, it's trivial, but it's way faster than doing it myself!

- Crash dump analysis using WinDbg. I read somwhere that "vibe debugging" of kernel dumps totally works, so when I had an actual crash I gave it a go for laughs. With AI help I managed to extract the name of the specific file that had NTFS corruption and was crashing the server. Deleted the file, restored it from backups, and the server was good to go again!

- If you ever watch the top mechanical engineers on YouTube, they all make their own tools instead of just buying them. Jigs, extenders, unusual sizes, etc... IT work is the same. As a recent example, I got Gemini to make me a code-AST rewriter for a specific issue I wanted to clean up in bulk across a huge code base. Using the Roslyn compiler SDK is a bit fiddly, but it spat out a working tool for me in under an hour. (This is not something you can solve with a script full of regex, it needed a proper parser to handle commented-out blocks and the like.)

zozbot234

Wait, what are you saying? These have nothing to do with the Linux kernel whatsoever, they are "kernels" in the GPU programming sense. Did you just hallucinate this whole comment or what?

thorum

Sorry, I added links! Just a week ago someone built a system that used o3 to find novel zero days in the Linux kernel’s SMB implementation.

stefan_

Theres zero days in obscure parts of the kernel nobody uses every other day. (It also of course found 100 other things that were not zero days or vulnerabilities, yet professed they were, which is why this trash even on Gemini 9000 Pro keeps spamming security mails)

None4U

There was a post on HN a bit ago from someone who used o3 to find a vulnerability in the Linux kernel's SMB server, which this person is just saying should've been tried earlier and probably recently became possible

ekelsen

"the reference code is in the default FP32, and given a tolerance threshold (1e-02)"

that's a huge tolerance and allows them to use fp16 operations to replace the "fp32" kernel.

unignorant

yeah, it seems likely the underlying task here (one reasoning step away) was: replace as many fp32 operations as possible in this kernel with fp16. i'm not sure exactly how challenging a port like that is, but intuitively seems a bit less impressive

maybe this intuition is wrong but would be great for the work to address it explicitly if so!

AlotOfReading

Only seems to have done that in a couple places, like the MatMul. The softmax kernel (https://github.com/ScalingIntelligence/good-kernels/blob/mai...) seem to be entirely bog-standard, and the layernorm kernels are only slightly more interesting.

beyonddream

Why do you think it is a huge tolerance ? (Just curious since it is not clear to me if that will lead to too much of reduction in numerical accuracy compared to the speedup)

creato

The point is, this amount of error is huge for fp32, but may be expected for fp16. But then why compare to fp32 performance baselines? An algorithm that gives you the accuracy of fp16 should be compared to an fp16 baseline, and this may not be (it probably is not) a speedup at all, it's likely much slower.

constantcrying

This means the results are useless. Did they even check the relative error at all?

Replacing float32 operations with float16 is also pointless. There is nothing to be gained by doing this, as it removes the actual accuracy advantage of float32s, which would the single most important reason to use that version of the algorithm.

threeducks

I ran their matrix multiplication code from GitHub (https://github.com/ScalingIntelligence/good-kernels/blob/mai...) and got a mean squared error of approximately 0.056 for two 4096x4096 matrices containing random values between 0 and 1.

I think this error is large enough that referring to it as FP32 is misleading.

Also, the performance gains do not translate to my RTX 3060M GPU (3.8 GFLOPS vs PyTorch's 5.3), presumably because it lacks the optimized hardware for half precision.

But on the plus side, the single file was very easy to adapt and the code is quite readable. I have seen much uglier kernels.

userbinator

Am I the only one who was enticed into this article by thinking they had AI generate an OS kernel?

dgfitz

Nope, I was as well.

vessenes

By far the most interesting part (after the 400% speed up in some cases) is the methodology: rather than hill climb on operations, they forced a language reasoning step between iterations to encourage diversity of search. This seems to have worked. Very very interesting.

lucidrains

oh wow, I was looking for use of islands or map-elites that I missed this.. thought it was the blandest mimetic evolution possible

vessenes

Just anecdotally I feel like hill climbing on operations is just so slow; I’m not saying it doesn’t work, but it always feels one step away from brute force search. I really like the idea of just throwing stuff at the LLM and giving it access to old strong variants in context.

null

[deleted]

null

[deleted]

FL33TW00D

Tried a replication here. The LayerNorm kernel is not numerically stable so cannot be counted as valid. They only test with zero mean and unit std, so the catastrophic cancellation doesn't show up until after.

EDIT: looks like they've since generated another one that is numerically stable! great work

poltomo

Beating pytorch and tensorflow kernels has been easy to do with ml compilers since ~2018. You typically train and evaluate your model in one of these frameworks then hand off the computation graph to a compiler like Apache TVM or your hardware vendor’s proprietary one. They should test their kernels against those kernels.

ML guided heuristic search over compute schedules is as old as 2013 (Halide for image processing)

bgwalter

> They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch.

The PyTorch code base is NOT written by performance experts in any way. This is the wrong base line. Nothing about that code base is clean or hand optimized .

The "AI" generation methodology seems to give many instructions and even descends into instruction trees, manually throwing away results etc. So it requires, as usual, extreme guidance.

Workaccount2

Very fascinating result, and it seems they wrote this blog post out of pure excitement to share their findings, and maybe to have someone throw cold water on it before publishing, ha.

Who knows if this is the actual fabled path of "self improvement", but results like this are what we expect to find on such a path.

suddenlybananas

> Who knows if this is the actual fabled path of "self improvement"

Seems doubtful as this works only on an extremely well-defined evaluation function.

observationist

Each time you define another task well enough for the system to work, you generalize the system just a little bit - repeat enough times and you can start to expand, develop taxonomies of functions, precisely define function spaces and metrics for improvement. This might not be a bootstrap for recursive self improvement generally, but it could definitely inform the theory or design of a system that does bootstrap rsi.

suddenlybananas

That's an entirely different idea that may or may not work. This is not evidence of that.

EMIRELADERO

That may be true, but this is the first example I've seen where the concept is successfully implemented in a noticeable way.

It's just like image generation: the first iteration is the worst it will ever be.

Mathnerd314

> we didn't mean to publish yet

I was thinking this was about leaking the kernels or something, but no, they are "publishing" them in the sense of putting out the blog post - they just mean they are skipping the peer review process and not doing a formal paper.

yahoozoo

Very cool. They used o3 and Gemini 2.5 Pro but unfortunately they don’t mention which one produced the better kernels.