Andrej Karpathy: Software in the era of AI [video]
71 comments
·June 19, 2025sothatsit
jwblackwell
Yeah I am currently enjoying giving the LLM relatively small chunks of code to write and then asking it to write accompanying tests. While I focus on testing the product myself. I then don't even bother to read the code it's written most of the time
geeunits
I would venture that 'tightening the feedback loop' isn't necessarily 'increasing the number of back and forth prompts'- and what you're saying you want is ultimately his argument. i.e. if integral enough it can almost guess what you're going to say next...
gchamonlive
I think it's interesting to juxtapose traditional coding, neural network weights and prompts because in many areas -- like the example of the self driving module having code being replaced by neural networks tuned to the target dataset representing the domain -- this will be quite useful.
However I think it's important to make it clear that given the hardware constraints of many environments the applicability of what's being called software 2.0 and 3.0 will be severely limited.
So instead of being replacements, these paradigms are more like extra tools in the tool belt. Code and prompts will live side by side, being used when convenient, but none a panacea.
karpathy
I kind of say it in words (agreeing with you) but I agree the versioning is a bit confusing analogy because it usually additionally implies some kind of improvement. When I’m just trying to distinguish them as very different software categories.
miki123211
What do you think about structured outputs / JSON mode / constrained decoding / whatever you wish to call it?
To me, it's a criminally underused tool. While "raw" LLMs are cool, they're annoying to use as anything but chatbots, as their output is unpredictable and basically impossible to parse programmatically.
Structured outputs solve that problem neatly. In a way, they're "neural networks without the training". They can be used to solve similar problems as traditional neural networks, things like image classification or extracting information from messy text, but all they require is a Zod or Pydantic type definition and a prompt. No renting GPUs, labeling data and tuning hyperparameters necessary.
They often also improve LLM performance significantly. Imagine you're trying to extract calories per 100g of product, but some product give you calories per serving and a serving size, calories per pound etc. The naive way to do this is a prompt like "give me calories per 100g", but that forces the LLM to do arithmetic, and LLMs are bad at arithmetic. With structured outputs, you just give it the fifteen different formats that you expect to see as alternatives, and use some simple Python to turn them all into calories per 100g on the backend side.
dmitrijbelikov
I think that Andrej presents “Software 3.0” as a revolution, but in essence it is a natural evolution of abstractions.
Abstractions don't eliminate the need to understand the underlying layers - they just hide them until something goes wrong.
Software 3.0 is a step forward in convenience. But it is not a replacement for developers with a foundation, but a tool for acceleration, amplification and scaling.
If you know what is under the hood — you are irreplaceable. If you do not know — you become dependent on a tool that you do not always understand.
nilirl
Where do these analogies break down?
1. Similar cost structure to electricity, but non-essential utility (currently)?
2. Like an operating system, but with non-determinism?
3. Like programming, but ...?
Where does the programming analogy break down?
rudedogg
> programming
The programming analogy is convenient but off. The joke has always been “the computer only does exactly what you tell it to do!” regarding logic bugs. Prompts and LLMs most certainly do not work like that.
I loved the parallels with modern LLMs and time sharing he presented though.
null
wjohn
The comparison of our current methods of interacting with LLMs (back and forth text) to old-school terminals is pretty interesting. I think there's still a lot work to be done to optimize how we interact with these models, especially for non-dev consumers.
practal
Great talk, thanks for putting it online so quickly. I liked the idea of making the generation / verification loop go brrr, and one way to do this is to make verification not just a human task, but a machine task, where possible.
Yes, I am talking about formal verification, of course!
That also goes nicely together with "keeping the AI on a tight leash". It seems to clash though with "English is the new programming language". So the question is, can you hide the formal stuff under the hood, just like you can hide a calculator tool for arithmetic? Use informal English on the surface, while some of it is interpreted as a formal expression, put to work, and then reflected back in English? I think that is possible, if you have a formal language and logic that is flexible enough, and close enough to informal English.
Yes, I am talking about abstraction logic [1], of course :-)
So the goal would be to have English (German, ...) as the ONLY programming language, invisibly backed underneath by abstraction logic.
AdieuToLogic
> So the question is, can you hide the formal stuff under the hood, just like you can hide a calculator tool for arithmetic? Use informal English on the surface, while some of it is interpreted as a formal expression, put to work, and then reflected back in English?
The problem with trying to make "English -> formal language -> (anything else)" work is that informality is, by definition, not a formal specification and therefore subject to ambiguity. The inverse is not nearly as difficult to support.
Much like how a property in an API initially defined as being optional cannot be made mandatory without potentially breaking clients, whereas making a mandatory property optional can be backward compatible. IOW, the cardinality of "0 .. 1" is a strict superset of "1".
dang
This was my favorite talk at AISUS because it was so full of concrete insights I hadn't heard before and (even better) practical points about what to build now, in the immediate future. (To mention just one example: the "autonomy slider".)
If it were up to me, which it very much is not, I would try to optimize the next AISUS for more of this. I felt like I was getting smarter as the talk went on.
anythingworks
loved the analogies! Karpathy is consistently one of the clearest thinkers out there.
interesting that Waymo could do uninterrupted trips back in 2013, wonder what took them so long to expand? regulation? tailend of driving optimization issues?
noticed one of the slides had a cross over 'AGI 2027'... ai-2027.com :)
AlotOfReading
You don't "solve" autonomous driving as such. There's a long, slow grind of gradually improving things until failures become rare enough.
petesergeant
I wonder at what point all the self-driving code becomes replaceable with a multimodal generalist model with the prompt “drive safely”
anon7000
Very advanced machine learning models are used in current self driving cars. It all depends what the model is trying to accomplish. I have a hard time seeing a generalist prompt-based generative model ever beating a model specifically designed to drive cars. The models are just designed for different, specific purposes
AlotOfReading
One of the issues with deploying models like that is the lack of clear, widely accepted ways to validate comprehensive safety and absence of unreasonable risk. If that can be solved, or regulators start accepting answers like "our software doesn't speed in over 95% of situations", then they'll become more common.
null
ActorNightly
> Karpathy is consistently one of the clearest thinkers out there.
Eh, he ran Teslas self driving division and put them into a direction that is never going to fully work.
What they should have done is a) trained a neural net to represent sequence of frames into a physical environment, and b)leveraged Mu Zero, so that self driving system basically builds out parallel simulations into the future, and does a search on the best course of action to take.
Because thats pretty much what makes humans great drivers. We don't need to know what a cone is - we internally compute that something that is an object on the road that we are driving towards is going to result in a negative outcome when we collide with it.
AlotOfReading
Aren't continuous, stochastic, partial knowledge environments where you need long horizon planning with strict deadlines and limited compute exactly the sort of environments muzero variants struggle with? Because that's driving.
It's also worth mentioning that humans intentionally (and safely) drive into "solid" objects all the time. Bags, steam, shadows, small animals, etc. We also break rules (e.g. drive on the wrong side of the road), and anticipate things we can't even see based on a theory of mind of other agents. Human driving is extremely sophisticated, not reducible to rules that are easily expressed in "simple" language.
visarga
> We don't need to know what a cone is
The counter argument is that you can't zoom in and fix a specific bug in this mode of operation. Everything is mashed together in the same neural net process. They needed to ensure safety, so testing was crucial. It is harder to test an end-to-end system than its individual parts.
tayo42
Is that the approach that waymo uses?
mikewarot
A few days ago, I was introduced to the idea that when you're vibe coding, you're consulting a "genie", much like in the fables, you almost never get what you asked for, but if your wishes are small, you might just get what you want.
fudged71
“You are an expert 10x software developer. Make me a billion dollar app.” Yeah this checks out
anythingworks
that's a really good analogy! It feels like wicked joke that llms behave in such a way that they're both intelligent and stupid at the same time
hgl
It’s fascinating to think about what true GUI for LLM could be like.
It immediately makes me think a LLM that can generate a customized GUI for the topic at hand where you can interact with in a non-linear way.
cjcenizal
My friend Eric Pelz started a company called Malleable to do this very thing: https://www.linkedin.com/posts/epelz_every-piece-of-software...
karpathy
Fun demo of an early idea was posted by Oriol just yesterday :)
nbbaier
I love this concept and would love to know where to look for people working on this type of thing!
nico
Thank you YC for posting this before the talk became deprecated[1]
sandslash
We couldn't let that happen!
bedit
I love the "people spirits" analogy. For casual tasks like vibecoding or boiling an egg, LLM errors aren't a big deal. But for critical work, we need rigorous checks—just like we do with human reasoning. That's the core of empirical science: we expect fallibility, so we verify. A great example is how early migration theories based on pottery were revised with better data like ancient DNA (see David Reich). Letting LLMs judge each other without solid external checks misses the point—leaderboard-style human rankings are often just as flawed.
I find Karpathy's focus on tightening the feedback loop between LLMs and humans interesting, because I've found I am the happiest when I extend the loop instead.
When I have tried to "pair program" with an LLM, I have found it incredibly tedious, and not that useful. The insights it gives me are not that great if I'm optimising for response speed, and it just frustrates me rather than letting me go faster. Worse, often my brain just turns off while waiting for the LLM to respond.
OTOH, when I work in a more async fashion, it feels freeing to just pass a problem to the AI. Then, I can stop thinking about it and work on something else. Later, I can come back to find the AI results, and I can proceed to adjust the prompt and re-generate, to slightly modify what the LLM produced, or sometimes to just accept its changes verbatim. I really like this process.