LLM Year in Review

27 comments

·December 19, 2025

thoughtpeddler

I appreciate Andrej’s optimistic spirit, and I am grateful that he dedicates so much of his time to educating the wider public about AI/LLMs. That said, it would be great to hear his perspective on how 2025 changed the concentration of power in the industry, what’s happening with open-source, local inference, hardware constraints, etc. For example, he characterizes Claude Code as “running on your computer”, but no, it’s just the TUI that runs locally, with inference in the cloud. The reader is left to wonder how that might evolve in 2026 and beyond.

ramoz

What he meant was, agents will probably not be these web abstractions that run in deployed services (langchain, crew); agents meaning the Harnesses (software wrapper) specifically that call the LLM API.

It runs on your computer because of its tooling. It can call Bash. It can literally do anything on the operating system and file system. That's what makes it different. You should think of it like a mech suit. The model is just the brain in a vat connected far away.

D-Machine

The section on Claude Code is very ambiguously and confusingly written, I think he meant that the agent runs on your computer (not inference) and that this is in contrast to agents running "on a website" or in the cloud:

> I think OpenAI got this wrong because I think they focused their codex / agent efforts on cloud deployments in containers orchestrated from ChatGPT instead of localhost. [...] CC got this order of precedence correct and packaged it into a beautiful, minimal, compelling CLI form factor that changed what AI looks like - it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer. This is a new, distinct paradigm of interaction with an AI.

However, if so, this is definitely a distinction that needs to be made far more clearly.

realcul

Well Microsoft had thier "localhost" AI before CC but that was a ghost without a clear purpose or skill.

karpathy

The CC point is more about the data and environmental and general configuration context, not compute and where it happens to run today. The cloud setups are clunky because of context and UIUX user in the loop considerations, not because of compute considerations.

CamperBob2

Agree with the GP, though -- you ought to make that clearer. It really reads like you're saying that CC runs locally, which is confusing since you obviously know better.

magicalhippo

From what I can gather, llama.cpp supports Anthropic's message format now[1], so you can use it with Claude Code[2].

[1]: https://github.com/ggml-org/llama.cpp/pull/17570

[2]: https://news.ycombinator.com/item?id=44654145

simonw

One of the most interesting coding agents to run locally is actually OpenAI Codex, since it has the ability to run against their gpt-oss models hosted by Ollama.

  codex --oss -m gpt-oss:20b

Or 120b if you can fit the larger model.

AlexCoventry

What do you find interesting about it, and how does it compare to commercial offerings?

mips_avatar

I would love Andrej's take on the fast models we got this year. Gemini 3 flash and Grok 4 fast have no business being as good + cheap + fast as they are. For Andrej's prediction about LLMs communicating with us via a visual interface we're going to need fast models, but I feel like AI twitter/HN has mostly ignored these.

null

[deleted]

victorbuilds

Notable omission: 2025 is also when the ghosts started haunting the training data. Half of X replies are now LLMs responding to LLMs. The call is coming from inside the dataset.

vlod

Any tips to spot this? I want to avoid arguing with a X bot.

shtack

Really easy: don't argue on the internet. The approach has many benefits.

delichon

> I like this version of the meme for pointing out that human intelligence is also jagged in its own different way.

The idea of jaggedicity seems useful to advancing epistemology. If we could identify the domains that have useful data that we fail to extract, we could fill those holes and eventually become a general intelligence ourselves. The task may be as hard as making a list of your blind spots. But now we have an alien intelligence with an outside perspective. While making AI less jagged it might return the favor.

If we keep inventing different kinds of intelligence the sum of the splats may eventually become well rounded.

jkubicek

> In the same way, LLMs should speak to us in our favored format - in images, infographics, slides, whiteboards, animations/videos, web apps, etc.

You think every Electron app out there re-inventing application UX from scratch is bad, wait until LLMs are generating their own custom UX for every single action for every user for every device. What does command-W do in this app? It's literally impossible to predict, try it and see!

starchild3001

The distinction Karpathy draws between "growing animals" and "summoning ghosts" via RLVR is the mental model I didn't know I needed to explain the current state of jagged intelligence. It perfectly articulates why trust in benchmarks is collapsing; we aren't creating generally adaptive survivors, but rather over-optimizing specific pockets of the embedding space against verifiable rewards.

I’m also sold on his take on "vibe coding" leading to ephemeral software; the idea of spinning up a custom, one-off tokenizer or app just to debug a single issue, and then deleting it, feels like a real shift.

TheAceOfHearts

I think one of the things that is missing from this post is engaging a bit in trying to answer: what are the highest priority AI-related problems that the industry should seek to tackle?

Karpathy hints at one major capability unlock being UI generation, so instead of interacting with text the AI can present different interfaces depending on the kind of problem. That seems like a severely underexplored problem domain so far. Who are the key figures innovating in this space so far?

In the most recent Demis interview, he suggests that one of the key problems that must be solved is online / continuous learning.

Aside from that, another major issues is probably reducing hallucinations and increasing reliability. Ideally you should be able to deploy an LLM to work on a problem domain, and if it encounters an unexpected scenario it reaches out to you in order to figure out what to do. But for standard problems it should function reliably 100% of the time.

mvkel

> In this world view, nano banana is a first early hint of what that might look like.

What is he referring to here? Is nano banana not just an image gen model? Is it because it's an LLM-based one, and not diffusion?

simonw

What's interesting about Nano Banana (and even more so video models like Veo 3) is that they act as a weird kind of world model when you consider that they accept images as input and return images as output.

Give it an image of a maze, it can output that same image with the maze completed (maybe).

There's a fantastic article about that for image-to-video models here: https://video-zero-shot.github.io/

> We demonstrate that Veo 3 can zero-shot solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and much more.

dragonwriter

I think he is referring to capability, not architecture, and say that NB is at the point that it is suggestive of the near-future capability of using GenAI models to create their own UI as needed.

NB (Gemini 2.5 Flash Image) isn't the first major-vendor LLM-based image gen model, after all; GPT Image 1 was first.

swyx

xposted to https://x.com/karpathy/status/2002118205729562949

CamperBob2

And also accessible sans login via https://xcancel.com/karpathy/status/2002118205729562949 .

bgwalter

Vibe coding is sufficient for job hoppers who never finish anything and leave when the last 20% have to be figured out. Much easier to promote oneself as an expert and leave the hard parts to other people.

zingar

I’ve found incredible productivity gains writing (vibe coding) tools for myself that will never need to be “productionised” or even used by another person. Heck even I will probably never use the latest log retrieval tool, which exists purely for Claude code to invoke it. There is a ton of useful software yet to be written for which there _is_ no “last 20%”.

simonw

Do you mean vibe coding as-in producing unreviewed code with LLMs and prompting at it until it appears to work, or vibe coding as a catch-all for any time someone uses AI-assistance to help them write code?

ausbah

tl;dr seems like llms are maturing on the product side and for day-day usage