Gemini 2.5 Pro reasons about task feasibility

66 comments

·March 26, 2025

simonw

I've been having some very impressive results from Gemini 2.5 Pro for complex coding tasks in the few hours I've been experimenting with it so far.

I added a section about that to my review last night describing two of the larger examples: https://simonwillison.net/2025/Mar/25/gemini/#update-it-s-ve...

(It's always risky saying anything like this on a forum like Hacker News because it's inevitable someone will find a way to argue that the examples are trivial/unrealistic/show I don't know what I'm doing/clearly just regurgitated from StackOverflow/etc, but I'll take the risk anyway.)

mattlondon

Agree - seems like the rumours and mutterings were true: This model is very, very good.

Quite a few happy people at Google today I bet.

Which leads me to wonder, it's not like the Gemini 2 models were terrible either - they consistently were in top 5 if not top 3, now they've smashed past everything with a +40 elo.

Are we starting to see Google apply their compute/resources/data/money to assert dominance? What next from the recently-pretty-quiet Open AI? Are we getting to the stage where well-funded startups like Anthropic et al simply cannot compete with "google-scale" for general purpose models and end up as coding-only niche models? Sure you can throw GPUs at the problem and burn more investor cash, but are Google starting to run away with it with their data and infrastructure advantages? Who even comes close when you factor in data? Meta are the only people I can think of, but their data must be quite narrow (basically social graph and short-form videos and ad click data?)

Exciting times.

aoeusnth1

I think if they could have been on top earlier, they would have been. They’ve been struggling to catch anthropic and OpenAI’s lead and they finally did it (for now), probably due to TPU superiority plus some secret sauce of some kind. Good! More competition means better service for the consumer.

intellectronica

That sounds plausible. Google have two advantages: 1. They do their own capital allocation, 2. TPUs. That likely means that they can execute more training runs in parallel, experiment more, and release when they hit a crack of gold. Independent labs that depend on outside investment have to carefully trade off experiments. Hence Stargate.

intellectronica

So far everything I tried indicates that it is superb at coding. Maybe the best yet (though I understand that some of the published benchmarks dispute that).

Here it one-shotted a fully functional LISP interpreter for me: https://everything.intellectronica.net/p/the-little-lisper

bn-l

hmm but how many examples of that exist on the internet practically verbatim?

hadlock

Good point.

Turns out OpenAI's llms are pretty decent at coding x86_64 bios bootloaders in assembly, but as soon as you go off script from the two main examples online, it falls apart really quickly, as it's crystal clear it has no idea what is actually going on or the limitations of how bootloaders (and 2, and 3 stage bootloaders) work.

intellectronica

Not sure that's relevant. Obviously an LLM has to learn from something, but it's not a database. I could also program this myself and I don't think that it's an argument against my coding abilities that I have read the source code to many existing interpreters. I can only do it because I not only read but also understood and internalised.

jgalt212

Simon is so important in this space, and rightfully so, I wonder if these models overweight his blog in the training sets. Similar to how car manufacturers design to what they know will impress the car reviewers.

jjani

100% guaranteed at least someone at Google tried the pelican SVG prompt before the public release. Doesn't mean they necessarily adjusted anything based on its results, but no chance they're not at least taking them into account, the budgets are far too high for them not to do so.

naveen99

Love the quote from reed: “My hack to-do list is empty because I built everything. I keep thinking of new things and knocking them out while watching a movie or something.”

Reminds me of the time when I discovered GTD. Don’t worry, we will find a way to become overwhelmed again.

niyyou

Hi Simon, just read your blog post, thanks for the wrap-up. Just curious, what did you use to make Gemini look at all of your codebase, Aider, something else?

Jimpulse

He mentions what he used here, https://simonwillison.net/2025/Mar/26/notes/.

https://github.com/simonw/files-to-prompt https://llm.datasette.io/

niyyou

Many thanks! For the record, there is another useful tool in the same vein, that packs a repo to be given to an LLM https://github.com/yamadashy/repomix.

mentalgear

There's only a risk if these statements are meant to generalise the overall performance of a system, instead of what they are: arbitrary samples.

simonw

It's getting a little less risky these days, but in the past just saying "LLMs are good at writing code" was enough to spark a hundred comment flame war.

philipwhiuk

Sure, because you're making a broad, provably false statement rather than a statement on a highly specific task.

Jensson

> (It's always risky saying anything like this on a forum like Hacker News because it's inevitable someone will find a way to argue that the examples are trivial/unrealistic/show I don't know what I'm doing/clearly just regurgitated from StackOverflow/etc, but I'll take the risk anyway.)

What is the risk? Do you lose anything? Its like saying its risky to say these AI are not useful since people might say you are just using it wrong etc.

simonw

The risk is that I end up wasting a bunch of my time arguing with people about it, because I can't let someone be wrong on the internet. https://xkcd.com/386/

tkgally

Very cool. Inspired by the post, I gave the following prompt to Gemini 2.5 Pro with the temperature set to 0.35:

“Create a music synthesizer that I can run in my browser. Give it original functionality, with buttons and other controls that create pleasing but unexpected sounds, melodies, rhythmic patterns, etc. as well as various sliders and dials that can be used to modify the sounds and patterns.”

Here is the result:

https://www.gally.net/temp/202540326geminisynthesizer/index....

Here’s an excerpt from the reasoning trace:

> Brainstorm "Original" and "Unexpected" Functionality:

> Standard Synths: Oscillators (sine, square, saw, triangle), filters (lowpass, highpass), envelopes (ADSR), LFOs. These are common. How to make them unexpected?

> Generative Aspects: Instead of just playing notes the user clicks, what if the synth generates patterns?

> Melody Generation: Use algorithms like random walks, cellular automata, or simple probability rules.

> Rhythm Generation: Euclidean rhythms, probabilistic triggers, evolving patterns. ...

> Interface Concept: How can the controls themselves be part of the "unexpected" experience? Buttons that trigger complex events, sliders with non-linear responses, dials that control abstract concepts.

tkgally

After sleeping on the above and watching some videos about Gemini 2.5 (especially Sam Witteveen’s at [1]), I decided to ask Gemini for an enhanced version of the synthesizer. Here it is:

https://www.gally.net/temp/202540327geminisynthesizer-v2/ind...

This was the prompt I gave to it (through a spoken interface, thus the length and repetition):

“Attached is a website I had you create for me yesterday based on the prompt that appears in another attached file. In that latter file I've also included your thinking process in response to my prompt as well as your explanation to me of how this synthesizer is supposed to work. I am basically happy with the synthesizer you created for me. It works very well, and the output is fascinating to listen to. But I would like the music produced by it to be more melodical and contrapuntal, that is, with more distinct notes that can be perceived forming melodies while still having the random and unexpected and creative generation of those melodies. I would also like to have a broader frequency range of tones that are being produced. For example it would be nice to have something like a bass line. Continue to make the music unexpected and creative and generative. That was one aspect of the music that was very positive for the first result: the fact that I could keep listening to the produced music for a long period of time and not get bored by it. So try to make the tone soundscape richer, more complex and with more sense of melody and counterpoint. Also add any more controls you can think of to make the, to give the user even more ways in which to affect the output, such as more fine tuning on the degree of tonality vs. atonality, conventional harmonic structures vs. unconventional harmonic structures, clear rhythmic patterns vs. unconventional rhythmic patterns, etc.”

The first result had a lot of digital clipping in the output on my M1 Mac mini. After some back and forth with Gemini about possible causes and solutions, it added a limiter and some more controls. The problem persists on the Mac mini. On my M4 iPad with Safari, the sound is clean. I kind of like it.

[1] https://www.youtube.com/watch?v=B3wLYDl2SmQ

jcims

I don't know what it was but this made my dog go nuts.

olalonde

Is there something like "Claude Code" for Gemini? Or do you have to manually copy/paste the code in files?

euazOn

Check out Aider [0] or Anon Kode [1] (clone of Claude Code). New models are why I try to build all my tools and infra to be model-independent. On that note, I also prefer to be provider-independent, using OpenRouter [2] or T3 Chat [3] and the like.

[0] https://aider.chat/ [1] https://github.com/dnakov/anon-kode [2] https://openrouter.ai/ [3] https://t3.chat/

rsanek

OpenRouter is great for trying new models but I wouldn't use it long term since they add their cut on top of the provider's pricing.

kayvansylvan

You can also use the "fabric" CLI tool with its new "code_helper" functionality:

https://github.com/danielmiessler/fabric?tab=readme-ov-file#...

This is more rudimentary and works on the CLI, but I've had good results with it using both Gemini Pro and local models.

bmichel

There is an Open-Source tool named Aider that can use Gemini: https://aider.chat/

notimetorelax

You can use Gemini from VScode. (Well at least copilot can call it)

intellectronica

WOW! Very cool and original!!

jrvarela56

Would be cool if the LLM can break up the request into sub-requests processable by LLMs. Current talk about agents mention some sort of router/orchestrator that delegates to other agents. But these can be another llm, another agent, another router itself or a simple tool call, etc - all function calls that wrap other llm-enabled sub components.

My feeling is that we have the pieces to build AGI. Like humans, we don't need a 400IQ person to solve all problems ('AGI'). What we have is coordination problems and in LLM land it's 'the glue' that's missing. Hopeful it's a matter of patterns/best-practices emerging.

dimitri-vs

Disagree that we have all pieces for AGI.

Memory for one, not only do models need to be able to have long term, short term memory but they also need to be able to selectively forget. Hallucinations are still a big problem, you can easily (unintentionally) put the models in situations where they make up facts. Context limits - comprehension limits are still effectively 8-10k even though the token limits have been raised to infinity.

mortarion

Models already have long term memory, that's all your basic LLM is after all; a gigantic long-term memory with all the faults that come with a neural network, like lossy storage and imperfect memory retrieval.

But for AGI, we're indeed missing an short term memory system with the ability to record the passage of time and filter out information not relevant to the task at hand, but I don't think they should be neural networks like we humans have. Neural networks for storing information is the only thing biology had to work with, but that doesn't mean it's the best solution for AGI, and I don't think the path to AGI is an complicated end-to-end neural network model.

AGI, no matter the level of consciousness* you aim for, will probably end up being more like an OS where processes are agents that work together. You'd have long running agents, short running agents, agents that analyze data, agents that apply algorithms, agents that come up algorithms, agents that criticize and fact check, agents that classify memories of other agents, agents that produce data for other agents to use in generating new models and supervising agents and interface agents that runs continuously to interact with the world and / or users.

*= which i define as the ability to understand that you are an entity existing in an environment that can be affected by an action, and also the ability to understand that an observed change in the environment might have been due to a previous action that you remember doing. This understanding can come on different levels and is mainly due to how detailed and fleeting your short-term memories are.

GaggiX

Why "selectively forget" should be a piece for AGI?

dimitri-vs

I guess we should start with the fact that models currently have no ability to remember at all.

You either fine-tune which is a very lossy process that degrades generality or you do in-context learning/RAG. Forgetting in its current form would be eliminating obsolete context, not forgetting would be using 1 million input tokens to answer "what is 2+2?".

In any case, any external mechanic to selectively manage context would be far too limiting for AGI.

dnadler

I think maybe this refers to unlearning wrong information?

intellectronica

And yes, I share the view / feeling that we basically got the AGI building blocks. Models will continue improving, but we can already get most of what we need just by orchestrating the latest generation of SOTA models. Crazy time to be alive!

NitpickLawyer

> But these can be another llm

Yes! I share the feeling that once LLMs get good enough at some abstraction level, you can always put another "level" on top that should abstract what already works into bite sized pieces. Hassabis also mentions this in a recent podcast, different levels of abstraction. We'll probably see some tooling in this space shortly, to coordinate between the different levels. And then RL it and watch it demolish planning tasks benchmarks.

We might very well already be at the point where every level is achievable, we just have to glue them together.

intellectronica

I bet it can do that if hooked up to an agent system. Rate limits are still very restrictive now in the free API, but as soon as they make it available for more frequent use we'll find out.

simonw

"Would be cool if the LLM can break up the request into sub-requests processable by LLMs."

It almost certainly can. Try asking Gemini 2.5 Pro to do that and see what happens.

viraptor

The llm itself doesn't even need to do that. The actual system / front end that people interact with can wrap that step. Plandex does it for example and has been doing it for longer than the integrated reasoning models existed.

I mean, it's nice when the models can integrate the step-by-step internally... but I feel people have been missing out on the complex interactions by expecting it all in one adhoc prompt.

jrvarela56

I think the feeling is that for this to really be AGI, it has to take in a single prompt and then delegate behind the scenes to an enormous tree of sub-agents if needed.

One app that comes to mind is Google's Conversational agents. The routing is just done by referencing another agent in the instructions, no need to explicitly link beyond the prompt.

jrvarela56

What parts of the stack or what patterns do you feel are missing? Where does your gut tell you is the 80/20?

eightysixfour

Tools like Claude Code already do this.

me_me_me

[dead]

wiz21c

FTA:

> I have never seen an LLM do this

Interestingly, many of the program we use provide a finite set of functionality that we can discover over time. But LLM's are different: you can't explore them because the input space is too big. Therefore, they can surprise us for a long time. That's cool!

patapong

Yes I agree! The more I use LLMs, the more use-cases for them I find. Even if the models stop advancing tomorrow (which is unlikely), I think we could spend many years just exploring what the current models are capable of.

intellectronica

Exactly! We don't know everything these machines are good for. Even with older and more established models new things that can be done are discovered many times every day.

coolgoose

Next step into LLM evolution is teaching them to procrastinate

pjmlp

Yesterday on a German radio I heard about a study on how they can get traumatized if trained on bad human behaviour content, and provide similar output like humans in such situations, and how their output improves if interacted in a way that kind of shows compassion behaviours.

Maybe being an LLM psychologist is a job with future.

johnisgood

It already refuses to answer to some questions. I wanted to know the mechanism of action of a medication and GPT did not want to answer me, I had to work around it by telling it that I am doing research or it is for university. Like come on, I can't even ask the mechanism of action of medications? This is just one example, there are lots of censorship going on around the models. Which ones are less censored? Are free & open source ones even censored at all or would they answer? I imagine they would. GPT and Claude may not answer so the company can save their asses, so local ones should probably work.

ofirtwo

I'm curious on how the model's going to face intellectual tasks he can't resolve by referring back to the user. Today most LLM's will give multiple answers to "what's the meaning of life?" and immediately move the wand back to the user. It could be interesting if they'll hang with the question more, dive deeper into contradictions and tell, eventually, they don't know.

retrofuturism

That's interesting, but I wonder if it's _just_ the system prompt dictating that a request that would likely consume too many resources and likely fail should be rejected with such an answer.

menzoic

"During its thinking session it reached the conclusion that this task is not feasible in one shot. It then stopped and explained that to me."

I've seen this happen with GPT-4 with zero shot prompts. Similar to the author "negotiating" allowed it to continue with an iterative approach.

cadamsdotcom

It’s a new type of refusal.

The model is unlikely to know its own limits. Hopefully these refusals are amenable to prompt engineering: “even if the task seems infeasible, try anyway.”

And hopefully next-gen models are trained to have more faith in themselves :)

vladmdgolam

I’ve encountered a similar when prompging o1-pro to make palindromes with some words and it actually answered that it’s impossible with some of them because they are gibberish when reversed and then made an example

trash_cat

Would be interesting to see the input prompts.

intellectronica

""" Create a complete reproduction of the ReBirth virtual synth. Create it as a single HTML page with javascript. Use canvas and/or react for the UI and tone.js or something similar for the audio. It should be fully working and playable. Output it as a single HTML page. See screenshot for the UI and description below for how the virtual synth system works. """

(this is followed by long spec of the RB-338 which I also generated and is too long to include here, and a screenshot).

gverrilla

This is essential to your text and it's significance - should be there imo.

null

[deleted]

HN

Gemini 2.5 Pro reasons about task feasibility

Gemini 2.5 Pro reasons about task feasibility