The last six months in LLMs, illustrated by pelicans on bicycles
243 comments
·June 8, 2025isx726552
simonw
Honestly, if my stupid pelican riding a bicycle benchmark becomes influential enough that AI labs waste their time optimizing for it and produce really beautiful pelican illustrations I will consider that a huge personal win.
benmathes
"personal" doing a lot of work there :-)
(And I'd be envious of your impact, of course)
Choco31415
Just tried that canard on GPT-4o and it failed:
"The word "strawberry" contains 2 letter r’s."
belter
I tried
strawberry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said three
strawberrry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said four
stawberrry -> DeepSeek, GeminiPro all correctly said three
ChatGPT4o even in a new Chat, incorrectly said the word "stawberrry" contains 4 letter "r" characters. Even provided this useful breakdown to let me know :-)
Breakdown: stawberrry → s, t, a, w, b, e, r, r, r, y → 4 r's
And then asked if I meant "strawberry" instead and said because that one has 2 r's....
MattRix
This is why things like the ARC Prize are better ways of approaching this: https://arcprize.org
whiplash451
Well, ARC-1 did not end well for the competitors of tech giants and it’s very unclear that ARC-2 won’t follow the same trajectory.
wolfmanstout
This doesn’t make ARC a bad benchmark. Tech giants will have a significant advantage in any benchmark they are interested in, _especially_ if the benchmark correlates with true general intelligence.
lofaszvanitt
You push sha512 hashes of things in a githup repo and a short sentence:
x8 version: still shit . . x15 version: we are closing, but overall a shit experience :D
this way they won't know what to improve upon. of course they can buy access. ;P
when they finally solve your problem you can reveal what was the benchmark.
adrian17
> This was one of the most successful product launches of all time. They signed up 100 million new user accounts in a week! They had a single hour where they signed up a million new accounts, as this thing kept on going viral again and again and again.
Awkwardly, I never heard of it until now. I was aware that at some point they added ability to generate images to the app, but I never realized it was a major thing (plus I already had an offline stable diffusion app on my phone, so it felt less of an upgrade to me personally). With so much AI news each week, feels like unless you're really invested in the space, it's almost impossible to not accidentally miss or dismiss some big release.
haiku2077
Congratulations, you are almost fully unplugged from social media. This product launch was a huge mainstream event; for a few days GPT generated images completely dominated mainstream social media.
sigmoid10
If you primarily consume text-based social media (HN, reddit with legacy UI) then it's kind of easy to not notice all the new kinds of image infographics and comics that now completely flood places like instagram or linkedin.
derwiki
Not sure if this is sarcasm or sincere, but I will take it as sincere haha. I came back to work from parental leave and everyone had that same Studio Ghiblized image as their Slack photo, and I had no idea why. It turns out you really can unplug from social media and not miss anything of value: if it’s a big enough deal you will find out from another channel.
stavros
Why does everyone keep calling news "social media"? Have I missed a trend? Knowing what my friend Steve is up to is social media, knowing what AI is up to is news.
dgfitz
I missed it until this thread. I think I’m proud of myself.
Semaphor
Facebook, discord, reddit, HN. Hadn’t heard of it either. But for FB, Reddit, and Discord I strictly curate what I see.
azinman2
Except this went very mainstream. Lots of turn myself into a muppet, what is the human equivalent for my dog, etc. TikTok is all over this.
It really is incredible.
thierrydamiba
The big trend was around the ghiblification of images. Those images were everywhere for a period of time.
Jedd
Yeah, but so were the bored ape NFTs - none of these ephemeral fads are any indication of quality, longevity, legitimacy, or interest.
herval
They still are. Instagram is full of accounts posting gpt-generated cartoons (and now veo3 videos). I’ve been tracking the image generation space from day one, and it never stuck like this before
null
pinoy420
[dead]
MattRix
To be clear: they already had image generation in ChatGPT, but this was a MUCH better one than what they had previously. Even for you with your stable diffusion app, it would be a significant upgrade. Not just because of image quality, but because it can actually generate coherent images and follow instructions.
MIC132
As impressive as it is, for some uses it still is worse than a local SD model. It will refuse to generate named anime characters (because of copyright, or because it just doesn't know them, even not particularly obscure ones) for example. Or obviously anything even remotely spicy. As someone who mostly uses image generation to amuse myself (and not to post it, where copyright might matter) it's honestly somewhat disappointing. But I don't expect any of the major AI companies to release anything without excessive guardrails.
bufferoverflow
Have you missed how everyone was Ghiblifying everything?
adrian17
I saw that, I just didn't connect it with newly added multimodal image generation. I knew variations of style transfer (or LoRA for SD) were possible for years, so I assumed it exploded in popularity purely as a meme, not due to OpenAI making it much more accessible.
Again, I was aware that they added image generation, just not how much of a deal it turned out to be. Think of it like me occasionally noticing merchandise and TV trailers for a new movie without realizing it became the new worldwide box office #1.
nathan_phoenix
My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample.
You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...
Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.
simonw
It might not be 100% clear from the writing but this benchmark is mainly intended as a joke - I built a talk around it because it's a great way to make the last six months of model releases a lot more entertaining.
I've been considering an expanded version of this where each model outputs ten images, then a vision model helps pick the "best" of those to represent that model in a further competition with other models.
(Then I would also expand the judging panel to three vision LLMs from different model families which vote on each round... partly because it will be interesting to track cases where the judges disagree.)
I'm not sure if it's worth me doing that though since the whole "benchmark" is pretty silly. I'm on the fence.
demosthanos
I'd say definitely do not do that. That would make the benchmark look more serious while still being problematic for knowledge cutoff reasons. Your prompt has become popular even outside your blog, so the odds of some SVG pelicans on bicycles making it into the training data have been going up and up.
Karpathy used it as an example in a recent interview: https://www.msn.com/en-in/health/other/ai-expert-asks-grok-3...
diggan
Yeah, this is the problem with benchmarks where the questions/problems are public. They're valuable for some months, until it bleeds into the training set. I'm certain a lot of the "improvements" we're seeing are just benchmarks leaking into the training set.
throwaway31131
I’d say it doesn’t really matter. There is no universally good benchmark and really they should only be used to answer very specific questions which may or may not be relevant to you.
Also, as the old saying goes, the only thing worse than using benchmarks is not using benchmarks.
6LLvveMx2koXfwn
I would definitely say he had no intention of doing that and was doubling down on the original joke.
telotortium
Yeah, Simon needs to release a new benchmark under a pen name, like Stephen King did with Richard Bachman.
fzzzy
Even if it is a joke, having a consistent methodology is useful. I did it for about a year with my own private benchmark of reasoning type questions that I always applied to each new open model that came out. Run it once and you get a random sample of performance. Got unlucky, or got lucky? So what. That's the experimental protocol. Running things a bunch of times and cherry picking the best ones adds human bias, and complicates the steps.
simonw
It wasn't until I put these slides together that I realized quite how well my joke benchmark correlates with actual model performance - the "better" models genuinely do appear to draw better pelicans and I don't really understand why!
Breza
Another advantage is you can easily include deprecated models in your comparisons. I maintain our internal LLM rankings at work. Since the prompts have remained the same, I can do things like compare the latest Gemini Pro to the original Bard.
Breza
I'd be really interested in evaluating the evaluations of different models. At work, I maintain our internal LLM benchmarks for content generation. We've always used human raters from MTurk, and the Elo rankings generally match what you'd expect. I'm looking at our options for having LLMs do the evaluating.
In your case, it would be neat to have a bunch of different models (and maybe MTurk) pick the winners of each head-to-head matchup and then compare how stable the Elo scores are between evaluators.
dilap
Joke or not, it still correlates much better with my own subjective experiences of the models than LM Arena!
ontouchstart
Very nice talk, acceptable by general public and by AI agent as well.
Any concerns about open source “AI celebrity talks” like yours can be used in contexts that would allow LLM models to optimize their market share in ways that we can’t imagine yet?
Your talk might influence the funding of AI startups.
#butterflyEffect
threecheese
I welcome a VC funded pelican … anything! Clippy 2.0 maybe?
Simon, hope you are comfortable in your new role of AI Celebrity.
planb
And by a sample that has become increasingly known as a benchmark. Newer training data will contain more articles like this one, which naturally improves the capabilities of an LLM to estimate what’s considered a good „pelican on a bike“.
criddell
And that’s why he says he’s going to have to find a new benchmark.
viraptor
Would it though? There really aren't that many valid answers to that question online. When this is talked about, we get more broken samples than reasonable ones. I feel like any talk about this actually sabotages future training a bit.
I actually don't think I've seen a single correct svg drawing for that prompt.
cyanydeez
So what you really need to do is clone this blog post, find and replace pelican with any other noun, run all the tests, and publish that.
Call it wikipediaslop.org
YuccaGloriosa
If the any other noun becomes fish... I think I disagree.
puttycat
You are right, but the companies making these models invest a lot of effort in marketing them as anything but probabilistic, i.e. making people think that these models work discretely like humans.
In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.
In any case, even if a model is probabilistic, if it had correctly learned the relevant knowledge you'd expect the output to be perfect because it would serve to lower the model's loss. These outputs clearly indicate flawed knowledge.
ben_w
> In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.
Look upon these works, ye mighty, and despair: https://www.gianlucagimini.it/portfolio-item/velocipedia/
jodrellblank
You claim those are drawn by people with "perfect knowledge about bikes" and "perfect drawing skills"?
rightbyte
That blog post is a 10/10. Oh dear I miss the old internet.
cyanydeez
Humans absolutely do not work discretely.
loloquwowndueo
They probably meant deterministically as opposed to probabilistically. Which also humans dont work like that :)
bufferoverflow
> work discretely like humans
What kind of humans are you surrounded by?
Ask any human to write 3 sentences about a specific topic. Then ask them the same exact question next day. They will not write the same 3 sentences.
mooreds
My biggest gripe is that he outsourced evaluation of the pelicans to another LLM.
I get it was way easier to do and that doing it took pennies and no time. But I would have loved it if he'd tried alternate methods of judging and seen what the results were.
Other ways:
* wisdom of the crowds (have people vote on it)
* wisdom of the experts (send the pelican images to a few dozen artists or ornithologists)
* wisdom of the LLMs (use more than one LLM)
Would have been neat to see what the human consensus was and if it differed from the LLM consensus
Anyway, great talk!
zahlman
It would have been interesting to see if the LLM that Claude judged worst would have attempted to justify itself....
timewizard
My biggest gripe is he didn't include a picture of an actual pelican.
https://www.google.com/search?q=pelican&udm=2
The "closest pelican" is not even close.
qeternity
I think you mean non-deterministic, instead of probabilistic.
And there is no reason that these models need to be non-deterministic.
skybrian
A deterministic algorithm can still be unpredictable in a sense. In the extreme case, a procedural generator (like in Minecraft) is deterministic given a seed, but you will still have trouble predicting what you get if you change the seed, because internally it uses a (pseudo-)random number generator.
So there’s still the question of how controllable the LLM really is. If you change a prompt slightly, how unpredictable is the change? That can’t be tested with one prompt.
rvz
> I think you mean non-deterministic, instead of probabilistic.
My thoughts too. It's more accurate to label LLMs as non-deterministic instead of "probablistic".
cyanydeez
[flagged]
zurichisstained
Wow, I love this benchmark - I've been doing something similar (as a joke for and much less frequently), where I ask multiple models to attempt to create a data structure like:
``` const melody = [ { freq: 261.63, duration: 'quarter' }, // C4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 293.66, duration: 'triplet' }, // D4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 329.63, duration: 'half' }, // E4 ] ```
But with the intro to Smoke on the Water by Deep Purple. Then I run it through the Web Audio API and see how it sounds.
It's never quite gotten it right, but it's gotten better, to the point where I can ask it to make a website that can play it.
I think yours is a lot more thoughtful about testing novelty, but its interesting to see them attempt to do things that they aren't really built for (in theory!).
https://codepen.io/mvattuone/pen/qEdPaoW - ChatGPT 4 Turbo
https://codepen.io/mvattuone/pen/ogXGzdg - Claude Sonnet 3.7
https://codepen.io/mvattuone/pen/ZYGXpom - Gemini 2.5 Pro
Gemini is by far the best sounding one, but it's still off. I'd be curious how the latest and greatest (paid) versions fare.
(And just for comparison, here's the first time I did it... you can tell I did the front-end because there isn't much to it!) https://nitter.space/mvattuone/status/1646610228748730368#m
ojosilva
Drawbacks for using a pelican on a bicycle svg: it's a very open-ended prompt, no specific criteria to judge, and lately the svg all start to look similar, or at least like they accomplished the same non-goals (there's a pelican, there's a bicycle and I'm not sure its feet should be on the saddle or on the pedals), so it's hard to agree on which is better. And, certainly, having a LLM as a judge, the entire game becomes double-hinged and who knows what to think.
Also, if it becomes popular, training sets may pick it up and improve models unfairly and unrealistically. But that's true of any known benchmark.
Side note: I'd really like to see the Language Benchmark Game become a prompt based languages * models benchmark game. So we could say model X excels at Python Fasta, etc. although then the risk is that, again, it becomes training set and the whole thing self-rigs itself.
dr_kretyn
I'm slightly confused by your example. What's the actual prompt? Is your expectation that a text model is going to know how to perform the exact song in audio?
zurichisstained
Ohhh absolutely not, that would be pretty wild - I just wanted to see if it could understand musical notation enough to come up with the correct melody.
I know there are far better ways to do gen AI with music, this was just a joke prompt that worked far better than I expected.
My naive guess is all of the guitar tabs and signal processing info it's trained on gives it the ability to do stuff like this (albeit not very well).
bredren
Great writeup.
This measure of LLM capability could be extended by taking it into the 3D domain.
That is, having the model write Python code for Blender, then running blender in headless mode behind an API.
The talk hints at this but one shot prompting likely won’t be a broad enough measurement of capability by this time next year. (Or perhaps now, even)
So the test could also include an agentic portion that includes consultation of the latest blender documentation or even use of a search engine for blog entries detailing syntax and technique.
For multimodal input processing, it could take into account a particular photo of a pelican as the test subject.
For usability, the objects can be converted to iOS’s native 3d format that can be viewed in mobile safari.
I built this workflow, including a service for blender as an initial test of what was possible in October of 2022. It took post processing for common syntax errors back then but id imagine the newer LLMs would make those mistakes less often now.
joshstrange
I really enjoy Simon’s work in this space. I’ve read almost every blog post they’ve posted on this and I love seeing them poke and prod the models to see what pops out. The CLI tools are all very easy to use and complement each other nicely all without trying to do too much by themselves.
And at the end of the day, it’s just so much fun to see someone else having so much fun. He’s like a kid in a candy store and that excitement is contagious. After reading every one of his blog posts, I’m inspired to go play with LLMs in some new and interesting way.
Thank you Simon!
alanmoraes
I also like what he writes and the way he does it.
franze
Here Claude Opus Extended Thinking https://claude.ai/public/artifacts/707c2459-05a1-4a32-b393-c...
anon373839
Enjoyable write-up, but why is Qwen 3 conspicuously absent? It was a really strong release, especially the fine-grained MoE which is unlike anything that’s come before (in terms of capability and speed on consumer hardware).
simonw
Omitting Qwen 3 is my great regret about this talk. Honestly I only realized I had missed it after I had delivered the talk!
It's one of my favorite local models right now, I'm not sure how I missed it when I was reviewing my highlights of the last six months.
Maxious
Cut for time - qwen3 was pelican tested too https://simonwillison.net/2025/Apr/29/qwen-3/
username223
Interesting timeline, though the most relevant part was at the end, where Simon mentions that Google is now aware of the "pelican on bicycle" question, so it is no longer useful as a benchmark. FWIW, many things outside of the training data will pants these models. I just tried this query, which probably has no examples online, and Gemini gave me the standard puzzle answer, which is wrong:
"Say I have a wolf, a goat, and some cabbage, and I want to get them across a river. The wolf will eat the goat if they're left alone, which is bad. The goat will eat some cabbage, and will starve otherwise. How do I get them all across the river in the fewest trips?"
A child would pick up that you have plenty of cabbage, but can't leave the goat without it, lest it starve. Also, there's no mention of boat capacity, so you could just bring them all over at once. Useful? Sometimes. Intelligent? No.
NohatCoder
If you calculate ELO based on a round-robin tournament with all participants starting out on the same score, then the resulting ratings should simply correspond to the win count. I guess the algorithm in use take into account the order of the matches, but taking order into account is only meaningful when competitors are expected to develop significantly, otherwise it is just added noise, so we never want to do so in competitions between bots.
I also can't help but notice that the competition is exactly one match short, for some reason exactly one of the 561 possible pairings has not been included.
simonw
Yeah, that's a good call out: Elo isn't actually necessary if you can have every competitor battle every other competitor exactly once.
The missing match is because one single round was declared a draw by the model, and I didn't have time to run it again (the Elo stuff was very much rushed at the last minute.)
qwertytyyuu
https://imgur.com/a/mzZ77xI here are a few i tried the models, looks like the newer vesion of gemini is another improvement?
landgenoot
If you would give a human the SVG documentation and ask to write an SVG, I think the results would be quite similar.
diggan
Lets give it a try, if you're willing to be the experiment subject :)
The prompt is "Generate an SVG of a pelican riding a bicycle" and you're supposed to write it by hand, so no graphical editor. The specification is here: https://www.w3.org/TR/SVG2/
I'm fairly certain I'd lose interest in getting it right before I got something better than most of those.
zahlman
> The colors use traditional bicycle brown (#8B4513) and a classic blue for the pelican (#4169E1) with gold accents for the beak (#FFD700).
The output pelican is indeed blue. I can't fathom where the idea that this is "classic", or suitable for a pelican, could have come from.
diggan
My guess would be that it doesn't see the web colors (CSS color hexes) as proper hex triplets, but because of tokenization it could be something dumb like '#8B','451','3' instead. I think the same issue happens around multiple special characters after each other too.
null
mormegil
Did the testing prompt for LLMs include a clause forbidding the use of any tools? If not, why are you adding it here?
simonw
The way I run the pelican on a bicycle benchmark is to use this exact prompt:
Generate an SVG of a pelican riding a bicycle
And execute it via the model's API with all default settings, not via their user-facing interface.Currently none of the model APIs enable tools unless you ask them to, so this method excludes the use of additional tools.
diggan
The models that are being put under the "Pelican" testing don't use a GUI to create SVGs (either via "tools" or anything else), they're all Text Generation models so they exclusively use text for creating the graphics.
There are 31 posts listed under "pelican-riding-a-bicycle" in case you wanna inspect the methodology even closer: https://simonwillison.net/tags/pelican-riding-a-bicycle/
ramesh31
>If you would give a human the SVG documentation and ask to write an SVG, I think the results would be quite similar.
It certainly would, and it would cost at minimum an hour of the human programmer's time at $50+/hr. Claude does it in seconds for pennies.
hae7eepa9eeY
[dead]
> I’ve been feeling pretty good about my benchmark! It should stay useful for a long time... provided none of the big AI labs catch on.
> And then I saw this in the Google I/O keynote a few weeks ago, in a blink and you’ll miss it moment! There’s a pelican riding a bicycle! They’re on to me. I’m going to have to switch to something else.
Yeah this touches on an issue that makes it very difficult to have a discussion in public about AI capabilities. Any specific test you talk about, no matter how small … if the big companies get wind of it, it will be RLHF’d away, sometimes to the point of absurdity. Just refer to the old “count the ‘r’s in strawberry” canard for one example.