AI World Clocks
282 comments
·November 14, 2025lanewinfield
hi, I made this. thank you for posting.
I love clocks and I love finding the edges of what any given technology is capable of.
I've watched this for many hours and Kimi frequently gets the most accurate clock but also the least variation and is most boring. Qwen is often times the most insane and makes me laugh. Which one is "better?"
jdietrich
Clock drawing is widely used as a test for assessing dementia. Sometimes the LLMs fail in ways that are fairly predictable if you're familiar with CSS and typical shortcomings of LLMs, but sometimes they fail in ways that are less obvious from a technical perspective but are exactly the same failure modes as cognitively-impaired humans.
I think you might have stumbled upon something surprisingly profound.
https://www.psychdb.com/cognitive-testing/clock-drawing-test
TheJoeMan
Figure 6 with the square clock would be a cool modern art piece.
smusamashah
Please make it show last 5 (or some other number) of clocks for each model. It will be nice to see the deviation and variety for each model at a glance.
bspammer
If you're keeping all the generated clocks in a database, I'd love to see a Facemash style spin-off website where users pick the best clock between two options, with a leaderboard. I want to know what the best clock Qwen ever made was!
abixb
We might on to creating a new crowd-ranked LLM benchmark here.
addandsubtract
A pelican wearing a working watch
nightpool
Yes! Please do this
chemotaxis
This is honestly the best thing I've seen on HN this month. It's stupid, enlightening... funny and profound and the same time. I have a strong temptation to pick some of these designs and build them in real life.
I applaud you for spending money to get it done.
ks2048
Nice job! Maybe let users click an example to see the raw source (LLM output)
anigbrowl
I really like this. The broken ones are sometimes just failures, but sometimes provide intriguing new design ideas.
jdiff
This same principle is why my favorite image generation model is the earlier models from 2019-2020 where they could only reliably generate soup. It's like Rorschach tests, it's not about what's there, it's about what you see in them. I don't want a bot to make art for me, sometimes I just want some shroom-induced inspirational smears.
nemomarx
I really miss that deepdream aesthetic with the dogs eyes popping up everywhere.
Fabricio20
Why is this different per user? I sent this to a few friends and they all see different things from what i'm seeing, for the same time..?
samtheprogram
It regenerates on page load. I find that pretty useful.
Grok 4 and Kimi nailed it the first time for me, then only Kimi on the second pass.
csours
LOVE IT!
It would be really cool if I could zoom out and have everything scale properly!
otterley
Watching this over the past few minutes, it looks like Kimi K2 generates the best clock face most consistently. I'd never heard of that model before today!
Qwen 2.5's clocks, on the other hand, look like they never make it out of the womb.
oaktowner
Perhaps Qwen 2.5 should be known as Dali 2.‽
bArray
It could be that the prompt is accidentally (or purposefully) more optimised for Kimi K2, or that Kimi K2 is better trained on this particular data. LLM's need "prompt engineers" for a reason to get the most out of a particular model.
bigfishrunning
How much engineering do prompt engineers do? Is it engineering when you add "photorealistic. correct number of fingers and teeth. High quality." to the end of a prompt?
we should call them "prompt witch doctors" or maybe "prompt alchemists".
davidsainez
Sure, we are still closer to alchemy than materials science, but its still early days. But consider this blogpost that was on the front page today: https://www.levs.fyi/blog/2-years-of-ml-vs-1-month-of-prompt.... The table on the bottom shows a generally steady increase in performance just by iterating on prompts. It feels like we are on the path to true engineering.
int_19h
I write quite a lot of prompts, and the closest analogy that I can think of is a shaman trying to appease the spirits.
WJW
Well if it works consistently, I don't see any problem with that. If they have a clear theory of when to add "photorealistic" and when to add "correct number of wheels on the bus" to get the output they want, it's engineering. If they don't have a (falsifiable) theory, it's probably not engineering.
Of course, the service they really provide is for businesses to feel they "do AI", and whether or not they do real engineering is as relevant as if your favorite pornstars' boobs are real or not.
tomrod
It could be bioengineering if you add that to a clock prompt then connect it to CRISPR process for out putting DNA.
Horrifying prospect, tbh
scrollop
"...and do it really well or my grandmother will be killed by her kidnappers! And I'll give you a tip of 2 billion dollars!!! Hurry, they're coming!"
BoorishBears
I like that actually, I've spent the last year probably 60:40 between post-training and prompt engineering/witch doctoring (the two go together more than most people realize)
Some of it is engineering-like, but I've also picked up a sixth sense when modifying prompts about what parts are affecting the behavior I want to modify for certain models, and that feels very witch doctory!
The more engineering-like part is essentially trying to RE a black box model's post-training, but that goes over some people's heads so I'm happy to help keep the "it's just voodoo and guessing" narrative going instead :)
Dilettante_
"How is engineering a real science? You just build the bridge so it doesn't fall down."
andix
I think the selection of models is a bit off. Haiku instead of Sonnet for example. Kimi K2's capabilities are closer to Sonnet than to Haiku. GPT-5 might be in the non-reasoning mode, which routes to a smaller model.
ceroxylon
I had my suspicions about the GPT-5 routing as well. When I first looked at it, the clock was by far the best; after the minute went by and everything refreshed, the next three were some of the worst of the group. I was wondering if it just hit a lucky path in routing the first time.
woodson
Just use something like DSPy/Ax and optimize your module for any given LLM (based on sample data and metrics) and you’re mostly good. No need to manually wordsmith prompts.
energy123
Goes to show the "frontier" is not really one frontier. It's a social/mathematical construct that's useful for a broad comparison, but if you have a niche task, there's no substitute for trying the different models.
observationist
It's not fair to use prompts tailored to a particular model when doing comparisons like this - one shot results that generalize across a domain demonstrate solid knowledge of the domain. You can use prompting and context hacking to get any particular model to behave pseudo-competently in almost any domain, even the tiny <1B models, for some set of questions. You could include an entire framework and model for rendering clocks and times that allowed all 9 models to perform fairly well.
This experiment, however, clearly states the goal with this prompt: `Create HTML/CSS of an analog clock showing ${time}. Include numbers (or numerals) if you wish, and have a CSS animated second hand. Make it responsive and use a white background. Return ONLY the HTML/CSS code with no markdown formatting.`
An LLM should be able to interpret that, and should be able to perform a wide range of tasks in that same style - countdown timers, clocks, calendars, floating quote bubble cycling through list of 100 pithy quotations, etc. Individual, clearly defined elements should have complex representations in latent space that correspond to the human understanding of those elements. Tasks and operations and goals should likewise align with our understanding. Qwen 2.5 and some others clearly aren't modeling clocks very well, or maybe the html/css rendering latents are broken. If you pick a semantic axis(like analog clocks), you can run a suite of tests to demonstrate their understanding by using limited one-shot interactions.
Reasoning models can adapt on the fly, and are capable of cheating - one shots might have crappy representations for some contexts, but after a lot of repetition and refinement, as long as there's a stable, well represented proxy for quality somewhere in the semantics it understands, it can deconstruct a task to fundamentals and eventually reach high quality output.
These type of tests also allow us to identify mode collapses - you can use complex sophisticated prompting to get most image models to produce accurate analog clocks displaying any time, but in the simple one shot tests, the models tend to only be able to produce the time 10:10, and you'll get wild artifacts and distortions if you try to force any other configuration of hands.
Image models are so bad at hands that they couldn't even get clock hands right, until recently anyway. Nano banana and some other models are much better at avoiding mode collapses, and can traverse complex and sophisticated compositions smoothly. You want that same sort of semantic generalization in text generating models, so hopefully some of the techniques cross over to other modalities.
I keep hoping they'll be able to use SAE or some form of analysis on static weight distributions in order to uncover some sort of structural feature of mode collapse, with a taxonomy of different failure modes and causes, like limited data, or corrupt/poisoned data, and so on. Seems like if you had that, you could deliberately iterate on, correct issues, or generate supporting training material to offset big distortions in a model.
null
jquery
Qwen 2.5 is so bad it’s good. Some really insane results if you watch it for a while. Almost like it’s taking the piss.
nightpool
It would be cool to also AI generate the favicon using some sort of image model.
frankfrank13
I find that Kimi K2 looks the best, but i've noticed the time is often wrong!
frizlab
I knew of Kimi K2 because it’s the model used by Kagi to generate the AI answers when query ends with an interrogation point.
OJFord
It's also one of the few 'recommended' models in Kagi Assistant (multi-model ChatGPT basically, available on paid plans).
dilap
I'm a huge K2 fan, it has a personality that feels very distinct from other models (not syccophantic at all), and is quite smart. Also pretty good at creative writing (tho not 100% slop free).
K2 hosted on groq is pretty crazy for intellgence/second. (Low rate limits still, tho.)
paulddraper
Kimi K2 is legitimately good.
abixb
>Qwen 2.5's clocks, on the other hand, look like they never make it out of the womb.
More like fell headfirst into the ground.
I'm disappointed with Gemini 2.5 (not sure Pro or Flash) -- I've personally had _fantastic_ results with Gemini 2.5 Pro building PWA, especially since the May 2025 "coding update." [0]
[0] https://blog.google/products/gemini/gemini-2-5-pro-updates/
baltimore
Since the first (good) image generation models became available, I've been trying to get them to generate an image of a clock with 13 instead of the usual 12 hour divisions. I have not been successful. Usually they will just replace the "12" with a "13" and/or mess up the clock face in some other way.
I'd be interested if anyone else is successful. Share how you did it!
andix
I gave this "riddle" to various models:
> The farmer and the goat are going to the river. They look into the sky and see three clouds shaped like: a wolf, a cabbage and a boat that can carry the farmer and one item. How can they safely cross the river?
Most of them are just giving the result to the well known river crossing riddle. Some "feel" that something is off, but still have a hard time to figure out that wolf, boat and cabbage are just clouds.
jampa
There are few examples of this as well:
https://www.reddit.com/r/singularity/comments/1fqjaxy/contex...
andix
It really shows how LLMs work. It's all about probabilities, and not about understanding. If something looks very similar to a well known problem, the llm is having a hard time to "see" contradictions. Even if it's really easy to notice for humans.
userbinator
Basically a variation of https://en.wikipedia.org/wiki/Age_of_the_captain
Scene_Cast2
I've noticed that image models are particularly bad at modifying popular concepts in novel ways (way worse "generalization" than what I observe in language models).
emp17344
Maybe LLMs always fail to generalize outside their data set, and it’s just less noticeable with written language.
phire
Most image models are diffusion models, not LLMs, and have a bunch of other idiosyncrasies.
So I suspect it's more that lessons from diffusion image models don't carry over to text LLMs.
And the Image models which are based on multi-mode LLMs (like Nano Banana) seem to do a lot better at novel concepts.
cluckindan
This is it. They’re language models which predict next tokens probabilistically and a sampler picks one according to the desired ”temperature”. Any generalization outside their data set is an artifact of random sampling: happenstance and circumstance, not genuine substance.
IshKebab
They definitely don't completely fail to generalise. You can easily prove that by asking them something completely novel.
Do you mean that LLMs might display a similar tendency to modify popular concepts? If so that definitely might be the case and would be fairly easy to test.
Something like "tell me the lord's prayer but it's our mother instead of our father", or maybe "write a haiku but with 5 syllables on every line"?
Let me try those ... nah ChatGPT nailed them both. Feels like it's particular to image generation.
CobrastanJorji
Also, they're fundamentally bad at math. They can draw a clock because they've seen clocks, but going further requires some calculations they can't do.
For example, try asking Nano Banana to do something simpler, like "draw a picture of 13 circles." It likely will not work.
deathanatos
Generate an image of a clock face, but instead of the usual 12 hour numbering, number it with 13 hours.
Gemini, 2.5 Flash or "Nano Banana" or whatever we're calling it these days. https://imgur.com/a/1sSeFX7A normal (ish) 12h clock. It numbered it twice, in two concentric rings. The outer ring is normal, but the inner ring numbers the 4th hour as "IIII" (fine, and a thing that clocks do) and the 8th hour as "VIIII" (wtf).
bar000n
It should be pretty clear already that anything which is based (limited?) to communicating words/text can never grasp conceptual thinking.
We have yet to design a language to cover that, and it might be just a donquijotism we're all diving into.
Uehreka
I don’t think that’s clear at all. In fact the proficiency of LLMs at a wide variety of tasks would seem to indicate that language is a highly efficient encoding of human thought, much moreso than people used to think.
bayindirh
> We have yet to design a language to cover that, and it might be just a donquijotism we're all diving into.
We have a very comprehensive and precise spec for that [0].
If you don't want to hop through the certificate warning, here's the transcript:
- Some day, we won't even need coders any more. We'll be able to just write the specification and the program will write itself.
- Oh wow, you're right! We'll be able to write a comprehensive and precise spec and bam, we won't need programmers any more.
- Exactly
- And do you know the industry term for a project specification that is comprehensive and precise enough to generate a program?
- Uh... no...
- Code, it's called code.
[0]: https://www.commitstrip.com/en/2016/08/25/a-very-comprehensi...
XenophileJKO
I mean, that's not really "true".
https://claude.ai/public/artifacts/0f1b67b7-020c-46e9-9536-c...
rideontime
Really? I can grasp the concept behind that command just fine.
chanux
Ah! This is so sad. The manager types won't be able to add an hour (actually, two) to the day even with AI.
BrandoElFollito
This is really cool. I tried to prompt gemini but every time I got the same picture. I do not know how to share a session (like it is possible with Chatgpt) but the prompts were
If a clock had 13 hours, what would be the angle between two of these 13 hours?
Generate an image of such a clock
No, I want the clock to have 13 distinct hours, with the angle between them as you calculated above
This is the same image. There need to be 13 hour marks around the dial, evenly spaced
... And its last answer was
You are absolutely right, my apologies. It seems I made an error and generated the same image again. I will correct that immediately.
Here is an image of a clock face with 13 distinct hour marks, evenly spaced around the dial, reflecting the angle we calculated.
And the very same clock, with 12 hours, and a 13th above the 12...
ryandrake
This is probably my biggest problem with AI tools, having played around with them more lately.
"You're absolutely right! I made a mistake. I have now comprehensively solved this problem. Here is the corrected output: [totally incorrect output]."
None of them ever seem to have the ability to say "I cannot seem to do this" or "I am uncertain if this is correct, confidence level 25%" The only time they will give up or refuse to do something is when they are deliberately programmed to censor for often dubious "AI safety" reasons. All other times, they come back again and again with extreme confidence as they totally produce garbage output.
int_19h
Gemini specifically is actually kinda notorious for giving up.
https://www.reddit.com/r/artificial/comments/1mp5mks/this_is...
BrandoElFollito
I agree, I see the same even in simple code where they will bend backwards apologizing and generate very similar crap.
It is like they are sometimes stuck in a local energetic minimum and will just wobble around various similar (and incorrect) answers.
What was annoying in my attempt above is that the picture was identical for every attempt
notatoad
you can click the share icon (the two-way branch icon, it doesn't look like apple's share icon) under the image it generates to share the conversation.
i'm curious if the clock image it was giving you was the same one it was giving me
giancarlostoro
Weird, I never tried that, I tried all the usual tricks that usually work including swearing at the model (this scarily works surprisingly well with LLMs) and nothing. I even tried to go the opposite direction, I want a 6 hour clock.
coffeecoders
LLMs are terrible for out-of-distribution (OOD) tasks. You should use chain of thought suppression and give constaints explictly.
My prompt to Grok:
---
Follow these rules exactly:
- There are 13 hours, labeled 1–13.
- There are 13 ticks.
- The center of each number is at angle: index * (360/13)
- Do not infer anything else.
- Do not apply knowledge of normal clocks.
Use the following variables:
HOUR_COUNT = 13
ANGLE_PER_HOUR = 360 / 13 // 27.692307°
Use index i ∈ [0..12] for hour marks:
angle_i = i * ANGLE_PER_HOUR
I want html/css (single file) of a 13-hour analog clock.
---
Output from grok.
chemotaxis
> Follow these rules exactly:
"Here's the line-by-line specification of the program I need you to write. Write that program."
signatoremo
Can you write this program in any language?
BrandoElFollito
Well, that's cheating :) You asked it to generate code, which is ok because it does not represent a direct generated image of a clock.
Can grok generate images? What would the result be?
I will try your prompt on chatgpt and gemini
BrandoElFollito
Gemini failed miserably - a standard 12 hours clock
Same for chatgpt
And perplexity replaced 12 with 13
chiwilliams
I'll also note that the output isn't quite right --- the top number should be 13 rather than 1!
layer8
I mean, the specification for the hour marks (angle_i) starts with a mark at angle 0. It just followed that spec. ;)
NooneAtAll3
close enough, but digit at the top should be the highest, not 1 :/
IAmGraydon
That's because they literally cannot do that. Doing what you're asking requires an understanding of why the numbers on the clock face are where they are and what it would mean if there was an extra hour on the clock (ie that you would have to divide 360 by 13 to begin to understand where the numbers would go). AI models have no concept of anything that's not included in their training data. Yet people continue to anthropomorphize this technology and are surprised when it becomes obvious that it's not actually thinking.
energy123
The hope was for this understanding to emerge as the most efficient solution to the next-token prediction problem.
Put another way, it was hoped that once the dataset got rich enough, developing this understanding is actually more efficient for the neural network than memorizing the training data.
The useful question to ask, if you believe the hope is not bearing fruit, is why. Point specifically to the absent data or the flawed assumption being made.
Or more realistically, put in the creative and difficult research work required to discover the answer to that question.
bobbylarrybobby
It's interesting because if you asked them to write code to generate an SVG of a clock, they'd probably use a loop from 1 to 12, using sin and cos of the angle (given by the loop index over 12 times 2pi) to place the numerals. They know how to do this, and so they basically understand the process that generates a clock face. And extrapolating from that to 13 hours is trivial (for a human). So the fact that they can't do this extrapolation on their own is very odd.
ryandrake
I wonder if you would have more success if you painstakingly described the shape and features of a clock in great detail but never used the words clock or time or anything that might give the AI the hint that they were supposed to output something like a clock.
BrandoElFollito
And this is a problem for me. I guess that it would work, but as soon as the word "clock" appears, gone is the request because a clock HAS.12.HOURS.
I use this a lot in cybersecurity when I need to do something "illegal". I am refused help, until I say that I am doing research on cybersecurity. In that case no problem.
Workaccount2
The problem is more likely the tokenization of images than anything. These models do their absolute worst when pictures are involved, but are seemingly miraculous at generalizing with just text.
chemotaxis
I wonder if it's because we mean different things by generalization.
For text, "generalization" is still "generate text that conforms to all the usual rules of the language". For images of 13-hour clock faces, we're explicitly asking the LLM to violate the inferred rules of the universe.
I think a good analogy would be asking an LLM to write in English, except the word "the" now means "purple". They will struggle to adhere to this prompt in a conversation.
godelski
Yes, the problem is that these so called "world models" do not actually contain a model of the world, or any world
echelon
gpt-image-1 and Google Imagen understand prompts, they just don't have training data to cover these use cases.
gpt-image-1 and Imagen are wickedly smart.
The new Nano Banana 2 that has been briefly teased around the internet can solve incredibly complicated differential equations on chalk boards with full proof of work.
phkahler
>> The new Nano Banana 2 that has been briefly teased around the internet can solve incredibly complicated differential equations on chalk boards with full proof of work.
That's great, but I bet it can't tie it's own shoes.
kylecazar
Non-determinism at it's finest. The clock is perfect, the refresh happens, the clock looks like a Dali painting.
ryandrake
I've been struggling all week trying to get Claude Code to write code to produce visual (not the usual, verifiable, text on a terminal) output in the form of a SDL_GPU rendered scene consisting of the usual things like shaders, pipelines, buffers, textures and samplers, vertex and index data and so on, and boy it just doesn't seem to know what it's doing. Despite providing paragraphs-long, detailed prompts. Despite describing each uniform and each matrix that needs to be sent. Despite giving it extremely detailed guidance about what order things need to be done in. It would have been faster for me to just write the code myself.
When it fails a couple of times it will try to put logging in place and then confidently tell me things like "The vertex data has been sent to the renderer, therefore the output is correct!" When I suggest it take a screenshot of the output each time to verify correctness, it does, and then declares victory over an entirely incorrect screenshot. When I suggest it write unit tests, it does so, but the tests are worthless and only tests that the incorrect code it wrote is always incorrect in the same ways.
When it fails even more times, it will get into this what I like to call "intern engineer" mode where it just tries random things that I know are not going to work. And if I let it keep going, it will end up modifying the entire source tree with random "try this" crap. And each iteration, it confidently tells me: "Perfect! I have found the root cause! It is [garbage bullshit]. I have corrected it and the code is now completely working!"
These tools are cute, but they really need to go a long way before they are actually useful for anything more than trivial toy projects.
fancy_pantser
Have you given using MCPs to provide documentation and examples a shot? I always have to bring in docs since I don't work in Python and TS+React (which it seems more capable at) and force it to review those in addition to any specification. e.g. Context7
ryandrake
Haven't looked into MCPs yet. Thanks for the suggestion!
rossant
Have you tried OpenAI Codex with GPT5.1? I'm using it for similar GPU rendering stuff and it appears to do an excellent job.
jamilton
I know this has been said many times before, but I wonder why this is such a common outcome. Maybe from negative outcomes being underrepresented in the training data? Maybe that plus being something slightly niche and complex?
The screenshot method not working is unsurprising to me, VLLMs visual reasoning is very bad with details because they (as far as I understand) do not really have access to those details, just the image embedding and maybe an OCR'd transcript.
poszlem
I’m not sure if it's just me, but I've also noticed Claude becoming even more lazy. For example, I've asked it several times to fix my tests. It'll fix four or five of them, then start struggling with the next couple, and suddenly declare something like: "All done, fixed 5 out of 10 tests. I can’t fix the remaining ones", followed by a long, convoluted explanation about why that’s actually a good thing.
anon_cow1111
I'm having a hard time believing this site is honest, especially with how ridiculous the scaling and rotation of numbers is for most of them. I dumped his prompt into chatgpt to try it myself and it did create a very neat clock face with the numbers at the correct position+animated second hand, it just got the exact time wrong, being a few hours off.
Edit: the time may actually have been perfect now that I account for my isp's geo-located time zone
Zopieux
On the contrary, in my experience this is very typical of the average failure mode / output of early 2025 LLMs for HTML of SVG.
perfmode
i read that the OP limited the output to 2000 tokens.
lanewinfield
^ this! there's a lot of clocks to generate so I've challenged it to stick to a small(er) amount of code
anon_cow1111
I got a ~1600 character reply from gpt, including spaces and it worked first shot dumping into an html doc. I think that probably fits ok in the limit? (If I missed something obvious feel free to tell me I'm an idiot)
Springtime
On the second minute I had the AI World Clocks site open the GPT-5 generated version displayed a perfect clock. Its clock before and every clock from it since has had very apparent issues though.
If you could get a perfect clock several times for the identical prompt in fresh contexts with the same model then it'd be a better comparison. Potentially the ChatGPT site you're using though is doing some adjustments that the API fed version isn't.
munro
Amazing, some people are so enamored with LLMs who use them for soft outcomes, and disagree with me when I say be careful they're not perfect -- this is such a great non technical way to explain the reality I'm seeing when using on hard outcome coding/logic tasks. "Hey this test is failing", LLM deletes test, "FIXED!"
derbOac
Something that struck me when I was looking at the clocks is that we know what a clock is supposed to look and act like.
What about when we don't know what it's supposed to look like?
Lately I've been wrestling with the fact that unlike, say, a generalized linear model fit to data with some inferential theory, we don't have a theory or model for the uncertainty about LLM products. We recognize when it's off about things we know are off, but don't have a way to estimate when it's off other than to check it against reality, which is probably the exception to how it's used rather than the rule.
worldsayshi
Yeah it seems crazy to use LLM on any task where the output can't be easily verified.
palmotea
> Yeah it seems crazy to use LLM on any task where the output can't be easily verified.
I disagree, those tasks are perfect for LLMs, since a bug you can't verify isn't a problem when vibecoding.
mopsi
> "Hey this test is failing", LLM deletes test, "FIXED!"
A nice continuation of the tradition of folk stories about supernatural entities like teapots or lamps that grant wishes and take them literally. "And that's why, kids, you should always review your AI-assisted commits."porphyra
LLMs can't "look" at the rendered HTML output to see if what they generated makes sense or not. But there ought to be a way to do that right? To let the model iterate until what it generates looks right.
Currently, at work, I'm using Cursor for something that has an OpenGL visualization program. It's incredibly frustrating trying to describe bugs to the AI because it is completely blind. Like I just wanna tell it "there's no line connecting these two points but there ought to be one!" or "your polygon is obviously malformed as it is missing a bunch of points and intersects itself" but it's impossible. I end up having to make the AI add debug prints to, say, print out the position of each vertex, in order to convince it that it has a bug. Very high friction and annoying!!!
firtoz
Cursor has this with their "browser" function for web dev, quite useful
You can also give it a mcp setup that it can send a screenshot to the conversation, though unsure if anyone made an easy enough "take screenshot of a specific window id" kind of mcp, so may need to be built first
I guess you could also ask it to build that mcp for you...
pil0u
I had some success providing screenshots to Cursor directly. It worked well for web UIs as well as generated graphs in Python. It makes them a bit less blind, though I feel more iterations are required.
EMM_386
You can absolutely do this. In fact, with Claude Anthropic encourages you to send it screenshots. It works very well if you aren't expecting pixel-perfection.
YMMV with other models but Sonnet 4.5 is good with things like this - writing the code, "seeing" the output and then iterating on it.
TheKidCoder
Kinda - Hand waiving over the question of if an LLM can really "look" but you can connect Cursor to a Puppeteer MCP server which will allow it to iterate with "eyes" by using Puppeteer to screenshot it's own output. Still has issues, but it does solve really silly mistakes often simply by having this MCP available.
fragmede
Claude totally can, same with ChatGPT. Upload a picture to either one of them via the app and tell it there's no line where there should be. There’s some plumbing involved to get it to work in Claude code or codex, but yes, computers can "see". If you have lm-server, there's tons of non-text models you can point your code at.
zkmon
Why are Deepseek and Kimi are beating other models by so much margin? Is this to do with their specialization for this task?
null
anonzzzies
Sonnet 4.5 does it flawless. Tried 8 times.
edfletcher_t137
Lack of Claude is a glaring oversight given how popular it is as an agentic coding model...
em3rgent0rdr
Most look like they were done by a beginner programmer on crack, but every once in a while a correct one appears.
shafoshaf
It's interesting how drawing a clock is one of the primary signals for dementia. https://www.verywellhealth.com/the-clock-drawing-test-98619
BrandoElFollito
This is very interesting, thank you.
I could not get to the store because of the cookie banner that does not work (at left on mobile chrome and ff). The Internet Archive page: https://archive.ph/qz4ep
I wonder how this test could be modified for people that have neurological problems - my father's hands shake a lot but I would like to try the test on him (I do not have suspicions, just curious).
I passed it :)
technothrasher
"One variation of the test is to provide the person with a blank piece of paper and ask them to draw a clock showing 10 minutes after 11. The word "hands" is not used to avoid giving clues."
Hmm, ambiguity. I would be the smart ass that drew a digital clock for them, or a shaku-dokei.
pixl97
DeepSeek and Kimi seem to have correct ones most of the time I've looked.
BrandoElFollito
DeepSeek told me that it cannot generate pictures and suggested code (which is very different)
energy123
If they can identify which one is correct, then it's the same as always being correct, just with an expensive compute budget.
morkalork
I'd say more like a blind programmer in the early stages of dementia. Able to write code, unable to form a mental image of what it would render as and can't see the final result.
"Every minute, a new clock is rendered by nine different AI models."