O3 beats a master-level GeoGuessr player, even with fake EXIF data
230 comments
·April 29, 2025rafram
SamPatt
Author here - it's a fair criticism, and I point it out in the article. However, I kept it in for a few reasons.
I'm trying to show the model's full capabilities for image location generally, not just playing geoguessr specifically. The ability to combine web search with image recognition, iteratively, is powerful.
Also, the web search was only meaningful in the Austria round. It did use it in the Ireland round too, but as you can see by the search terms it used, it already knew the road solely from image recognition.
It beat me in the Colombia round without search at all.
It's worthwhile to do a proper apples and apples comparison - I'll run it again and update the post. But the point was to show how incredibly capable the model is generally, and the lack of search won't change that. Just read the chain of thought, it's incredible!
LeifCarrotson
There's some level at which an AI 'player' goes from being competitive with a human player, matching better-trained human strategy against a more impressive memory, to just a cheaty computer with too much memorization. Finding that limit is the interesting thing about this analysis, IMO!
It's not interesting playing chess against Stockfish 17, even for high-level GMs. It's alien and just crushes every human. Writing down an analysis to 20 move depth, following some lines to 30 or more, would be cheating for humans. It would take way too long (exceeding any time controls and more importantly exceeding the lifetime of the human), a powerful computer can just crunch it in seconds. Referencing a tablebase of endgames for 7 pieces would also be cheating, memorizing 7 terabytes of bitwise layouts is absurd but the computer just stores that on its hard drive.
Human geoguessr players have impressive memories way above baseline with respect to regional infrastructure, geography, trees, road signs, written language, and other details. Likewise, human Jeopardy players know an awful lot of trivia. Once you get to something like Scrabble or chess, it's less and less about knowing words or knowing moves, but more about synthesizing that knowledge intelligently.
One would expect a human to recognize some domain names like, I don't know, osu.edu: lots of people know that's Ohio State University, one of the biggest schools in the US, located in Columbus, Ohio. They don't have to cheat and go to an external resource. One would expect a human (a top human player, at least) to know that taxilinder.at is based in Austria. One would never expect any human to have every business or domain name memorized.
With modern AI models trained on internet data, searching the internet is not that different from querying its own training data.
mrlongroots
To reframe your takeaway: you want to benchmark the "system" and see how capable it is. The boundaries of the system are somewhat arbitrary: is it "AI + web" or "only AI", and it is not about fairness as much as about "what do you, the evaluator, want to know".
tshaddox
> There's some level at which an AI 'player' goes from being competitive with a human player, matching better-trained human strategy against a more impressive memory, to just a cheaty computer with too much memorization. Finding that limit is the interesting thing about this analysis, IMO!
And a lot of human competitions aren't designed in such a way that the competition even makes sense with "AI." A lot of video games make this pretty obvious. It's relatively simple to build an aimbot in a first-person shooter that can outperform the most skilled humans. Even in ostensibly strategic games like Starcraft, bots can micro in ways that are blatantly impossible for humans and which don't really feel like an impressive display of Starcraft skill.
Another great example was IBM Watson playing Jeopardy! back in 2011. We were supposed to be impressed with Watson's natural language capabilities, but if you know anything about high-level Jeopardy! then you know that all you were really seeing is that robots have better reflexes than humans, which is hardly impressive.
k4rli
It's still as much cheating as googling. Completely irrelevant. Even if it were to beat Blinky, it's not different from googlers/scripters.
SamPatt
I disagree. I ran those rounds again, without search this time, and the results were nearly identical:
IanCal
I tried the image without search and it talked about Dornbirn anyway but ended up choosing Bezau which is really quite close.
edit - the models are also at a disadvantage in a way too, they don't have a map to look at while the pick the location.
Ukv
The author did specifically point out that
> Using Google during rounds is technically cheating - I’m unsure about visiting domains you find during the rounds though. It certainly violates the spirit of the game, but it also shows the models are smart enough to use whatever information they can to win.
and had noted in the methodology that
> Browsing/tools — o3 had normal web access enabled.
Still an interesting result - maybe more accurate to say O3+Search beats a human, but could also consider the search index/cache to just be a part of the system being tested.
godelski
Pointing out that it is cheating doesn't excuse the lie in the headline. That just makes it bait and switch, a form of fraud. OP knew they were doing a bait and switch.
I remember when we were all pissed about clickbait headlines because they were deceptive. Did we just stop caring?
sdenton4
The people pissed about clickbait headlines were often overstating things to drum up outrage and accumulate more hacker news upboats...
627467
Cheating implies there's a game. There isn't.
> Titles and headlines grab attention, summarize content, and entice readers to engage with the material
I'm sorry you felt defrauded instead. To me the title was very good at conveying to me the ability of o3 in geolocating photos.
SecretDreams
What's your suggestion for an alternative headline?
bahmboo
The headline said the AI beat him, it did not say it beat him in a GeoGuessr game. The article clearly states what he did and why.
jasonlotito
One of the rules is banning the use of third-party software or scripts.
Any LLM attempting to play will lose because of that rule. So, if you know the rules, and you strictly adhere to them (as you seem to be doing) than no need to click on the link. You already know it's not playing buy GeoGuesser rules.
That being said, if you are running a test, you are free to set the rules as you see fit and explain so, and under the conditions set by the person running the test, these are the results.
> Did we just stop caring?
We stopped caring about pedantry. Especially when the person being pedantic seems to cherry pick to make their point.
null
jahsome
[flagged]
_heimdall
This seems like a great example of why some are so concerned with AI alignment.
The game rules were ambiguous and the LLM did what it needed to (and was allowed to) to win. It probably is against the spirit of the game to look things up online at all but no one thought to define that rule beforehand.
umanwizard
No, the game rules aren't ambiguous. This is 100% unambiguously cheating. From the list of things that are definitely considered cheating in the rules:
> using Google or other external sources of information as assistance during play.
The contents of URLs found during play is clearly an external source of information.
spookie
A human can also use the same tools if it wasn't for the rules or fair play. They should've simply redone the test.
ceph_
The AI should be forced to use the same rules as the human. Not the other way around. The AI shouldn't be using outside resources.
krferriter
An AI being better than a human at doing a google search and then skimming a bunch of pages to find location-related terms isn't as interesting of a result.
arandomhuman
But then they couldn't make a click bait title for the article.
silveraxe93
Yeah, the author does note that in the article. He also points it out in the conclusion:
> If it’s using other information to arrive at the guess, then it’s not metadata from the files, but instead web search. It seems likely that in the Austria round, the web search was meaningful, since it mentioned the website named the town itself. It appeared less meaningful in the Ireland round. It was still very capable in the rounds without search.
rafram
Seems like they should've just repeated the test. But without the huge point lead from the rounds where it cheated, it wouldn't have looked very impressive at all.
silveraxe93
People found the original post so impressive they were saying that it had to be coming from cheating by looking at EXIF data. The point of this article was to show it doesn't. It got an unfair advantage in 1 (and say 0.5) out of 5. With the non-search rounds still doing great.
If you think this is unimpressive, that's subjective so you're entitled to believe that. I think that's awesome.
SamPatt
I did repeat the test without search, and updated the post. It made no difference. Details here:
SamPatt
That's inaccurate. It beat me by 1,100 points, and given the chain of thought demonstrated that it knew the general region of both guesses before it employed search, it would likely have still beaten me in those rounds. Though probably by fewer points.
I will try it again without web search and update the post though. Still, if you read the chain of thought, it demonstrates remarkable capabilities in all the rounds. It only used search in 2/5 rounds.
clhodapp
The question is not only how much it helped the AI model but rather how much it would have helped the human.
This is because the AI model could have chosen to run a search whenever it wanted (e.g. perhaps if it knew how to leverage search better, it could have used it more).
In order for the results to be meaningful, the competitors have to play by the same rules.
ACS_Solver
I just tried (o4-mini-high) and had it come to the wrong conclusion when I asked about the location and date, because it didn't search the web. I have a photo of a bench with a sign mentioning the cancellation of an event due to the Pope's death. It impressively figured out the location but then decided that Pope Francis is alive and the sign is likely a prank, so the photo is from April Fools day.
Then after I explicitly instructed it to search the web to confirm whether the Pope is alive, it found news of his death and corrected its answer, but it was interesting to see how the LLM makes a mistake due to a major recent event being after its cutoff.
ricardo81
>isn't playing fair.
the idea of having nth more dimensions of information, readable and ingestible within a short frame of time probably isn't either.
WhitneyLand
As models continue to evolve it may not even need to cheat.
Since web scale data is already part of pre-training this info is in principle available for most businesses without a web search.
The exceptions would be if it’s recently added, or doesn’t appear often enough to generate a significant signal during training, as in this case with a really small business.
It’s not hard to imagine base model knowledge improving to the point where it’s still performing at almost the same level without any web search needed.
layman51
Using the decal as a clue is funny because what if there was a street scene where that happened to be misleading? For example, I had seen that a Sacramento County Sheriff car got to Europe and I guess it now belonged to a member of the public who is driving it with the original decals still attached. I wonder how the LLM would reason if it sees the car as “out of place”.
yen223
Stands to reason a human might get fooled by this as well
SamPatt
Absolutely!
It happens occasionally - the most common example I can think of it getting a license plate or other location from a tractor-trailer (semi) on the highway. Those are very unreliable.
You also sometimes get flags in the wrong countries, immigrants showing their native pride or even embassies.
victorbjorklund
Probabilities. That could happen with anything. Someone could build a classic japanese house with a japanese garden in Hawaii. But Japan is probably a better guess if you see a japanese house with japanese fauna.
CamperBob2
To be fair, my local copy of R1 isn't doing any searching at all, but it frequently says "A search suggests..." or something along the lines.
SamPatt
Author here, I'm glad to see folks find this interesting.
I encourage everyone to try Geoguessr! I love it.
I'm seeing a lot of comments saying that the fact that the o3 model used web search in 2 of 5 rounds made this unfair, and the results invalid.
To determine if that's true, I re-ran the two rounds where o3 used search, and I've updated the post with the results.
Bottom line: It changed nothing. The guesses were nearly identical. You can verify the GPS coordinates in the post.
Here's an example of why it didn't matter. In the Austria round, check out how the model identifies the city based on the mountain in the background:
https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
It already has so much information that it doesn't need the search.
Would search ever be useful? Of course it would. But in this particular case, it was irrelevant.
bjourne
What's your take on man vs. machine? If AI already beats Master level players it seem certain that it will soon beat the Geoguessr world champion too. Will people still derive pleasure from playing it, like with chess?
SamPatt
>Will people still derive pleasure from playing it, like with chess?
Exactly - I see it just like chess, which I also play and enjoy.
The only problem is cheating. I don't have an answer for that, except right now it's too slow to do that effectively, at least consistently.
Otherwise, I don't care that a machine is better than I am.
jvvw
I'm Master level at Geoguessr - it's a rank where you have to definitely know what you are doing but it isn't as high as it probably sounds from the headline.
Masters is about 800-1200 ELO whereas the pros are 1900-2000ish. I'll know the country straight away on 95% of rounds but I can still have no idea where I am in Russia or Brazil sometimes if there's no info. Scripters can definitely beat me!
SamPatt
Yeah I added a "My skill level" section to talk through that. I'm far from a professional.
But I know enough to be able to determine if the chain of thought it outputs is nonsense or comparable to a good human player. I found it remarkable!
paulcole
Gotta learn your Brazilian soil!
windowshopping
Was it worth it?
rosstex
I have 2000+ hours in Team Fortress 2. Was it worth it?
Cyph0n
Yes, it was. Granted, I probably have more than that.
650REDHAIR
Yes? It’s fun.
make3
it's a game, that's like asking why a public service is not profitable
OtherShrezzing
It's my understanding that o3 was trained on multimodal data, including imagery. Is it unreasonable to assume its training data includes images of these exact locations and features? GeoGuesser uses Google Maps, and Google Maps purchases most of its imagery from third-parties these days. If those third parties aren't also selling to all the big AI companies, I'd be very surprised.
pests
> Google Maps purchases most of its imagery from third-parties these days
Maps maybe, but Streetview? Rainbolt just did a video with two Maps PMs recently and it sounds like they still source all their street view themselves considering the special camera and car needed, etc.
OtherShrezzing
Maybe the end-user isn't Google Maps, but TomTom have a pretty comprehensive street-view-ish product for private buyers like car companies, Bing and Apple Maps called MoMa.
I'd be surprised if this building[0] wasn't included in their dataset from every road-side angle possible, alongside every piece of locational metadata imaginable, and I'd be surprised if that dataset hasn't made it into OpenAI's training data - especially when TomTom's relationship to Microsoft, and Microsoft's relationship to OpenAI, is taken into account.
[0] https://cdn.jsdelivr.net/gh/sampatt/media@main/posts/2025-04...
cpeterso
Here's a link to that interview: https://youtu.be/2T6pIJWKMcg
shpx
You can upload your own panoramic images to Street View, people do this for hiking trails. But I'm sure 99% of streetview imagery is Google-sourced and Geoguessr might not even use user-submitted imagery.
pests
I believe Geogesser categorizes their games on this facet. Rainbolt plays on only official imagery.
mikeocool
My understanding is you're correct -- Google still captures a lot of their own street view imagery.
Though there are other companies that capture the same sorts of imagery and license it. TomTom imagery is used on the Bing Maps street view clone.
Yenrabbit
Try it with your own personal photos. It is scarily good!
rafram
That's true for heavily photographed urban areas. I've tried it on some desert photos (even at identifiable points with human structures) and it basically just guesses a random trailhead in Joshua Tree and makes up a BS explanation for why it matches.
kube-system
I have had surprisingly good luck with beach photos that don’t have much beyond dunes and vegetation in them
thrance
A machine that's read every book ever written, seen every photo ever taken, visited every streets on Earth... That feels a little frightening.
GaggiX
It does work well with images you have taken, not just Geoguessr: https://simonwillison.net/2025/Apr/26/o3-photo-locations/
thi2
> I’m confident it didn’t cheat and look at the EXIF data on the photograph, because if it had cheated it wouldn’t have guessed Cambria first.
Hm no way to be sure though, would be nice to do another run without Exif information
arm32
GeoGuessr aside, I really hope that this tech will be able to help save kids someday, e.g. help with FBI's ECAP (https://www.fbi.gov/wanted/ecap).
parsimo2010
Looking at those photos, those are some crazy hard pictures- masked regions of the image, partially cropped faces, blurry, pictures of insides of rooms. I don't think any current LLM is going to be able to Sherlock Holmes their way into finding any of those people.
Maybe they will one day if there's a model trained on a facial recognition database with every living person included.
thrance
I wouldn't put too much hope on this technology bringing more good than harm to the world.
mopenstein
But it will bring some percentage of good and some percentage of bad. Which ain't half bad, if you ask me.
martinsnow
What do you do when it flags you or someone you know who's innocent? Blindly trusting these models without any verification will put innocent people in prison. Normal people don't understand why they are so confident. They're confident because they believe all the data they have is correct. I forsee a future with many faux trials because they don't understand critical thinking.
moritzwarhier
What a quip! What if it's 51% bad?
ketzo
If we don’t actively try to identify and implement positive use cases, then yes, it’ll definitely bring more harm than good.
Isn’t that all the more reason to call out our high hopes?
thrance
I don't know what in my comment made you think I was opposed to seeking positive applications of this technology.
From the guidelines:
> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
mkoubaa
The bad is already priced in. Nothing wrong with hoping for more good.
jampa
I was trying to play with o3 this week to see how close it can identify things, and, interestingly, it tries more pattern matching than its own "logic deduction". For example, it can easily deduce any of my photos from Europe and the US because there are many pictures online that I can search for and see similar pictures.
However, when there are not many photos of the place online, it gets closer but stops seeking deeper into it and instead tries to pattern-match things in its corpus / internet.
One example was an island's popular trail that no longer exists. It has been overgrown since 2020. It said first that the rocks are typical of those of an island and the vegetation is from Brazil, but then it ignored its hunch and tried to look for places in Rio de Janeiro.
Another one was a popular beach known for its natural pools during low tides. I took a photo during high tide, when no one posts pictures. It captured the vegetation and the state correctly. But then it started to search for more popular places elsewhere again.
orangecat
Amazing. I'm relatively bullish on AI and still I would have bet on the human here. Looking forward to the inevitable goalpost-moving of "that's not real reasoning".
usaar333
Why? AI beat rainbolt 1.5 years ago: https://www.npr.org/2023/12/19/1219984002/artificial-intelli...
AI tends to have superhuman pattern matching abilities with enough data
karlding
If you watch the video, (one of) the reasons why the AI was winning was because it was using “meta” information from the Street View camera images, and not necessarily because it’s successfully identifying locations purely based on the landmarks in the image.
> I realized that the AI was using the smudges on the camera to help make an educated guess here.
ApolloFortyNine
Pro geoguessr players do the same thing. The vividness of the colors and weirdness in the sky are two examples I've seen Rainbolt use in the past (and he's not even the best).
ZeWaka
Meta is widely used by humans. One funny one is the different hiding-masks for the different streetview cars.
InkCanon
I think if your assumption is that AI is deducing where it is with rational thoughts, you would be. In truth what probably happened is that the significant majority of digital images of the world had been scraped, labeled and used as training data.
Philpax
Try it with your own photos from around the world. I used my own photos from Stockholm, San Francisco, Tvarožná, Saas-Fee, London, Bergen, Adelaide, Melbourne, Paris, and Sicily, and can confirm that it was within acceptable range for almost all of them (without EXIF data), and it absolutely nailed some of the more obvious spots.
oncallthrow
How do you explain https://simonwillison.net/2025/Apr/26/o3-photo-locations/?
Rumudiez
they only posted one photo in the post, but going off of that it's still an easy match based on streetview imagery. furthermore, the AI just identified the license plate and got lucky that photographer lives in a populous area, making it more prominent in the training data and therefore more likely to be found (even though it was off by 200 miles on its first guess)
TimTheTinker
> Looking forward to the inevitable goalpost-moving of "that's not real reasoning".
It's less about the definition of "reasoning" and more about what's interesting.
Maybe I'm wrong here ... but a chess bot that wins via a 100% game solution stored in exabytes of precomputed data might have an interesting internal design (at least the precomputing part), but playing against it wouldn't keep on being an interesting experience for most people because it always wins optimally and there's no real-time reasoning going on (that is, unless you're interested in the experience of playing against a perfect player). But for most people just interested in playing chess, I suspect it would get old quickly.
Now ... if someone followed up with a tool that could explain insightfully why any given move (or series) the bot played is the best, or showed when two or more moves are equally optimal and why, that would be really interesting.
SirHumphrey
My objection is not “that is not real reasoning” my objection is that’s not that hard.
I happen to do some geolocating from static images from time to time and at least most of the images provided as examples contain a lot of clues- enough that i think a semi experienced person could figure out the location although - in fairness- in a few hours not few minutes.
Second, the similar approaches were tried using CNNs and it worked (somewhat)[1].
[1]: https://huggingface.co/geolocal/StreetCLIP
EDIT: I am not talking about geoguesser - i am talking about geolocating an image with everything available (e.g. google…)
TimorousBestie
I don’t think any goalposts need to be redecorated. The “inner monologue” isn’t a reliable witness to o3’s model, it’s at best a post-hoc estimation of what a human inner monologue might be in this circumstance. So its “testimony” about what it is doing is unreliable, and therefore it doesn’t move the needle on whether or not this is “real reasoning” for some value of that phrase.
In short, it’s still anthropomorphism and apophenia locked in a feedback loop.
katmannthree
Devil's advocate, as with most LLM issues this applies to the meatbags that generated the source material as well. Quick example is asking someone to describe their favorite music and why they like it, and note the probable lack of reasoning on the `this is what I listened to as a teenager` axis.
hombre_fatal
Good point. When we try to explain why we're attracted to something or someone, what we do seems closer to modeling what we like to think about ourself. At the extreme, we're just story-telling about an estimation we like to think is true.
ewoodrich
Something as inherently subjective as personal preference doesn't seem like an ideal example to make that point. How could you expect to objectively evaluate something like "I enjoy songs in a minor scale" or "I hate country"?
TimorousBestie
I largely agree! Humans are notoriously bad at doing what we call reasoning.
I also agree with the cousin comment that (paraphrased) “reasoning is the wrong question, we should be asking about how it adapts to novelty.” But most cybernetic systems meet that bar.
null
empath75
I don't think the inner monologue is evidence of reasoning at all, but doing a task which can only be accomplished by reasoning is.
TimorousBestie
Geoguessr is not a task that can only be accomplished by reasoning. Famously, it took a less than a day of compute time in 2011 to SLAM together a bunch of pictures of Rome (https://grail.cs.washington.edu/rome/).
jibal
Such as? geoguessing certainly isn't that.
red75prime
> it’s at best a post-hoc estimation of what a human inner monologue might be in this circumstance
Nope. It's not autoregressive training on examples of human inner monologue. It's reinforcement learning on the results of generated chains of thoughts.
jibal
"It's reinforcement learning on the results of generated chains of thoughts."
No, that's not how LLMs work.
s17n
Geoguessing isn't much of a reasoning task, its more about memorizing a bunch of knowledge. Since LLMs contain essentially all knowledge, it's not surprising that they would be good at this.
As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days.
Macha
Obviously when the Turing Test was designed, the thought was that anything that could pass it would so obviously be clearly human-like that passing it would be a clear signal.
LLMs really made it clear that it's not so clear cut. And so the relevance of the test fell.
zahlman
Look at contemporary accounts of what people thought a conversation with a Turing-test-passing machine would look like. It's clear they had something very different in mind.
Realizing problems with previous hypotheses about what might make a good test, is not the same thing as choosing a standard and then revising it when it's met.
distortionfield
Because the Chinese Room is a much better analogy for what LLMs are doing inside than the Turing test is.
CamperBob2
What happens if we give the operator of the Chinese Room a nontrivial math problem, one that can't simply be answered with a symbolic lookup but requires the operator to proceed step-by-step on a path of inquiry that he doesn't even know he's taking?
The analogy I used in another thread is a third grader who finds a high school algebra book. She can read the book easily, but without access to teachers or background material that she can engage with -- consciously, literately, and interactively, unlike the Chinese Room operator -- she will not be able to answer the exercises in the book correctly, the way an LLM can.
jibal
That's a non sequitur that mixes apples and giraffes, and is completely wrong about what happens in the Chinese Room and what happens in LLMs. Ex hypothesi, the "rule book" that the Searle homunculus in the Chinese Room uses is "the right sort of program" to implement "Strong AI". The LLM algorithm is very much not that sort of program, it's a statistical pattern matcher. Strong AI does symbolic reasoning, LLMs do not.
But worse, the Turing Test is not remotely intended to be an "analogy for what LLMs are doing inside" so your comparison makes no sense whatsoever, and completely fails to address the actual point--which is that, for ages the Turing Test was held out as the criterion for determining whether a system was "thinking", but that has been abandoned in the face of LLMs, which have near perfect language models and are able to closely model modes of human interaction regardless of whether they are "thinking" (and they aren't, so the TT is clearly an inadequate test, which some argued for decades before LLMs became a reality).
bluefirebrand
> As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days
To be honest I am still not entirely convinced that current LLMs pass the turing test consistently, at least not with any reasonably skeptical tester
"Reasonably Skeptical Tester" is a bit of goalpost shifting, but... Let's be real here.
Most of these LLMs have way too much of a "customer service voice", it's not very conversational and I think it is fairly easy to identify, especially if you suspect they are an LLM and start to probe their behavior
Frankly, if the bar for passing the Turing Test is "it must fool some number of low intelligence gullible people" then we've had AI for decades, since people have been falling for scammy porno bots for a long time
jibal
One needs to be more than "reasonably skeptical" and merely not "low intelligence gullible" to be a competent TT judge--it requires skill, experience, and understanding an LLM's weak spots.
And the "customer service voice" you see is one that is intentionally programmed in by the vendors via baseline rules. They can be programmed differently--or overridden by appropriate prompts--to have a very different tone.
LLMs trained on trillions of human-generated text fragments available from the internet have shown that the TT is simply not an adequate test for identifying whether a machine is "thinking"--which was Turing's original intent in his 1950 paper "Computing Machinery and Intelligence" in which he introduced the test (which he called "the imitation game").
TimorousBestie
A lot happens in seventy-five years.
jibal
People were talking about the Turing Test as the criterion for whether a system was "thinking" up until the advent of LLMs, which was far less than 75 years ago.
sundarurfriend
> As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days.
UCSD: Large Language Models Pass the Turing Test https://news.ycombinator.com/item?id=43555248
From just a month ago.
s17n
Exactly - maybe the most significant long-term goal in computer science history has been achieved and it's barely discussed.
darkwater
> As far as goalpost-moving goes, it's wild to me that nobody is talking about the turing test these days.
Well, in this case humans has to be trained as well but now there are humans pretty good at detecting LLM slobs as well. (I'm half-joking and half-serious)
zahlman
> Looking forward to the inevitable goalpost-moving of "that's not real reasoning".
How is that moving the goalposts? Where did you see them set before, and where did your critics agree to that?
short_sells_poo
Can you please explain to me how this is evidence for reasoning?
z7
Quoting Chollet:
>I have repeatedly said that "can LLM reason?" was the wrong question to ask. Instead the right question is, "can they adapt to novelty?".
kelseyfrog
Because the output contains evidence of thought processes that have been established as leading to valid solutions to problems.
I have a simple question: Is text a sufficient medium to render a conclusion of reasoning? It can't be sufficient for humans and insufficient for computers - such a position is indefensible.
zahlman
> Because the output contains evidence of thought processes that have been established as leading to valid solutions to problems.
This sort of claim always just reminds me of Lucky's monologue in Waiting for Godot.
empath75
I would say that almost all of what humans do is not the result of reasoning, and that reasoning is an unnatural and learned skill for humans, and most humans aren't good at even very basic reasoning.
parsimo2010
My comment from the previous post:
> I’m sure there are areas where the location guessing can be scary accurate, like the article managed to guess the exact town as its backup guess. But seeing the chain of thought, I’m confident there are many areas that it will be far less precise. Show it a picture of a trailer park somewhere in Kansas (exclude any signs with the trailer park name and location) and I’ll bet the model only manages to guess the state correctly.
This post, while not a big sample size, reflects how I would expect these models to perform. The model managed to be reliable with guessing the right country, even in pictures without a lot of visual information (I'll claim that getting the country correct in Europe is roughly equivalent to guessing the right state in the USA). It does sometimes manage to get the correct town, but this is not a reliable level of accuracy. The previous article only tested on one picture and it happened to get the correct town as its second guess and the author called it "scary accurate." I suppose that's a judgement call. To me, I've grown to expect that people can identify what country I'm in from a variety of things (IP address, my manner of speech, name, etc.), so I don't think that is "scary."
I will acknowledge that o3 with web search enabled seems capable of playing GeoGuessr at a high level, because that is less of a judgement call. What I want to see now is an o3 GeoGuessr bot to play many matches and see what its ELO is.
asdsadasdasd123
This is probably one of the less impressive LLM applications imo. Like it already knows what every plant, street sign, etc is. I would imagine a traditional neural net would do really well here as well if you can extract some crude features.
EGreg
Cant the same be said about “unimpressive” behavior by coding LLMs that know every algorithm, language and library?
asdsadasdasd123
Disagree because code has to be far more precise than, the location is in the jungles of brazil. This level of coding as never been achievable by traditional ML methods AFAIK
ksec
>But several comments intrigued me:
>>I wonder What happened if you put fake EXIF information and asking it to do the same. ( We are deliberately misleading the LLM )
Yay. That was me [1] which was actually downvoted for most of its time. But Thank You for testing out my theory.
What I realised over the years is that comments do get read by people and do shape other people's thought.
I honestly dont think looking up online is cheating. May be in terms of the game. But in real life situation which is most of the time it is absolutely the right thing to do. The chains of thought is scary. I still dont know anything about how AI works other than old garbage in, garbage out. But CoT is definitely something else. Even though the author said it is sometimes doing needless work, but in terms of computing resources I am not even sure if it matters as long as it is accurate. And it is another proof that may be, just may be AI taking over the world is much closer than I imagined.
exitb
I tried a picture of Dublin and it pointed out the hotel I took it from. Obviously that’s more data than any single person can keep in their head.
godelski
There's two important things here to consider when reading:
1) O3 cheated by using Google search. This is both against the rules of the game and OP didn't use search either
2) OP was much quicker. They didn't record their time but if their final summary is accurate then they were much faster.
It's an apples to oranges comparison. They're both fruit and round, but you're ignoring obvious differences. You're cherry picking.
The title is fraudulent as you can't make a claim like that when one party cheats.
I would find it surprising if OP didn't know these rules considering their credentials. Doing this kind of clickbait completely undermines a playful study like this.
Certainly O3 is impressive, but by over exaggerating its capabilities you taint any impressive feats with deception. It's far better to under sell than over sell. If it's better than expected people are happier, even if the thing is crap. But if you over sell people are angry and feel cheated, even if the thing is revolutionary. I don't know why we insist on doing this in tech, but if you're wondering why so many people hate "tech bros", this is one of the reasons. There's no reason to lie here either! Come on! We can't just normalize this behavior. It's just creating a reasonable expectation for people to be distrusting of technology and anything tech people say. It's pretty fucked up. And no, I don't think "it's just a blog post" makes it any better. It makes it worse, because it normalizes the behavior. There's other reasons to distrust big corporations, I don't want to live in a world where we should have our guards up all the time.
SamPatt
>1) O3 cheated by using Google search. This is both against the rules of the game and OP didn't use search either
I re-ran it without search, and it made no difference:
https://news.ycombinator.com/item?id=43837832
>2) OP was much quicker. They didn't record their time but if their final summary is accurate then they were much faster.
Correct. This was the second bullet point of my conclusion:
>Humans still hold a big edge in decision time—most of my guesses were < 2 min, o3 often took > 4 min.”
I genuinely don't believe that I'm exaggerating or this is clickbait. The o3 geolocation capability astounded me, and I wanted to share my awe with others.
From one of o3 outputs:
> Rear window decal clearly reads “www.taxilinder.at”. A quick lookup shows Taxi Linder GmbH is based in Dornbirn, Vorarlberg.
That's cheating. If it can use web search, it isn't playing fair. Obviously you can get a perfect score on any urban GeoGuessr round by looking up a couple businesses, but that isn't the point.