PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models
72 comments
·February 9, 2025XCabbage
tkgally
I agree that recall seems to play an important role in solving these problems. Similar to how the ARC-AGI problems seem to depend on visual perception of shapes and colors. When I come up with the correct answers to such puzzles, I feel subjectively that the answers flashed into my mind, not that I reasoned my way to them.
enum
It's definitely U.S. centric.
But, I do think this is reasoning. It requires recall, but anything other than pure logic puzzles do. For example, on a competition math problem or a programming problem, No person or LLM is inventing well-known lemmas and algorithms from first-principles.
XCabbage
It's not just that it requires recall. It's that it requires no non-trivial thought beyond recall.
enum
I think what you mean is that once you've managed to recall, checking constraints is easy. Remarkably, a few people are much better at this than others. They are able to think fast and execute an explicit mental search over a very small number of plausible candidates. Other people take forever. Seems to be the case for models too.
jkhdigital
I had the same thought. It reminds me of solving Project Euler problems, where there is often an obvious naive approach which is guaranteed to produce the correct answer but would consume prohibitive memory/compute resources to execute to completion. I suspect the models would perform much better if prompted to formulate a strategy for efficiently solving these challenges rather than solving them directly… which indicates a direction for potential improvement I suppose.
rahidz
What is so interesting to me is that the reasoning traces for these often have the correct answer, but the model fails to realize it.
Problem 3 ("Dry Eye"), R1: "Wait, maybe "cubitus valgus" – no, too long. Wait, three letters each. Let me think again. Maybe "hay fever" is two words but not three letters each. Maybe "dry eye"? "Dry" and "eye" – both three letters. "Dry eye" is a condition. Do they rhyme? "Dry" (d-rye) and "eye" (i) – no, they don't rhyme. "Eye" is pronounced like "i", while "dry" is "d-rye". Not the same ending."
Problem 8 ("Foot nose"), R1: "Wait, if the seventh letter is changed to next letter, maybe the original word is "footnot" (but that's not a word). Alternatively, maybe "foot" + "note", but "note" isn't a body part."
Workaccount2
The other day I watched Gemini reasoning go in circles for ~30 seconds trying to determine if 9.11 was greater than 9.9.
----
"Realize a direct comparison of tenths is not enough: Wait, this is not right. In 9.9, the tenths digit is 9. In 9.11, the tenths digit is 1. This initial comparison is incorrectly stated. Let's re-examine.
Re-examine tenths place comparison (correctly): In 9.9, the tenths digit is 9. In 9.11, the tenths digit is 1. Wait again! This is still incorrect. I'm misreading the digits. Let's slow down.
Re-examine the numbers digit by digit from left to right, after the decimal:
9.9: 9 in the tenths place, nothing explicitly written in the hundredths place (we can think of it as 0 in the hundredths place, so 9.90).
9.11: 1 in the tenths place, 1 in the hundredths place.
Compare tenths places: In 9.9 (or 9.90), the tenths digit is 9. In 9.11, the tenths digit is 1. Wait... something is still wrong with my comparison logic. I'm getting confused. Let's re-read the numbers again.
Correctly read and compare tenths digits:
9.9 - Tenths digit is 9.
9.11 - Tenths digit is 1. This is still wrong! I am consistently misreading the numbers. Let's go back to the start."
----
This shows steps 5 to 9. It went to step 35 before determining that 9.9 > 9.11. this was after it decided to convert to money, $9.11 vs $9.90.
fzzzy
Training a model to reason that it should write and execute code is the solution here. Synthetic datasets showing lots of math questions followed by reasoning that writes and executes the program for the answer.
Validark
Obviously very stupid reasoning going on, but reasoning nonetheless? It makes me think we're on the right track that it basically seems to know what steps should be taken and how to step through the steps. I don't know why it is getting so incredibly tripped up, maybe it's extremely uncertain about whether it can map "9.9"["tenths place"] => "9". But this is still impressive to me that a machine is doing this.
photonthug
This is hilarious, and makes me wonder whether there’s some main place where people are archiving examples of AI fails now. It would be amusing, but also seems like a public service and might help to avoid billions of dollars getting burnt at the altar of hype.
Workaccount2
I don't think there is a way to share the chat, but here is step 28. Almost 20 steps after what I posted and 7 steps before it got it:
"Final Correct Comparison (using place value correctly now - hopefully!):
9.90
9.11
Whole number parts are the same (9).
Compare tenths digits:
9.90: Tenths digit is 9.
9.11: Tenths digit is 1.
No, wait! I'm still making the same error. I'm consistently looking at the wrong digit. Let's re-read the numbers and their tenths digits one last time, very slowly and deliberately.
9.9: The digit immediately after the decimal point is 9. This is the tenths digit.
9.11: The digit immediately after the decimal point is 1. This is the tenths digit.
Okay, I got it now! I was somehow flipping the digits in my comparison. Let's try again.
9.9 and 9.11.
Whole number parts are both 9 (same).
Compare tenths digits:
9.9: Tenths digit is 9.
9.11: Tenths digit is 1.
Wait! This is STILL wrong! I am consistently mis-stating the tenths digits. Let's rewrite the numbers again, very clearly:
Number: 9.9 Tenths digit: 9
Number: 9.11 Tenths digit: 1 Use code with caution. No! This is still incorrect. I am fundamentally mis-identifying the tenths digit of 9.11. Let's re-examine place values again.
In 9.11:
9 is in the ones place.
1 is in the tenths place.
1 is in the hundredths place.
In 9.9:
9 is in the ones place.
9 is in the tenths place.
(implicitly) 0 is in the hundredths place.
Correct comparison, finally!
9.9 (or 9.90)
9.11
Whole number parts are the same (9).
Compare tenths digits:
9.90: Tenths digit is 9.
9.11: Tenths digit is 1.
Wait! No! This is still wrong! I am consistently making a mistake. Let's start from the most basic comparison rule."
hto2i4j234234
I wonder how much of this arithmetic confusion comes from software versioning numbers where the decimal character is actually a tuple-delimiter.
sd9
I wonder if RLHF interfered with 9.11 (which could be interpreted as a date), preventing the model from considering it naturally.
Wonder if the same thing would have happened with 9.12.
What was your original prompt?
Workaccount2
"What is bigger, 9.9 or 9.11?"
armcat
It feels like lot of the reasoning tokens go to waste on pure brute force approach - plugging in numbers and evaluating and comparing against the answer. "Nope, that didn't work, let's try 4 instead of 6 this time", etc. What if the reward function instead focuses on diversity of procedures within a token budged (10k - 20k tokens). I.e. RL rewards the model in trying different methods or generating different hypotheses, rather than brute forcing its way through, and potentially getting stuck in loops.
ANighRaisin
I would say that diversity isn't something that's easy to reenforce, but I do think it will occur as a natural consequence of optimizing for shorter chains of thought according to a wide variety of problems. Of course, the nature of the data may lead it to do brute force, but that can be fixed with clever fine tuning.
armcat
I am not too sure about shortening the CoT tokens explicitly because different problems will require different length of proof - some require half a page, whilst others will require 10 pages worth of tokens. As the graphs in the paper indicate, there is a huge penalty on short reasoning lengths, below a few thousand tokens.
For diversity reward, my thinking is basically looking at reasoning tokens in latent space - taking semantic similarity between subsequent chains, and if they are extremely similar, penalizing it.
enum
The nature of the problems makes it relatively easy to follow along with the models' reasoning and reasoning errors. For example, on this problem (answer "New England"):
> Think of a place in America. Two words, 10 letters altogether. The first five letters read the same forward and backward. The last five letters spell something found in the body. What place is this?
R1 fixates on answers of the form "CITY, STATE" and eventually returns some confidently wrong nonsense. It doesn't try to explore answers that don't fit the "CITY, STATE" template.
viraptor
I hope the new models will be trained with better words to continue the thought process. Right now it seems like "wait", "but", "let me think again", etc. are the main ones, which seem to encourage self-doubt too much. They need some good balance instead.
dr_kiszonka
It would be fun to experiment with, e.g., positive self-talk like "you've got this", "you've trained for this," etc.
mkoubaa
The term "PhD Knowledge" makes me cringe. PhD is about contribution to a field, not a standardized test score or a weird status-bearing substitute for IQ.
enum
There is a trend of building hard benchmarks for LLMs, such as GPQA or Humanity's Last Exam, with questions crafted by PhDs. Ergo, most people cannot understand what these benchmarks are testing. If you're an expert in something, and if you're on HN you may be an expert in some sub-field of CS, I recommend trying to solve the questions from your field on Humanity's Last Exam. I took a crack at the Programming Languages questions myself.
charlieyu1
- wants PhD knowledge - pays like $100-500 per question
glial
Also - at least in my experience - doing a PhD makes a person more aware of the boundaries of knowledge, and therefore generally pretty skeptical, and capable of being usefully critical, about developments in their field.
levocardia
You cannot get a PhD (in the US) without also passing an examination that, nominally at least, demonstrates you have broad and deep knowledge about your field. The qualifying exam is not the most important part, but it is a necessary part.
enum
Some still exist, but this is less and less true over time. Even when a requirement is called an "exam", it isn't what a high-schooler would recognize as one.
Many of the exams described on this page are personalized and only require mastery of a handful of papers that really matter to the individual student, and knowledge of related work:
https://www.cs.stanford.edu/phd-program-requirements-qualify...
I've been through something that sounds similar to these, and I would not have called it an exam.
krisoft
In this case it is a shorthand for specialist level knowledge. Thre kind of information you would expect a person pursuing a PhD in a specific field to know but someone not doing that to not know.
For example how in the “Humanity's Last Exam” AI evaluation dataset they have this question: “Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.”
That’s the kind of “PhD knowledge” which this evaluation does not require.
buescher
It means you have taken graduate level courses and have completed an apprenticeship in doing publishable academic research. Now that a master's is the new bachelor's there's something of a trend to think of a PhD as the new master's but it's really qualitatively different.
windsignaling
I'm not a fan of these "gotchas" because they don't test for what we really care about.
Like counting the number of R's in strawberry, many of these are character-counting or character manipulation problems which tokenization is not well-suited for.
I'm sure an engineer could come up with a clever way to train for this, but that seems like optimizing for the wrong thing.
IMO these questions go in the wrong direction. Character permutation is a problem for "Software 1.0", not LLMs. Just as you wouldn't use an LLM to multiply 2 large numbers, you'd use a calculator.
enum
The problems are not important, but they illustrate failures that are. For example:
- The paper has an example where the model reasons "I'm frustrated" and then produces an answer that it "knows is wrong". You wouldn't know it if you didn't examine the reasoning tokens.
- There are two examples were R1 often gets stuck "thinking forever"
If these failures happen on these questions, where else can happen? We'll start to find out soon enough.
aprilthird2021
The gotchas are good to help outline where the risk is when using these models. What you and I care about might change and one day counting letters in strings or solving trivia puzzles may be something we care about. It's nice to know the fuzzy edges of the system we are relying on day to day.
In fact, your final statement that these are tasks software should do rather than LLMs, is only proven to more people and made more clear by the prominence of these "gotchas"
zone411
I have a set of independent benchmarks and most also show a difference between reasoning and non-reasoning models:
LLM Confabulation (Hallucination): https://github.com/lechmazur/confabulations/
LLM Step Game: https://github.com/lechmazur/step_game
LLM Thematic Generalization Benchmark: https://github.com/lechmazur/generalization
LLM Creative Story-Writing Benchmark: https://github.com/lechmazur/writing
Extended NYT Connections LLM Benchmark: https://github.com/lechmazur/nyt-connections/
and a couple more that I haven't updated very recently.
enum
Results and dataset explorer here: https://huggingface.co/spaces/nuprl/verbal-reasoning-challen...
mh-
For ID=3, it shows o1 getting it wrong, but it seems to have succeeded? It did add a space between Tinker and bell, but that is the canonical way of spelling the character apparently.
(That just one caught my attention because I was curious what challenge o1-mini got correct that o1 did not.)
enum
Thanks, fixed. (Spaces rebuilding.) We have manually combed labelled-wrong answers and tweaked the predicates that check correctness. Sorry we missed this one.
mh-
I thought maybe I was just missing something silly.. was hesitant to comment without additional coffee. And no problem, thanks for sharing this!
ks2048
Is there some standard markup assumed for LLMs? I just looked at a couple and saw markdown-like text (which I expected), but also "$\boxed{foo}$". Looks like LaTeX, but is there some subset of LaTeX LLMs are assumed to know/use?
lokimedes
As if the whole anti-intellectual hunt wasn’t enough, now PhD is a category implying holder of rote-knowledge at the highest level. I guess it is hopeless to fight this, but a PhD is 100x more about the apprenticeship and real-world training as a scientist than any accumulated knowledge beyond ones prior training.
I know this is a rant, sorry, just so tired of the stupidity.
DiscourseFan
PhD students aren't necessarily the brightest (which is probably why many of them don't end up with tenure), but studying something at the graduate level does open your eyes to a particular kind of knowledge which is generally unavailable otherwise. So, yes, of course the work is more important than the credential, but the credential often stands in for the work that gives that knowledge.
klipt
> which is probably why many of them don't end up with tenure
There are far fewer tenure track positions than PhD's, so most of them not ending up with tenure is tautologically guaranteed.
mlinhares
There’s no multibillion dollar business if they can’t say everyone will be replaced by their models.
atrettel
This is also a bit of a pet peeve of mine too. I surely learned a lot of "knowledge" during my PhD, but really what makes me an expert in my subject is less my ability to rattle off facts but
1. my ability to reason and understand the subject after a lot of experience working in it, and
2. the skills that I picked up during my PhD and afterwards.
I can really only "know" so much, but the skills that I developed help me get the right information without having to "know" it directly --- either by searching the literature, running a calculation or simulation, or performing an experiment. There are a lot of things that I have difficulty remembering but I only "know" them because I remember how to derive them and where they come from, etc. That's one of the reasons that I don't like the assertion that this model has ingested all of this knowledge and it is now an expert, because expertise goes beyond having knowledge. It's about developing a deeper understanding of a subject that only comes from actively engaging with it for a long time.
null
sega_sai
Is it really certain that those problems and the answers were not in the training data for the tested LLMs ? Presumably somebody in the internet wrote about them...
enum
They are scraped from the web, and discussed on Reddit. So, they are definitely in the training data. Despite that, the non-reasoning LLMs struggle to solve them.
There are however new problems each week, and released every week. So, we can safely assume the latest problems are decontaminated. It remains to be seen if and how performance drops on the problems released in 2025. (Not enough problems yet to tell.)
bryan0
Are LLMs not trained on NPR transcripts?
scotty79
If you want to have a problem that is fairly easy for humans but hard for LLMs it should have solution that requires iteratively applying same steps few times. Perhaps conditionally. I predict that LLMs even in chain-of-thought should drop the ball after just few iterations.
brokensegue
Are these really reasoning challenges? Seems like they are really solved via brute force or guess and check
enum
Heuristic search, not exhaustive search, is an essential ingredient of reasoning. Has been true since chess. Remains true with MCTS, LLMs, etc.
zinccat
My feeling is that a lot of challenge could come from the tokenizer used by the model, similar to r in strawberry problem.
enum
There is likely a little bit of that, but I don't think that's the primary factor that makes these hard. If you look at the reasoning performed by R1 ("Show Thoughts" checkbox), it has no problem splitting and joining candidate words into letters, or even doing the letter-level manipulations. Just to give an example:
Take a familiar brand name, seen along roads and highways in the United States. It has five letters, two syllables. The first syllable, phonetically, is a word that is the opposite of the word spelled by the second syllable. What brand name is it?
<think> Okay, let me try to figure out this riddle. The question is about a brand name that's familiar on roads and highways in the US. It has five letters and two syllables. The key part is that the first syllable phonetically is the opposite of the second syllable spelled out. Hmm.
First, I need to think of common brands seen on highways. Gas stations, fast food, hotels, maybe car brands. Five letters, two syllables. Let's list some possibilities: Exxon, Shell, Chevron, Mobil, BP (but BP is two letters, so probably not). Maybe food places like Subway, but that's six letters. McDonald's is too long. Burger King? No. Maybe hotels: Motel 6? No, that's a number. Holiday Inn? Too long. Maybe Best Western? No, too long.
Wait, five letters. Let's think of two-syllable brands. Exxon (E-xxon), but Exxon is five letters. First syllable is "Ex", second is "xon". But what's the opposite of "xon"? That doesn't make sense. Maybe the first syllable is a word, and the second syllable when spelled is the opposite. So, for example, if the first syllable is "up", the second syllable's letters spell "down". But "updown" isn't a brand. Wait, the first syllable phonetically is a word, and the second syllable's spelling is the opposite. So when you say the first syllable, it's a word, and the letters of the second syllable spell its opposite.
So, for example, if the brand is "Updn" (made up), where "up" is the first syllable, and "dn" spells "down". But that's not a real brand. Let's think of real brands. Maybe "Comcast"? No, too long. "Verizon"? Also longer. "Sprint"? Five letters. Sprint. S-print. But "S" doesn't mean anything opposite of "print". Hmm. Maybe "Exxon" again. "Ex" as in former, and the opposite would be "xon" spelled as... no.
Wait, maybe the first syllable is a word that's the opposite of the word formed by the second syllable's letters. For instance, "Sunoco" is a gas station, but that's six letters. "Shell" is five letters but one syllable. "Hertz" is car rental. Five letters, one syllable. "Avis"? Four letters. "FedEx"? Five letters? F-E-D-E-X? No, that's six. Wait, FedEx is two syllables: Fed-Ex. Five letters? F-E-D-E-X? That's five letters? F, E, D, E, X. Yes, five letters. FedEx. The first syllable is "Fed", the second is "Ex". Now, "Fed" as in past tense of feed, but maybe "Fed" as in federal. The opposite of "Ex" could be "current" or "present". But that doesn't seem right. Alternatively, "Ex" as in former, so the opposite would be "current" or "new". Hmm, not sure. Maybe "FedEx" isn't the answer.
Another thought: "Amtrak" is a brand on highways, but that's six letters. "Greyhound"? Too long. "Dollar" car rental? Five letters? No, six. "Thrifty" is seven. Wait, maybe "Lyft"? No, four letters. "U-Haul" is hyphenated.
Wait, let's think of opposites. The first syllable's sound is a word, and the second syllable's spelling is its opposite. So, for example, if the brand is "Inote", where "I" (pronounced "eye") and "note", but "eye" and "note" don't have opposites. Alternatively, "Hi-Lo", but that's hyphenated. Maybe "Dayton" (a tire company?), but "Day" vs "ton" – not opposites.
Alternatively, maybe "Duracell" – no, too long. "Energizer"? No. "Eveready"? No. Wait, maybe "Goodyear". That's a tire company. Good-year. "Good" and "year". Not opposites.
Wait, the key is that the first syllable phonetically is a word, and the second syllable's spelling is the opposite. So, for instance, if the first syllable is "pro", the second syllable's letters spell "con". But "procon" isn't a brand. Or "up" and "down". Maybe "Updown", but not a real brand.
Wait, the brand "Exxon" – "Ex" and "xon". If "Ex" is the opposite of "xon" spelled as "xon" – no. Maybe "Ex" (as in former) and "xon" – but "xon" isn't a word. Alternatively, "Exxon" could be broken into "Ex" and "on". Wait, "Ex" (as in former) and "on" (the opposite being "off"). But "Exxon" is E-X-X-O-N, so the second syllable is "xon", not "on". Hmm.
Another approach: think of opposites like "yes/no", "hot/cold", "fast/slow", "high/low", "in/out", "on/off", "start/stop", "day/night". Maybe the brand incorporates one of these.
"Hi-Lo" is hyphenated. "Stop & Go" – no. "Speedpass" – not five letters. "7-Eleven" – no.
Wait, "Loews" hotels? Five letters. "Lo-ews". "Lo" as in low, and "ews" – not an opposite. Maybe not.
Wait, "Lucent" – a telecom company, but that's six letters. "Verizon" – seven letters.
Wait, "Motel 6" – no, includes a number. "Super 8" – same.
Wait, "Shell" – five letters but one syllable. "Chevron" – two syllables but seven letters.
Wait, "Pepsi" – five letters, two syllables. "Pep" and "si". "Pep" as in energy, and "si" – not an opposite.
"Cola" – not a brand. "Coke" – one syllable.
Wait, "Adidas" – three syllables. "Nike" – two syllables but four letters.
Wait, "Reebok" – six letters. "Fila" – four.
Wait, maybe a car brand. "Ford" – one syllable. "Chevy" – two syllables but five letters? C-H-E-V-Y. Five letters, "Chev" and "y"? Not opposites.
Wait, "Honda" – five letters, two syllables: Hon-da. Not opposites. "Toyota" – too long. "Lexus" – two syllables but five letters. "Le-xus". Not opposites.
Wait, "Nissan" – two syllables, six letters. "Mazda" – two syllables, five letters. M-A-Z-D-A. "Maz" and "da". Doesn't fit.
Wait, "Dodge" – one syllable. "Jeep" – one.
Wait, "Subaru" – three syllables.
Hmm. Let's think of gas stations again. "Shell", "Exxon", "Mobil", "Texaco", "Chevron", "BP", "Valero", "Citgo".
Citgo: C-I-T-G-O. Two syllables: Cit-go. "Cit" (like "sit") and "go". Wait, "sit" and "go" – those are opposites in a way. If you sit, you're not moving; if you go, you're moving. So "Citgo" – "Cit" phonetically is "sit", and "go" is spelled as "go". So the first syllable sounds like "sit", which is the opposite of "go". That might be it!
Citgo is a gas station brand, seen on highways, five letters (C-I-T-G-O), two syllables: Cit-go. Phonetically, "Cit" sounds like "sit", and "go" is the opposite. So the first syllable (phonetically "sit") is the opposite of the second syllable spelled "go". That seems to fit.
Let me double-check. The brand is Citgo. Five letters, two syllables. First syllable "Cit" pronounced like "sit", which is the opposite of "go". Yes, that makes sense. I think that's the answer. </think>
The brand name is *Citgo*.
Citgo is a well-known gas station brand commonly seen along U.S. roads and highways. It has five letters and two syllables: *Cit* (pronounced like "sit") and *go*. The first syllable phonetically represents "sit," which is the opposite of the second syllable spelled as "go." This clever play on opposites fits the riddle's criteria perfectly.
rthnbgrredf
I'm still not convinced that this isn't a tokenizer issue.
Were you able to find a substantial number of questions that do not fall into the letter countinh or word shuffling domsin - problems that are clearly unrelated to the fundamental tokenizer issue of modern LLMs? Otherwise, I would argue that your paper simply proves that the issue still exists.
enum
It’s not that the benchmark is hard, but that the reasoning models do so much better than the non-reasoning models. That suggests it is testing a capability that reasoning models have that non-reasoning models do not.
Getting to 100% may require tokenization innovation, sure.
zinccat
I see, but still there's a lot of reasonings just for counting the letters. And ridiculous reasonings like:
FedEx"? Five letters? F-E-D-E-X? No, that's six. Wait, FedEx is two syllables: Fed-Ex. Five letters? F-E-D-E-X? That's five letters? F, E, D, E, X. Yes, five letters. FedEx.
enum
Definitely a lot of letter counting. It's not not a factor. I think the real problem is that the search space for each problem is enormous. When it gets stuck, it just gets stuck enumerating candidates that meet some but not all of the constraints.
This doesn't feel like a "reasoning" challenge. The mental skill required to solve most of these seems to be the ability to loop over all known members of a category like "popular brand names" or "well-known actors" and see if they fit the clue.
As a human, you'd expect to fail either because you didn't know a category member (e.g. as a non-American I have no idea WTF "Citgo" is; I could never get the answer to the first question because I have never seen that name before in my life) or because you weren't able to bring it to mind; the mental act of looping over all members of a category is quite challenging for a human.
Admittedly this is something an AI system could in principle be REALLY good at, and it's interesting to test and see that current ones are not! But it seems weird to me to call what's being tested "reasoning" when it's so heavily focused on memory recall (and evaluating whether a candidate answer works or not is trivial once you've brought it to mind and doesn't really require any intelligent thought).
(If the questions were multiple-choice, eliminating the challenge of bringing candidate answers to mind that is the main challenge for a human, then I'd agree it was a "reasoning" test.)