Skip to content(if available)orjump to list(if available)

Recent results show that LLMs struggle with compositional tasks

moolimon

The main thesis here seems to be that LLMs behave like almost all other machine learning models, in that they are doing pattern matching on their input data, and short circuiting to a statistically likely result. Chain of thought reasoning is still bound by this basic property of reflexive pattern matching, except the LLM is forced to go through a process of iteratively refining the domain it does matching on.

Chain of thought is interesting, because you can combine it with reinforcement learning to get models to solve (seemingly) arbitrarily hard problems. This comes with the caveat that you need some reward model for all RL. This means you need a clear definition of success, and some way of rewarding being closer to success, to actually solve those problems.

Framing transformer based models as pattern matchers makes all the sense in the world. Pattern matching is obviously vital to human problem solving skills too. Interesting to think about what structures human intelligence has that these models don't. For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.

drakenot

With DeepSeek-R1-Zero, their usage of RL didn't have reward functions really that indicated progress towards the goal afaik.

It was "correct structure, wrong answer", "correct answer", "wrong answer". This was for Math & Coding, where they could verify answers deterministically.

geoffhill

Idk, `o3-mini-high` was able to pop this Prolog code out in about 20 seconds:

  solve(WaterDrinker, ZebraOwner) :-
      % H01: Five houses with positions 1..5.
      Houses = [ house(1, _, norwegian, _, _, _),  % H10: Norwegian lives in the first house.
                 house(2, blue, _, _, _, _),       % H15: Since the Norwegian lives next to the blue house,
                 house(3, _, _, milk, _, _),        %       and house1 is Norwegian, house2 must be blue.
                 house(4, _, _, _, _, _),
                 house(5, _, _, _, _, _) ],
  
      % H02: The Englishman lives in the red house.
      member(house(_, red, englishman, _, _, _), Houses),
      % H03: The Spaniard owns the dog.
      member(house(_, _, spaniard, _, dog, _), Houses),
      % H04: Coffee is drunk in the green house.
      member(house(_, green, _, coffee, _, _), Houses),
      % H05: The Ukrainian drinks tea.
      member(house(_, _, ukrainian, tea, _, _), Houses),
      % H06: The green house is immediately to the right of the ivory house.
      right_of(house(_, green, _, _, _, _), house(_, ivory, _, _, _, _), Houses),
      % H07: The Old Gold smoker owns snails.
      member(house(_, _, _, _, snails, old_gold), Houses),
      % H08: Kools are smoked in the yellow house.
      member(house(_, yellow, _, _, _, kools), Houses),
      % H11: The man who smokes Chesterfields lives in the house next to the man with the fox.
      next_to(house(_, _, _, _, _, chesterfields), house(_, _, _, _, fox, _), Houses),
      % H12: Kools are smoked in a house next to the house where the horse is kept.
      next_to(house(_, _, _, _, horse, _), house(_, _, _, _, _, kools), Houses),
      % H13: The Lucky Strike smoker drinks orange juice.
      member(house(_, _, _, orange_juice, _, lucky_strike), Houses),
      % H14: The Japanese smokes Parliaments.
      member(house(_, _, japanese, _, _, parliaments), Houses),
      % (H09 is built in: Milk is drunk in the middle house, i.e. house3.)
      
      % Finally, find out:
      % Q1: Who drinks water?
      member(house(_, _, WaterDrinker, water, _, _), Houses),
      % Q2: Who owns the zebra?
      member(house(_, _, ZebraOwner, _, zebra, _), Houses).
  
  right_of(Right, Left, Houses) :-
      nextto(Left, Right, Houses).
  
  next_to(X, Y, Houses) :-
      nextto(X, Y, Houses);
      nextto(Y, X, Houses).
Seems ok to me.

   ?- solve(WaterDrinker, ZebraOwner).
   WaterDrinker = norwegian,
   ZebraOwner = japanese .

orbital-decay

That's because it uses a long CoT. The actual paper [1] [2] talks about the limitations of decoder-only transformers predicting the reply directly, although it also establishes the benefits of CoT for composition.

This is all known for a long time and makes intuitive sense - you can't squeeze more computation from it than it can provide. The authors just formally proved it (which is no small deal). And Quanta is being dramatic with conclusions and headlines, as always.

[1] https://arxiv.org/abs/2412.02975

[2] https://news.ycombinator.com/item?id=42889786

leonidasv

Also, notice that the original study is from 2023.

simonw

Solving it with Prolog is neat, and a very realistic way of how LLMs with tools should be expected to handle this kind of thing.

EdwardDiego

I would've been very surprised if Prolog to solve this wasn't something that the model had already ingested.

Early AI hype cycles, after all, is where Prolog, like Lisp, shone.

simonw

I'm certain models like o3-mini are capable of writing Prolog of this quality for puzzles they haven't seen before - it feels like a very straight-forward conversion operation for them.

est

I'd say not only LLM stuggle with these kind of problems, 99% of humans do.

lsy

If the LLM’s user indicates that the input can and should be translated as a logic problem, and then the user runs that definition in an external Prolog solver, what’s the LLM really doing here? Probabilistically mapping a logic problem to Prolog? That’s not quite the LLM solving the problem.

endofreach

Psst, don't tell my clients that it's not actually me but the languages syntax i use, that's solving their problem.

xyzzy123

Do you feel differently if it runs the prolog in a tool call?

echelon

The LLM doesn't understand it's doing this, though. It pattern matched against your "steering" in a way that generalized. And it didn't hallucinate in this particular case. That's still cherry picking, and you wouldn't trust this to turn a $500k screw.

I feel like we're at 2004 Darpa Grand Challenge level, but we're nowhere near solving all of the issues required to run this on public streets. It's impressive, but leaves an enormous amount to be desired.

I think we'll get there, but I don't think it'll be in just a few short years. The companies hyping that this accelerated timeline is just around the corner are doing so out of existential need to keep the funding flowing.

tuatoru

    solve (make me a sandwich)
Moravec's Paradox is still a thing.

mohsen1

https://chatgpt.com/share/679f0353-12bc-8007-91cf-dd63d52044...

O1 Pro gets the answer with proper reasoning

Can’t tell if this is dude to data contamination or it really figured it out?

How can we form the question in another way to avoid data contamination?

mohsen1

With modified prompt to make less look like the original prompt it thought 6x more:

https://chatgpt.com/share/679f086d-a758-8007-b240-38e6843037...

null

[deleted]

klodolph

When one of these limitations gets spelled out in an article, it feels like six months later, somebody has a demo of a chatbot without that particular limitation.

These limitations don’t seem in any way “fundamental” to me. I’m sure there are a ton of people gluing LLMs to SAT solvers as we speak.

chefandy

Could you give an example of something we recently solved that was considered an unsolvable problem six months beforehand? I don’t have any specific examples, but it seems like most of the huge breakthrough discoveries I’ve seen announced end up being overstated and for practical usage, our choice of LLM-driven tools is only marginally better than they were a couple of years ago. It seems like the preponderance of practical advancement in recent times has come from the tooling/interface improvements rather than generating miracles from the models themselves. But it could be that I just don’t have the right use cases.

munchler

Take a look at the ARC Prize, which is a test for achieving "AGI" created in 2019 by François Chollet. Scroll down halfway on the home page and ponder the steep yellow line on the graph. That's what OpenAI o3 recently achieved.

[0] https://arcprize.org/

[1] https://arcprize.org/blog/oai-o3-pub-breakthrough

mrshadowgoose

Reviewing the actual problems is highly recommended: https://kts.github.io/arc-viewer/

They're not particularly difficult, but clearly require reasoning to solve.

EdwardDiego

So we're only 12% from AGI?

I'm dubious tbh. Given we still can't simulate a nematode.

liamwire

Not quite what you asked for, but it seems tangentially related and you might find it interesting: https://r0bk.github.io/killedbyllm/

intelkishan

Performance of OpenAI o3 in the ARC-AGI challenge fits the bill, however the model is not released publicly.

gallerdude

Completely disagree… there are a crazy amount of cases that didn’t work, until the models scaled to a point they magically did.

Best example I can think of is the ARC AGI benchmark. It was seen to measure human-like intelligence through special symmetries and abstract patterns.

From GPT-2 to GPT-4 there was basically had no progress, then o1 got about 20%. Now o3 has basically solved the benchmark.

wslh

SMT solvers really.

xigency

I've been slacking but yeah it's on my list.

changoplatanero

By the time these academic studies get published they are usually already several months out of date. o3-mini was released yesterday and if one wants to know about the limitations of current technology they are much better to check twitter than some research paper

FuckButtons

I think the breathless hype train of twitter is probably the worst place to get an actually grounded take on what the real world implications of the technology is.

Seeing the 100th example of an llm generating some toy code for which there are a vast number of examples of approximately similar things in the training corpus doesn’t give you a clearer view of what is or isn’t possible.

elicksaur

Or neither. Try it yourself.

For me, LLMs still don’t meet basic usefulness and are a net negative when I try to use them. I push code daily for my job.

Xmd5a

The paper is recent and being discussed here: https://news.ycombinator.com/item?id=42889786

anon291

It fundamentally does not matter. Matrix multiplication does not erase the truth of Godel and Turing.

kadoban

Godel and Turing just proved that there are some true things that can't be proved, and things that cannot be computed. They didn't show where those boundaries are.

They certainly didn't show those boundaries to be below human cognition level.

anon291

Godel proved that there are unprovable statements. Turing showed that certain classes of problems can only be solved by machines with infinite tapes. This no bounded LLM can possibly solve every turing complete problem. Only theoretically infinite chain of thought can possibly get us that power.

Godel then tells us that, if we have such a system, there are things where this system may get stuck.

Indeed this is what we see in chain of thought models. If you give them an impossible problem they either give up or produce a seemingly infinite series of tokens before emitting the </think> tag.

Turing tells us that examining any set of matrices modeling a finite state machine over an infinite token stream is the halting problem.

andrewflnr

Yeah, the grounded take is that Turing and Gödel apply just as much to human intelligence. If not, someone please go ahead and use this to physically prove the existence of an immortal, hypercomputational soul.

drdeca

Who is trying to “erase the truth of Gödel and Turing”? (Well, some cranks are, but I don’t think that’s who you are talking about.)

Gödel and Turing’s results do not appear to give any reason that a computer program can’t do what a person can do.

anon291

That's not the point. Computer program with a finite number of steps (an auto regressive LLM without chain of thought) has a limit in what it can reason in one step. This article does a lot of wordcelling to show this obvious point.

mikeknoop

One must now ask whether research results are analyzing pure LLMs (eg. gpt-series) or LLM synthesis engines (eg. o-series, r-series). In this case, the headline is summarizing a paper originally published in 2023 and does not necessarily have bearing on new synthesis engines. In fact, evidence strongly suggests the opposite given o3's significant performance on ARC-AGI-1 which requires on-the-fly composition capability.

orbital-decay

It's Quanta being misleading. They mention several papers but end up with this [1] which talks about decoder-only transformers, not LLMs in general, chatbots, or LLM synthesis engines, whatever that means. The paper also proves that CoT-like planning lets you squeeze more computation from a transformer, which is... obvious? but formally proven this time. Models trained to do CoT don't have some magical on-the-fly compositional ability, they just invest more computation (could be dozens millions of tokens in case of o3-high solving ARC-AGI-1 tasks).

[1] https://arxiv.org/abs/2412.02975

WillAdams

Perhaps applying:

https://www.doc.ic.ac.uk/~re14/Evans-R-2020-PhD-Thesis.pdf

"Kant’s Cognitive Architecture"

will help?

> ...we provide a precise formalization of what it means

> to “make sense” of a sensory sequence. According to our definition, making sense means constructing

> a symbolic causal theory that explains the sensory sequence and satisfies a set of unity conditions

> that were inspired by Kant’s discussion in the first half of the Critique of Pure Reason. According to

> our interpretation, making sense of sensory input is a type of program synthesis, but it is unsupervised

> program synthesis.

d0mine

> multilayer transformers indeed cannot solve certain complicated compositional tasks

> chain-of-thought prompting essentially turns a large problem into a sequence of smaller problems, making it possible for transformers to tackle more complex compositional tasks

--- [out of order]

> the model could be trained on 20-digit numbers and still reliably (with 98% accuracy) add 100-digit numbers, whereas a model trained without the extra positional embedding was only about 3% accurate

thfuran

98% seems pretty damn unreliable to me given that a computer ought to be able to do that addition with probably about 12 nines instead of 1 without really putting much effort into reliability.

bruce511

This speaks to the psychology of it all.

Over my life I've been "trained" that humans are unreliable and subject-to-change while computers are deterministic and will give the same output every time (given the same input.)

LLMs though are very human. Ask it the same question and you'll get different answers. Sometimes grammatically different, sometimes factually different.

This is partly why hallucinations freaked people out. I expect a correct, and deterministic answer. And it just made stuff up. Then it made up URLs to validate the answer.

In other words you have computers behaving like humans. Which was indeed the whole point of them to begin with.

But psychologically this is disturbing.

szvsw

Can you add 100 digit numbers reliably in a short amount of time over a large sample size? How about without a piece of paper and a pen? Can you even remember a 100 digit number? It’s likely quite difficult…

This is not about comparing to reliability “of a computer” but rather to a process which responds to a wide array of natural language queries (both human processes an algorithmic processes)… and obviously the reliability goes up once tool use is included.

apsec112

"showing that multilayer transformers indeed cannot solve certain complicated compositional tasks. Basically, some compositional problems will always be beyond the ability of transformer-based LLMs."

Pretty sure this is just false and the paper doesn't show this. I could be misunderstanding, but it looks like the result is only about a single token/forward pass, not a reasoning model with many thousands of tokens like o1/o3

simonw

I'm not sure that the statement "some compositional problems will always be beyond the ability of transformer-based LLMs" is even controversial to be honest.

There's a reason all of the AI labs have been leaning hard into tool use and (more recently) inference-scaling compute (o1/o3/Gemini Thinking/R1 etc) recently - those are just some of the techniques you can apply to move beyond the unsurprising limitations of purely guessing-the-next-token.

apsec112

o3 is still a transformer-based LLM, just one with a different loss function

simonw

Huh, yeah that's a good point. The various distilled R1 models are definitely regular transformer-based LLMs because the GGUF file versions of them work without any upgrades to the underlying llama.cpp library.

simonw

Here's "Einstein’s puzzle" from this paper: https://www.researchgate.net/publication/341189675_Is_Einste...

  H01 There are five houses.
  H02 The Englishman lives in the red house.
  H03 The Spaniard owns the dog.
  H04 Coffee is drunk in the green house.
  H05 The Ukrainian drinks tea.
  H06 The green house is immediately to the right of the ivory house.
  H07 The Old Gold smoker owns snails.
  H08 Kools are smoked in the yellow house.
  H09 Milk is drunk in the middle house.
  H10 The Norwegian lives in the first house.
  H11 The man who smokes Chesterfields lives in the house next to the man with the fox.
  H12 Kools are smoked in a house next to the house where the horse is kept.
  H13 The Lucky Strike smoker drinks orange juice.
  H14 The Japanese smokes Parliaments.
  H15 The Norwegian lives next to the blue house.
  Now,
  Q1 Who drinks water?
  Q2 Who owns the zebra?

simonw

I ran a few replacements because the puzzle is probably in the training data somewhere:

  Englishman -> Scotsman
  Spaniard -> Welshman
  dog -> parrot
  Coffee -> Kombucha
  green -> scarlet
  tea -> butterscotch
  Ukrainian -> Romanian
  ivory -> pink
  snails -> scorpions
  yellow -> turquoise
  Milk -> Red Bull
  Norwegian -> Brazilian
  fox -> bear
  horse -> mule
  orange juice -> rum
  Japanese -> German
  blue -> brown
  red -> purple
Here's my new prompt after replacing those words:

  H01 There are five huts.
  H02 The Scotsman lives in the purple hut.
  H03 The Welshman owns the parrot.
  H04 Kombucha is drunk in the scarlet hut.
  H05 The Romanian drinks butterscotch.
  H06 The scarlet hut is immediately to the right of the pink hut.
  H07 The Old Gold smoker owns scorpions.
  H08 Kools are smoked in the turquoise hut.
  H09 Red Bull is drunk in the middle hut.
  H10 The Brazilian lives in the first hut.
  H11 The man who smokes Chesterfields lives in the hut next to the man with the bear.
  H12 Kools are smoked in a hut next to the hut where the mule is kept.
  H13 The Lucky Strike smoker drinks rum.
  H14 The German smokes Parliaments.
  H15 The Brazilian lives next to the brown hut.
  Now,
  Q1 Who drinks water?
  Q2 Who owns the zebra?

Then I ran the new prompt through o3-mini-high and it gave this answer (after a thought process that looks legit to me, like it actually did reason through the puzzle and not just spit out an answer it had seen):

  Q1: Who drinks water?
  A: The Brazilian (in Hut 1) drinks water.

  Q2: Who owns the zebra?
  A: The German (in Hut 5) owns the zebra.
Undoing my replacements, that's the Norwegian drinking water and the Japanese person owning a zebra. Which I think is the solution from the original paper.

Here's my o3-mini-high transcript: https://chatgpt.com/share/679efae7-8544-8006-960b-a744cc1fea...

simonw

I tried it against deepseek-r1-distill-llama-70b running on Groq (which is really fast) and it didn't get the right answer: https://gist.github.com/simonw/487c4c074cd6ad163dba061e1e594...

I ran it like this:

  llm -m groq/deepseek-r1-distill-llama-70b '
    H01 There are five huts.
    H02 The Scotsman lives in the purple hut.
    H03 The Welshman owns the parrot.
    H04 Kombucha is drunk in the scarlet hut.
    H05 The Romanian drinks butterscotch.
    H06 The scarlet hut is immediately to the right of the pink hut.
    H07 The Old Gold smoker owns scorpions.
    H08 Kools are smoked in the turquoise hut.
    H09 Red Bull is drunk in the middle hut.
    H10 The Brazilian lives in the first hut.
    H11 The man who smokes Chesterfields lives in the hut next to the man with the bear.
    H12 Kools are smoked in a hut next to the hut where the mule is kept.
    H13 The Lucky Strike smoker drinks rum.
    H14 The German smokes Parliaments.
    H15 The Brazilian lives next to the brown hut.
    Now,
    Q1 Who drinks water?
    Q2 Who owns the zebra?'
Using this plugin: https://github.com/angerman/llm-groq

simonw

Full DeepSeek R1 - accessed through the DeepSeek API (their "deepseek-reasoner" model) - got the right answer: https://gist.github.com/simonw/f77be3bbc720e1314235d42593562...

mbernstein

I just did a pass with some replacements with o1 and it very much still recognized it as the Einstein riddle and actually seems to have cheated a bit :)

"Revisiting assumptions

Considering "Camels" might be a mistake for "Kools," leading to confusion. This inconsistency complicates solving the puzzle, showing the need for careful brand assignment."

Tracking puzzle progress

I’m mapping out various house and nationality combinations, but the classic conclusion is the Norwegian drinks water and the Japanese owns the zebra.

Analyzing the arrangement

I’m working through the classic puzzle structure and noting variations, while consistently identifying the Norwegian drinking water and the Japanese owning the zebra as the final solution."

simonw

Hah, that's fun. My o3-mini-high transcript didn't hint that it recognized the puzzle and looked legit when I scanned through them, but I'm still very suspicious since this is evidently such a classic puzzle.

I should have changed the cigarette brands to something else too.

RossBencina

> the puzzle is probably in the training data somewhere

Given that these models can perform translation I'm not sure why you think renaming things is sufficient to put your version out of distribution.

simonw

I don't - but like I said, I reviewed the thought process in the transcript and it looked legit to me.

I'm not sure what else I could do here to be honest, without coming up with a completely new puzzle that captures the same kind of challenge as the original. I'm not nearly patient enough to do that!

null

[deleted]

lgats

Interesting o3-mini produces a correct response: https://chatgpt.com/share/679efd3f-8e88-800d-9aa8-e3fadccf7c...

gorgoiler

I am reminded of this Reddit post from November last year, except this is fifth grade math homework:

https://www.reddit.com/r/mildlyinfuriating/comments/1gippqm/...

The puzzle is a grid of simultaneous equations with 9 unknowns:

  9 + ? - 6 = ? - 3
  -   =   -   =   x
  4 + ? x 5 - ? = 6
  /   +   +   -   -
  8 - 5 = ? - ? x 2
  x   -   -   -   x
  ? + 1 + 7 - 4 = 8
  =   -   =   +   =
  7 = ? x 3 - ? - 2

dweekly

(Not fair because it is in the training set, but o3-mini-high gives a thorough answer after 66s)

Below is a step‐by‐step reasoning that leads to the well‐known solution of the “Zebra Puzzle.”

Step 1. Number the Houses 1 to 5

We assume the houses are arranged in a row from left to right, with House 1 on the far left and House 5 on the far right.

Step 2. Assign the Easy Clues 1. H10: “The Norwegian lives in the first house.” → House 1: Nationality = Norwegian. 2. H15: “The Norwegian lives next to the blue house.” House 1 has only one neighbor (House 2), so → House 2: Color = Blue. 3. H09: “Milk is drunk in the middle house.” → House 3: Beverage = Milk.

Step 3. Determine the House Colors with the Ivory/Green Pair • H06: “The green house is immediately to the right of the ivory house.” There are two possible placements for this pair: either Houses (3,4) or Houses (4,5). The option (3,4) would force House 3 to be ivory, but later clues will assign House 3 a color (red, see H02) for the Englishman. So we take the pair as Houses 4 and 5: • House 4: Color = Ivory. • House 5: Color = Green. • H04: “Coffee is drunk in the green house.” → House 5: Beverage = Coffee. • The remaining colors for Houses 1–3 are now Red, Yellow, and (already used) Blue. H02: “The Englishman lives in the red house.” Since House 1 is taken by the Norwegian and House 2 is blue, the only possibility is: → House 3: Color = Red, Nationality = Englishman. • The only color left for House 1 is Yellow. H08: “Kools are smoked in the yellow house.” → House 1: Color = Yellow, Cigarette = Kools.

Step 4. Assign the Remaining Nationalities and Beverages

The nationalities so far: • House 1: Norwegian • House 3: Englishman

Remaining nationalities: Ukrainian, Spaniard, Japanese. • H05: “The Ukrainian drinks tea.” House 2 already is blue and unassigned for nationality and beverage, so: → House 2: Nationality = Ukrainian, Beverage = Tea.

Now for Houses 4 and 5, the remaining nationalities are Spaniard and Japanese. • H03: “The Spaniard owns the dog.” And • H14: “The Japanese smokes Parliaments.”

We still have the beverages and cigarettes to assign. Notice that House 5 (green) already has its beverage (coffee). It is natural then to assign: • House 5: Nationality = Japanese, Cigarette = Parliaments. This leaves: • House 4: Nationality = Spaniard, and by H03, → House 4: Pet = Dog.

Step 5. Use the Clues about Cigarettes and Pets

The remaining cigarette brands to assign (across Houses 2, 3, and 4) are: Old Gold, Chesterfields, and Lucky Strike. • H13: “The Lucky Strike smoker drinks orange juice.” House 5 already drinks coffee; House 3 has milk; House 2 has tea; so the only house with an unassigned beverage is House 4. → House 4: Beverage = Orange juice, Cigarette = Lucky Strike. • Now, the remaining cigarettes for Houses 2 and 3 are: Old Gold and Chesterfields. • H07: “The Old Gold smoker owns snails.” • H11: “The man who smokes Chesterfields lives in the house next to the man with the fox.” • H12: “Kools are smoked in a house next to the house where the horse is kept.” House 1 (which smokes Kools) has only one neighbor, House 2, so: → House 2: Pet = Horse.

Now, consider Houses 2 and 3 for the remaining cigarette assignments: • If House 2 were the Old Gold smoker, then by H07 it would have to own snails—but it already has the horse. So, House 2 must smoke Chesterfields, and consequently, → House 2: Cigarette = Chesterfields. • This forces: → House 3: Cigarette = Old Gold, and by H07, → House 3: Pet = Snails.

Finally, H11: “The man who smokes Chesterfields lives in the house next to the man with the fox.” House 2 (smoking Chesterfields) is adjacent to Houses 1 and 3. House 3 already has snails, so → House 1: must have the fox. → House 1: Pet = Fox.

The only pet left is the zebra. The only house without a pet now is House 5, so: → House 5: Pet = Zebra.

Step 6. Assign the Final Beverage

At this point, the beverages are: • House 2: Tea • House 3: Milk • House 5: Coffee • House 4: Orange juice

The only beverage not yet used is water. The only house still without a beverage is House 1, so: → House 1: Beverage = Water.

Final House Assignments • House 1: Color: Yellow Nationality: Norwegian Beverage: Water Cigarette: Kools Pet: Fox • House 2: Color: Blue Nationality: Ukrainian Beverage: Tea Cigarette: Chesterfields Pet: Horse • House 3: Color: Red Nationality: Englishman Beverage: Milk Cigarette: Old Gold Pet: Snails • House 4: Color: Ivory Nationality: Spaniard Beverage: Orange juice Cigarette: Lucky Strike Pet: Dog • House 5: Color: Green Nationality: Japanese Beverage: Coffee Cigarette: Parliaments Pet: Zebra

Cross Check Against the Clues • H02: Englishman in the red house → House 3 is red and English. • H03: Spaniard owns the dog → House 4 (Spaniard) has a dog. • H04: Coffee in the green house → House 5 (green) has coffee. • H05: Ukrainian drinks tea → House 2 (Ukrainian) has tea. • H06: Green house is immediately right of the ivory house → Houses 4 (ivory) and 5 (green). • H07: Old Gold smoker owns snails → House 3 (Old Gold) has snails. • H08: Kools in the yellow house → House 1 (yellow) smokes Kools. • H09: Milk in the middle house → House 3 has milk. • H10: Norwegian in House 1 → correct. • H11: Chesterfields smoker (House 2) lives next to fox owner (House 1). • H12: Kools (House 1) next to horse (House 2). • H13: Lucky Strike smoker (House 4) drinks orange juice. • H14: Japanese smokes Parliaments → House 5. • H15: Norwegian (House 1) lives next to blue house (House 2).

All clues are satisfied.

Final Answers • Q1: Who drinks water? → The Norwegian in House 1 drinks water. • Q2: Who owns the zebra? → The Japanese in House 5 owns the zebra.

Sources and Further Reading • For a classic discussion of this puzzle and its variations, see “Einstein’s Riddle” as presented in various logic puzzle collections and articles such as those found on Brainzilla or in books on lateral thinking puzzles.

This structured reasoning confirms the answers with a high degree of certainty.

null

[deleted]

EGreg

Maybe they’re not training them on the right thing, maybe math can be a different “mode” just lime sound or images

anon291

This is ultimately a basic adaptation of the pigeonhole principle and is not surprising. A finite system of matrix multiplications cannot be turing complete. You cannot expect one trip through a series of matrix multiplications and bias additions and a final sampling at the end which commits it to a certain answer to ever produce a correct answer. It's a mathematical impossibility. No talk of quantum woo, emergent phenomenen, or whatever other pseudo-science has arisen to explain AI intelligence can get around this simple truth of mathematics.

However, chain of thought reasoning where token streams can continue ad infinitum could potentially solve large swaths of problems whose ordinary solutions require turing machines. It could also solve problems that cannot generally be solved by turing machines, but where you only need solutions for a few classes of problems.

Either way, even with chain of thought, you would expect that... in some instances, the model output diverges and does not complete. And unsurprisingly... this is exactly what you see with the DeepSeek models (and other CoT models) when you pose it difficult questions. It will never emit the </think> tag.