Learning how to think with Meta Chain-of-Thought

80 comments

·January 10, 2025

drcwpl

I find their critique compelling, particularly their emphasis on the disconnect between CoT’s algorithmic mimicry and true cognitive exploration. The authors illustrate this with examples from advanced mathematics, such as the "windmill problem" from the International Mathematics Olympiad, a puzzle whose solution eludes brute-force sequential thinking. These cases underscore the limits of a framework that relies on static datasets and rigid generative processes. CoT, as they demonstrate, falters not because it cannot generate solutions, but because it cannot conceive of them in ways that mirror human ingenuity.

As they say - "Superintelligence isn't about discovering new things; it's about discovering new ways to discover."

seagullz

And then other problems would perhaps turn up down the track that would call for "discovering new ways to discover new ways of discovery" and so on.

KaoruAoiShiho

Just train it on meta reasoning, ie train it on people discovering ways to discover. It's not really a big problem, just generate the dataset and have at it.

derefr

This doesn't give you the ability to process ideas through the derived new insights, any more than loading the contents of a VLSI program into regular RAM gives you an FPGA.

The linear-algebra primitives used in LLM inference, fundamentally do not have the power to allow an LLM to "emulate" its own internals (i.e. to have the [static!] weights + [runtime-mutable] context, together encode [runtime-mutable] virtual weights, that the same host context can be passed through.) You need host support for that.

lxgr

> The linear-algebra primitives used in LLM inference, fundamentally do not have the power to allow an LLM to "emulate" its own internals […] You need host support for that.

Neither do biological brains (explicitly), yet we can hypothesize just fine.

keeganpoppen

“it’s not really a big problem”… surely you can’t be serious… this comment betrays such a profound ignorance that it could only have come from a genius or a... well, let’s not resort to name-calling…

but, seriously: play the tape forward literally one frame and outline what this dataset even remotely resembles… a core sample from a living human brain? “yeah, just train it on thinking about everything at once”. strong ai isn’t like the restaurant: the path to success doesn’t involve starting with more than you finished with.

FuckButtons

Sure, what's your training corpus for that then?

I find that fairly often if I'm really wrestling with a novel or difficult problem, I will work and work at it, and then one day I will wake up with the answer fully formed with no clear understanding of any of the thought processes that got me to arrive at the solution.

Are you going to record peoples subconscious as they sleep, how do you train on something that is so poorly understood in the first place? It's nonsense.

KaoruAoiShiho

I'm sure if you take an hour to recall you'll be able to come up with a process. Or ask a philosophy professor who specializes in reason.

But the easiest way I can think of ATM is to go through all the questions that AI currently fails on, and then have a human work through them and show the chain of thought a human would do, including the false starts, and describing the strategy pivots. Then generate your corpus based on that. However, that burns the problem-set so you'll have to constantly try to come up with new problems.

mrbungie

That would still be limited eventually, at what point do we stop adding layers?

hnuser123456

The point where it gets better at discovering ways of discovering things than the combination of the rest of humanity.

What is the combination of parameters that makes a text generator quick-thinking, self-checking, confidence-estimating? Jumping directly from question to accurate, precise, confidence-measured answers, regardless of how abstract the question is?

dartos

> "Superintelligence isn't about discovering new things; it's about discovering new ways to discover."

Wow I love that quote.

leobg

That’s meta. Literally.

Edit: Sorry. This was based on the false assumption that this was research by Meta, Inc..

WillieCubed

I love the quote you mentioned at the end. Do you remember the original source?

fragmede

https://x.com/nathanthinks/status/1877510438621163987

null

[deleted]

TaurenHunter

Thank you for mentioning the windmill problem. Great insights!

https://www.3blue1brown.com/lessons/windmills

adampk

This is the big idea in the paper, basically that CoT is limited for some complex problems because there is a class of problems where there is no 'textbook' way to find a solution. These are novel problems that need a unique methodology. "Essentially, to start generating the solution requires that we already know the full approach. The underlying generative process of the solution is not auto-regressive from left-to-right."

Mathematical meaning:

We can formalize this argument through the interpretation of reasoning as a latent variable process (Phan et al., 2023). In particular, classical CoT can be viewed as (equation) i.e., the probability of the final answer being produced by a marginalization over latent reasoning chains.

We claim that for complex problems, the true solution generating process should be viewed as (equation) i.e., the joint probability distribution of the solution (a, s1, . . . , s) is conditioned on the latent generative process. Notice that this argument is a meta-generalization of the prior CoT argument, hence why we will refer to the process q → z1 → . . . → z as Meta-CoT.

I think this is seminal. It is getting at heart of some issues. Ask o1-pro how you could make a 1550nm laser diode operating at 1ghz have low geometric loss without an expensive collimator using commodity materials or novel manufacturing approaches using first principle physics and the illusion is lost that o1-pro is a big deal. 'Novel' engineering is out of reach because there is no text book on how to do novel engineering and these class of problems is 'not auto-regressive from left-to-right'.

gjm11

I think it's remarkable how the goalposts have shifted.

For an AI model to be "a big deal", apparently we need to be able to give it a hard problem in an arbitrary field, one that humans have not yet solved[1], and have it spit out a good solution.

[1] At least, I think that's your intent. I am not a laser expert so I don't have a sense of where your challenge lies on a scale from "known but only to experts" to "major research project, may turn out to be impossible".

I very much agree that an AI system that could do that would be a big deal. An AI that could do that would be a world-changing deal. But it's pretty startling if everything short of that is not "a big deal" now, no?

lacksconfidence

The problem is this is what people are being told is happening. I've talked to laypeople that think chatgpt is a superintelligent thing they get 100% truthful answers from. I saw a podcast last week from a PhD (in an unrelated field) claiming AGI will be here in 2027. As long as there are people out there claiming AI is everything, there will be people that look at whats available and say no, it's not actually that (yet).

keeganpoppen

respectfully, i feel i am alone in this opinion, but i’m not even remotely convinced that there isn’t a “superintelligent being” hiding in plain sight with tools that we already have at hand. people always grouse about the quality of LLM outputs, and then you realize that they (tend to) think that somehow the LLM is supposed to read their minds and deliver the answer they “didn’t need, but deserved”… i’d take my chances being dumped in 12 th century england getting bleated at in old english over being an LLM that has to suffer through a three sentence essay about someone’s brilliant, life-altering startup idea, having to grapple with the overwhelming certainty that there is absolutely no conceivable satisfactory answer to a question poorly conceived.

for all we (well, “i”, i guess) know, “superintelligence” is nothing more than a(n extremely) clever arrangement of millions of gpt-3 prompts working together in harmony. is it really so heretical to think that silicon + a semi-quadrillion human-hour-dollars might maybe have the raw information-theoretical “measurables” to be comparable to those of us exalted organic, enlightened lifeforms?

clearly others “know” much more than i do about the limits of these things than me. i just have spent like 16 hours a day for ~18 months talking to the damned heretic with my own two hands— i am far from an authority on the subject. but beyond the classical “hard” cases (deep math, … the inevitability of death …?), i personally have yet to see a case where an LLM is truly given all the salient information in an architecturaly useful way in which “troublesome output”. you put more bits into the prompt, you get more bits out. yes, there’s, in my opinion, an incumbent conservation law here— no amount of input bits yields superlinear returns (as far as i have seen). but who looks at an exponential under whose profoundly extensive shadow we have continued to lose ground for… a half-century? … and says “nah, that can never matter, because i am actually, secretly, so special that the profound power i embody (but, somehow, never manage to use in such a profound way as to actually tilt the balance “myself”) is beyond compare, beyond imitation— not to be overly flip, but it sure is hard to distinguish that mindset from… “mommy said i was special”. and i say this all with my eyes keenly aware of my own reflection.

the irony of it all is that so much of this reasoning is completely contingent on a Leibniz-ian, “we are living in the best of all possible worlds” axiom that i am certain i am actually more in accord with than anyone who opines thusly… it’s all “unscientific”… until it isn’t. somehow in this “wtf is a narcissus” society we live in, we have gone from “we are the tools of our tools” to “surely our tools could never exceed us”… the ancient greek philosopher homer of simpson once mused “could god microwave a burrito so hot that even he could not eat it”… and we collectively seem all too comfortable to conclude that the map Thomas Acquinas made for us all those scores of years ago is, in fact, the territoire…

tomrod

> For an AI model to be "a big deal", apparently we need to be able to give it a hard problem in an arbitrary field, one that humans have not yet solved[1], and have it spit out a good solution.

Once you've been to the moon, the next stage is Mars or Deimos. Humans celebrate progress but also appreciate incremental improvements.

I run an AI/ML consultancy so I have skin in this game. The "traditional" model approaches still have tons, tons, tons of value to offer. Few need to have the frontier right away.

adampk

Yes! The ChatGPT moment has warn off. And there hasn't been a step-change other than Claude Sonnet 3.5 + Cursor for dramatic impact (which is only for coding) since then.

I 100% agree with you that AI is fantastic and it is a big deal in general. But now that the world has gotten used to it being able to parrot back something it learned (including reasoning) in the training set, the next 'big deal' is actual insight.

But I see your point, I still think what we have currently is out of a sci-fi book, but I am also not that amazed by computers in our pockets anymore :)

YeGoblynQueenne

No, and no goalposts have shifted. What's happened instead is that the claims made by LLM makers keep getting more and more outlandish as time passes, and they do that as a response to criticism that keeps pointing out the shortcomings of their systems. Every new model is presented as a breakthrough [1] and its makers rush to show off the results like "the new model is 100% better than the old one in passing the Bar exam!". You can almost hear the unsaid triumphant question hanging in the air "Are you convinced now? Are we having fun yet?".

We're not. The big deal with LLMs is that they are large enough language models that they can generate fluent, grammatical text that is coherent and keeps to a subject over a very, very long context. We never could do this with smaller language models. Because statistics.

What LLMs can absolutely not do is generate novel text. This is hard to explain perhaps to anyone who hasn't trained a small language model but generativity -the ability to generate text that isn't in a training set- is a property of the tiniest language model, as it is of the largest one [2]. The only difference is that the largest model can generate a lot more text.

And still that is not what we mean by novelty. For example, take art. When ancient humans created art, that was a new thing that had never before existed in the world and was not the result of combining existing things. It was the result of a process of abstraction, and invention: of generalisation. That is a capability that LLMs (as other statistical systems) lack.

The goalposts therefore have not moved because the criticism is as old as nails and the LLM makers have still not been able to comprehensively address it. They just try to ignore it. If the goalposts are here and you're shooting goals over there and then doing a little victory run every time the ball breaks Col. Mustard's windows, that's not the goalposts that have moved, it's you that keeps missing them.

_____________

[1] I'm old enough to remember... GPT-3 and how it blew GPT-2 out of the water; GPT-3.5 and how it blew GPT-3 out of the water; GPT-4 and how it blew GPT-3.5 out of the water... And all the users who would berate you for using the older model since "the new one is something completely different". Every single model. A yuuuge breakthrough. What progress!

[2] Try this. Take the sentence "<start> the cat sat on the mat with the bat as a hat <end>" and generate its set of bi-grams ("<start> the", "the cat", "cat sat", etc.). Then generate permutations of that set. You'll get a whole bunch -14!-1, as in |sentence|! minus the initial one- of sentences that were not in the training set. That's generativity in a tiny language model. That's how it works in the largest also, hard as that may be to believe. It shouldn't. It's a very simple mechanism that is extremely powerful. Large models are simply better at assigning weights to permutations so that the ones more often encountered in a corpus are weighted more.

adampk

Agreed! Don't get me wrong, the statistical distribution modeling for human language is still SUPER helpful. And for things like legal/tax/coding, which has a lot to do with applying language patterns, this is a very big deal. But the ability to find the true 'sub structure' of content that it is trained on is not something they can do. It is like there is some lower substrate that it is 'missing'. That is a lot to ask for, but once we get there it will be the 'magic' that is promised, rather than amazing, super helpful, parlor tricks.

dfilppi

[dead]

pillefitz

I do wonder whether a human could come up with a working solution for this problem without querying physical reality, i.e. experimentation. Parts of reality are uncomputable, so they can only be arrived at by letting the universe simulate it.

adampk

The closest example I could think of is the (maybe true, maybe myth making) story of SpaceX using car wash valves instead of super expensive 'space grade' valves that did the same thing, and were orders of magnitude cheaper. Doesn't seem like embodied AI is necessary to figure this out.

nuancebydefault

> CoT is limited for some complex problems because there is a class of problems where there is no 'textbook' way to find a solution.

This is contrary to my findings when interacting with LLMs. I can ask questions in ways not understandable for most human beings and from the reply I can derive the question is interpreted correctly (leaving aside the correctness of the answer). Some non-textbook-example of interpretation did emerge.

adampk

Interesting, could you give me an example? LLMs definitely can "understand" what I am asking at times when a human couldn't. They have more data to 'find similarity' to what I might mean. But I do not think you are saying they answer questions a human couldn't?

moffkalast

> 'Novel' engineering is out of reach because there is no text book on how to do novel engineering

There's no book on the scientific method?

As other commenters point out, it's kind of physically impossible to expect even a superinteligence in a box to figure something out that takes experimentation and observation. It would have to be limited to exploring pure math concepts and other fields where you only need to write on paper and think about axioms. And that's arguably the hardest type of field to make progress in, it took us millennia to produce enough polymaths to get where we are, and they all contributed a very small part each.

adampk

I don't disagree that there is never a need for 'new data' to make progress. But there is plenty of novel engineering that can be done with 'new data'. Just needing insights and connections.

But realizing that you can use certain commodity devices or known processing techniques in different problem spaces does not require new data, just 'insight'.

erikerikson

> That is, language models learn the implicit meaning in text, as opposed to the early belief some researchers held that sequence-to-sequence models (including transformers) simply fit correlations between sequential words.

Is this so, that the research community is agreed? Are there papers discussing this topic?

jbarrow

Research community is definitely not agreed on this, and there are a number of different camps. Broadly, 2 perspectives from the NLP community:

The 2020 Bender and Koller paper[1] that argues that meaning is not learnable from form, and LLMs are trained on form. They propose a thought experiment ("The Octopus Test" section of the paper) featuring an octopus that can intercept the conversation two humans are having, but "having only form available as training data, [the Octopus] did not learn meaning."

And a contradicting response from Yoav Goldberg (another NLP researcher)[2] with a much more informal discussion of "groundedness" and what LLMs learn. His argument is broadly that instruction tuning + post-training can meaningfully grounds terms like "summarize" etc.

[1] https://aclanthology.org/2020.acl-main.463/

[2] https://gist.github.com/yoavg/59d174608e92e845c8994ac2e234c8...

naasking

> The 2020 Bender and Koller paper[1] that argues that meaning is not learnable from form, and LLMs are trained on form. They propose a thought experiment ("The Octopus Test" section of the paper) featuring an octopus that can intercept the conversation two humans are having, but "having only form available as training data, [the Octopus] did not learn meaning."

This is just Searle's Chinese Room, and it's obviously false. How can we know it's false? Because there is no "meaning" in the standard model of particle physics (all interactions are by "form"/syntax), and therefore all humans must learn meaning from "form" as well.

wavemode

My sense has always been that, there actually is no difference between "the implicit meaning in text" and "correlations between sequential words".

That is to say, the fact that LLMs are able to communicate effectively with humans is a discovery about the regularity of the semantics of human communication, rather than a discovery about the intelligence of neural networks.

naasking

Agreed: semantics boils down to a relational network between words that designate concepts, therefore large language models build a relational network of concepts. Meta just published large concept models which builds on this:

https://ai.meta.com/research/publications/large-concept-mode...

mjburgess

This is certainly not agreed. Computer scientists here don't even have a theory of meaning, because it isn't part of the discipline, nor do almost any have any prior research background in it -- hence making these sort of outrageous claims all over the place. However you want to give natural language semantics, ML models certainly to not use this semantics.

The very best that might be said is that the correlational structure of words under transformer-like supervision (ie., where "predict the next word" is the goal) produces a distribution which is an extremely approximate model of natural language semantics.

Though this has never been disputed. The question comes down to what kind of extreme approximation is involved.

Eg., the truth conditions for "I have a pen in my hand" are that I have a pen in my hand -- direct access to these truth conditions is very plausibly necessary to mean "I have a pen in my hand" in the relevant context. Since a machine has no access to the truth conditions of such utterances it cannot possibly mean them.

Thus if a machine manages to say, "I have a pen in my hand" at an appropriate occasion -- the "extreme approximation to natural language semantics" has to do with this occasion and what "appropriateness" means.

Critics of LLMs and "computer-science-addled thinking" about such matters (such as myself) would say that there are a very narrow range of "occasions" (ie., situations in which you're prompting) that allow such responses to seem appropraite.

That a response seems appropriate to a user is a good engineering condition on a tool working -- it has nothing to do with whether a model understands natural language semantics.

What we might say is that it approximates conversations between agents who understand such semantics on a narrow range of occasions, and succeeds in modelling appropriate language use. And so you might call LLMs models of 'average appropriateness of replies'.

It obviously does not, nor cannot mean, "I have a pen in my hand"

gjm11

The truth conditions for the sentence "The composer Johann Sebastian Bach died in 1750" are not directly accessible to me. Can I mean that, if I say it?

The truth conditions for "The god of the evangelical Christians exists" and "The god of the evangelical Christians does not exist" have, arguably, never been directly accessible to any ordinary human being. (Though some of their consequences could be accessible.) Can people mean such things, when they say them?

The truth conditions for "There are infinitely many prime numbers" are ... unclear, really, but maybe they're vacuous (there is no possible world in which there aren't infinitely many prime numbers) or they involve only abstracta (such as those numbers). How do you feel about the possibility of an AI saying that and meaning it, and why?

The first of these examples is the most directly relevant one. I have no direct access to the truth conditions of that sentence, but I think I can still mean it, have good reason to think it true, etc. The processes by which I got into that state involve ... learning things by reading about them, which is exactly what I think you're saying cannot in principle ever give genuine knowledge.

Anticipating a possible response: Of course many of the other things I know, some of which are relevant to the way I understand those words, I learned more directly. For instance, part of what "died" means is the cessation of various natural processes like breathing and having one's heart beat, and I have direct experience of breathing and having a beating heart. One could argue that real knowledge and understanding needs to be somehow traceable back to direct experience, and therefore LLM-type systems cannot have them. But that would be a different argument from the one you've made, and I think it's less compelling (though more likely to be right!) than the simpler "knowledge isn't real unless it's based on direct access to the relevant truth conditions".

mjburgess

The mechanism of access varies depending on the claim made. "the sun is engaged in nuclear fusion" could not have been meant in 100 BC. But "I have a pen in my hand" could have. Julius Caeser could have made those sounds but he could never have meant the meaning of those words.

... to mean "I have" requires an "I" to "have", and so on. So what parts of non-linguistic reality language refers to matter for evaluating whether the user means what they say. An actor is likewise pretending to mean, and a child may say something without knowing what it means (as in, eg., a claim about nuclear fusion).

If two children were immitating sounds to each other, such that one "said", "the sun is nuclear fusion" and so on -- then neither in this conversation are communicating, neither know what these words mean. No child involved could ever come up with these words in this worder, and mean their meaning, they can only have this conversation via immitation. This is the case with an LLM -- it's an imitation game wherein the game is to either fool the adult overheading the child, or to generate some userful material (depending whether you're the CEO or CTO).

The problem with a "predict the next word" training goal is that any patterns which emerge will only be coincidentally related to the non-linguistic reality words refer to -- because the machine isn't trained on reference: it is not participating in reality and associating words with it.

The kind of participation necessary for an agent to acquire the meaning of words has no universal answer, but it always "some". An LLM has none.

For a claim about a composer, an agent who means to make this claim (rather than a child who imitates the sounds of words) -- must be aware of what a composer is, and so on. They cannot mean this claim if they don't have access to the non-linguistic reality to which these words refer (or are unable, via imgiation, to simulate similar ways the world might be, such that it has composers, given their prior knowledge -- eg., they at least have to have some prior direct access to music, leading-groups-of-people, and the like).

We can slightly weaken all this but it'll make no difference for an LLM -- however weak we require access, to access the meaning of words requires accessing a non-lingusitic reality. Words mean non-ligustic things -- that is their point.

idiotsecant

In what way do you have access to the truth conditions for 'I have a pen in my hand' that an LLM can not? This smells circular to me.

mjburgess

Well, by having a hand, and having a pen in it

null

[deleted]

YeGoblynQueenne

>> Behind this approach is a simple principle often abbreviated as "compression is intelligence", or the model must approximate the distribution of data and perform implicit reasoning in its activations in order to predict the next token (see Solomonoff Induction; Solomonoff 1964)

For the record, the word "intelligence" appears in the two parts of "A Formal Theory of Inductive Inference" (referenced above) a total of 0 times. The word "Compression" appears a total of 0 times. The word "reasoning" once; in the phrase "using similar reasoning".

Unsurprisingly, Solomonoff's work was preoccupied with Inductive Inference. I don't know that he ever said anything bout "compression is intelligence" but I believe this is an idea, and a slogan, that was developed only much later. I am not sure where it comes from, originally.

It is correct that Solomonoff induction was very much about predicting the next symbol in a sequence of symbols; not necessarily linguistic tokens, either. The common claim that LLMs are "in their infancy" or similar are dead wrong. Language modelling is basically ancient (in CS terms) and we have long since crossed in the era of its technological maturity.

_______________

[1] https://raysolomonoff.com/publications/1964pt1.pdf

[2] https://raysolomonoff.com/publications/1964pt2.pdf

naasking

It makes perfect sense that intelligence is a form of compression. An inductive model is small but can potentially generate arbitrary amounts of information.

YeGoblynQueenne

That's not in the reference (Solomonoff's 1964 paper) either.

naasking

You're the only one talking about the reference, I'm talking about what makes sense/what logically follows.

pama

Congrats to the authors for a thoughtful work! I have been thinking and working on related ideas for a few months now but did not yet spent commensurate compute on them and might have gone in a different direction; this work certainly helps create better baselines along the way of making better use of decoder transformer architectures. Please keep it coming!

lawlessone

Is Meta the company here or are they using meta the word? or both?

tomrod

word

https://chatgpt.com/share/67813a3f-c7e8-8001-ab0c-7f024bc41a...

vlovich123

But be careful with that output. It completely hallucinated sympy and the way it's using it wouldn't do anything because it keeps calling it on the original problem statement rather than as an aid to the LLM. So it's entirely unclear where the mistakes are in the summary without reading & fully understanding the paper.

tomrod

Feedback noted! Too late for me to edit comment. Will see if I can wipe the hallucinating chat.

baobun

Right answer, otherwise garbage source with much incorrectness. Please stop using CGPT links as references or source-of-truth.

lawlessone

Thank you!

j45

I'm a little curious, would anyone have a way to know how many researchers research something they came up with, vs researching something being done by an independent developer online, it being picked up and then researched and reported on?

null

[deleted]

null

[deleted]

jpcom

The example in the paper using an plug-and-chug algebra equation, and the step-by-step process to solve it, reinforces the notion that LLMs can only reproduce recipes they have seen before. This is really no different than how we learn mathematics in school, the teacher shows a starting point and moves, step-by-step, to the end of the process. Calling this "Meta Chain-of-Thought" feels like an aggrandizement of basic educational process to me. Next we'll be labeling the act of holding basic utensils as Layered Physical Kineticism, or something contrived like that. In school this "Meta Chain of Thought" was called "Show your work." Is this really a "phenomena" that needs explaining? It might teach us more about how we achieve logical induction (steps of reasoning) but we are pretty deep in the soup to be able to describe accurately the shape of the pot.

keeganpoppen

“can only reproduce recipes they have seen before”… are you talking about llms or about yourself?

jpcom

That's weirdly ad hominem, clearly I meant LLMs. They gave it a basic algebra problem and it could do it if it had broken down a problem step-by-step in a similar way. What's with the attitude? Edit: I don't even know why I replied to your vitriolic nonsense, I even used LLM in the sentence preceding what you quoted...

naasking

Meta's recently released Large Concept Models + this Meta Chain of Thought sounds very promising for AGI. The timeline of 2030 sounds increasingly plausible IMO.

HN

Learning how to think with Meta Chain-of-Thought

Learning how to think with Meta Chain-of-Thought