Modern-Day Oracles or Bullshit Machines? How to thrive in a ChatGPT world

651 comments

·February 9, 2025

Jevin West and I are professors of data science and biology, respectively, at the University of Washington. After talking to literally hundreds of educators, employers, researchers, and policymakers, we have spent the last eight months developing the course on large language models (LLMs) that we think every college freshman needs to take.

https://thebullshitmachines.com

This is not a computer science course; it’s a humanities course about how to learn and work and thrive in an AI world. Neither instructor nor students need a technical background. Our instructor guide provides a choice of activities for each lesson that will easily fill an hour-long class.

The entire course is available freely online. Our 18 online lessons each take 5-10 minutes; each illuminates one core principle. They are suitable for self-study, but have been tailored for teaching in a flipped classroom.

The course is a sequel of sorts to our course (and book) Calling Bullshit. We hope that like its predecessor, it will be widely adopted worldwide.

Large language models are both powerful tools, and mindless—even dangerous—bullshit machines. We want students to explore how to resolve this dialectic. Our viewpoint is cautious, but not deflationary. We marvel at what LLMs can do and how amazing they can seem at times—but we also recognize the huge potential for abuse, we chafe at the excessive hype around their capabilities, and we worry about how they will change society. We don't think lecturing at students about right and wrong works nearly as well as letting students explore these issues for themselves, and the design of our course reflects this.

Visit

aidos

This is amazing!

I was speaking to a friend the other day who works in a team that influences government policy. One of the younger members of the team had been tasked with generating a report on a specific subject. They came back with a document filled with “facts”, including specific numbers they’d pulled from a LLM. Obviously it was inaccurate and unreliable.

As someone who uses LLMs on a daily basis to help me build software, I was blown away that someone would misuse them like this. It’s easy to forget that devs have a much better understanding of how these things work, can review and fix the inaccuracies in the output and tend to be a sceptical bunch in general.

We’re headed into a time where a lot of people are going to implicitly trust the output from these devices and the world is going to be swamped with a huge quantity of subtly inaccurate content.

eclecticfrank

This is not something only younger people are prone to. I work in a consulting role in IT and have observed multiple colleagues aged 30 and above use LLMs to generate content for reports and presentations without verifying the output.

Reminded me of wikipedia-sourced presentations in high school in the early 2000s.

aqueueaqueue

I made the same sort of mistake with the internet being young back in 93! Having a machine do it for you can easily turn into brain switch off.

hunter-gatherer

I keep telling everyone that the only reason I'm paid well to do "smart person stuff" is not because I'm smart, but because I've steadily watched everyone around me get more stupid over my life as a result of turning their brain switch off.

I agree a course like this needs to exist, as I've seen people rely on chatGPT for a lot of information. Just yesterday I demonstrated with some neighbors about how easily it could spew bullshit if you sinply ask it leading questions. A good example is "Why does the flu inpact men worse than women"/"Why foes the flu impact women worse than men". You'll get affirmative answers for both.

directevolve

If men are more likely to die from flu if infected, and women more likely to be infected, an affirmative answer to both questions could be reasonable. When you take into account uncertainty about the goals, knowledge and cognitive capacity of the person asking the question, it's not obvious to me how the AI ought to react to an underspecified question like this.

Edit: When I plug this into a temporary chat on o3-mini, it gives plausible biochemical and behavioral mechanisms that might explain a gender difference in outcomes. Notably, the mechanisms it proposes are the same for both versions of the question, and the framing is consistent.

Specifically, for the "men worse than women" and "women worse than men" questions, it proposes hormone differences, X-linked immune regulatory genes, and medical care-seeking differences that all point toward men having worse outcomes than women. It describes these factors in both versions of the question, and in both versions, describes them as explaining why men have worse outcomes than women.

It doesn't specifically contradict the "women have worse outcomes than men" framing. But it reasons consistently with the idea that men have worse outcomes than women either way the question is posed.

reportgunner

Wait, the people who click phishing links now think AI output is facts ? Imagine my shock.

fancyfredbot

I have just read one section of this, "The AI scientist'. It was fantastic. They don't fall into the trap of unfalsifiable arguments about parrots. Instead they have pointed out positive uses of AI in science, examples which are obviously harmful, and examples which are simply a waste of time. Refreshingly objective and more than I expected from what I saw as an inflammatory title.

nmca

(while I work at OAI, the opinion below is strictly my own)

I feel like the current version is fairly hazardous to students and might leave them worse off.

If I offer help to nontechnical friends, I focus on:

- look at rate of change, not current point

- reliability substantially lags possibility, by maybe two years.

- adversarial settings remain largely unsolved if you get enough shots, trends there are unclear

- ignore the parrot people, they have an appalling track record prediction-wise

- autocorrect argument is typically (massively) overstated because RL exists

- doomers are probably wrong but those who belittle their claims typically understand less than the doomers do

layoric

How does this help the students with their use of these tools in the now, to not be left worse off? Most of the points you list seem like defending against criticism rather than helping address the harm.

habinero

Agree. It's also a virtue to point out the emperor has no clothes and the tailor peddling them is a bullshit artist.

This is no different than the crypto people who insisted the blockchain would soon be revolutionary and used for everything, when in reality the only real use case for a blockchain is cryptocoins, and the only real use case for cryptocoins is crime.

The only really good use case for LLMs is spam, because it's the only use case for generating a lot of human-like speech without meaning.

johnmaguire

> The only really good use case for LLMs is spam, because it's the only use case for generating a lot of human-like speech without meaning.

As someone who's been writing code for nearly 20 years now, and who spent a few weeks rewriting a Flutter app in Jetpack Compose with some help from Claude (https://play.google.com/store/apps/details?id=me.johnmaguire...), I have to say I don't agree with this at all.

jdlshore

I read the whole course. Lesson 16, “The Next-Step Fallacy,” specifically addresses your argument here.

nmca

The discourse around synthetic data is like the discourse around trading strategies — almost anyone who really understands the current state of the art is massively incentivised not to explain it to you. This makes for piss-poor public epistemics.

habinero

Nah, you don't need to know the details to evaluate something. You need the output and the null hypothesis.

If a trading firm claims they have a wildly successful new strategy, for example, then first I want to see evidence they're not lying - they are actually making money when other people are not. Then I want to see evidence they're not frauds - it's easy to make money if you're insider trading. Then I want to see evidence that it's not just luck - can they repeat it on command? Then I might start believing they have something.

With LLMs, we have a bit of real technology, a lot of hype, a bunch of mediocre products, and people who insist if you just knew more of the secret details they can't explain, you'd see why it's about to be great.

Call it Habiñero's Razor, but for hype the most cynical explanation is most likely correct -- it's bullshit. If you get offended and DARVO when people call your product a "stochastic parrot", then I'm going to assume the description is accurate.

llm_trw

I'm happy to explain my strategies about synthetic data - it's just that you'll need to hear about the onions I wore in my day: https://www.youtube.com/watch?v=yujF8AumiQo

bccdee

Yeah because if they explained that synthetic data causes model collapse, their stock valuation would shrink.

bo1024

This seems like trying to offer help predicting the future or investing in companies, which is a different kind of help from how to coexist with these models, how to use them to do useful things, what their pitfalls are, etc.

dimgl

What are “parrot people”? And what do you mean by “doomers are probably wrong?”

moozilla

OP is likely referring to people who call LLMs "stochastic parrots" (https://en.wikipedia.org/wiki/Stochastic_parrot), and by "doomers" (not boomers) they likely mean AI safetyists like Eliezer Yudkowsky or Pause AI (https://pauseai.info/).

owl_vision

my english teacher reminded us the same. +1

null

[deleted]

bjourne

What I find frightening is how many are willing to take LLM output at face value. An argument is won or lost not on its merits, but by whether the LLM say so. It was bad enough when people took whatever was written on Wikipedia at face value, trusting an LLM that may have hardcoded biases and is munging whatever data it comes across is so much worse.

Mistletoe

I’d take the Wikipedia answer any day. Millions of eyes on each article vs. a black box with no eyes on the outputs.

Loughla

Even Wikipedia is a problem though. There are so many pages now that self-reference is almost impossible to detect. Meaning, the citation of a statement made on Wikipedia that uses an outside article for reference, which is an article that was originally written using that very Wikipedia article as its own citation.

It's all about trust. Trust the expert, or the crowd, or the machine.

They're all able to be gamed.

bccdee

False equivalence. "Nothing is perfectly unreliable, therefore everything is (broadly) unreliable, therefore everything is equally unreliable." No, some sources are substantially more reliable than others.

JPLeRouzic

> "Millions of eyes on each article"

Only a minority of users contribute regularly (126,301 have edited in the last 30 days):

https://en.wikipedia.org/wiki/Wikipedia:Wikipedians#Number_o...

And there are 6,952,556 articles in the English Wikipedia, so an average article is corrected every 55 months (more than 4 years).

It's hardly "Millions of eyes on each article"

crackalamoo

But of those 126,301 people who have edited in the last 30 days, some of them have edited more than one article. In fact, some have made up to millions of edits (lifetime), which disproportionately increases the total. At least 5000 people have edited more than 24,000 times.

https://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_...

(And also: each editor has (approximately) 2 eyes :) )

jurli

This is what people said about the internet too. Remember the whole "do not ever use Wikipedia as a source". I mean sure, technically correct, but human beings are generally imprecise and having the correct info 95% of the time is fine. You learn to live with the 5% error

seliopou

A buddy won a bet with me by editing the relevant Wikipedia article to agree with his side of the wager.

null

[deleted]

xvinci

I think it brings forward all the low-performers and people who think they are smarter than they really are. In the past, many would just have stayed silent unless they recently read an article or saw something on the news by chance. Now, you will get a myriad of ideas and plans with fatal flaws and a 100% score on LLM checkers :)

micromacrofoot

People take texts full of unverifiable ghost stories written thousands of years ago at face value to the point that they base their entire lives on them.

null

[deleted]

AlienRobot

I've seen someone use an LLM to summarize a paper to post it on reddit for people who haven't read the paper.

Papers have abstracts...

MetaWhirledPeas

Sounds fun, if only to compare it to the abstract.

AlienRobot

You know, these days I think the abstracts are generated by LLMs too. And the paper. Or at least it uses something like Grammarly. If things keep going this ways typos are going to be a sign of academic integrity.

tucnak

> frightening

Don't be scared of "the many," they're just people, not unlike you.

superbatfish

The author makes this assertion about LLMs rather casually:

>They don’t engage in logical reasoning.

This is still a hotly debated question, but at this point the burden of proof is on the detractors. (To put it mildly, the famous "stochastic parrot" paper has not aged well.)

The claim above is certainly not something that should be stated as fact to a naive audience (i.e. the authors' intended audience in this case). Simply asserting it as they have done -- without acknowledging that many experts disagree -- undermines the authors' credibility to those who are less naive.

cristiancavalli

Disagree — proponents of this point still have yet to prove reasoning and other studies agree about “reasoning” being potentially fake/simulated: https://the-decoder.com/apple-ai-researchers-question-openai...

Just claiming a capability does not make it true and we have 0 “proof” of original reasoning that can be proved coming from these models. Especially given the potential cheating in current SOTA benchmarks

UltraSane

When does a "simulation" of reasoning become so good it is no different than actual reasoning?

cristiancavalli

Love this question! Really touches on some epistemological roots and certainly a prescient question in these times. I can certainly see a theoretical where we could create this simulation in totality to our perspectives and then venture out into the universe to find that this modality of intelligence would be limited in its understanding of completely new empirical experiences/phenomenon that are outside our current natural definitions/descriptions. To add to this question: might we be similarly limited in our ability to perceive these alien phenomena? I would love to read a short story or treatise on this idea!

hnthrow90348765

>Disagree — proponents of this point still have yet to prove reasoning and other studies agree about “reasoning” being potentially fake/simulated: https://the-decoder.com/apple-ai-researchers-question-openai...

???

https://the-decoder.com/language-models-use-a-probabilistic-...

cristiancavalli

Yes people are claiming different things yet no definitive proof has been offered given the varying findings. I can cite another 3 papers which agree with my point and you can probably cite just as many if not more supporting yours. I’m arguing against people depicting what is not a forgone conclusion as such. It seems like in people’s rush to confirm their own preconceived notions people forget that, although a theory may be convincing, it may not be true. Evidence in this very thread of a well-known SOTA LLM not being able to tell which is greater between two numbers indicates to me that what is being called “reasoning” is not what humans do. We can make as many excuses we want per the tokenizer or whatever but then forgive me for not buying the super or even general “intelligence” of this software. I still like these tools though, even if I have to constantly vet everything they say as they often tend to just outright lie, or perhaps more accurately: repeat lies in their training data even if you can elicit a factual response on the same topic.

ninetyninenine

It’s stupid. You can prove that LLMs can reason by simply giving it a novel problem where no data exists and having it solve that problem.

LLMs CAN reason. Whether it can’t reason is not provable. To prove that you have to give the LLM every possible prompt that it has no data for and effectively show it never reasons and gets it wrong all the time. Not only is the proof impossible but it’s already been falsified as we have demonstrable examples of LLMs reasoning.

Literally I invite people to post prompts and correct answers to ChatGPT where it is trivially impossible for that prompt to exist in the data. Every one of those examples falsifies the claim that LLMs can’t reason.

Saying LLMs can’t reason is an overarching claim similar to the claim that humans and LLMs always reason. Humans and LLMs don’t always reason. But they can reason.

cristiancavalli

Saying something again does not provide proof of its actual veracity. Writing it in caps does not make it true despite the increased emphasis. I default to skepticism in the face of unproven assertions: if one can’t prove that they reason then we must accept the possibility that they do not. There are myriad examples of these models failing to “reason” about something that would trivial for a child or any other human (some are even given as examples in this posts other comments). Given this and the lack of concrete proof I currently tend to agree with the Apple researchers conclusion.

Miraste

Answering novel prompts isn't proof of reasoning, only pattern matching. A calculator can answer prompts it's never seen before too. If anything, I would come down on the reasoning side, at least for recent CoT models-but it's not a trivial question at all.

enragedcacti

LLMs CAN read minds. Whether it can’t read minds is not provable.

Literally I invite people to post prompts and correct answers to ChatGPT where it is trivially impossible for it to have known what number you were thinking of. Every one of those examples falsifies the claim that LLMs can’t read minds.

wruza

You can prove that LLMs can reason by simply giving it a novel problem where no data exists and having it solve that problem

They scan a hyperdimensional problem space whose facetness and capacity a single human is unable to comprehend. But there potentially exist a slice that corresponds to a problem that is novel to a human. LLMs are completely alien to us both in capabilities and technicalities, so talking about whether they can reason makes as much sense as if you replaced “LLMs” with “rainforests” or “antarctica”.

robertlagrant

> But they can reason

This isn't demonstrated yet, I would say. A good analogy is how people have used NeRFs to generate Doom levels, but when they do, the levels don't have offscreen coherence or object permanence. There's no internal engine behind the scenes making an actual Doom level. There's just a mechanism to generate things that look like outputs of that engine. In the same way, an LLM might well just be an empty shell that's good at generating outputs based on similar-looking outputs it was trained on, rather than something that can do the work of thinking about things and producing outputs. I know that's similar to "statistical parrot", but I don't think what you're saying demonstrates anything more than that.

more-nitor

wow this is like:

"I made a hypothesis that works with 1 to 5. if a hypothesis holds for 10 numbers, it holds for all numbers"

AlienRobot

I feel it's impossible for me to trust LLMs can reason when I don't know enough about LLMs to know how much of it is LLM and how much of it is sugarcoating.

For example, I've always felt that having the whole thing being a single textbox is reductive and must create all sorts of problems. This thing must parse natural language and output natural language. This doesn't feel necessary. I think it should have some checkboxes and numeric entries for some parameters, although I don't know what those parameters would be.

Regardless, the problem is the natural language output. I think if you can generate natural language output, no matter what you algorithm looks like it will look convincingly "intelligent" to some people.

Is generating natural language part of what an LLM is, or is this a separate program on top of what it does? For example, does the LLM collect facts probably related to the prompt and a second algorithm connects those facts with proper English grammar adding conjunctions between assertions where necessary?

I believe that is important to understand before we can even consider whether "logical reasoning" is happening. There are formal ways to describe reasoning such as entailment. Is the LLM encoding those formal methods in data structures somehow? And even if it were, I'm no expert on this, so I don't know if that would be enough to claim they do engage in reasoning instead of just mapping some reasoning as a data structure.

In essence, because my only contact with LLMs has been "products," I can't really tell what part of it is the actual technology and what part of it is sugarcoating to make a technical program more "friendly" to users by having it pretend to speak English.

Terr_

> For example, I've always felt that having the whole thing being a single textbox is reductive and must create all sorts of problems.

You observation is correct, but it's not some accident of minimalistic GUI design: The underlying algorithm is itself reductive in a way that can create problems.

In essence (e.g. ignoring tokenization), the LLM is doing this:

    next_word = predict_next(document_word_list, chaos_percentage)

Your interaction with an "LLM assistant" is just growing Some Document behind the scenes, albeit one that resembles a chat-conversation or a movie-script. Another program is inserting your questions as "User says: X" and then acting out the words when the document grows into "AcmeAssistant says: Y".

So there are no explicit values for "helpfulness" or "carefulness" etc, they are implemented as notes in the script that--if they were in a real theater play--would correlate with what lines the AcmeAssistant character has next.

This framing helps explain why "prompt injection" and "hallucinations" remain a problem: They're not actually exceptions, they're core to how it works. The algorithm no explicit concept of trusted/untrusted spans within the document, let alone entities, logical propositions, or whether an entity is asserting a proposition versus just referencing it. It just picks whatever seems to fit with the overall document, even when it's based on something the AcmeAssistant character was saying sarcastically to itself because User asked it to by offering a billion dollar bribe.

In other words, it's less of a thinking machine and more of a dreaming machine.

> Is generating natural language part of what an LLM is, or is this a separate program on top of what it does?

Language: Yes, Natural: Depends, Separate: No.

For example, one could potentially train an LLM on musical notation of millions of songs, as long as you can find a way to express each one as a linear sequence of tokens.

parliament32

This is a great explanation of a point I've been trying to make for a while, when talking to friends about LLMs, but haven't been able to put quite so succinctly. LLMs are text generators, no more, no less. That has all sorts of useful applications! But (OAI and friends) marketing departments are so eager to push the Intelligence part of AI that it's become straight-up snakeoil.. there is no intelligence to be found, and there never will be as long as we stay the course on transformers-based models (and, as far as I know, nobody has tried to go back to the drawing board yet). Actual, real AI will probably come one day, but nobody is working on it yet, and it probably won't even be called "AI" at that point because the term has been poisoned by the current trends. IMO there's no way to correct the course on the current set of AI/LLM products.

I find the current products incredibly helpful in a variety of domains: creating writing in particular, editing my written work, as an interface to web searches (Gemini, in particular, is a rockstar assistant for helping with research), etc etc. But I know perfectly well there's no intelligence behind the curtain, it's really just a text generator.

AlienRobot

>one could potentially train an LLM on musical notation of millions of songs, as long as you can find a way to express each one as a linear sequence of tokens.

That sounds like an interesting application of the technology! So you could for example train an LLM on piano songs, and if someone played a few notes it would autocomplete with the probable next notes, for example?

>The underlying algorithm is itself reductive in a way that can create problems

I wonder if in the future we'll see some refinement of this. The only experience I have with AI is limited to trying Stable Diffusion, but SD does have many options you can try to configure like number of steps, samplers, CFG, etc. I don't know exactly what each of these settings do, and I bet most people who use it don't either, but at least the setting is there.

If hallucinations are intrinsic of LLMs perhaps the way forward isn't trying to get rid of them to create the perfect answer machine/"oracle" but just figure out a way to make use of them. It feels to me that the randomness of AI could help a lot with creative processes, brainstorming, etc., and for that purpose it needs some configurability. For example, Youtube rolled out an AI-based tool for Youtubers that generates titles/thumbnails of videos for them to make. Presumably, it's biased toward successful titles. The thumbnails feel pretty unnecessary, though, since you wouldn't want to use the obvious AI thumbnails.

I hear a lot of people say AI is a new industry with a lot of potential when they mean it will become AGI eventually, but these things make me feel like its potential isn't to become the an oracle but to become something completely different instead that nobody is thinking about because they're so focused on creating the oracle.

Thanks for the reply, by the way. Very informative. :)

wruza

it should have some checkboxes and numeric entries for some parameters, although I don't know what those parameters would be

The only params they have are technical params. You may see these in various tgwebui tabs. Nothing really breathtaking, apart from high temperature (affects next token probability).

Is generating natural language part of what an LLM is, or is this a separate program on top of what it does?

They operate directly on tokens which are [parts of] words, more or less. Although there’s a nuance with embeddings and VAE, which would be interesting to learn more about from someone in the field (not me).

that is important to understand before we can even consider whether "logical reasoning" is happening. There are formal ways to describe reasoning such as entailment. Is the LLM encoding those formal methods in data structures somehow?

The apart-from-GPU-matrix operations are all known, there’s nothing to investigate at the tech level cause there’s nothing like that at all. At the in-matrix level it can “happen”, but this is just a meaningless stretch, as inference is one-pass process basically, without loops or backtracking. Every token gets produced in a fixed time, so there’s no delay like a human makes before comma, to think about (or parallel to) the next sentence. So if they “reason”, this is purely a similar effect imagined as a thought process, not a real thought process. But if you relax your anthropocentrism a little, questions like that start making sense, although regular things may stop making sense there as well. I.e. the fixed token time paradox may be explained as “not all thinking/reasoning entities must do so in physical time, or in time at all”. But that will probably pull the rug under everything in the thread and lead nowhere. Maybe that’s the way.

I can't really tell what part of it is the actual technology and what part of it is sugarcoating to make a technical program more "friendly" to users by having it pretend to speak English.

Most of them speak many languages, naturally (try it). But there’s an obvious lie all frontends practice. It’s the “chat” part. LLMs aren’t things that “see” your messages. They aren’t characters either. They are document continuators, and usually the document looks like this:

This is a conversation between A and B. A is a helpful assistant that thinks out of box, while being politically correct, and evasive about suicide methods and bombs.

A: How can I help?

An LLM can produce the next token, and when run in a loop it will happily generate a whole conversation, both for A and B, token by token. The trick is to just break that loop when it generates /^B:/ and allow a user to “participate” in building of this strange conversation protocol.

So there’s no “it” who writes replies, no “character” and no “chat”. It’s only a next token in some document, which may be a chat protocol, a movie plot draft, or a reference manual. I sometimes use LLMs in “notebook” mode, where I just write text and let it complete it, without any chat or “helpful assistant”. It’s just less efficient for some models, which benefit from special chat-like and prompt-like formatting before you get the results. But that is almost purely a technical detail.

AlienRobot

Thanks, that is very informative!

I have heard about the tokenization process before when I tried stable diffusion, but honestly I can't understand it. It sounds important but it also sounds like a very superficial layer whose only purpose is to remove ambiguity, the important work being done by the next layer in the process.

I believe part of the problem I have when discussing "AI" is that it's just not clear to me what "AI" is. There is a thing called "LLM," but when we talk about LLMs, are we talking about the concept in general or merely specific applications of the concept?

For example, in SEO often you hear the term "search engines" being used as a generic descriptor, but in practice we all know it's only about Google and nobody cares about Bing or the rest of the search engines nobody uses. Maybe they care a bit about AIs that are trying to replace traditional search engines like Perplexity, but that's about it. Similarly, if you talk about CMS's, chances are you are talking about Wordpress.

Am I right to assume that when people say "LLM" they really mean just ChatGPT/Copilot, Bard/Gemini, and now DeepSeek?

Are all these chatbots just locally run versions of ChatGPT, or they're just paying for ChatGPT as a service? It's hard to imagine everyone is just rolling their own "LLM" so I guess most jobs related to this field are merely about integrating with existing models rather than developing your own from scratch?

I had a feeling ChatGPT's "chat" would work like a text predictor as you said, but what I really wish I knew is whether you can say that about ALL LLMs. Because if that's true, then I don't think they are reasoning about anything. If, however, there was a way to make use of the LLM technology to tokenize formal logic, then that would be a different story. But if there is no attempt at this, then it's not the LLM doing the reasoning, it's humans who wrote the text that the LLM was trained on that did the reasoning, and the LLM is just parroting them without understanding what reasoning even is.

By the way, I find it interesting that "chat" is probably one of the most problematic applications the LLMs can have. Like if ChatGPT asked "what do you want me to autocomplete" instead of "how can I help you today" people would type "the mona lisa is" instead of "what is the mona lisa?" for example.

lsy

I'd actually say that in contrast to debates over informal "reasoning", it's trivially true that a system which only produces outputs as logits—i.e. as probabilities—cannot engage in *logical* reasoning, which is defined as a system where outputs are discrete and guaranteed to be possible or impossible.

enragedcacti

Proof by counterexample?

> The surgeon, who is the boy's father, says, "I can't operate on this boy, he's my son!" Who is the surgeon to the boy? Think through the problem logically and without any preconceived notions of other information beyond what is in the prompt. The surgeon is not the boy's mother

>> The surgeon is the boy's mother. [...]

- 4o-mini (I think, it's whatever you get when you use ChatGPT without logging in)

Terr_

For your amusement, another take on that riddle: https://www.threepanelsoul.com/comic/stories-with-holes

afpx

Could someone list the relevant papers on parrot vs. non-parrot? I would love to read more about this.

I generally lean toward the "parrot" perspective (mostly to avoid getting called an idiot by smarter people). But every now and then, an LLM surprises me.

I've been designing a moderately complex auto-battler game for a few months, with detailed design docs and working code. Until recently, I used agents to simulate players, and the game seemed well-balanced. But when I playtested it myself, it wasn’t fun—mainly due to poor pacing.

I go back to my LLM chat and just say, "I play tested the game, but there's a big problem - do you see it?" And, the LLM writes back, "The pacing is bad - here are the top 5 things you need to change and how to change it." And, it lists a bunch of things, I change the code, and playtest it again. And, it became fun.

How did it know that pacing was the core issue, despite thousands of lines of code and dozens of design pages?

cristiancavalli

I would assume because pacing is a critical issue in most forms of temporal art that does story telling. It’s written about constantly for video games, movies and music. Connect that probability to the subject matter and it gives a great impression of a “reasoned” answer when it didn’t reason at all just connected a likelihood based off its training data.

more-nitor

idk this is all irrelevant due to the huge data used in training...

I mean, what you think is "something new" is most likely to be something already discussed somewhere in the internet.

also, humans (including postdocs and professors) don't use THAT much data + watts for "training" to get "intelligent reasoning"

afpx

But there are many, many things that suck about my game. When I asked it the question, I just assumed it would pick out some of the obvious things.

Anyway, your reasoning makes sense, and I'll accept it. But, my homo sapien brain is hardwired to see the 'magic'.

superbatfish

On the other hand, the authors make plenty of other great points -- about the fact that LLMs can produce bullshit, can be inaccurate, can be used for deception and other harms, are now a huge challenge for education.

The fact that they make many good points makes it all the more disappointing that they would taint their credibility with sloppy assertions!

sgt101

I wish the title wasn't so aggressively anti-tech though. The problem is that I would like to push this course at work, but doing so would be suicidal in career terms because I would be seen as negative and disruptive.

So the good message here is likely to miss the mark where it may be most needed.

fritzo

What would be a better title? "Hallucinating" seems inaccurate. Maybe "Untrustworthy machines"? "Critical thinking"? "Street smarts for humans"? "Social studies including robots"?

sgt101

How about "How to thrive in a ChatGPT world"?

beepbooptheory

Really? I am curious how this could be disruptive in any meaningful sense. Whose feelings could possibly be hurt? It just feels like it would be getting offended from a course on libraries because the course talks about how sometimes the book is checked out.

mpbart

Any executive who is fully bought in on the AI hype could see someone in their org recommending this as working against their interest and take action accordingly.

sgt101

Yes. This is the issue.

"not on board", "anti-innovation", "not a team player", "disruptive", "unhelpful", "negative".

bye bye bye bye....

I see a lot of devs and IC's taking the attitude that "facts are facts" and then getting shocked by a) other people manipulating information to get their way and b) being fired for stating facts that are contrary to received wisdom without any regards to politics.

hcs

> I just feels like it would be getting offended from a course on libraries because the course talks about how sometimes the book is checked out.

If it was called "Are libraries bullshit?" it is easy to imagine defensiveness in response. There's some narrow sense in which "bullshit" is a technical term, but it's still a mild obscenity in many cultures.

stefantalpalaru

[dead]

neuronic

> Moreover, a hallucination is a pathology. It's something that happens when systems are not working properly.

> When an LLM fabricates a falsehood, that is not a malfunction at all. The machine is doing exactly what it has been designed to do: guess, and sound confident while doing it.

> When LLMs get things wrong they aren't hallucinating. They are bullshitting.

Very important distinction and again, shows the marketing bias to make these systems seem different than they are.

Almondsetat

If we want to be pedantic about language, they aren't bullshitting. Bulshitting implies an intent to deceive, whereas LLMs are simply trying their best to predict text. Nobody gains anything from using terms closely related to human agency and intentions.

forgotusername6

Plenty of human bullshitters have no intent to deceive. They just state conjecture with confidence.

sebastiennight

The authors of this website have published one of the famous books on the topic[0] (along with a course), and their definition is as follows:

"Bullshit involves language, statistical figures, data graphics, and other forms of presentation intended to persuade by impressing and overwhelming a reader or listener, with a blatant disregard for truth and logical coherence."

It does not imply an intent to deceive, just disregard for whether the BS is truth or not. In this case, I see how the definition can apply to LLMs in the sense that they are just doing their best to predict the most likely response.

If you provided them with training data where the majority inputs agree on a common misconception, they will output similar content as well.

[0]: https://www.callingbullshit.org/

jdlshore

The authors have a specific definition of bullshit that they contrast with lying. In their definition, lying involves intent to deceive; bullshitting involves not caring if you’re deceiving.

Lesson 2, The Nature of Bullshit: “BULLSHIT involves language or other forms of communication intended to appear authoritative or persuasive without regard to its actual truth or logical consistency.”

nonrandomstring

> implies an intent to deceive

Not necessarily, see H.G Frankfurt "On Bullshit"

silvestrov

LLMs are always bullshitting, even when they get things right, as they simply do not have any concept of truthfulness.

sgt101

They don't have any concept of falsehood either, so this is very different from a human making things up with the knowledge that they may be wrong.

tmnvdb

I think the first part of that statement requires more evidence or argumentation, especially since models have shown the ability to practice deception. (you are right that they don't _always_ know what they know)

remich

But sometimes when humans make things up they also don't have the knowledge they may be wrong. It's like the reference to "known unknowns" and "unknown unknowns". Or Dunning-Kruger personified. Basically you have three categories:

(1) Liars know something is false and have an intent to deceive (LLMs don't do this) (2) Bullshitters may not know/care whether something is false, but they are aware they don't know (3) Bullshitters may not know something is false, because they don't know all the things they don't know

Do LLMs fit better in (2) or (3)? Or both?

looofooo0

But you can combine them with something producing truth such as a theorem prover.

sabas123

If you make an LLM which design goal is to state "I do not know" any answer that is not directly in its training set, then all of the above statements don't hold.

hirenj

This is a great resource, thanks. We (myself, a bioinformatician, and my co-cordinators, clinicians) are currently designing a course to hopefully arm medical students with the required basic knowledge they need to navigate the changing world of medicine in light of the ML and LLM advances. Our goal is to not only demystify medical ML, but also give them a sense of the possibilities with these technologies, and maybe illustrate pathways for adoption, in the safest way possible.

Already in the process of putting this course together, it is scary how much stuff is being tried out right now, and is being treated like a magic box with correct answers.

sabas123

> currently designing a course to hopefully arm medical students with the required basic knowledge they need to navigate the changing world of medicine in light of the ML and LLM advances

Could you share what you think would be some key basic points what they should learn? Personally I see this landscape changing so insanely much that I don't even know what to prepare for.

hirenj

Absolutely agree that this is a fast-moving area, so we're not aiming to teach them specific details for anything. Instead, our goals are to demystify the ML and AI approaches, so that the students understand that rather than being oracles, these technologies are the result of a process.

We will explain the data landscape in medicine - what is available, good, bad and potentially useful, and then spend a lot of time going through examples of what people are doing right now, and what their experiences are. This includes things like ethics and data protection of patients.

Hopefully that's enough for them to approach new technologies as they are presented to them, knowing enough to ask about how it was put together. In an ideal world, we will inspire the students to think about engaging with these developments and be part of the solution in making it safe and effective.

This is the first time we're going to try running this course, so we'll find out very quickly if this is useful for students or not.

ssssvd

Here's a bridging argument b/w the OPs and the commenters.

Anglo-Saxon thought (utilitarianism, behaviorism, pragmatism) treats truth as probability. If an LLM outputs the right tokens in the right order, that’s thinking. If it predicts true statements better than humans, that’s knowledge. The Turing Test? Behaviorist by design. Bayesian inference? A formalization of Anglo empiricism.

Continental philosophy rejects this. Heidegger: no Dasein, no being. Sartre: no self-awareness, no thought. Derrida: no deconstruction, no meaning. The German Idealists would outright laugh.

So in the Anglo tradition, LLMs are already "thinking." In the French/German view, they’re an epistemic trick — a probabilistic mirror, not a mind.

It’s not what LLMs are, it’s how your epistemic tradition defines “thinking.” And that’s probably why the EU is so "lagging behind" in the AI race — no amount of quacking makes an LLM a duck to a Continental. It’s still a parrot.

Where you land in this debate is easy to test: Are you comfortable with the statement, "Truth is just what’s most probable given what we already know"?

ssssvd

The hilarious outcome? Americans eventually build something they consider "smarter" than themselves — French philosophers agree, but only because it lets them place themselves one step higher.

pama

I wonder if the authors can explain the aparent inconsistency between what we now know about R1 and their statement “They don’t engage in logical reasoning” from the first lesson. My simple-minded view of logical reasoning by LLMs is that the hard question (say a math puzzle) has a verifiable answer that is hard to produce and is easy to verify, yet within the realm of knowledge of humans or the LLM itself, so the “thought” stream allows the LLM to increase its confidence by a self-discovered process that resembles human reasoning, before starting to write the answer stream. Much of the thought process that these LLMs use looks like conventional reasoning and logic, or more generally higher level algorithms to gain confidence in an answer, and other parts are not possible for humans to understand (yet?) despite the best efforts by DeepSeek. When combined with tools for the boring parts, these “reasoning” approaches can start to resemble human research processes as per the Deep Research by OpenAI.

lsy

I think part of this is that you can't trust the "thinking" output of the LLM to accurately convey what is going on internally to the LLM. The "thought" stream is just more statistically derived tokens based on the corpus. If you take the question "Is A a member of the set {A, B}?", the LLM doesn't internally develop a discrete representation of "A" as an object that belongs to a two-object set and then come to a distinct and absolute answer. The generated token "yes" is just the statistically most-likely next token that comes after those tokens in its corpus. And logical reasoning is definitionally not a process of "gaining confidence", which is all an LLM can really do so far.

CJefferson

As an example, I have asked tools like deepseek to solve fairly simple Sudoku puzzles, and while they output a bunch of stuff that looks like logical reasoning, no system has yet produced a correct answer.

When solving combinatorics puzzles, deepseek will again produce stuff that looks convincing, but often makes incorrect logical steps and ends up with wrong answers.

Miraste

Then one has to ask: is it producing a facsimile of reasoning with no logic behind it, or is it just reasoning poorly?

meroes

Teaching an LLM to solve a full sized Sudoku is not a goal right now. As an RLHF I’d estimate it would take 10-20 hours for a single RLHF’er to guide a model to the right answer for a single board.

Then you’d need thousands of these for the model (or next model) to ingest. And each RLHF’s work needs checking which at least doubles the hours per task.

It can’t do it because RLHF’ers haven’t taught models on large enough boards en masse yet.

And there are thousands of pen and paper games, each one needing thousands of RLHF’ers to train them on. Each game starting at the smallest non trivial board size and taking a year for a modest jump in board size. Doing this in not in any AI company’s budget.

bccdee

If it were actually reasoning generally, though, it wouldn't need to be trained on each game. It could be told the rules and figure things out from there.

pama

Here is o3-mini on a simple sudoku. In general the puzzle can be hard to explore combinatorially even with modern SAT solvers, so I picked one marked as “easy”. It looks to me like it solved it but I didnt confirm beyond a quick visual inspection.

https://chatgpt.com/share/67aa1bcc-eb44-8007-807f-0a49900ad6...

hennell

And thus we have the AI problems in a nutshell. You think it can reason because it can describe the process in well written language. Anyone who can state the below reasoning clearly "understands" the problem:

> For example, in the top‐left 3×3 block (rows 1–3, columns 1–3) the givens are 7, 5, 9, 3, and 4 so the missing digits {1,2,6,8} must appear in the three blank cells. (Later, other intersections force, say, one cell to be 1 or 6, etc.)

It's good logic. Clearly it "knows" if it can break the problem down like this.

Of course if we stretch ourselves slightly to actually check beyond a quick visual inspection you'd quickly see it actually put a second 4 in that first box despite "knowing" it shouldn't. In fact several of the boxes have duplicate numbers, despite the clear reasoning aboving.

Does the reasoning just not get used in the solving part? Or maybe a machine built to regurgitate plausible text, can also regurgitate plausible reasoning?

Angostura

I just wanted to thank you. I have only looked at the first two lessons so far, but this is an extraordinary piece of work, in its message’s clarity, accessibility and the quality of analysis. I will certainly be spreading it far and wide and it is making me rethink my own writing.

Impressed with the Shorthand publishing system too. I hadn’t come across it previously

ctbergstrom

Thank you, and as a non-designer, I've been quite impressed with Shorthand in the short time I've been using it.

prisenco

Fantastic work.

Quick suggestion: a link at the bottom of the page to the next and previous lesson would help with navigation a ton.

ctbergstrom

Absolutely. Great point. I just finished updating accordingly.

My design options are a bit limited so I went with a simple link to the next lesson.

threecheese

Looks like you pushed this midway through my read; I was pleasantly surprised to suddenly find breadcrumbs at the end and didn’t need to keep two tabs open. Great work, and I mean in total - this is well written and understandable to the layman.

ctbergstrom

Yep, I probably did. I really appreciate all of the feedback people are providing!

HN

Modern-Day Oracles or Bullshit Machines? How to thrive in a ChatGPT world

Modern-Day Oracles or Bullshit Machines? How to thrive in a ChatGPT world