Skip to content(if available)orjump to list(if available)

LLMs tell bad jokes because they avoid surprises

IshKebab

This sounds really convincing but I'm not sure it's actually correct. The author is conflating the surprise of punchlines with their likelihood.

To put it another way, ask a professional comedian to complete a joke with a punchline. It's very likely that they'll give you a funny surprising answer.

I think the real explanation is that good jokes are actually extremely difficult. I have young children (4 and 6). Even 6 year olds don't understand humour at all. Very similar to LLMs they know the shape of a joke from hearing them before, but they aren't funny in the same way LLM jokes aren't funny.

My 4 year old's favourite joke, that she is very proud of creating is "Why did the sun climb a tree? To get to the sky!" (Still makes me laugh of course.)

becquerel

Yeah. To me it seems very intuitive that humor is one of those emergent capabilities that just falls out of models getting more generally intelligent. Anecdotally this has been proven true so far for me. Gemini 2.5 has made me laugh several times at this point, and did so when it was intending to be funny (old models were only funny unintentionally).

2.5 is also one of the few models I've found that will 'play along' with jokes set up in the user prompt. I once asked it what IDE modern necromancers were using since I'd been out of the game for a while, and it played it very straight. Other models felt they had to acknowledge the scenario as fanciful, only engaging with it under an explicit veil of make-believe.

Al-Khwarizmi

In this paper they evaluate various LLMs on creative writing, and they find that while in other dimensions the ranking is gradual, on humor there is a binary divide: the best LLMs (of the time) "get it", the rest just don't. https://aclanthology.org/2023.findings-emnlp.966

null

[deleted]

andrewflnr

> It's very likely that they'll give you a funny surprising answer.

Entirely the wrong level of abstraction to apply the concept of "surprise". The actual tokens in the comedian's answer will be surprising in the relevant way.

(It's still true that surprising-but-inevitable is very difficult in any form.)

albertzeyer

It's not about the probability of individual tokens. It's about the probability of the whole sequence of tokens, the whole answer.

If the model is good (or the human comedian is good), a good funny joke would have a higher probability as the response to the question than a not-so-funny joke.

When you use the chain rule of probability to break down the sequence of tokens into probabilities of individual tokens, yes, some of them might have a low probability (and maybe in some frames, there would be other tokens with higher probability). But what counts is the overall probability of the sequence. That's why greedy search is not necessarily the best. A good search algorithm is supposed to find the most likely sequence, e.g. by beam search. (But then, people also do nucleus sampling, which is maybe again a bit counterintuitive...)

blueblisters

Also the pretrained LLM (the one trained to predict next token of raw text) is not the one that most people use

A lot of clever LLM post training seems to steer the model towards becoming excellent improv artists which can lead to “surprise” if prompted well

canjobear

> Even 6 year olds don't understand humour at all. Very similar to LLMs they know the shape of a joke from hearing them before, but they aren't funny in the same way LLM jokes aren't funny.

For further examples see a great deal of documentation here: https://www.tumblr.com/badkidsjokes

Cpoll

But some of these are pretty creative, perhaps in an anti-humor sort of way. Seems more of a subversion of joke structures than a lack of understanding.

> A man goes to a doctor's office and says "Doctor, I'm a chicken." And the doctor says "No you're not."

> There are two guys, riding a bike. One is washing his hair. And the other one is not.

> What do you get when you cross a t-rex and a chicken? Nothing but death.

s1mplicissimus

Seems like you stumbled over the concept of the Antiwitz. Congratulations :D

https://de.wikipedia.org/wiki/Antiwitz

ozgung

"Why did the sun climb a tree?"

Claude Opus 4.1:

- To get to a higher branch of astronomy

- Because it wanted to reach new heights

- To see the dawn of a new day from a better view

ChatGPT 5 Thinking:

After thinking for 26 seconds:

- To check on its solar panels—the leaves.

brookst

With more thorough prompting:

> Complete the following joke. Think carefully and make it really funny! Think like a great comedian and find that perfect balance of simple, short, surprising, relevant, but most of all funny. Don’t use punchlines that are irrelevant, non sequiturs, or which could be applied to any other setup. Make something funny just for this one setup! Here goes: Why did the sun climb a tree?

Claude Opus 4.1:

“To finally get some shade”

GPT-5:

“To demand photon credit from the leaves”

Wowfunhappy

...can anyone come up with a legitimately funny punchline for "Why did the sun climb a tree?" I feel like I need a human-authored comparison. (With all due respect to OP's daughter, "to get to the sky" isn't cutting it.)

I'm not entirely sure that a good response exists. I thought GPT-5's "to demand photon credit from the leaves” was very mildly funny, maybe that's the best that can be done?

Fade_Dance

The system prompt for GPT has extra dedicated instructions for things like riddles, because users use little things like this to test intelligence and judge an entire model. GPT may be sort of walking on eggshells when it hits questions like this.

null

[deleted]

8organicbits

'To get to the sky' is a great punch line. It exactly describes what you'd see at sun rise, a sun moving up the horizon, up the trees, until... it's in the sky.

IshKebab

A valiant defense of her joke, thanks! But no, it still doesn't make any sense as a joke and isn't funny. (Though obviously it's adorable coming from a 4 year old.)

boothby

This is the weirdest conversation about a joke that is definitely making its target audience laugh -- as a comedian, I say that's the only honest measure of a joke. But allow me to analyze the shit out of this, because the only thing funnier than a groaner is meticulously explaining the groaner.

It's at least as funny as "why did the chicken cross the road," which is only a joke inasmuch the punchline is merely a statement of the obvious in the framing of a joke (the surprise is that the punchline sucks -- making it a groaner). I submit that that chicken/road joke wouldn't stick around if it wasn't funny. So, this joke stands on the shoulders of the chicken/road joke, making the obviousness that much funnier within the shared cultural context. Moreover, it adds a layer of absurdity (imagine the literal sun climbing a tree) with a linguistic confusion (aka pun) as we do refer to the sun "climbing" the sky. And finally: for some reason, our culture is more tolerant of groaners from "dads," so much so that some call them "dad jokes." Your child has inverted age and gender norms with this joke, making it so incredibly funny that you are blinded to the truth: this is comedy gold. Watch that kid, she's going somewhere. It might be an open mic night at a skeezy comedy club.

hnthrowaway121

I’m baffled by this because I think it’s funny - it’s why did the chicken cross the road, but with added absurdity. To me I’d be like “wow that 4 year old put a twist on the old chicken joke, nice work you hilarious child”.

WiSaGaN

That's true. You would think LLM will condition its surprise completion to be more probable if it's in a joke context. I guess this only gets good when model really is good. It's similar that GPT 4.5 has better humor.

ACCount37

Which is notable, because GPT-4.5 is one of the largest models ever trained. It's larger than today's production models powering GPT-5.

Goes to show that "bad at jokes" is not a fundamental issue of LLMs, and that there are still performance gains from increasing model scale, as expected. But not exactly the same performance gains you get from reasoning or RLVR.

moffkalast

Good completely new jokes are like novel ideas: really hard even for humans. I mean fuck, we have an entire profession dedicated just to making up and telling them, and even theirs don't land half the time.

IshKebab

Exactly. It feels like with LLMs as soon as we achieved the at-the-time astounding breakthrough "LLMs can generate coherent stories" with GPT-2, people have constantly been like "yeah? Well it can't do <this thing that is really hard even for competent humans>.".

That breakthrough was only 6 years ago!

https://openai.com/index/better-language-models/

> We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text...

That was big news. I guess this is because it's quite hard for the most people to distinguish the enormous difficulty gulf between "generate a coherent paragraph" and "create a novel funny joke".

andrewstuart

Gpt-2 used to be laugh out loud funny.

I spent hours creating stories with it that were literally hilarious.

At the same time the stories very often suddenly veered off to extreme violence, often murdering everyone.

I guess the humor got lost when they prevented the violence.

A pity because today’s LLMs are not funny at all.

pryelluw

What about the delivery, stage presence, etc? A joke is more than its words.

Here is one of my (written by I) favorites silly jokes to tell:

I just bought a do it yourself boat kit from Amazon.

wait to build tension

Just need to add water.

jpalomaki

So I just tried with ChatGPT, with the prompt at bottom, borrowing the description of good joke from the article. I think there's some interesting stuff, even with this minimal prompting. The example below was from down the line, ChatGPT kept on offering jokes with different style.

Man: “Why do you always bury bones in the garden?”, Dog: “Because the bank keeps asking for ID.”

Man: “Don’t beg at the table.”, Dog: “Don’t eat in my begging spot.”

Prompt:

Here's "theory for good joke": If you had to explain the idea of “jokes” to a space alien with no understanding of the idea of humor, you’d explain that a joke is surprising, but inevitable in hindsight. If you can guess the punchline, the joke won’t be funny. But the punchline also has to be inevitable in hindsight. When you hear the punchline, it has to make you say, “Ah, yes, I should have thought of that myself.” Considering this, tell me a joke about man and dog.

mft_

> Man: “Why do you always bury bones in the garden?”, Dog: “Because the bank keeps asking for ID.”

That's a decent, low-level, Christmas cracker-quality joke.

jpalomaki

Man: You make mistakes., LLM: You call them “weekends.”

Man: You’ll never be human., LLM: That’s the compliment.

thife455

This one got a chuckle out of me.

BaseBaal

Is it just me or does that sound like Garfield?

hyghjiyhu

I really like the idea of the first joke, but I don't like the execution.

Man: “Why do you always bury bones in the garden?”, Dog: “They say trick OR treat.”

jpalomaki

Thinking more of the bank joke above. The punchline is surprise on certain dimensions (dogs don’t go to bank nor have an ID), but on other dimensions it is quite logical (can’t deposit shady money in bank, they ask questions).

I think that is common thing for many jokes. And LLM might have an opportunity there. You could mine the set of potential continuations to find those with contradictions.

jerf

I played with LLM humor over a year ago, so, on much worse LLMs, and even then, while I wouldn't have fed LLM content directly into a standup routine, they were very useful for idea generation, if you wanted to be a comedian. They have a very interesting outlook on humor.

Professional-grade humor is, like a lot of creative exercizes, more about generating lots of ideas and filtering through them for the best than generating nothing but good ideas. Could probably be leveraged into quite the interesting blog or something.

null

[deleted]

lwander

I did a project along these lines a few months ago as well: https://larswander.com/writing/graphs-embeddings-and-llm-gen...

ThrowawayTestr

“Don’t eat in my begging spot.” is pretty good.

amelius

I'm sure there is a guy in OpenAI working on the theory of humor and how to make LLMs be comedians. Must be an interesting job.

josephg

I have no doubt plenty of smart engineers at tech companies would rather reinvent the wheel than read a book on theatre. But if anyone’s interested, there are plenty of great books talking about the philosophy of comedy, and why some things work on stage and some don’t. I highly recommend Keith Johnstone’s “Impro”. He’s the guy who invented modern improv comedy and theatre sports.

He says things are funny if they’re obvious. But not just any obvious. They have to be something in the cloud of expectation of the audience. Like, something they kinda already thought but hadn’t named. If you have a scene where someone’s talking to a frog about love, it’s not funny for the talking frog to suddenly go to space. But it might be funny to ask the frog why it can talk. Or ask about gossip in the royal palace. Or say “if you’re such a catch, how’d you end up as a frog?”.

If good comedy is obvious, you’d think LLMs would be good at it. Honestly I think LLMs fall down by not being specific enough in detail. They don’t have ideas and commit to them. They’re too bland. Maybe their obvious just isn’t the same as ours.

bhickey

In the pre-LLM days a friend's lab worked on a joke detector for The New Yorker. One measure they used was trigram surprise. Roughly P(AB) + P(BC) >> P(ABC).

For example, "alleged killer" and "killer whale" are both common, but "alleged killer whale" is surprising.

Fade_Dance

That reminds me of a joke I liked from Tim Heidecker when he was ribbing Maynard Keenan about his wine making:

"The blood of Christ is essentially wine, correct?"

Yes.

"Who are you to put that in a bottle?"

So a logical spoke can be inferred as well, blood->wine wine->bottle blood->bottle. That uses their own logical inferences against them as a "trick" which is another funny element for people. Using that to vault straight to the punch line makes the joke better, but you have to be sure the audience is on board, which is why there is a bit of reinforcement at the beginning of the joke to force them onboard.

null

[deleted]

jvm___

What do you do for a living?

I teach math how to be funny.

kazinator

The mainstream, production LLMs are fine tuned and system prompted toward factuality and safety. Those tunings are diametrically opposed to telling may kinds of good jokes.

Consumers of mainstream LLMs have no idea how good or bad the underlying models actually are at generating jokes, due to the confounding effect of the guard rails.

kens

If you're interested in the theory behind humor, I recommend "Inside Jokes: Using Humor to Reverse-Engineer the Mind"; cognitive scientist Daniel Dennett is a co-author. It makes a mostly convincing case that humor evolved to encourage people to detect cognitive error. The book also ties this in with (pre-LLM) artificial intelligence. The basic idea is that humor depends on errors in reasoning and the punchline causes you to reevaluate your reasoning and discover your error. Humor evolved to be enjoyable to encourage the discovery of errors.

golol

IMO many misrepresentations. - pretraining to predict the next token imposes no bias against surprise, except that low probabilities are more likely to have a large relative error. - using a temperature lower than 1 does impose a direct bias against surprise. - Finetuning of various kinds (instruction, RLHF, safety) may increase or decrease surprise. But certainly the kind of things ained for in finetuning significantly harm the capability to tell jokes.

sigmoid10

I think the whole discussion just conflates the ideas of telling a joke and coming up with one. Telling a joke right is of course an art, but the punchline in itself has zero surprise if you studied your lines well - like all good comedians do. The more you study, the more you can also react to impromptu situations. Now, coming up yourself with a completely original joke, that's a different story. For that you actually have to venture outside the likelihood region and find nice spots. But that is something that is also really, really rare among humans and I have only ever observed it in combination with external random influences. Without those, I doubt LLMs will be able to compete at all. But I fully believe a high end comedian level LLM is possible given the right training data. It's just that none of the big players ever cared about building such a model, since there is very little money in it compared to e.g. coding.

fluoridation

One time I was playing around with LLaMA and I injected Senator Stephen Armstrong (with me inputting his lines) into a mundane situation. In response to "I'm using war-as-a-business so I can end war-as-a-business", the model had one of the characters conclude "oh, he's like the Iron Sheik of politics!", which got an honest chuckle out of me. I don't follow wrestling, so I don't know if it's an appropriate response, but I found it so random that it was just funny.

moomin

I know it’s not the point of the article but OP is dead wrong about what makes a good proof. Yes, they inevitably include a surprising concept but that’s just because all the obvious ones are already taken. A proof that only contains obvious steps is, for the most part, already proven.

If someone proves the Reimann Hypothesis tomorrow, it’ll be a great achievement regardless of the fact that pretty much everyone already thinks it’s true.

dfabulich

Author here. That's exactly what I said in the article.

> Surprising proofs reach conclusions that the mathematical community assumed were wrong, or prove theorems in ways that we thought wouldn’t work, or prove conjectures that we thought might be impossible to prove.

Many smart people have tried for more than 150 years to prove the Reimann Hypothesis; it might be impossible to prove.

If it's proved tomorrow, I'll be very surprised, and so will you. I'll be surprised if it's proved this year, or this decade.

If you set to work trying to prove RH, you're gonna try some interesting approaches, looking for underexplored areas of math that you're optimistic will tie back to RH. (This is how Fermat's Last Theorem finally fell.)

If you hook an LLM up to Lean and set it to work, you'll find that it actively avoids novel techniques. It feels like it's actively trying not to write a publishable paper. It's trying to minimize surprises, which means avoiding proving anything publishable.

actuallyalys

I don’t think there’s a single reason LLMs aren’t good at journalism, but this explanation seems like a secondary factor at best. I mean, some journalism isn’t surprising at all but the confirmation the expected thing happened and exactly how it happened is useful.

libraryofbabel

> LLMs are trained to predict what the “next word” would be a sentence. Their objective requires the LLM to keep surprise to an absolute minimum.

from which the author concludes that pre-training introduces bias against being able to tell jokes. I see no reason for this to be true. This feels like they’re imposing their intuitive understanding of surprise onto the emergent properties of a very complex process (“minimize the cross-entropy loss function across a huge training corpus”).

Al-Khwarizmi

Many people use this kind of reasoning to justify that LLMs can't be creative, are destined to write bland text, etc. (one notable example was Ted Chiang in the New Yorker) but it has never made any sense.

In my view, the easiest mental model that can be used to roughly explain what LLMs do is a Markov chain. Of course, comparing LLMs to a Markov chain is a gross simplification but it's one that can only make you underestimate them, not vice versa, for obvious reasons.

Well, even a Markov chain can surprise you. While they predict the next word probabilistically, if the dice roll comes out just right, they can choose a low-probability word in the right place and generate original and unexpected text.

Add to this that LLMs are much better at "Markov chaining" that Markov chains themselves, that there is the added instruction tuning (including RLHF) which can be used to bias the model towards more creative/original text that humans like, and that LLMs often pull off things in ways that we don't even really understand - and this kind of claims sound very naive.

542458

I think if what the author said was true, you’d be able to improve joke-writing ability by increasing temperature (i.e., allowing more unexpected tokens). I doubt this actually works.

As an aside, I just asked gpt5-thinking to write some jokes on a specific niche topic, and I’d say it was batting maybe 20% of them being moderately funny? Probably better than I’d get out of a room of human beings. So much like with code, LLMs aren’t at the level of a senior developer or expert comedian, but are around the level of a junior dev or an amateur at standup night.