Is Winter Coming? (2024)

125 comments

·May 19, 2025

lolinder

An example of the prompt engineering phenomenon: my wife and I were recently discussing a financial decision. I'd offered my arguments in favor of one choice and she was mostly persuaded but decided to check in with ChatGPT to help reassure herself that I was right. She asked the financial question in layman's terms and got the opposite answer that I had given.

She showed me the result and I immediately saw the logical flaws and pointed them out to her. She pressed the model on it and it of course apologized and corrected itself. Out of curiosity I tried the prompt again, this time using financial jargon that I was familiar with and my wife was not. The intended meaning of the words was the same, the only difference is that my prompt sounded like it came from someone who knew finance. The result was that the model got it right and gave an explanation for the reasoning in exacting detail.

It was an interesting result to me because it shows that experts in a field are not only more likely to recognize when a model is giving incorrect answers but they're also more likely to get correct answers because they are able to tap into a set of weights that are populated by text that knew what it was talking about. Lay people trying to use an LLM to understand an unfamiliar field are vulnerable to accidentally tapping into the "amateur" weights and ending up with an answer learned from random Reddit threads or SEO marketing blog posts, whereas experts can use jargon correctly in order to tap into answers learned from other experts.

colinmorelli

Related similar thing when I sent my dog's recent bloodwork to an LLM, including dates, tests, and values. The model suggested that an advancement in her kidney values (all still within normal range) were likely evidence of chronic kidney disease in its early stage. Naturally this caused some concern for my wife.

But, I work in healthcare and have enough knowledge of health to know that CKD almost certainly could not advance fast enough to be the cause of the kidney value changes in the labs that were only 6 weeks apart. I asked the LLM if that's the best explanation for these values given they're only 6 weeks apart, and it adjusted its answer to say CKD is likely not the explanation as progression would happen typically over 6+ months to a year at this stage, and more likely explanations were nephrotoxins (recent NSAID use), temporary dehydration, or recent infection.

We then spoke to our vet who confirmed that CKD would be unlikely to explain a shift in values like this between two tests that were just 6 weeks apart.

That would almost certainly throw off someone with less knowledge about this, however. If the tests were 4-6 months apart, CKD could explain the change. It's not an implausible explanation, but it skipped over a critical piece of information (the time between tests) before originally coming to that answer.

osigurdson

The internet, and now LLMs have always been bad at diagnosing medical problems. I think it comes from the data source. For instance, few articles would be linked to / popular if a given set of symptoms were just associated with not getting enough sleep. No, the articles stand out are the ones where the symptoms are associated with some rare / horrible condition. This is our LLM training data which are often missing the entire middle part of the bell curve.

colinmorelli

For what it's worth this statement is actually not entirely correct anymore. Top-end models today are on par with diagnostic capabilities of physicians on average (across many specialties), and, in some cases, can outperform them when RAG'd in with vetted clinical guidelines (like NIH data, UpToDate, etc)

However, they do have particular types of failure modes that they're more prone to, and this is one of them. So they're imperfect.

whstl

I very often also get better programming results than less experienced engineers, even though I'm not remotely doing any kind of "prompt engineering".

Also how you ask matters a lot. Sometimes it just wants to make you happy with whatever answer, if you go along without skepticism it will definitely make garbage.

Fun story: at a previous job a Product Manager made someone work a full week on a QR-Code standard that doesn't exist, except in ChatGPT's mind. It produced test cases and examples, but since nobody had a way to test

When it was sent to a bank in Sweden to test, the customer was just "wait this feature doesn't exist in Sweden" and a heated discussion ensued until the PM admitted using ChatGPT to create the requirements.

c22

This same phenomenon is true for classic search engines as well. Whenever I am becoming informed on a new topic my first searches are always very naive and targeted just at discovering the relevant jargon that will let me make better searches. It turns out that many disciplines contain analogue concepts, just with different words being used to describe them. Understanding the domain specific language used is more than half the battle.

skydhash

If I’m dealing with an unfamiliar domain, my next step is always wikipedia or an introductory book. Just to collect a sets of keywords and references to narrow my future searches. I don’t think I’ve ever asked google a question shaped query.

nothercastle

Yeah I reminder that being the trick to get Google to provide good results was to find some key industry or area terms for what you were looking for. This doesn’t work anymore because Google search has gotten so bad

caust1c

This anecdote corroborates my theory that it will still be critical to become an expert in your field. Everyone is treating AI like it's a zero-sum game with regards to jobs being "lost" to AI, but the reality is that the best results will come from experts in the field who have the vocabulary and knowledge to get the best answers.

My fear is that people treat AI like an oracle when they should be treating it just like any other human being.

redeye100

This is just bad design. Or a faulty tool. Why should the job market shift to accommodate this gap in the functioning of LLMs? This is a bug that needs to be fixed.

I have a personal gripe about this bringing an unfinished tool to market and then prophetizing about its usefulness. And how we all better get ready for it. This seems very hand-wavey and is looking more and more like vaporware.

It's like trying to quickly build a house on an unfinished foundation. Why are we rushing to build? Can't we get the foundational things right first?

xnorswap

People treat certain humans, or humans in certain roles, as oracles too.

lblume

What percentage of people are actually experts at their jobs though?

aleph_minus_one

For the sake of discussion I want to play devil's advocate concerning your point

> It was an interesting result to me because it shows that experts in a field are not only more likely to recognize when a model is giving incorrect answers but they're also more likely to get correct answers because they are able to tap into a set of weights that are populated by text that knew what it was talking about. Lay people trying to use an LLM to understand an unfamiliar field are vulnerable to accidentally tapping into the "amateur" weights and ending up with an answer learned from random Reddit threads or SEO marketing blog posts, whereas experts can use jargon correctly in order to tap into answers learned from other experts.

Couldn't it be the case that people who (in this case recognizable to the AI by their choice of wording) are knowledgeable in the topic need different advise than people who know less about the topic?

To give one specific examples from finance: if you know a lot about finance, getting some deep analysis and advice about what is the best way to trade some exotic options is likely sound advice. On the other hand, for people who are not deeply into finance the best advice is likely rather "don't do it!".

lolinder

In some cases, sure, but not here—neither option had more risk associated with it than the other, it was just an optimization problem. The first answer that the model gave to my wife was just wrong about the math, with no room for subjectivity.

dogleash

> Couldn't it be the case [...] need different advise than people who know less about the topic?

> for people who are not deeply into finance the best advice is likely rather "don't do it!".

Oh boy, more nanny software. This future blows.

aleph_minus_one

> Oh boy, more nanny software. This future blows.

I think this topic is a little bit more complicated: this is rather a balancing of the model between

1. "giving the best possible advice to the respective person given their circumstances" vs

2. "giving the most precise answer to the query to the user"

(if you ask me: the best decision would in my opinion be to give the user a choice for this, but this would be overtaxing to many users)

- Freedom-loving people will hate it if they don't get 2

- On the other hand, many people would like to actually get the advice that is most helpful to them (i.e. 1), and not the one that may answer their question exactly, but is likely a bad idea for them

cjohnson318

I had a similar experience. I did some back of the envelope math and my wife suggested I run it through ChatGPT. After actually doing the math, I felt a lot better about my understanding of the problem and I just... don't trust an LLM to understand algebra. Yeah, they're awesome most of the time, but I don't want to trust it with something important, be wrong, and then have to explain to someone that I trusted the opinion of a couple of big matrices, over my own knowledge and experience, on a high school word problem.

aaronbaugher

I've noticed Grok struggles with dates and relative time, especially when referring to things it "remembers" from earlier conversations. Even a phrase like "last night" will be wrongly interpreted sometimes. So although I've had it research numbers and create estimates for me, I wouldn't just assume the numbers are right without checking everything.

lblume

Since LLMs are basically linear algebra all the way down, this is vaguely reminiscent of how human brains also have a very hard time understanding neural circuitry despite literally being made from it.

ivanjermakov

LLM users are highly susceptible to the confirmation bias, by putting their expectations into the prompt.

amelius

Maybe try a 2-step approach: first ask the LLM to translate your question into expert-language, then ask that question :)

aleph_minus_one

The expert's answer when asked in expert language is "it's complicated". :-)

Fraterkes

I love reading, I enjoy long-form articles, but I really wish technical bloggers especially would practice distilling their point into shorter posts. I notice it a lot with (older) scott alexander articles, this implicit assumption that your writing is informative/entertaining enough that you can stretch a simple idea to many pages.

I want to reitterate that I don't want dull, minimal, writing. I don't subscribe to the "reduce your wordcount until it can't be reduced any further" style of writing advice. I just think that many people have very similar ideas about ai (and written very similar things), and if you have something to say that you haven't seen expressed before, it is worthwile (imo) to express it without preamble.

stego-tech

As someone with this very style (my own blog posts often rise into the 5k word range) and also from a technical background, I can at least explain my motivations for length: absolute domination of the argument.

In professional settings, brevity is often mistaken for inexperience or a weak position. As the thinking goes, a competent engineer should be able to defend every position they take like a PhD candidate defending their dissertation. At the same time, however, excess verbosity is viewed as distinctly “cold” and “engineer” in tone, and frowned upon by non-technical folks in my experience; they wanted an answer, not an explainer.

The problem is that each of us have the data points of what succeeds in convincing others: the longer argument, every single time. Thus we use it in our own writing because we want to convince the imagined reader (as well as ourselves) that our position is correct, or at the very least, sound. In doing so we write lengthy posts, while often doing research to validate our positions with charts, screenshots, Wikipedia articles, news sources, etc. It’s as much about convincing ourselves as it is other readers, hence why we go for longer posts based on real world experiences.

One plot twist subjective to me: my lengthy posts are also about quelling my brain, in a very real sense. It is the reader, and if I do not get everything out of my head about that topic and onto “paper”, it will continue to dwell and gnaw on the missed points in perpetuity. Thus, 5k posts about things like the inefficiency of hate in Capital or a Systems Analysis of American Hegemony, just so I can have peace and quiet in my own head by getting it completely out of said head.

collinmcnulty

As someone who similarly writes to think, I found a lot of insight from this video [0] from the University of Chicago. Long story appropriately short, he recommends writing something twice: once for yourself and once for the reader.

[0]: https://www.youtube.com/watch?v=vtIzMaLkCaM

mannykannot

I don't recall if this is covered in the video, but here are two pitfalls I have noticed from my own attempts:

1) If I am considering possible objections to my position, I have to be very clear which points I am raising only for the sake of argument, and which are the ones I am actually advocating for, or else it will appear confused or self-contradictory.

A related issue is to preempt possible objections to the point where the reader might lose track of the main issue.

2) After making several passes to hone my position, it can seem so obvious to me that what I write for the reader is too terse for anyone who is approaching the issue for the first time.

stego-tech

That approach has helped me immensely in my communications, but less so for blog posts. I think it’s because I’ve fully internalized writing in my downtime as writing for myself first, and I just like longer, in-depth reads as a personal preference.

542354234235

Also, the internet in particular has a tendency to go out of its way to interpret things in the least charitable way possible. If there is a way to take anything you said negatively, or hyper literally, or any other way to misinterpret your intention, it will. So you tend to assume a bad-faith reading and preemptively explain/respond to possible nitpicks.

keiferski

Ironically this is one of the best use cases I’ve found for AI tools at the moment.

Summarize and critique this argument in a series of bullet points.

More seriously though, I think there is a lack of rigorous thinking about AI specifically and technology in general. And hence you get a lot of these rambling thought-style posts which are no doubt by intelligent people with something compelling to say, but without any fundamental method for analyzing those thoughts.

Which is why I really recommend taking a course in symbolic logic or analytic philosophy, if you are able to. You’ll quickly learn how to communicate your ideas in a straightforward, no nonsense manner.

SilverSlash

> Which is why I really recommend taking a course in symbolic logic or analytic philosophy, if you are able to. You’ll quickly learn how to communicate your ideas in a straightforward, no nonsense manner.

Do you have any free online course recommendations?

keiferski

I haven’t taken any online courses unfortunately (took them in person in college) but for symbolic logic I recommend the book by Klenk. I used that in my course and found it to be a good intro.

There are a bunch of lectures on YouTube about analytic philosophy though, and from a quick look they seem solid.

edent

The problem is, unless you expand every point then some jerk on HN will nit-pick your "logical fallacies".

I find myself writing longer and more defensively because lots of people don't understand nuance or subtext. Forget hyperbole or humour - lots of technical readers lack the ability to understand them.

Finally, editing is hard work. Revising and refining a document often takes several times longer than writing the first draft.

mvdtnz

Strategery is the most extreme example of this. He has interesting things to say but boy does he love the act of actually _saying_ them.

divan

I wish it was a skill easy to learn but it's not.

disambiguation

I mean it's their blog, the writing is just as much for them as it is for you.

decimalenough

I'm generally quite skeptical of AI, but this overstates its case. Two things stand out:

> what LLMs do is string together words in a statistically highly probable manner.

This is not incorrect, but it's no longer a sufficient mental model for reasoning models. For example, while researching new monitors today, I told Gemini to compare $NEW_MODEL_1 with $NEW_MODEL_2. Its training data did not contain information about either model, but it was capable of searching the Internet to find information about both and provide me with a factual (and, yes, I checked, accurate) comparison of the differences in the specs of the models as well as a summary of sentiment for reliability etc for the two brands.

> Currently available software may very well make human drivers both more comfortable and safe, but the hype has promised completely autonomous cars reliably zipping about in rush hour traffic.

And this is already not hype, it's reality anywhere Waymo operates.

lolinder

To have a good mental model for modern AI agents you have to understand both the LLM and the other stuff that's built up around it. OP is correct about the behavior of LLMs, and that is valuable information to keep in mind. Then you layer on top of that an understanding that some implementations of agents will sometimes automatically feed search results into context, if you ask them to or are paying for an advanced tier or whatever the extra qualifications are for your particular tool.

If you skip this two-part understanding then you run the risk of missing when the agent decided not to do a search for some reason and is therefore entirely dependent on statistical probability in the training data. I've personally seen people without this mental model take an LLM at its word when it was wrong because they'd gotten used to it looking things up for them.

jvanderbot

Defending TFA a little ... the rest of the article builds up context around the word "Completely", so that the single example of "Highway zipping" is not just what is being discussed.

"Completely" here should be expanded to include all the unique and unforseen circumstances a driver might encounter, such as a policeman directing traffic manually or any other "soft" situation that is not well represented in training.

Not to mention the somewhat extreme amount of apriori and continuous mapping that goes into operating a fleet of AVs. That is hardly to be considered "Completely autonomous".

This isn't just pedantry, the disconnect between a technical person's deep understanding and a common user's everyday experience is pretty much what the article hinges on. Try taking a Waymo from SF to NYC. This seems like something a "Completely autonomous" car should be able to do given a layperson's understanding of "Completely", without the experts' long list of caveats.

efficax

waymo is pretty amazing but it's still short of complete full self driving, and it's not at all clear that it can close the gap. similary, current LLMs are really remarkable. but it's not at all clear that they will make the leap into "real" intelligence.

zahlman

> For example, while researching new monitors today, I told Gemini...

You told an agent, not just an LLM.

> And this is already not hype, it's reality anywhere Waymo operates.

Some beg to differ; see e.g. https://www.youtube.com/watch?v=040ejWnFkj0 .

pbalau

> For example, while researching new monitors today, I told Gemini to compare $NEW_MODEL_1 with $NEW_MODEL_2.

But this feature was a staple of most online shops that sell monitors and a bunch of "review" sites. You don't need a highly complex system to compare 2 monitors, you need a spreadsheet.

sceptic123

Waymo only works because it's geofenced — that's a massive barrier to "completely autonomous" (or level 5 automation)

agumonkey

> what LLMs do is string together words in a statistically highly probable manner.

I guess the article fails to admit that when you have billions of connected points in a vector space, "stringing together" is not simply "stringing together". I'm not a fanboy but somehow GPT/attention based logic is capable of parsing input and data then remodeling it in depths that are surprising.

belter

Waymo operates on highly controlled and mapped environments. Can they handle Rome or Mumbai?

whynotminot

Why is that the standard? I, a human, can’t handle driving in Mumbai.

And lol at anyone who thinks any urban driving environment is “highly controlled”.

jogjayr

I, a human who learned to drive in Mumbai, can't handle driving in Mumbai anymore.

rad_gruchalski

> Why is that the standard? I, a human, can’t handle driving in Mumbai.

Aren’t you confusing “navigating” vs “driving”?

jogjayr

Waymo works where it works and it's useful where it works. Can a Mumbai autorickshaw handle an American freeway? Does that make it a pointless vehicle?

smus

Have you been to San Francisco?

riehwvfbk

Compared to driving in the developing world though, SF traffic is very structured and very tame.

My favorite data point here is Cairo: the sound of traffic there is horns blaring and metal-on-metal. Driving in Cario is a contact sport. And it doesn't seem to matter how nice a car is: a fancy Mercedes will have as many body dents as a rust bucket Lada.

colinmorelli

This feels a lot like "moving the goalposts." First, it was complete science fiction to have technology in the car. Then, it was in the car, but it could only do navigation and music, it can't operate the car the way humans can. Then, it can prevent you from weaving out of your lane, and it can stop the car if you're about to crash into something, but it can't help you with your commute. Then, it can speed up, slow down, and steer on the highway, but it can't take you door to door. Now, it can take you door to door, but only in certain environments, it can't do it everywhere.

All of the above happened over the last ~20 year or so. The progression clearly seems to point to this being more than hype, even if it takes us longer to realize than originally anticipated.

thesuitonym

It's not really moving the goalposts, though. The idea of a self driving car has always been "I can get in my car, tell it where I want to go, and then it goes there while I read a book."

Having navigation and music, and lane assist, and adaptive cruise control, and some cars that can operate autonomously in some environments is great, but it's not what we meant when we said self driving cars.

gilleain

Not so much moving the goalposts as pointing out that playing American football is not like playing soccer (football - that is, driving in Rome) or even cricket (Mumbai).

In fact, cricket doesn't even _have_ goalposts, it has wickets. Driving in cities outside North America is very different.

decimalenough

Describing SF's Tenderloin at night as a "highly controlled environment" would be stretching it.

Anyway, you're moving the goalposts here. Waymo is operating at scale in actual human cities in actual rush hour traffic. Sure, it would struggle in Buffalo during a snowstorm or in Mumbai during the monsoon, but so do human drivers.

hiatus

> Sure, it would struggle in Buffalo during a snowstorm or in Mumbai during the monsoon, but so do human drivers.

We don't expect technology to be on par with human capabilities but to exceed them.

tim333

Winter seems unlikely in the near term because we are around the level in the steady increase in computing power where it's around human level, like this cartoon https://x.com/waitbutwhy/status/1919870578502021257

Havoc

The hype cooling down a bit might not be a terrible thing

osigurdson

A different kind of AI winter is already here. This "winter" is associated with companies laying people off and then lazily waiting around for AGI to emerge. This is leading to a kind of malaise that I think will ultimately be bad for economies. It is fine to use any available tool to boost productivity, but magical thinking is not sound management.

antirez

Orthogonal: the lemons in the picture, from Palermo (Sicily), could not only being lemon or lemon-shaped soap, but also a sweet, our very famous "frutta martorana": https://en.wikipedia.org/wiki/Frutta_martorana

wellUc

A predict a lot of circumlocutions about AI but most people not noticing since they blindly follow the TV/politics as-is anyway.

A lot of people (still a tiny proportion of the population) will be loud in opposition but ultimately overwhelmed by the nihilism and indifference of the aggregate.

The loudest will be those who perceive some loss to their own lifestyle that relies on exploiting other’s attention, as AI presents new risk to their attention grabbing behaviors.

Then they will die off and humanity will carry on with AI not them.

Circle of life Simba.

qudat

While reading this article, I kept asking myself the question: "Why can't LLM ask us follow up questions?"

lblume

They absolutely can if you prompt them to. You can even add it to your system prompt for it to happen in every new conversation!

esjeon

AI winters will keep coming as long as the definition of AI stays relative. We used to call chess programs chess "AIs", but hardly anyone says that anymore. We call LLMs "AIs" now, but let's be real: a few decades from now, we'll probably be calling them token predictors, while some shiny new "AIs" are already out there kicking asses.

At the end of the day, "AI" really just means throwing expensive algorithms at problems we've labeled as "subjective" and hoping for the best. More compute, faster communication, bigger storage, and we get to run more of those algorithms. Odds are, the real bottleneck is hardware, not software. Better hardware just lets us take bolder swings at problems, basically wasting even more computing power on nothing.

So yeah, we’ll get yet another AI boom when a new computing paradigm shows up. And that boom will hit yet another AI winter, because it'll smack into the same old bottleneck. And when that winter hits, we'll do what we've always done. Move the goalposts, lower the bar, and start the cycle all over again. Just with new chips this time.

Ah, Jesus. I should quit drinking Turkish coffee.

rvz

Let's just say that people once thought that Big Tech was once invincible, until it wasn't.

Obviously those exposed in the AI hype will tell you that there is no winter.

Until the music stops and almost little to no-one can make money out of this AI race to zero.

dinfinity

> Let's just say that people once thought that Big Tech was once invincible, until it wasn't.

Half the world runs on Big Tech. Some of them have cash reserves bigger than the GDP of sizeable countries. They lead in R&D investment: https://www.rdworldonline.com/top-15-rd-spenders-of-2024/

> Obviously those exposed in the AI hype will tell you that there is no winter.

Go look at how much money was spent on AI R&D in the last AI 'summers' (and winters). Pennies compared to the billions and billions of dollars the private and public sector is throwing at it right now.

Will some investments turn out to be a waste of time and money? Yes.

Will investment be reduced to a fraction of what it is today? Hell no.

The music stops when humans are economically obsolete.

GaggiX

>Spring 2024

Reasoning models like o1 had not yet been released at that time. It's amazing how much progress has been made since then.

Edit: also Search wasn't available as the blog mention "citation"s.

netdevphoenix

The point isn't that progress is not happening but that it's slowing down. You get more of the same, smaller memory footprint, faster responses, less hallucinations, etc. Significant progress would be another deep seek kinda of breakthrough, near 0% hallucination rate, performing like current models with less than half of their dataset, epistemological self awareness (i.e. I am not sure of the correctness of the answer I just gave you, ability to override assumptions from the training dataset, etc).

We are just getting cars of different shapes and colours, with built-in speakers and radio. Not exactly progress

patapong

This assumes that LLMs are only useful if they are AGI. I don't think we do - what we have today is already sufficient to unlock an enormous amount of value, we just haven't done so yet.

Thus, I think we can compare them to electricity - a sophisticated technology with a ton of potential, which will take years to fully exploit, even if there are no more fundamental breakthroughs. But also not the solution to every single problem.

zahlman

Arguably, LLMs - or whatever systems succeed them - are only useful if they are not AGI. Given the evidence already collected about how willing humans are to make these systems "agentive", we pretty well have to worry about the possibility of an AGI using us instead. Even if there's some other logical barrier to recursive self-improvement ("hard takeoff") scenarios.

eisfresser

> another deep seek kinda of breakthrough

That was only six month ago. I don't think this is an argument that things are slowing down (yet).

netdevphoenix

I didn't say that progress stopped only that it is slowing down (ie they become less frequent). Deep seek happening 6 months ago doesn't counter what I said.

pixl97

>but that it's slowing down

Progress isn't a smooth curve but more step like.

Also, the last 10% of getting AI right is 90% of the work, but it doesn't seem to us humans. I don't think you understand the gigantic impact that last 10% is going to make on the world and how fast it will change things once we accomplish it.

Personally, I hope it takes us a while. We're not ready for this as a society and planet.

GaggiX

Reasoning models like o1 are on a whole new level compared to previous models, for example they are incredible at math, something that previous models struggled with a lot, this seems pretty huge to me as the performance of previous models was flattening, it's a kinda a new paradigm.

empath75

The consequences of a new thing being invented are not entirely dependent on progress of innovation in the thing itself. It takes quite a long time for people to build _on top of_ a new technology. LLMs could not appreciably improve at all, and we've still barely scratched the surface of applying what we have.

Your car example is a perfect one -- society was _completely reordered_ around the car, even though the fundamental technology behind the car didn't change from the early 20th century until the invention of the electric car.

netdevphoenix

Surely, finding new applications to existing tech can't be considered progress in the development of that tech

Jedd

That's also one of those things that would probably confuse LLMs as readily as it confuses North Americans (for much the same reason - training).

Spring 2024 for me was from the 1st of September to the 30th of November.

barbazoo

Only about 10% of earths population live in the southern hemisphere. It’s pretty fair to assume northern.

GaggiX

What's the confusion here? The author is from Sweden, also neither I nor the author are North Americans.

Jedd

While one half of the planet is having spring, the other half is having autumn.

rgreeko42

o1 was not a major advancement