O1 isn't a chat model (and that's the point)

154 comments

·January 18, 2025

geor9e

Instead of learning the latest workarounds for the kinks and quirks of a beta AI product, I'm going to wait 3 weeks for the advice to become completely obsolete

gwern

What people are discovering with the latest models is that often their errors are due to entirely reasonable choices and assumptions... which happen to be wrong in your specific case. They call a library you don't have installed, or something like that. Short of inventing either telepathy or spice which can allow LLMs to see the future, it will increasingly be the case that you cannot use the best models efficiently without giving them extensive context. Writing 'reports' where you dump in everything even tangentially relevant is the obvious way to do so, and so I would expect future LLMs to be even more so than o1-preview/pro.

sheepscreek

I get much better output from o1* models when I dump a lot of context + leave a detailed but tightly scoped prompt with minimal ambiguity. Sometimes I even add - don’t assume, ask me if you are unsure. What I get back is usually very very high quality. To the point that I feel my 95th percentile coding skills have diminishing returns. I find that I am more productive researching and thinking about the what and leaving the how (implementation details) to the model - nudging it along.

One last thing, anecdotally - I find that it’s often better to start a new chat after implementing a chunky bit/functionality.

gwern

Yes, I've tried out both: ordering it to ask me questions upfront, and sometimes restarting with an edited 'report' and a prototype implementation for a 'clean start'. It feels like it sometimes helps... but I have no numbers or rigorous evidence on that.

bionhoward

The economics of the deal over the long term are exponentially more critical than the performance in the short term, right?

In that context, how is convincing intelligent people to pay OpenAI to help train their own replacements while agreeing not to compete with them anything but the biggest, dumbest, most successful nerd snipe in history?

Dumping more context just implies getting brain raped even harder. Y’all are the horses paying to work at the glue factory. “Pro” users paying extra for that privilege, no thank you!

pizza

Maximum likelihood training tinges, nay, corrupts, everything it touches. That’s before you pull apart the variously-typed maximum likelihood training processes that the artifice underwent..

Your model attempts to give you a reasonably maximum likelihood output (in terms of kl-ball constrained preference distributions not too far from language), and expects you to be the maximum likelihood user (since its equilibriation is intended for the world in which you the user are just like the people who ended up in the training corpus) for which the prompt that you gave would be a maximum likelihood query (implying that there are times it’s better to ignore you-specific contingencies in your prompt to instead rather re-envision your question instead as being a noisily worded version of a more normal question).

I think there are probably some ways to still use maximum likelihood but you switch out over the ‘what’ that is being assumed as likely - eg models that attenuate dominant response strategies as needed by the user, and easy ux affordances for the user to better and more fluidly align the model with their own dispositional needs.

cap11235

MLE is a basic statistical technique. Feel free to go REEEE when Amazon recommends products.

lumost

Alternatively, we can standardize the environment. It takes humans weeks to adapt to a new interface or starting point. Why would ai be different?

raincole

There was a debate over whether to integrate Stable Diffusion into the curriculum in a local art school here.

Personally while I consider AI a useful tool, I think it's quite pointless to teach it in school, because whatever you learn will be obsolete next month.

Of course some people might argue that the whole art school (it's already quite a "job-seeking" type, mostly digital painting/Adobe After Effect) will be obsolete anyway...

simonw

The skill that's worth learning is how to investigate, experiment and think about these kinds of tools.

A "Stable Diffusion" class might be a waste of time, but a "Generative art" class where students are challenged to explore what's available, share their own experiments and discuss under what circumstances these tools could be useful, harmful, productive, misleading etc feels like it would be very relevant to me, no matter where the technology goes next.

moritzwarhier

Very true regarding the subjects of a hypothetical AI art class.

What's also important is the teaching of how commercial art or art in general is conceptualized, in other words:

What is important and why? Design thinking. I know that phrase might sound dated but that's the work what humans should fear being replaced on / foster their skills.

That's also the line that at first seems to be blurred when using generative text-to-image AI, or LLMs in general.

The seemingly magical connection between prompt and result appears to human users like the work of a creative entity distilling and developing an idea.

That's the most important aspect of all creative work.

If you read my reply, thanks Simon, your blog's an amazing companion in the boom of generative AI. Was a regular reader in 2022/2023, should revisit! I think you guided me through my first local LLama setup.

londons_explore

All knowledge degrades with time. Medical books from the 1800's wouldn't be a lot of use today.

There is just a different decay curve for different topics.

Part of 'knowing' a field is to learn it and then keep up with the field.

dutchbookmaker

I would say it really depends on the goal of art school.

There is a creative purist idea that the best thing that can happen to an art student is to be thrown out of school early on before it ruins your creativity.

If you put that aside, a stable diffusion art school class just sounds really cool to me as an elective class. Especially the group that would be in this class. The problem I find with these tools is they are overwhelmed by the average person, non-artist, making pictures of cats and darth vader so it is so hard to find what real artists are doing in the space.

dyauspitr

Integrating it into the curriculum is strange. They should do one time introductory lectures instead.

swyx

> whatever you learn will be obsolete next month

this is exactly the kind of attitude that turns university courses into dinosaurs with far less connection to the “real world” industry than ideal. frankly its an excuse for laziness and luddism at this point. much of what i learned about food groups and economics and politics and writing in school is obsolete at this point, should my teachers not have bothered at all? out of what? fear?

the way stable diffusion works hasn’t really changed, and in fact people have just built comfyui layers and workflows on top of it in the ensuing 3 years, and the more you stick your head in the sand because you already predetermined the outcome you are mostly piling up the debt that your students will have to learn on their own because you were too insecure to make a call without trusting that your students can adjust as needed

loktarogar

The answer in formal education is probably somewhere in the middle. The stuff you learn shouldn't be obsolete by the time you graduate but at the same time they should be integrating new advancements sooner.

The problem has also always been that those who know enough about cutting edge stuff are generally not interested in teaching for a fraction of what they can get doing the stuff.

thornewolf

To be fair, the article basically says "ask the LLM for what you want in detail"

fullstackwife

great advice, but difficult to apply given very small context window of o1 models

jameslk

The churn is real. I wonder if so much churn due to innovation in a space can prevent enough adoption such that it actually reduces innovation

dartos

It’s churn because every new model may or may not break strategies that worked before.

Nobody is designing how to prompt models. It’s an emergent property of these models, so they could just change entirely from each generation of any model.

kyle_grove

IMO the lack of real version control and lack of reliable programmability have been significant impediments to impact and adoption. The control surfaces are more brittle than say, regex, which isn’t a good place to be.

I would quibble that there is a modicum of design in prompting; RLHF, DPO and ORPO are explicitly designing the models to be more promptable. But the methods don’t yet adequately scale to the variety of user inputs, especially in a customer-facing context.

My preference would be for the field to put more emphasis on control over LLMs, but it seems like the momentum is again on training LLM-based AGIs. Perhaps the Bitter Lesson has struck again.

miltonlost

A constantly changing "API" coupled with a inherently unreliable output is not conducive to stable business.

ithkuil

It's interesting that despite all these real issues you're pointing out a lot of people nevertheless are drawn to interact with this technology.

It looks as if it touches some deep psychological lever: have an assistant that can help to carry out tasks that you don't have to bother learning the boring details of a craft.

Unfortunately lead cannot yet be turned into gold

bbarnett

Unless your business is customer service reps, with no ability to do anything but read scripts, who have no real knowledge of how things actually work.

Then current AI is basically the same, for cheap.

QuantumGood

Great summary of how AI compresses the development (and hype) product cycle

AbstractH24

Out of curiosity, where do you look for the advice?

icpmacdo

Modern AI both shortens the useful lifespan of software and increases the importance of development speed. Waiting around doesn’t seem optimal right now.

goolulusaurs

The reality is that o1 is a step away from general intelligence and back towards narrow ai. It is great for solving the kinds of math, coding and logic puzzles it has been designed for, but for many kinds of tasks, including chat and creative writing, it is actually worse than 4o. It is good at the specific kinds of reasoning tasks that it was built for, much like alpha-go is great at playing go, but that does not actually mean it is more generally intelligent.

madeofpalk

LLMs will not give us "artificial general intelligence", whatever that means.

righthand

AGI currently is an intentionally vague and undefined goal. This allows businesses to operate towards a goal, define the parameters, and relish in the “rocket launches”-esque hype without leaving the vague umbrella of AI. It allows businesses to claim a double pursuit. Not only are they building AGI but all their work will surely benefit AI as well. How noble. Right?

It’s vagueness is intentional and allows you to ignore the blind truth and fill in the gaps yourself. You just have to believe it’s right around the corner.

pzs

"If the human brain were so simple that we could understand it, we would be so simple that we couldn’t." - without trying to defend such business practice, it appears very difficult to define what are necessary and sufficient properties that make AGI.

swalsh

In my opinion it's probably closer to real agi then it's not. I think the missing piece is learning after the pretraining phase.

UltraSane

An AGI will be able to do any task any humans can do. Or all tasks any human can do. An AGI will be able to get any college degree.

layer8

> any task any humans can do

That doesn’t seem accurate, even if you limit it to mental tasks. For example, do we expect an AGI to be able to meditate, or to mentally introspect itself like a human, or to describe its inner qualia, in order to constitute an AGI?

Another thought: The way human perform tasks is affected by involuntary aspects of the respective individual mind, in a way that the involuntariness is relevant (for example being repulsed by something, or something not crossing one’s mind). If it is involuntary for the AGI as well, then it can’t perform tasks in all the different ways that different humans would. And if it isn’t involuntary for the AGI, can it really reproduce the way (all the ways) individual humans would perform a task? To put it more concretely: For every individual, there is probably a task that they can’t perform (with a specific outcome) that however another individual can perform. If the same is true for an AGI, then by your definition it isn’t an AGI because it can’t perform all tasks. On the other hand, if we assume it can perform all tasks, then it would be unlike any individual human, which raises the question of whether this is (a) possible, and (b) conceptually coherent to begin with.

nkrisc

So it’s not an AGI if it can’t create an AGI?

qup

You can't do any task humans can do

Xmd5a

No but they gave us GAI. The fact they flipped the frame problem(s) upside down is remarkable but not often discussed.

nurettin

I think it means a self-sufficient mind, which LLMs inherently are not.

ben_w

What is "self-sufficient" in this case?

Lots of debate since ChatGPT and Stable Diffusion can be summarised as A: "AI cheated by copying humans, it just mixes the bits up really small like a collage" B: "So like humans learning from books and studying artists?" A: "That doesn't count, it's totally different"

Even though I am quite happy to agree that differences exist, I have yet to see a clear answer as to what about people even mean when asserting that AI learning from books is "cheating" given that it's *mandatory* for humans in most places.

swyx

it must be wonderful to live life with such supreme unfounded confidence. really, no sarcasm, i wonder what that is like. to be so sure of something when many smarter people are not, and when we dont know how our own intelligence fully works or evolved, and dont know if ANY lessons from our own intelligence even apply to artificial ones.

and yet, so confident. so secure. interesting.

sandspar

Social media doesn't punish people for overconfidence. In fact social media rewards people's controversial statements by giving them engagement - engagement like yours.

adrianN

So-so general intelligence is a lot harder to sell than narrow competence.

kilroy123

Yes, I don't understand their ridiculous AGI hype. I get it you need to raise a lot of money.

We need to crack the code for updating the base model on the fly or daily / weekly. Where is the regular learning by doing?

Not over the course of a year, spending untold billions to do it.

tomohelix

Technically, the models can already learn on the fly. Just that the knowledge it can learn is limited to the context length. It cannot, to use the trendy word, "grok" it and internally adjust the weights in its neural network yet.

To change this you would either need to let the model retrain itself every time it receives new information, or to have such a great context length that there is no effective difference. I suspect even meat models like our brains is still struggling to do this effectively and need a long rest cycle (i.e. sleep) to handle it. So the problem is inherently more difficult to solve than just "thinking". We may even need an entire new architecture different from the neural network to achieve this.

chikere232

> Technically, the models can already learn on the fly. Just that the knowledge it can learn is limited to the context length.

Isn't that just improving the prompt to the non-learning model?

mike_hearn

Google just published a paper on a new neural architecture that does exactly that, called Titans.

KuriousCat

Only small problem is that models are neither thinking nor understanding, I am not sure how this kind of wording is allowed with these models.

ninetyninenine

I understand the hype. I think most humans understand why a machine responding to a query like never before in the history of mankind is amazing.

What you’re going through is hype overdose. You’re numb to it. Like I can get if someone disagrees but it’s a next level lack of understanding human behavior if you don’t get the hype at all.

There exists living human beings who are still children or with brain damage with comparable intelligence to an LLM and we classify those humans as conscious but we don’t with LLMs.

I’m not trying to say LLMs are conscious but just saying that the creation of LLMs marks a significant turning point. We crossed a barrier 2 years ago somewhat equivalent to landing on the moon and i am just dumb founded that someone doesn’t understand why there is hype around this.

bbarnett

The first plane ever flies, and people think "we can fly to the moon soon!".

Yet powered flight has nothing to do with space travel, no connection at all. Gliding in the air via low/high pressure doesn't mean you'll get near space, ever, with that tech. No matter how you try.

AI and AGI are like this.

golol

This is kind if true. I feel like the reasoning power if O1 is really only truly available on the kinds of math/coding tasks it was trained on so much.

raincole

Which sounds like... a very good thing?

samrolken

I have a lot of luck using 4o to build and iterate on context and then carry that into o1. I’ll ask 4o to break down concepts, make outlines, identify missing information and think of more angles and options. Then at the end, switch on o1 which can use all that context.

ttul

FWIW: OpenAI provides advice on how to prompt o1 (https://platform.openai.com/docs/guides/reasoning/advice-on-...). Their first bit of advice is to, “Keep prompts simple and direct: The models excel at understanding and responding to brief, clear instructions without the need for extensive guidance.”

jmcdonald-ut

The article links out to OpenAI's advice on prompting, but it also claims:

    OpenAI does publish advice on prompting o1, 
    but we find it incomplete, and in a sense you can
    view this article as a “Missing Manual” to lived
    experience using o1 and o1 pro in practice.

To that end, the article does seem to contradict some of the advice OpenAI gives. E.g., the article recommends stuffing the model with as much context as possible... while OpenAI's docs note to include only the most relevant information to prevent the model from overcomplicating its response.

I haven't used o1 enough to have my own opinion.

irthomasthomas

Those are contradictory. Openai claim that you don't need a manual, since O1 performs best with simple prompts. The author claims it performs better with more complex prompts, but provides no evidence.

Terretta

The claims are not contradictory.

They are bimodal.

Bottom 20% of users can't prompt because they don't understand what they're looking for or couldn't describe it well if they did. This model handles them asking briefly, then breaks it down, seeks implications, and prompts itself. OpenAI's How to Prompt is for them.

Top 20% of users understand what they're looking for and how to frame and contextualize well. The article is for them.

The middle 60%, YMMV. (But in practice, they're probably closer to bottom 20 in not knowing how to get the most from LLMs, so the bottom 20 guide saves typing.)

orf

In case you missed it

    OpenAI does publish advice on prompting o1, 
    but we find it incomplete, and in a sense you can
    view this article as a “Missing Manual” to lived
    experience using o1 and o1 pro in practice.

The last line is important

yzydserd

I think there is a distinction between “instructions”, “guidance” and “knowledge/context”. I tend to provide o1 pro with a LOT of knowledge/context, a simple instruction, and no guidance. I think TFA is advocating same.

chikere232

So in a sense, being an early adopter for the previous models makes you worse at this one?

wahnfrieden

The advice is wrong

3abiton

But the way they did their PR for O1 made it sound like it was the next step, while in reality it was a side step. A branching from the current direction towards AGI.

isoprophlex

People agreeing and disagreeing about the central thesis of the article, which is fine because i enjoy the discussion...

no matter where you stand in the specific o1/o3 discussion the concept of "question entropy" is very enlightening.

what is the question of theoretical minimum complexity that still solves your question adequately? or for a specific model, are its users capable of supplying the minimum required intellectual complexity the model needs?

Would be interesting to quantify these two and see if our models are close to converging on certain task domains.

dutchbookmaker

Good stuff.

I am going to try to start measuring my prompts.

Thinking about it, I am not sure what the entropy is for the above vs "start measuring prompts".

isoprophlex

That's a tough one, I'm not sure how to get a quantifyable number on it except for painstakingly ablating a prompt until the answer you get becomes significantly degraded.

But then still, how do you measure how much you have ablated your prompt? How do you measure objectively how badly the answer has degraded?

martythemaniak

One thing I'd like to experiment with is "prompt to service". I want to take an existing microservice of about 3-5kloc and see if I can write a prompt to get o1 to generate the entire service, proper structure, all files, all tests, compiles and passes etc. o1 certainly has the context window to do this at 200k input and 100k output - code is ~10 tokens per line of code, so you'd need like 100k input and 50k output tokens.

My approach would be:

- take an exemplar service, dump it in the context

- provide examples explaining specific things in the exemplar service

- write a detailed formal spec

- ask for the output in JSON to simplify writing the code - [{"filename":"./src/index.php", "contents":"<?php...."}]

The first try would inevitably fail, so I'd provide errors and feedback, and ask for new code (ie complete service, not diffs or explanations), plus have o1 update and rewrite the spec based on my feedback and errors.

Curious if anyone's tried something like this.

swyx

coauthor/editor here!

we recorded a followup conversation after the surprise popularity of this article breaking down some more thoughts and behind the scenes: https://youtu.be/NkHcSpOOC60?si=3KvtpyMYpdIafK3U

cebert

Thanks for sharing this video, swyx. I learned a lot from listening to it. I hadn’t considered checking prompts for a project into source control. This video has also changed my approach to prompting in the future.

swyx

thanks for watching!

“prompts in source control” is kinda like “configs in source control” for me. recommended for small projects, but at scale eventually you wanna abstract it out into some kind of prompt manager software for others to use and even for yourself to track and manage over time. git isnt the right database for everything.

dutchbookmaker

Great stuff, thanks for this.

keizo

I made a tool for manually collecting context. I use it when copying and pasting multiple files is cumbersome: https://pypi.org/project/ggrab/

franze

i creates thisismy.franzai.com for the same reason

patrickhogan1

The buggy nature of o1 in ChatGPT is what prevents me from using it the most.

Waiting is one thing, but waiting to return to a prompt that never completes is frustrating. It’s the same frustration you get from a long running ‘make/npm/brew/pip’ command that errors out right as it’s about to finish.

One pattern that’s been effective is

1. Use Claude Developer Prompt Generator to create a prompt for what I want.

2. Run the prompt on o1 pro mode

swalsh

Work with chat bots like a junior dev, work with o1 like a senior dev.

inciampati

o1 appears to not be able to see it's own reasoning traces. Or it's own context is potentially being summarized to deal with the cost of giving access to all those chain of thought traces and the chat history. This breaks the computational expressivity or chain of thought, which supports universal (general) reasoning if you have reliable access to the things you've thought, and is threshold circuit (TC0) or bounded parallel pattern matcher when not.

PoignardAzur

My understanding is that o1's chain-of-thought tokens are in its own internal embedding, and anything human-readable the UI shows you is a translation of these CoT tokens into natural language.

inciampati

I found this documentation from openai that supports my hunch: https://platform.openai.com/docs/guides/reasoning/advice-on-...

The reasoning tokens from each step are lost. And there is no indication that they are different tokens than regular tokens.

inciampati

Where is that documented? Fwiw, interactive use suggests they are not available to later invocations of the model. Any evidence this isn't the case?

timewizard

> To justify the $200/mo price tag, it just has to provide 1-2 Engineer hours a month

> Give a ton of context. Whatever you think I mean by a “ton” — 10x that.

One step forward. Two steps back.

HN

O1 isn't a chat model (and that's the point)

O1 isn't a chat model (and that's the point)