Skip to content(if available)orjump to list(if available)

Train Your Own O1 Preview Model Within $450

danielhanchen

If anyone's interested, I made Colab notebooks with free GPUs for both GRPO (the algo DeepSeek used) to train a reasoning model from scratch, and also general finetuning, which the Berkeley team employed!

GRPO notebook for Llama 3.1 8B: https://colab.research.google.com/github/unslothai/notebooks...

General finetuning notebook: https://colab.research.google.com/github/unslothai/notebooks...

The Berkeley team's 17K dataset: https://huggingface.co/datasets/NovaSky-AI/Sky-T1_data_17k Hugging Face also released a 220K dataset: https://huggingface.co/datasets/open-r1/OpenR1-Math-220k

threecheese

How long does this take on a free tier T4? This is really neat, I’d assumed this type of “playing with the guts” work was more difficult to access as a normie programmer. Looks like something I’d like to try!

mkagenius

Weird that they had to resort to click bait using "O1 preview" in their name.

I expected some sort of way to actually get o1 preview retrained (and downloadable).

Also, calling it O1 preview on just 7 benchmarks is not correct. What if someone comes up with some use cases where O1 preview does better than this.

apart from that, good that things are becoming cheaper.

null

[deleted]

jug

It’s dishonest because they not only point towards a specific language model, but the beta version of a specific model. WTH?

anigbrowl

You should always assume headlines are hyperbolic, and 'verb your own noun for cheap' headlines are always offering a way to make your own version of $expensive_thing for hobby prices, not to provide a copy of $expensive_thing.

If you a headline saying 'make your own James Webb Space Telescope in a weekend' they're offering a project that leverages some tech concept from the JWST, like mirror arrays or a particular sort of sensor. They're not promising that you will be able to build a space-capable telescope the size of a semi truck.

echelon

It's not dishonest, it's simple human behavior.

The vocabulary used to describe the culturally prevailing leader will be used to explain similar concepts and create analogies. That's an easier tool to communicate to the masses than crafting super tailored messages for only domain experts.

It's why we keep doing this, and it's also why trademarks become generics.

"Google it", "Uber for X", "band aid", "the band sounds like Y", "the actor looks like Z", etc. etc.

This is a core part of how human language works and how we as a species communicate with one another.

michaelt

"Build your own Lamborghini Huracan at home for $450"

"Wow! Quite a feat to deliver an iconic design, a 631 horsepower engine, and performance of 0-150 mph in 15.4 seconds on such a small budget!"

"Actually what we mean is, like the Lamborghini Huracan, our vehicle has two seats."

yieldcrv

ChatGPT is the market leader, nobody except enthusiasts are distinguishing between their models, any models. And the enthusiasts know the difference

Verdict: dishonest

codelion

Yeah, I agree. The "O1 preview" naming feels a bit misleading. It sets an expectation of broader coverage than just those specific benchmarks. It's cool to see cost reductions, but the marketing could be more transparent about the scope.

fl4tul4

I do love competition.

In the last weeks are are seeing a torrent of advances, just because someone opened their architectures.

Imagine where we could go if the training datasets were also publicly available and unbounded by any copyright laws. (I'm not talking about doing anything illegal).

I can only dream, I guess.

Lucasoato

A torrent of advances is the right way to word it, especially after it has been discovered what Meta trained their models on :)

paper2d

Those training datasets can never be free as almost all of them is copyrighted.

landryraccoon

Japan has said AI can train on copyrighted materials.

https://www.privacyworld.blog/2024/03/japans-new-draft-guide...

I imagine if copyright is a big issue for AI, Japanese startups will have an advantage.

0xdeadbeefbabe

Does China need to say anything or can you guess their policy?

chii

perhaps copyright needs to be updated. And in any case, my personal belief is that training on data that is publicly released, and as well as purchased media, is fair use.

philipwhiuk

If anything it needs to be updated to actually prevent the rampant profit extraction from human creation in order to protect actual creators.

azinman2

Why should it be? I’d personally be pissed if my book, which came from my own hard work and is sold per person, all of the sudden get subsumed by a general AI. Even worse if it is commercialized and I get nothing for it.

tonyedgecombe

The UK government is doing that at the behest of the AI companies which tends to indicate they have bet misbehaving up to now.

null

[deleted]

taosx

Share the non-copyrighted ones and it's still a win if you make it possible to people to contribute, both through PRs, testing and discussion.

lionkor

almost all free things are copyrighted

Kye

It seems like the torrent was already happening and DeepSeek's part is just one example of that. They did help bring attention to those advancements, and that's led to lots more people contributing and finding more niche applications.

noduerme

Isn't the general attitude these days to just break laws and bribe officials once you own the hottest startup? /s

edit: re. the /s I was living offshore and running the most popular bitcoin casino at the time, spending a vast amount of money and energy to block any player who might be American. As a result I didn't make that much money. And I tried to calculate how much I would need to make if I wanted to break the law and hide out forever. I figured I could make $10-15M a year but that wouldn't be enough to hide. I fucked up, I guess. Because the richest man in the world made most of his first round of money facilitating gambling transactions, and he's now got his snout in every federal agency. I should have had the balls, I guess, to ask forgiveness rather than permission.

coliveira

This was always like this. Youtube started publishing mostly copyrighted content, then Google settled with copyright owners. Google by the way has perfected the "art" of training their algos with content without approval from copyright owners.

rdli

The blog post was a little unclear, so my summary was:

- They used QwQ to generate training data (with some cleanup using GPT-4o-mini)

- The training data was then used to FT Qwen2.5-32B-Instruct (non-reasoning model)

- Result was that Sky-T1 performs slightly worse than QwQ but much better than Qwen2.5 on reasoning tasks

There are a few dismissive comments here but I actually think this is pretty interesting as it shows how you can FT a foundation model to do better at reasoning.

azinman2

I wish they would have compared to the r1 distills of qwen2.5

scosman

Inference time compute is still very under utilized in actual AI deployments. Lots of folks are working on foundation models, which require reasoning about broad problem domains. Not enough people are using the same techniques for task-specific performance improvements. You can easily distill the reasoning from larger models like R1 for your task. Often better, you can mix in custom thinking instructions for specific sub-problems so a fine tuned model learns a mix of task specific reasoning and custom logic. It’s not hard and easily beats prompt iteration. When you find bugs, you can fix it.

I made a GitHub project for distilling thinking models (and customs COT inference time fine tuning): https://docs.getkiln.ai/docs/guide-train-a-reasoning-model

anon373839

Thanks for linking to this. That’s a good resource!

Do you have any pointers on assembling fine-tuning data not for isolated tasks, but for a flexible range of queries in a particular problem domain? Similar to general purpose instruction-tuning, but much more focused.

For example, suppose you’re building an app that helps doctors search through research literature to aid in diagnosis, check hypotheses, etc. Of course you would want to have some domain experts and real users available to see what kind of queries they would create. But getting from that point to a well-balanced dataset that adequately represents the distribution of possible queries, instructions, writing/cognitive styles, formatting, dialog flows, etc. your app will encounter —- it just seems kind of hard to know how to approach a task like that. It seems there are infinitely many dimensions you could accidentally overfit on.

pizza

General advice? Collect data, train a model, note the mistakes in the model, mistakes in the data, and think critically about what it is that you're ending up teaching. Repeat many, many, many times.. For some tasks, don't be surprised if it ends up taking months or a year or several. It took me 6 months of building a dataset, by hand, by myself, to produce ~1600 'gold standard' text examples (bolstered by ~100K synthetic examples) - texts plus 20 dimensions rated 1-4. But I managed to beat SOTA models in this task from all the frontier labs by doing so. It also makes sense to consider all of the various "lacks" of the competing models.

It's quite difficult to see all the future decisions you will make due to future insights about future versions of the whole loop. But you will be needing to make some.

I will say one more concrete thing though: the more metadata you collect, generally, the better, but this can make it more expensive.

Also, if you ever need to update your schema.. well this is actually one reason why text data for LLMs is nice: your schema is essentially fluid in the first place, so you could eg stick metadata in the text itself if at some future point you start collecting it.

I guess, also, it's a good thing to constantly add new benchmarks, if possible. Treat your model's capabilities as knowable, but never treat your model's capabilities as actually known.

anon373839

Thanks for the input. It sounds like the task is about as daunting as it seems, then, but doable. Are there any resources (such as papers) you’ve found especially helpful?

magicalhippo

So this is a fine-tune and not from scratch, which makes the proposition much more reasonable.

That said, for someone who's not in the game but been curious as to the details of fine-tuning, it's great to get both the dataset and the code.

moconnor

They trained on QwQ traces and in their evaluation they are… mostly slightly worse than QwQ.

Hardly a huge win.

null

[deleted]

genpfault

> The model training finishes in 19 hours on 8 H100 with DeepSpeed Zero-3 offload (~ $450 according to Lambda Cloud pricing).

JoshTko

Has anyone tested if the consensus of top 4-5 mini models together would out perform the best frontier model?

_joel

It's not from scratch, though, right? Am I missing something here as to why it's at the top of the posts?

twobitshifter

There’s no real reason to start from true scratch anymore. You don’t harvest wheat, mill flour, milk a cow, and churn butter for your cake.