SmolGPT: A minimal PyTorch implementation for training a small LLM from scratch

46 comments

·January 29, 2025

attentionmech

This is cool, and timely (I wanted a neat repo like that).

I have also been working from last 2 weeks on a gpt implementation in C. Eventually it turned out to be really slow (without CUDA). But it taught me how much memory management and data management there is when implementing these systems. You are running like a loop billions of times so you need to preallocate the computational graph and stuff. If anyone wanna check out it's ~1500 LOC single file:

https://github.com/attentionmech/gpt.c/blob/main/gpt.c

sitkack

Neat, I love projects like these.

The next level down is to do it directly in numpy.

And then from there, write a minimal numpy work-a-like to support the model above.

You start with a working system using the most powerful abstractions. Then you iteratively remove abstractions, lowering your solution, then when you get low enough but still riding on an external abstraction, you rewrite that, but ONLY to support the layers above you.

Following the above pattern, you can bootstrap yourself to have full system understanding. This is not unlike RL+distillation that human persons do learn complex topics.

bee_rider

Numpy can use the chipmaker’s BLAS (Intel MKL or AMD’s Blis fork). Trying to replace it could be a good academic exercise but I think most people wisely leave that to the vendors.

sitkack

It is a purely pedagogical device, like building a go kart.

lagrange77

> but still riding on an external abstraction, you rewrite that, but ONLY to support the layers above you.

i don't get it. Why do i stop before stripping all abstractions?

byteknight

Where do you get that? He is postulating the external abstraction you are using has more features than you use. He is saying implement only the parts you use.

lagrange77

> Where do you get that?

From "when you get low enough but still riding on an external abstraction".

> He is saying implement only the parts you use.

Thanks.

tomrod

Likewise. And your comment reminded me of real programmers*

* https://xkcd.com/378/

c0wb0yc0d3r

Can someone help me understand what I’m looking at here? This repository allows me to train a specific model on a specific data set, and finally test the result? Is that correct?

I am interested in how large and small language models are trained, but as someone who has little knowledge in this world I find it hard to cut through the noise to find useful information.

Really I’m looking for an open source project that helps a person gain this knowledge. Something like a docker container that encapsulates all the dependencies. When training it will use any available gpu or tell me why my gpu can’t be used and then fall back to cpu. Then had a simple interface to test the training results. Finally you can easily pull back the curtain to understand the process in better detail and maybe even adapt it to different model to experiment.

Does something like that exist?

timnetworks

As opposed to inference (like generating text and images), training requires some more math (fp16 or bf16) and a single CPU generally won't cut it.

The prepare/train/generate instructions in the github linked are pretty much it for the 'how' of training a model. You give it a task and it does it for 1 billion trillion epochs and saves the changes incrementally (or not).

Training a LoRA for an image model may be more approachable, there's more blog entries etc on this, and the process is largely similar, except you're doing it for a single slice instead of the whole network.

[edit] I'm also learning so correct me if I'm off, hn!

sva_

> You give it a task and it does it for 1 billion trillion epochs and saves the changes incrementally (or not).

Somewhat confusingly, big LLM are most just trained for 1 epoch afaik.

_joel

I've seen 3 epochs on some of the finetuning R1 blog posts. It's not my field so not sure how valid that is.

SJC_Hacker

Do you have a good theoretical foundation in ML ? You will also need some linear algebra.

If not would invest the time in a decent course, there are plenty online, even offline if you are close enough to where its offered. I took one from Andrew Ng on Coursera years ago, which used Matlab. There are probably much better, more up-to-date options now, especially now that LLMs are very in-vogue. The fundamentals such as gradient descent, ANNs and back-propagation however, is still relevant, and hasn't changed much.

Trying to understand what code is doing without that foundation will be an exercise in futility.

c0wb0yc0d3r

I don’t have a solid ML foundation, and it’s been a decade or more since I’ve worked with linear algebra.

For now I think that might be too deep for what I’m after. I’m at the beach looking out at the vast ocean that is machine learning and LLMs.

barrenko

You're probably having the right hunch, it takes a crapload of time, especially if you want to implement and not just "get an intuition".

MacTea

https://course.fast.ai/ is the best. From their site: " A free course designed for people with some coding experience, who want to learn how to apply deep learning and machine learning to practical problems. "

c0wb0yc0d3r

This is at the top of my lunch time learning list. Not quite what I’ve been envisioning but it’s in the right direction. Thanks!

numba888

github has a bunch of them for years, the most known from Andrej Karpathy:

https://github.com/karpathy/nanoGPT

some other have MoE implemented.

syassami

Personal fave: https://github.com/karpathy/llama2.c

benreesman

nanoGPT is awesome (and I highly recommend his videos on it), but it’s closer to a direct reproduction of GPT-2, so it’s cool to have a really clean implementation of some newer ideas.

Nimitz14

nanoGPT contains some new ideas. https://github.com/karpathy/minGPT is more plain

OmAlve

Thanks a lot for posting this here! I can't believe it went viral, makes all the efforts feel worth it now! - Om Alve

Lerc

The example story is interesting.

I have made my own implementation from scratch with my own multi-channel tokeniser, each channel gets its own embedding table 32768, 256,256, 64, and 4. Which are summed along with the position encoding.

Yet with all of those differences, my stories have Lily as a protagonist often enough that I thought I had a bug somewhere.

Might have to check tinystories for name distribution.

Most questionable output from mine so far:

"one day, a naughty man and a little boy went to the park place to find some new things."

febin

Here's a google collab notebook built from this. It takes ~2 hours on A100 GPU if you have collab pro. Might work on free account as well.

https://colab.research.google.com/drive/1dklqzK8TDPfbPbyHrk3...

Diffused_asi

How many parameters model is this ?

brap

It’s interesting that technology so transformative is only a few hundred lines of code (excluding underlying frameworks and such).

How big would you guess state of the art models are, in terms of lines of code?

miki123211

Llama2 inference can be implemented in 900-ish lines of dependency-free C89, with no code golfing[1]. More modern architectures (at least the dense, non-MoE models) aren't that much more complicated.

That code is CPU only, uses float32 everywhere and doesn't do any optimizations, so it's not realistically usable for models beyond 100m params, but that's how much it takes to run the core algorithm.

[1] https://github.com/karpathy/llama2.c

hatthew

A minimal hardcoded definition of the structure: probably a few hundred lines.

The actual definition, including reusable components, optional features, and flexibility for experimentation: probably a few thousand.

The code needed to train the model, including all the data pipelines and management, training framework, optimization tricks, etc.: tens of thousands.

The whole codebase, including experiments, training/inference monitoring, modules that didn't make it into the final architecture, unit tests, and all custom code written to support everything mentioned so far: hundreds of thousands.

ks2048

So, this has nothing to do with "SmolLM" - a set of models (with data, training recipes, etc) released by HuggingFace? https://huggingface.co/blog/smollm

mkagenius

Looks like a rip off of - https://github.com/PraveenRaja42/Tiny-Stories-GPT

without any credits to above or TinyStories paper.

yorwba

The implementations are different, so I don't think you can consider it a rip-off.

mkagenius

Why do you say implementations are different?

yorwba

Because I read the code.

imdsm

Is there a corresponding article for this? I'd love to read through it!

HN

SmolGPT: A minimal PyTorch implementation for training a small LLM from scratch

SmolGPT: A minimal PyTorch implementation for training a small LLM from scratch