Writing an LLM from scratch, part 8 – trainable self-attention
32 comments
·March 5, 2025andsoitis
nate
What always surprises me is how "sleep" is the ingredient I need here. Yes repetition helps. But space out the repetition with sleep and "woah! i get it!" is often uttered. Obviously there's so much written about sleep as some kind of RAM to SSD offloading step, but it never ceases to amaze me when it happens so well.
gpjt
OP here: that is an excellent point! Of the eight read-throughs, four were on Friday night, then I had dreams involving some kind of TRON-like vector spaces, and at brunch on Saturday things seemed to start gelling. The four extra read-throughs were to crystallise that intuition to a level that I felt I could start writing it up. I'm 100% sure the sleep was what built some kind of intuition at a pre-linguistic stage that I could build on.
diggan
Yeah, or just leaving whatever it is you're trying to learn for your subconscious to process while you do other stuff.
I've found myself having a lot easier time to learn difficult new concepts by (somewhat counterintuitively) stop trying to learn it for a week or two, then come back to it, and suddenly it's a lot easier to grasp, even if I haven't consciously thought about it since I left it.
snozolli
I've had concepts click in my head literally years after studying them. I have no idea why. I assume I was exposed to tangentially related information at some point, then an old thought bubbles up and it suddenly makes sense.
jay-barronville
> What always surprises me is how "sleep" is the ingredient I need here. Yes repetition helps. But space out the repetition with sleep and "woah! i get it!" is often uttered.
I really relate to this. In fact, a lot of times, when working on a difficult problem, I’ll intentionally go take a nap when I feel stuck, and more often than not, I’ll wake up feeling like the problem is much easier than I thought before my nap. I can’t explain why it works so well—it just does!
SpaceManNabs
This becomes very obvious to one's self if you learn an instrument or another mechanical/procedural thing. For example, cramming 4 hours into guitar is less useful than 1 hour over 4 days cause of sleep. Sometimes I will struggle with something for quite a bit towards the end of a practice session, and then I will get it right on the first try in my next session on the following day. Hell, there are so many times when I go to my guitar teacher and I tell him "yeah so i practiced this part of the assignment over the past week and I could never figure it out" and then I try to show him the issue and I play it mostly correctly to my surprise.
With something like learning a language, problem solving skills, a code base, new maths, or wtv, it is less apparent because the mind can delude itself, and it is much harder to delude yourself with actual physical movements. You either do it, or you don't...
miki123211
For me, it's not just about repetition, but also repetition of different explanations of the same concept.
Sometimes one gets stuck, and a different explanation, previously completely incomprehensible, gets one "unstuck" enough to make all the other explanations start making more sense.
Sometimes you realize midway through that there's some other concept you don't fully understand (e.g. you can't fully understand why an attention layer looks like it does without understanding matmul or tensor ops). Getting some explanations of these other concepts can also get one unstuck.
I imagine this as every explanation containing little "nuggets" of knowledge at various "skill levels", but you can only grasp at those that are slightly beyond yours at the moment. Too high and nothing makes sense, too low and it's obvious.
As you re-read the explanation, you understand more, and so more "nuggets" are available to you. Sometimes the explanation is too hard, there's nothing to grasp onto and you're stuck, this is where other explanations, or explanations of other concepts, can push you over the edge.
SpaceManNabs
> also repetition of different explanations of the same concept.
multiple encodings are stronger since they end up fusing into not just more robust and general representations, but also more well connected.
i forgot which neuroscience textbook it was that I read all this from from, but Dr. Barbara Oakley talks about this in her course too so anyone can look through there for more details and sources.
> desirable difficulty
Well regarded in anki, music practice, and most problem solving books (Polya's "How To Solve It"). I think some neuroscience texts also try to formalize it but I don't know enough there to point to a more concrete abstraction of the phenomenon.
> As you re-read the explanation
I suggest something even better. Try to do the explanation yourself from free recall, and when you fail, then re-read and note exactly where and why you failed. This is a mid-level efficiency technique (I forgot the source) compared to free call, active testing, and spaced repetition, but definitely better than plain re-reading.
neuralkoi
This can even happen long after you've done an initial pass through the material. A few years after college I reviewed some electrical engineer concepts that had simply never made sense to me, but this time around everything suddenly clicked into place.
deskr
Oh I know that feeling and it's amazing.
It's so strange, you read the material and sometimes you either struggle with understanding or even don't see how it's relevant and not just a waste of time.
Then when you review it much later, it's like opening a new room of knowledge, it just makes sense and it makes me so happy in a strange way.
I've had this happen often with music theory. Music theory it pretty dull on it's own and small gaps in knowledge excludes you from understanding a lot more.
penguin_booze
Relatable. Presumably what takes it to click is the original knowledge, as presented in the book, plus the experience one happens to accumulate over time, even unrelated to the original knowledge. In other words, the author expected for the reader to click based on the pure knowledge alone, which has since been proven inadequate. So, in a sense, it's the author's fault.
PS: electrical engineering has one of the most reader-hostile presentations, so all authors should be tried for crimes against humanity /s
serviceberry
I think some of this is on the authors of reference texts, though. I often catch myself writing what I think is a very accessible intro to a particular topic, only to re-read it a month later and be mortified by all the logical leaps, niche terminology that's used before it's defined, and so forth.
It's really hard to spell out new concepts in a way that doesn't require the reader to make several passes to resolve all the nuance... but it's not necessarily a matter of how we learn, just how we write.
ForOldHack
Yes. This exists. It is the expression of a facit of genius called Explanation, and there is such a tiny line between this and mansplaining. Think Einstien, trying to Explain the math before Explaining relativity.
(Read the glossary first).
maccard
I did a mix of CS and EE in university, and I was really remarkably poor at one particular part - signal processing. I scraped by the first module, just about. I took a class which I can only describe as "applied DSP" - it was basically just programming the DSP theory we had done the previous semester, and I was top of the class. When we did DSP 2 the following semester with the same lecturer, I was pretty much top of the class again, without going back to basics on the theory stuff. Either I was unfairly graded in the first instance (probably some amount of this is true or I learned enough practical application during the second class to fill in the gaps where it all fell into place.
josephg
When I used to teach CS, I’d tell my students they would need to learn pointers 3 times. First, they’d hear me explain it in a lecture and think they understood it. Then they’d try and implement a linked list and realise they didn’t get it at all, and figure it all out from scratch. Then at some point it would all click and seem really easy.
Lots of stuff is like this. You learn what you do. Sitting in a lecture theatre hearing about pointers, or photography, or DSP, isn’t really learning those things. Some people are definitely better at using their imagination to actually learn things during a lecture. But it’s a long shot at the best of times I think. For everyone.
melvinroest
Pointers finally clicked for me when I started doing things with a normal int and treat it as a memory address. I forgot the details, but I remember making a Stack Overflow post about it. The people on Stack Overflow didn't seem too happy with my program from a software engineering standpoint, not realizing I was just trying to really test some ideas of what pointers actually are.
Found the SO post: https://stackoverflow.com/questions/35379191/cannot-cast-var...
igleria
I've found out that sometimes there is an "echo" of sorts when hearing or reading something. The first instinct one has is to say "what?" followed by "ah, you said flan, not plan". But if I stay silent, I can get access to that "echo" and find out "on my own" what the person just said.
Sorry if this explanation is convoluted. There is a lot going on in the brain!
Robin_Message
Sounds like the Phonological Loop: https://en.wikipedia.org/wiki/Baddeley%27s_model_of_working_...
igleria
Of course someone already had a name for this, I appreciate the link!
penguin_booze
One sees from scratch, and then one also sees
from fancy_module import magic_functions
I'm semi-serious here, of course. To me, for something to be called 'from scratch', requisite knowledge should be built ground up. To wit, I'd want to write the tokenizer myself but don't want to derive laws of quantum physics that makes the computation happen.gpjt
I do see your point!
But at the end of the day, it depends on where you want to spend your time. "Build an LLM from scratch" is over 300 pages -- and they are very dense pages. My blog post covers fewer than 10 of them (though TBF they are the hardest pages). Adding on tokenizers in depth from scratch would add on 100 or so more. Adding on efficient-enough matrix multiplication to do anything would add on a few hundred more, and doing it in CUDA would probably be a couple of thousand. Now add on automated differentiation to work out the gradients for training -- a few thousand more? Optimizers for the training -- even more than that, perhaps.
You have to draw the line somewhere, as otherwise (as you suggest) the "from scratch" book has to start "go out and get some really clean sand" so that you can start fabbing your own chips. I think that tiktoken and PyTorch are a solid choice for that line, as it means that the book is manageable in size and gives you enough of an overview of the underlying stuff to be able to work out what you want to dig into next.
slicktux
I totally agree; I was contemplating giving the book a try…I still might but knowing myself I’m going to go down the rabbit hole of how those mathematical functions work before I can continue…
gpjt
I have 100% set myself a "no side quests" rule while going through the book so that I don't do that. I've had... patchy success with that, but I think I'm doing pretty well apart from the week I spent getting LaTeX rendering working on my blog so that I could do the pretty maths in that post.
What I'm doing is building up a list of things to dig into in depth once I've finished the book. Kind of like a treat to encourage me to push forward when I'm working through a bit that's tough to understand.
shpongled
I feel very similarly. When I want to understand from scratch, that means no pytorch, no numpy, etc.
skipants
I went through parts 1 and 8 and did not see an import statement in any of the posts
gpjt
OP here: my posts are kind of reading notes for the book, so I don't normally copy the code from there -- so you wouldn't have seen the packages used. So far there's been tiktoken for tokenization (Raschka shows how to write a simple tokenizer and explains the workings of the byte-pair tokenization that he recommends, though) and PyTorch for CUDA-acceleratable matrix maths, automated differentiation for gradient descent, and so on.
jatins
what's the fancy_module you are talking about here?
OnACoffeeBreak
PyTorch
The book's code repo: https://github.com/rasbt/LLMs-from-scratch
null
linwangg
[dead]
ForOldHack
Part 8? Wait... Is this a story that wrote itself?
1) I am kidding. 2) At what point does it become self replicating? 3) skynet. 4) kidding - not kidding.
> For me, it took eight re-reads of Raschka's (emininently clear and readable) explanation to get to a level where I felt I understood it.
It’s interesting to observe in oneself how repetition can result in internalizing new concepts. It is less about rote memorization but more about becoming aware of nuance and letting our minds “see” things from different angles, integrating it with existing world models either through augmentation, replacement, or adjustment. Similar for practicing activities that require some form of motor ability.
Some concepts are internalized less explicitly, like when we “learn” through role-modeling behaviors or feedback loops through interaction with people, objects, and ideas (like how to fit into a society).