FFmpeg School of Assembly Language
132 comments
·February 22, 2025computerbuster
zbobet2012
So on point. We do _a lot_ of hand written SIMD on the other side (encoders) as well for similar reasons. In addition on the encoder side it's often necessary to "structure" the problem so you can perform things like early elimination of loops, and especially loads. Compilers simply can not generate autovectorized code that does those kinds of things.
dundarious
What does Zig offer in the way of builtin SIMD support, beyond overloads for trivial arithmetic operations? 90% of the utility of SIMD is outside of those types of simple operations. I like Zig, but my understanding is you have to reach for CPU specific builtins for the vast majority of cases, just like in C/C++.
GCC and Clang support the vector_size attribute and overloaded arithmetic operators on those "vectorized" types, and a LOT more besides -- in fact, that's how intrinsics like _mm256_mul_ps are implemented: `#define _mm256_mul_ps(a,b) (__m256)((v8sf)(a) * (v8sf)(b))`. The utility of all of that is much, much greater than what's available in Zig.
kierank
I am the author of these lessons.
Ask me anything.
cnt-dracula
Hi, thanks for your work!
I have a question, as someone who can just about read assembly but still do not intuitively understand how to write or decompose ideas to utilise assembly, do you have any suggestions to learn / improve this?
As in, at what point would someone realise this thing can be sped up by using assembly? If one found a function that would be really performant in assembly how do you go about writing it? Would you take the output from a compiler that's been converted to assembly or would you start from scratch? Does it even matter?
qingcharles
You're looking for the tiniest blocks of code that are run an exceptional number of times.
For instance, I used to work on graphics renderers. You'd find the bit that was called the most (writing lines of pixels to the screen) and try to jiggle the order of the instructions to decrease the number of cycles used to move X bits from system RAM to graphics RAM.
When I was doing it, branching (usually checking an exit condition on a loop) was the biggest performance killer. The CPU couldn't queue up instructions past the check because it didn't know whether it was going to go true or false until it got there.
qingcharles
As someone who wrote x86 optimization code professionally in the 90s, do we need to do this manually still in 2025?
Can we not just write tests and have some LLM try 10,000 different algorithms and profile the results?
Or is an LLM unlikely to find the optimal solution even with 10,000 random seeds?
Just asking. Optimizing x86 by hand isn't the easiest, because to think through it you start to have to try and fit all the registers in your mind and work through the combinations. Also you need to know how long each instruction combination will take; and some of these instructions have weird edge cases that take vastly longer or quicker to run that is hard for a human to take into account.
qingcharles
LOL downvoted immediately for having "LLM" in my question. If it's not a legit question, respond, don't just downvote. Teach me.
christiangenco
Hacker News is such a cool website.
Hi thank you for writing this!
wruza
I don’t care about the split, just wanted to say that this guide is so good. I wish I had this back when I was interested in low-low-level.
Daniel_Van_Zant
I'm curious from anyone who has done it. Is there any "pleasure" to be had in learning or implementing assembly (like there is for LISP or RISC-V) or is it something you learn and implement because you want to do something else (like learning COBOL if you need to work with certain kinds of systems). It has always piqued my interest but I don't have a good reason in my day-to-day job to get into it. Wondering if it is worth committing some time to for the fun of it.
btown
One “fun” thing about it is that it’s higher level than you think, because the actual chip may do things with branch prediction and pipelining that you can only barely control.
I remember a university course where we competed on who could have the most performant assembly program for a specific task; everyone tried various variants of loop unrolling to eke out the best performance and guide the processor away from bad branch predictions. I may or may not have hit Ballmer Peak the night before the due date and tried a setup that most others missed, and won the competition by a hair!
There’s also the incredible joy of seeing https://github.com/chrislgarry/Apollo-11 and quipping “this is a Unix system; I know this!” Knowing how to read the language of how we made it to the moon will never fade in wonder.
Short answer: yes!
crq-yml
Learning at least one assembly language is very rewarding because it puts you in touch with the most primitive forms of practical programming: while there are theoretical models like Turing machines or lambda calculus that are even more simplistic, the architectures that programmers actually work with have some forgiving qualities.
It isn't a thing to be scared of - assembly is verbose, not complex. Everything you do in it needs load and store, load and store, millions of times. When you add some macros and build-time checks, or put it in the context of a Forth system(which wraps an interpreter around "run chunks of assembly", enabling interactive development and scripting) - it's not that far off from C, and it removes the magic of the compiler.
I'm an advocate for going retro with it as well; an 8-bit machine in an emulator keeps the working model small, in a well-documented zone, and adds constraints that make it valuable to think about doing more tasks in assembly, which so often is not the case once you are using a 32-bit or later architecture and you have a lot of resources to throw around. People who develop in assembly for work will have more specific preferences, but beginners mostly need an environment where the documentation and examples are good. Rosetta Code has some good assembly language examples that are worth using as a way to learn.
msaltz
I did the first 27 chapters of this tutorial just because I was interested in learning more and it was thoroughly enjoyable: https://mariokartwii.com/armv8/
I actually quite like coding in assembly now (though I haven’t done much more than the tutorial, just made an array library that I could call from C). I think it’s so fun because at that level there’s very little magic left - you’re really saying exactly what should happen. What you see is mostly what you get. It also helped me understand linking a lot better and other things that I understood at a high level but still felt fuzzy on some details.
Am now interested to check out this ffmpeg tutorial bc it’s x86 and not ARM :)
brown
Learning assembly was profound for me, not because I've used it (I haven't in 30 years of coding), but because it completed the picture - from transistors to logic gates to CPU architecture to high-level programming. That moment when you understand how it all fits together is worth the effort, even if you never write assembly professionally.
daeken
I have spent the last ~25 years deep in assembly because it's fun. It's occasionally useful, but there's so much pleasure in getting every last byte where it belongs, or working through binaries that no one has inspected in decades, or building an emulator that was previously impossible. It's one of the few areas where I still feel The Magic, in the way I did when I first started out.
sigbottle
If you're working with C++ (and I'd imagine C), knowing how to debug the assembly comes up. And if you've written assembly it helps to be aware of basic patterns such as loops, variables, etc. to not get completely lost.
Compilers have debug symbols, you can tune optimization levels, etc. so it's hopefully not too scary of a mess once you objdump it, but I've seen people both use their assembly knowledge at work and get rewarded handsomely for it.
ghhrjfkt4k
I once used it to get a 4x speedup of sqrt computations, by using SIMD. It was quite fun, and also quite self contained and manageable.
The library sqrt handles all kinds of edge-cases which prevent the compiler from autovectorizing it.
mobiledev2014
Given there’s a mini genre of games that emulate using assembly to solve puzzles the answer is clearly yes. Not sure if any of them teach a real language.
The most popular are the Zachtronics games and Tomorrow Corp games. They’re so so good!
jupp0r
I personally don't think there's much value in writing assembly (vs using intrinsics), but it's been really helpful to read it. I have often used Compiler Explorer (https://godbolt.org/) to look at the assembly generated and understand optimizations that compilers perform when optimizing for performance.
imchaz
I'll be honest, I didn't read through much. Ffmpeg gives me severe ptsd. My first task out of college was to write a procedurally generated video using ffmpeg, conform to dash, and get it under 150kb/s while being readable. Docs were unusable. Dash was only a few months old. And stackoverflow was devoid of help. I kid you not, the only way to get any insight was some sketchy IRC channel. (2016 btw, well past IRCs prime)
thegrim33
Not trying to be too negative but the memories your comment brought up in me, I need to rant about ffmpeg for a minute. ffmpeg is the worst documented major library I've ever used in my life. I integrated with it to render videos inside my 3D engine and boy do I shiver at any thought of having to work with it again.
The "documentation" is a collect of 15-20 year old source samples. The vast majority of them either won't compile anymore because the API has changed, or they use 2, 3, or 4 times deprecated functions that shouldn't be used anymore. The source examples also have almost no comments explaining anything. They have super dense, super complicated code, with no comments, but then there will be a line like "setRenderSpeed(3)" or whatever and it'll have a comment: "Sets render speed to 3", the absolute most useless comment ever. The source examples are also all written in 30 year old as C-Style of C code as you can get, incredibly horribly dense, with almost no organization, have to jump up and down all over the file to find the variables being accessed, it's just gross and barely comprehensible.
They put a lot of effort into producing doxygen documentation for everything, but the doxygen documentation is nearly useless, it just lists the API with effectively zero documentation or explanation on the functions or parameters. There's so little explanation of how to do anything. On the website they have sections for each library, and for most libraries you get 2-3 sentences of explanation on what the library is for, and that's it. That's the extent the entire library is documented. They just drop an undocumented massive C API split across a dozen or so libraries on you and wish you luck.
The API has also gotten absolutely wrecked over the last 20 years or however long it's been around as it has evolved. Sometimes they straight up delete functions to deprecate them, sometimes they create a new version of a function as fuction2 and then as function3, and keep all of them around, sometimes they replace a function with a completely differently named function and keep them both around, and there's absolutely nothing written anywhere about what the "right" way to do anything is, what functions you should actually be using. So many times I went down rabbit holes reading some obscure 15 year old mailing list post trying to find anyone that had successfully done something I was trying to do. And again, the obscure message board posts and documentation that does exist is almost all deprecated at this point and shouldn't be used.
Then there's the custom build system, so if you need to build it custom to support or not support different features, you can't use any modern build system, it's all custom scripts that do weird things like hardcoded dumping build output into your home directory. Makes it difficult to integrate with a modern build system.
It has so much momentum, and so many users, but man, there has to be a massive opening for someone to replace ffmpeg with a modern programming language and a modern build system, built with GPU acceleration of stuff in mind from the beginning and not tacked on top 20 years later, and not using 30 year old c-style code, and an actually documented project.
slicktux
Kudos for the K&R reference! That was the book I bought to learn C and programming in general. I had initially tried C++ as my first language but I found it too abstract to learn because I kept asking what was going on underneath the hood.
lukaslalinsky
This is perfect. I used to know the x86 assembly at the time of 386, but for the more advanced processors, it was too complex. I'd definitely like to learn more about SIMD on recent CPUs, so this seems like a great resource.
foresto
> Note that the “q” suffix refers to the size of the pointer *(*i.e in C it represents *sizeof(*src) == 8 on 64-bit systems, and x86asm is smart enough to use 32-bit on 32-bit systems) but the underlying load is 128-bit.
I find that sentence confusing.
I assume that i.e is supposed to be i.e., but What is *(* supposed to mean? Shouldn't that be just an open parenthesis?
In what context would *sizeof(*src) be considered valid? As far as I know, sizeof never yields a pointer.
I get the impression that someone sprinkled random asterisks in that sentence, or maybe tried to mix asterisks-denoting-italics with C syntax.
kevingadd
Yes, this looks like something went wrong with the markdown itself or the conversion of the source material to markdown.
sweeter
Wouldn't it return the size of the pointer? I would guess it's exclusively used to handle architecture differences
imglorp
Asm is 10x faster than C? That was definitely true at some point but is it still true today? Have compilers really stagnated so badly they can't come close to hand coded asm?
jsheard
C with intrinsics can get very close to straight assembly performance. The FFmpeg devs are somewhat infamously against intrinsics (IIRC they don't allow them in their codebase even if the performance is as good as equivalent assembly) but even by TFAs own estimates the difference between intrinsics and assembly is on the order of 10-15%.
You might see a 10x difference if you compare meticulously optimized assembly to naive C in cases where vectorization is possible but the compiler fails to capitalize on that, which is often, because auto-vectorization still mostly sucks beyond trivial cases. It's not really a surprise that expert code runs circles around naive code though.
CyberDildonics
You might see a 10x difference if you compare meticulously optimized assembly to naive C in cases where vectorization is possible but the compiler fails to capitalize on that,
I can get far more than 10x over naive C just by reordering memory accesses. With SIMD it can be 7x more, but that can be done with ISPC, it doesn't need to be done with asm.
UltraSane
"The FFmpeg devs are somewhat infamously against intrinsics (they don't allow them in their codebase even if the performance is as good as equivalent assembly)"
Why?
Narishma
I don't know if it's their reason but I myself avoid them because I find them harder to read than assembly language.
schainks
Did you read lesson one?
TL;DR They want to squeeze every drop of performance out of the CPU when processing media, and maintaining a mixture of intrinsics code and assembly is not worth the trade off when doing 100% assembly offers better performance guarantees, readability, and ease of maintenance / onboarding of developers.
oguz-ismail
Have you seen C code with SIMD intrinsics? They are an eyesore
lukaslalinsky
This is for heavily vectorized code, using every hack possible to fully utilize the CPU. Compilers are smart when it comes to normal code, but codecs are not really normal code. Not a ffmpeg programmer, but have some background dealing with audio.
PaulDavisThe1st
> codecs are not really normal code.
Not really a fair comment. They are entirely normal code in most senses. They differ in one important way: they are (frequently) perfect examples of where "single instruction, multiple data" completely makes sense. "Do this to every sample" is the order of the day, and that is a bit odd when compared with text processing or numerical computation.
But this is true of the majority of signal processing, not just codecs. As simple a thing as increasing the volume of an audio data stream means multiplying every sample by the same value - more or less the definition of SIMD.
astrange
There's a difference because audio processing is often "massively parallel", or at least like 1024 samples at once, but in video codecs operations could be only 4 pixels at once and you have to stretch to find extra things to feed the SIMD operations.
bad_username
> codecs are not really normal code.
Codecs are pretty normal code. You can get decent performance by just writing quality idiomatic C or C++, even without asm. (I implemented a commercial x.264 codec and worked on a bunch of audio codecs.)
warble
I highly doubt it's true. I can usually approach the same speed in C if I'm working with a familiar compiler. Sometimes I can do significantly better in assembly but it's rare.
I work on bare metal embedded systems though, so maybe there's some nuance when working with bigger OS libs?
umanwizard
The difference is probably that you don’t work in an environment that supports SIMD or your code can’t benefit from it.
variadix
C compilers are still pretty bad at auto vectorization. For problems where SIMD is applicable, you can reasonably expect a 2x-16x speed up over the naive scalar implementation.
astrange
Also, if you write code with intrinsics the autovectorization can make it _worse_. eg a pattern is to write a SIMD main loop and then a scalar tail, but it can autovectorize that and mess it up.
epolanski
I remember a series of lectures from an Intel engineer that went into how difficult it was writing assembly code for x86. He basically stated that the number of cases you can really write code that is faster than what a compiler would do is close to none.
Essentially people think they are writing low level code, in reality that's not how CPUs interpret that code, so he explained how writing manual assembly kills performance pretty much always (at least on modern x86).
iforgotpassword
That's for random "I know asm so it must be faster".
If you know it really well, have already optimized everything on an algorithmic level and have code that can benefit from simd, 10x is real.
FarmerPotato
You have to consider that modern CPUs don't execute code in-order, but speculatively, in multiple instruction pipelines.
I've used Intel's icc compiler and profiler tools in an iterative fashion. A compiler like Intel's might be made to profile cache misses, pipeline utilization, branches, stalls, and supposedly improve in the next compilation.
The assembly programmer has to consider those factors. Sure would be nice to have a computer check those things!
In the old days, we only worried about cycle counts, wait states, and number of instructions.
1propionyl
It's not a matter of compiler stagnation. The compiler simply isn't privy to the information the assembly author makes use of to inform their design.
Put more simply: a C compiler can't infer from a plain C implementation that you're trying to do certain mathematics that could alternately be expressed more efficiently with SIMD intrinsics. It doesn't have access to your knowledge about the mathematics you're trying to do.
There are also target specific considerations. A compiler is, necessarily, a general purpose compiler. Problems like resource (e.g. register) allocation are NP-complete (equivalent to knapsack) and very few people want their compiler to spend hours upon hours searching for the absolute most optimal (if indeed you can even know that statically...) asmgen.
bob1029
This gets even more complex once you start looking at dynamic compilations. Some of the JIT compilers have the ability to hot patch functions based upon runtime statistics. In very large, enterprisey applications with unknowns regarding how they will actually be used at build time, this can make a difference.
You can go nuclear option with your static compilations and turn on all the optimizations everywhere, but this kills inner loop iteration speed. I believe there are aspects of some dynamic compiling runtimes that can make them superior to static compilations - even if we don't care how long the build takes.
astrange
Statistics aren't magic and it's not going to find superoptimizing cases like this by using them. I think this is only helpful when you get a lot of incoming poorly written/dynamic code needing a lot of inlining, that maybe just got generated in the first place. So basically serving ads on websites.
In ffmpeg's case you can just always be the correct thing.
jki275
Probably some very niche things. I know I can't write ASM that's 10x better than C, but I wouldn't assume no one can.
CyberDildonics
It isn't very hard to write C that is 10x better than C, because most programs have too many memory allocations and terrible memory access patterns. Once you sort that out you are already more than 10x ahead, then you can turn on the juice with SIMD, parallelization and possibly optimize for memory bandwidth as well.
1propionyl
It depends on what you're trying to do. I would in general only expect such substantial speedups when considering writing computation kernels (for audio, video, etc).
Compilers today are liable in most circumstances to know many more tricks than you do. Especially if you make use of hints (e.g. "this memory is almost always accessed sequentially", "this branch is almost never taken", etc) to guide it.
astrange
Mm, those hints don't matter on modern CPUs. There's no good way for the compiler to pass it down to them either. There are some things like prefetch instructions, but unless you know the exact machine you're targeting, you won't know when to use them.
jki275
Oh I definitely agree that in the vast majority of cases the compiler will probably win.
But I suspect there are cases where the super experts exist who can do things better.
fracus
I'm halfway through this tutorial and I'm really enjoying it. I haven't touched assembly since back in university decades ago. I've always had an urge to optimize processes for some reason. This scratches that itch. I was also more curious about SIMD since hearing about it on Digital Foundry.
Charon77
This is very approachable and beginner friendly. Kudos to authors.
Another resource on the same topic: https://blogs.gnome.org/rbultje/2017/07/14/writing-x86-simd-...
As I'm seeing in the comments here, the usefulness of handwritten SIMD ranges from "totally unclear" to "mission critical". I'm seeing a lot on the "totally unclear" side, but not as much on the "mission critical", so I'll talk a bit about that.
FFmpeg is a pretty clear use case because of how often it is used, but I think it is easier to quantify the impact of handwriting SIMD with something like dav1d, the universal production AV1 video decoder.
dav1d is used pretty much everywhere, from major browsers to the Android operating system (superseding libgav1). A massive element of dav1d's success is its incredible speed, which is largely due to how much of the codebase is handwritten SIMD.
While I think it is a good thing that languages like Zig have built-in SIMD support, there are some use cases where it becomes necessary to do things by hand because even a potential performance delta is important to investigate. There are lines of code in dav1d that will be run trillions of times in a single day, and they need to be as fast as possible. The difference between handwritten & compiler-generated SIMD can be up to 50% in some cases, so it is important.
I happen to be somewhat involved in similar use cases, where things I write will run a lot of times. To make sure these skills stay alive, resources like the FFmpeg school of assembly language are pretty important, in my opinion.