FFmpeg Assembly Language Lessons

95 comments

·August 18, 2025

cr125rider

I can’t imagine the scale that FFMPEG operates at. A small improvement has to be thousands and thousands of hours of compute saved. Insanely useful project.

prisenco

Their commitment to performance is a beautiful thing.

Imagine all projects were similarly committed.

godelski

There's tons of backlash here as if people think better performance requires writing in assembly.

But to anyone complaining, I want to know, when was the last you pulled out a profiler? When was the last time you saw anyone use a profiler?

People asking for performance aren't pissed you didn't write Microsoft Word in assembly we're pissed it takes 10 seconds to open a fucking text editor.

I literally timed it on my M2 Air. 8s to open and another 1s to get a blank document. Meanwhile it took (neo)vim 0.1s and it's so fast I can't click my stopwatch fast enough to properly time it. And I'm not going to bother checking because the race isn't even close.

I'm (we're) not pissed that the code isn't optional, I'm pissed because it's slower than dialup. So take that Knuth quote you love about optimization and do what he actually suggested. Grab a fucking profiler, it is more important than your Big O

nwallin

Another datapoint that supports your argument is the Grand Theft Auto Online (GTAO) thing a few months ago.[0] GTAO took 5-15 minutes to start up. Like you click the icon and 5-15 minutes later you're in the main menu. Everyone was complaining about it for years. Years. Eventually some enterprising hacker disassembled the binary and profiled it. 95% of the runtime was in `strlen()` calls. Not only was that where all the time was spent, but it was all spent `strlen()`ing the exact same ~10MB resource string. They knew exactly how large the string was because they allocated memory for it, and then read the file off the disk into that memory. Then they were tokenizing it in a loop. But their tokenization routine didn't track how big the string was, or where the end of it was, so for each token it popped off the beginning, it had to `strlen()` the entire resource file.

The enterprising hacker then wrote a simple binary patch that reduced the startup time from 5-10 minutes to like 15 seconds or something.

To me that's profound. It implies that not only was management not concerned about the start up time, but none of the developers of the project ever used a profiler. You could just glance at a flamegraph of it, see that it was a single enormous plateau of a function that should honestly be pretty fast, and anyone with an ounce of curiousity would be like, ".........wait a minute, that's weird." And then the bug would be fixed in less time than it would take to convince management that it was worth prioritizing.

It disturbs me to think that this is the kind of world we live in. Where people lack such basic curiosity. The problem wasn't that optimization was hard, (optimization can be extremely hard) it was just because nobody gave a shit and nobody was even remotely curious about bad performance. They just accepted bad performance as if that's just the way the world is.

[0] Oh god it was 4 years ago: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

AdieuToLogic

> People asking for performance aren't pissed you didn't write Microsoft Word in assembly we're pissed it takes 10 seconds to open a fucking text editor.

It could be worse I suppose...

Some versions of Microsoft Excel had a flight simulator embedded in them[0]!

:-D

0 - https://web.archive.org/web/20210326220319/https://eeggs.com...

1vuio0pswjnm7

"I literally timed it on my M2 Air."

I bet it opens faster on a Surface Pro

therealmarv

like Slack or Jira... lol.

sfn42

That would be an enormous waste of time. 99.9% of software doesn't have to be anywhere near optimal. It just has to not be wasteful.

Sadly lots of software is blatantly wasteful. But it doesn't take fancy assembly micro optimization to fix it, the problem is typically much higher level than that. It's more like serialized network requests, unnecessarily high time complexities, just lots of unnecessary work and unnecessary waiting.

Once you have that stuff solved you can start looking at lower level optimization, but by that point most apps are already nice and snappy so there's no reason to optimize further.

harikb

Sorry, I would word it differently. 99.9% software should be decently performant. Yes, don't need 'fancy assembly micro optimization'. That said, today some large portion of software is written by folks who absolutely doesn't care about performance - just duct-taping some sh*t to somehow make it work and call it a day.

Almondsetat

Yeah no, I'd like non-performance critical programs to focus on other things than performance thank you

Sesse__

Hard disagree. I'd like word processors to not need ten seconds just to start up. I'd like chat clients not to use _seconds_ to echo my message back to me. I'd like news pages that don't empty my mobile data cap just by existing. All of these are “non-performance critical”, but I'd _love_ for them to focus on performance.

not_your_vase

This mentality brings you a loading screen when you start the calculator on windows.

null

[deleted]

EliRivers

Surely all programs are performance critical. Any program we think isn't is just a program where the performance met the criteria already.

lo_zamoyski

Indeed. All else remaining the same, a faster program is generally more desirable than a slower program, but we don't live in generalities where all else remains the same and we simply need to choose fast over slow. Fast often costs more to produce.

Programming is a small piece of a larger context. What makes a program "good" is not a property of the program itself, but measured by external ends and constraints. This is true of all technology. Some of these constraints are resources, and one of these resources is time. In fact, the very same limitation on time that motivates the prioritization of development effort toward some features other than performance is the very same limitation that motivates the desire for performance in the first place.

Performance must be understood globally. Let's say we need a result in three days, and it takes two days to write a program that takes one day to get the result, but a week to write a program that takes a second to produce a result, then obviously, it is better to write the program the first way. In a week's time, your fast program will no longer be needed! The value of the result will have expired.

This is effectively a matter of opportunity cost.

byteknight

Seems so easy! You only need the entire world even tangentially related to video to rely solely on your project for a task and you too can have all the developers you need to work on performance!

astrange

ffmpeg has competition. For the longest time it wasn't the best audio encoder for any codec[0], and it wasn't the fastest H.264 decoder when everyone wanted that because a closed-source codec named CoreAVC was better[1].

ffmpeg was however, always the best open-source project, basically because it had all the smart developers who were capable of collaborating on anything. Its competition either wasn't smart enough and got lost in useless architecture-astronauting[2], or were too contrarian and refused to believe their encoder quality could get better because they designed it based on artificial PSNR benchmarks instead of actually watching the output.

[0] For complicated reasons I don't fully understand myself, audio encoders don't get quality improvements by sharing code or developers the way decoders do. Basically because they use something called "psychoacoustic models" which are always designed for the specific codec instead of generalized. It might just be that noone's invented a way to do it yet.

[1] I eventually fixed this by writing a new multithreading system, but it took me ~2 years of working off summer of code grants, because this was before there was much commercial interest in it.

[2] This seems to happen whenever I see anyone try to write anything in C++. They just spend all day figuring out how to connect things to other things and never write the part that does anything?

ackfoobar

I seem to recall that they lamented on twitter the low amount of (monetary or code) contribution they got, despite how heavily they are used.

hluska

You know friend, if open source actually worked like that I wouldn’t be so allergic to releasing projects. But it doesn’t - a large swath of the economy depends on unpaid labour being treated poorly by people who won’t or can’t contribute.

zahlman

It'd be nice, though, to have a proper API (in the traditional sense, not SaaS) instead of having to figure out these command lines in what's practically its own programming language....

codys

FFMpeg does have an API. It ships a few libraries (libavcodec, libavformat, and others) which expose a C api that is used in the ffmpeg command line tool.

They publish doxygen generated documentation for the APIs, available here: https://ffmpeg.org/doxygen/trunk/

zahlman

Don't know how I overlooked that, thanks. Maybe because the one Python wrapper I know about is generating command lines and making subprocess calls.

xxpor

I get why the CLI is so complicated, but I will say AI has been great at figuring out what I need to run given an English language input. It's been one of the highest value uses of AI for me.

gooob

hell yeah, same here. i made a little python GUI app to edit videos

KwanEsq

Prior discussion 2025-02-22, 222 comments: https://news.ycombinator.com/item?id=43140614

NullCascade

What is the actual process of identifying hotspots caused suboptimal compiler generated assembly?

Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly?

astrange

So the main issues here are not what people think they are. They generally aren't "suboptimal assembly", at least not what you can reasonably expect out of a C compiler.

The factors are something like:

- specialization: there's already a decent plain-C implementation of the loop, asm/SIMD versions are added on for specific hardware platforms. And different platforms have different SIMD features, so it's hard to generalize them.

- predictability: users have different compiler versions, so even if there is a good one out there not everyone is going to use it.

- optimization difficulties: C's memory model specifically makes optimization difficult here because video is `char *` and `char *` aliases everything. Also, the two kinds of features compilers add for this (intrinsics and autovectorization) can fight each other and make things worse than nothing.

- taste: you could imagine a better portable language for writing SIMD in, but C isn't it. And on Intel C with intrinsics definitely isn't it, because their stuff was invented by Microsoft, who were famous for having absolutely no aesthetic taste in anything. The assembly is /more/ readable than C would be because it'd all be function calls with names like `_mm_movemask_epi8`.

derf_

One time I spent a week carefully rewriting all of the SIMD asm in libtheora, really pulling out all of the stops to go after every last cycle [0], and managed to squeeze out 1% faster total decoder performance. Then I spent a day reorganizing some structs in the C code and got 7%. I think about that a lot when I decide what optimizations to go after.

[0] https://gitlab.xiph.org/xiph/theora/-/blob/main/lib/x86/mmxl... is an example of what we are talking about here.

ack_complete

> And on Intel C with intrinsics definitely isn't it, because their stuff was invented by Microsoft, who were famous for having absolutely no aesthetic taste in anything.

Wouldn't Intel be the one defining the intrinsics? They're referenced from the ISA manuals, and the Intel Intrinsics Guide regularly references intrinsics like _allow_cpu_features() that are only supported by the Intel compiler and aren't implemented in MSVC.

duped

Normally you spin up a tool like vtune or uprof to analyze your benchmark hotspots at the ISA level. No idea about tools like that for ARM.

> Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly?

IME, not really. I've done a fair bit of hand-written assembly and it exclusively comes up when dealing with architecture-specific problems - for everything else you can just write C (unless you hit one of the edge cases where C semantics don't allow you to express something in C, but those are rare).

For example: C and C++ compilers are really, really good at writing optimized code in general. Where they tend to be worse are things like vectorized code which requires you to redesign algorithms such that they can use fast vector instructions, and even then, you'll have to resort to compiler intrinsics to use the instructions at all, and even then, compiler intrinsics can lead to some bad codegen. So your code winds up being non-portable, looks like assembly, and has some overhead just because of what the compiler emits (and can't optimize). So you wind up just writing it in asm anyway, and get smarter about things the compiler worries about like register allocation and out-of-order instructions.

But the real problem once you get into this domain is that you simply cannot tell at a glance whether hand written assembly is "better" (insert your metric for "better here) than what the compiler emits. You must measure and benchmark, and those benchmarks have to be meaningful.

Sesse__

> Normally you spin up a tool like vtune or uprof to analyze your benchmark hotspots at the ISA level. No idea about tools like that for ARM.

perf is included with the Linux kernel, and works with a fair amount of architectures (including Arm).

godelski

You may still need to install linux-tools to get the perf command.

duped

perf doesn't give you instruction level profiling, does it? I thought the traces were mostly at the symbol level

jcranmer

> Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly?

Not really. There are a couple of reasons to reach for handwritten assembly, and in every case, IR is just not the right choice:

If your goal is to ensure vector code, your first choice is to try slapping explicit vectorize-me pragmas onto the loop. If that fails, your next effort is either to use generic or arch-specific vector intrinsics (or jump to something like ISPC, a language for writing SIMT-like vector code). You don't really gain anything in this use case from jumping to IR, since the intrinsics will satisfy your code.

If your goal is to work around compiler suboptimality in register allocation or instruction selection... well, trying to write it in IR gives the compiler a very high likelihood of simply recanonicalizing the exact sequence you wrote to the same sequence the original code would have produced for no actual difference in code. Compiler IR doesn't add anything to the code; it just creates an extra layer that uses an unstable and harder-to-use interface for writing code. To produce the best handwritten version of assembly in these cases, you have to go straight to writing the assembly you wanted anyways.

astrange

Loop vectorization doesn't work for ffmpeg's needs because the kernels are too small and specialized. It works better for scientific/numeric computing.

You could invent a DSL for writing the kernels in… but they did, it's x86inc.asm. I agree ispc is close to something that could work.

null

[deleted]

commandlinefan

Shame this doesn't start with a quick introduction to running the examples with an actual assembler like NASM.

WhitneyLand

I was expecting to read pearls of wisdom gleaned from all the hard work done on the project, but I’m not really getting how this relates to ffmpeg.

The few chapters I saw seemed to be pretty generic intro to assembly language type stuff.

SilentM68

Why not include the required or targeted math lessons needed for the FFmpeg Assembly Lessons in the GitHub repository? It'd be easier for people to get started if everything was in one place :)

snickerbockers

NTA but if the assumption is that the reader has only a basic understanding of C programming and wants to contribute to a video codec there is a lot of ground that needs to be covered just to get to how the cooley/tukey algorithm works and even that's just the basic fundamentals.

byryan

I read the repo more as "go through this if you want to have a greater understanding of how things work on a lower level inside your computer". In other words, presumably it's not only intended for people who want to contribute to a video codec/other parts of ffmpeg. But I'm also NTA, so could be wrong.

Alifatisk

How do they make these assembly instructions portable across different cpus?

CannotCarrot

I think there's a generic C fallback, which can also serve as a baseline. But for the big (targeted) architectures, there one handwritten assembly version per arch.

faluzure

Yup.

On startup, it runs cpuid and assigns each operation the most optimal function pointer for that architecture.

In addition to things like ‘supports avx’ or ‘supports sse4’ some operations even have more explicit checks like ‘is a fifth generation celeron’. The level of optimization in that case was optimizing around the cache architecture on the cpu iirc.

Source: I did some dirty things with chromes native client and ffmpeg 10 years ago.

null

[deleted]

KeplerBoy

They don't. It's just x86-64.

ahartmetz

The lessons yes, but the repo contains assembly for the 5-6 architectures in wide use in consumer hardware today. Separate files of course. https://github.com/FFmpeg/FFmpeg/tree/master/libavcodec

KeplerBoy

Yeah, sure. I was specifically referring to the tutorials. Ffmpeg needs to run everywhere, although I believe they are more concerned about data center hardware than consumer hardware. So probably also stuff like power pc.

abhisek

Love it. Thanks for taking the time to write this. Hope it will encourage more folks to contribute.

ngcc_hk

More interesting than I thought it could be. A domain specific tutorial is so much better.

sylware

There is serious abuse of nasm macro-preprocessor. Going to be tough to move away to another assembler.

loeg

Why move away?

oguz-ismail

Where? There's very little code in those lessons

pveierland

The lessons reference `cglobal` in `x86inc.asm`:

https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/x86/x...

nisten

I feel like I just got a 3 page intro to autism.

It's glorious.

HN

FFmpeg Assembly Language Lessons

FFmpeg Assembly Language Lessons