Video encoding requires using your eyes

52 comments

·February 28, 2025

okramcivokram

I don't see (or maybe don't recognize) any issues with the image that they're talking about (ringing, color shift, or fake details). It most certainly doesn't look awful to me, it looks exactly the same, only a bit sharper.

AlotOfReading

You can see some ringing in the sky around the trees and on the line between the crow's beak/feathers if you look closely. (Alvin's?) fur goes much farther down his forehead as well. People who work deeply with codecs are usually hypersensitive to these sorts of issues that mere mortals like us need to try to see.

There used to be a legendary blog called "Diaries of an x264 developer" by Fiona Glaser [0] where she'd go on long rants about various ways to cheat in encoder comparisons [1], much like this.

[0] https://web.archive.org/web/2012/http://x264dev.multimedia.c...

[1] https://web.archive.org/web/20130215095527/http://x264dev.mu...

machrider

Appreciate this, I was feeling the same way as the original comment. It looks maybe over-sharpened, but I don't see anything as glaring as the text of the article makes it sound! (Of course, I'm not a video codec developer.)

It does remind me of how stereo & speaker manufacturers sometimes boost treble a little bit (rather than being perfectly "transparent" to the original signal) because it gives the impression of clarity. But ideally each step in the processing chain "colors" the signal as little as possible, because those little differences can add up.

LegionMammal978

Yeah, audio response curves have always been a bit confusing to me. Like, they say that headphones should use a Harman curve because that sounds 'best' to listeners, but how valid is it as an objective measure? (E.g., will listeners 50 years from now find a different curve 'better', the same way that instrument tuning has changed over centuries?) And how much of it is responding to current practices in recording and mixing?

Of course, you won't get a sound as if you're in the same room (without a very fancy setup), so you'll generally want some sort of transformation to get an acceptable output. And artists often want to aim for a certain effect on top of that. But with how things currently are, many of the decisions going into the final sound are very opaque.

layer8

The color shift was the most obvious to me. The other artifacts may not be super visible, but if you care about preserving the original picture, it’s certainly an important quality difference.

ksec

In the era of x264, we ( the user or the enthusiast communities ) and x264 developers deeply cared about preserving the original content as much as possible, even the noise and artefacts.

That was in the 00s. It is not the encoder's job to remove or filter out all the details. Background or not. There are some caveat to that but that was comparatively speaking at the time, say RMVB from Real Media or WMV.

Worth remembering it wasn't really the internet era back then. People encode so they could fit more things into CD and DVDs. At least that was how it started.

Somewhere along the line Internet, or mobile internet aka iPhone happened. Now everyone watches on a small screen. With all details washed out, people just want to consume. None of the details mattered. What we would only used to do in AVISyth Filter are now done automatically with Encoder. The 10 min Youtube video doesn't care about any of that. And then the 3 min, now the attention span is basically 30s TikTok or Instagram Reels. Worst of all a lot of these attention to details are also gone when doing Netflix or other long from of movie streaming. VMAF 90 is good enough, lets try to minimise the bit rate as much as possible.

We need higher / best quality at minimal bitrate, instead of having bare minimum / good enough quality at lowest possible bitrate. The two are very different.

While the march of internet / tech giant on video codec means someday we may lose out. Somewhat fortunately we still have a small group of old people in the movie production, broadcast industry, and private torrents release group still cares about these.

Hopefully, someday, especially the west, could move back to celebrate greatness rather than mediocre.

crazygringo

I agree the "negative" artifacts are almost impossible to see, and came here to the comments to see what the heck the author was talking about.

> People who work deeply with codecs are usually hypersensitive to these sorts of issues that mere mortals like us need to try to see.

I think that kind of shows that the author is unfairly critical.

They're saying "this should not have shipped", when it seems just fine to us "mere mortals".

Yes, video encoding requires using your eyes. But it also seems like it should use normal eyes, not hypersensitive eyes...?

HPsquared

Also video is viewed in motion, not as static frames. And end-users watching on low bitrates aren't going to freeze-frame and zoom in.

trashtensor

It does strike me that video encoding blog posts that show up here are often these kinda toxic rants that seemingly exaggerate whatever it is they're ranting about and also assume the people working on these things are complete morons for missing whatever minute detail the author is angry about.

sgarland

It’s mostly only visible in the closeup of the kid. There are hairs that have been unnecessarily accentuated, and his eye and eyebrow outline look hyper-sharpened, with rough edges.

The non-zoomed image looks fine to me, and I (to some extent) know what I’m looking for. Some private torrent trackers that pride themselves on having transparent encodes will look for this kind of stuff; you have to do multiple test encodes tweaking various parameters to ffmpeg, agonizing over A/B screencaps, only to inevitably be told you either missed some minuscule detail in a single scene, or that your encode is bloated.

klik99

As someone who works a lot with sound I notice a ton of artifacts most people don't recognize. Often those artifacts are harder to hear on cheap speakers but become far more obvious with a good setup. But they also add up and while untrained ears can't hear specific examples, they do result in an overall worse experience which is especially frustrating for high end users when they paid a lot for nice speakers and it's just revealing the grunge that was always there.

That's all to say - I also could not find all the things they are talking about, probably a combination of not being trained, not working a lot with video codecs, and not having the best monitors - but I get the authors frustration, and I'm glad there's people who care about these things! But yeah, I hunted for that color shift and just not seeing it...

ksec

I remember a while ago we have a thread on HN discussing this. 50% of the world cant taste the difference between Pepsi and Coke Cola. People cant tell the difference between 128Kbps or 256Kbps MP3. etc.

For the people who are sensitive to a lot of these, it is more of a curse than gift. Some cant taste the difference between Corn Fed ( or Finished ) and Grass Fed Beef. The colour shift in this article, or how the latest TV perform between OLED, QD-OLED, Four Layer OLED, Mini-LED with different brand.

It turns out being able to "compare" is a skill set in itself. I would assume comparing is also a function that requires more brain power / energy, and most people's natural state would be to conserve that energy.

I have been thinking about this for quite some time. Most people dont know how to compare, or what to compare it to. And precisely because most people dont know how to compare or how not to compare, we need marketing. And I think most successful founder are very good at comparing things. Steve Jobs would be a prime example.

null

[deleted]

mrob

The ringing is most obvious in the striped shirt in the painting on the wall. It's added entire new stripes that don't exist in the original.

dist-epoch

Ringing is pretty obvious to me. It has a specific meaning in this context, it means edges are over-sharpened to the point that "fake" extra edges appear.

https://en.wikipedia.org/wiki/Ringing_artifacts

astrange

That is what it should mean, but in image compression people use it to mean "mosquito noise" artifacts, which come from quantizing DCT compression applied to edges. (Nyquist theorem = DCT is bad at edges because they're made of an infinite number of frequencies.)

bestham

That is the exact same phenomenon. Artificial sharpening is introducing high bandwidth components to a signal. If you bandwidth limit (low pass) a signal to fit below the Shannon-Nyquist limit you will get ringing as the signal cannot be represented accurately and will smear in the time domain. Given a bandwidth constraint, artificial sharpening above a certain threshold will result in ringing.

cvz

It's the exact same phenomenon in both. Not sure where you're making the distinction.

ZeroGravitas

I don't immediately see the issue either.

Though, even if I could, this is a new way to preprocess an image before feeding it into an encoder, and the examples have both been fed through the new downsizer, then the standard encoder, presumably at the standard Netflix bitrates and then (I think) upscaled back to the original size.

So if this didn't look a little compressed then that would be a methodological mistake, as you don't use downscaled encoding unless you've already decided that a full size encode has too much quality for your task.

And Netflix generally has incentives set up to reduce quality until their customers notice. That's why they quote stats based on that.

In web dev terms it's like reducing the size of your product images until it hurts sales. You're almost guaranteed to have artifacts visible to image compression experts before you hit the point that it affects your bottom line. If you are targeting customers on slow internet (and again if you are downscaling then you basically are) your sales are likely to initially rise as you get usable pictures to people faster.

SG-

the fact that it looks different is the problem.

mikeryan

Sorta related story.

I worked for a cable channel (TechTV) in the late 90s early 2ks (until Comcast bought it, laid everyone off and turned it into G4) this was the early days of cable VOD. At that time you had to pay a service by the minute to watch your video before they’d distribute it. That was the QA forced on you by the cable companies.

The fun note is that those services charged double for “adult” content.

kelseyfrog

I'd imagine that for the group of people who can see the difference, it must be very aggravating - much like when you start to see kerning mistakes. For the rest of us, though, it is imperceptible.

runeblaze

> Is Lanczos an example, or the current best option

What’s the best (computationally not that more expensive than Lanczos) option?

Edit: also some CV researchers write like that (the Netflix writing) — bicubic is like a flag in opencv that they just use. Probably those researchers were much more preoccupied with the researchy problem than actual wide deployment, which is what many researchers do

bick_nyers

It's not really possible to say what's "best" because the criteria is super subjective.

I personally like the Spline family, and I default to Spline36 for both upscaling and downscaling in ffmpeg. Most people can't tell the difference between Spline36 and Lanczos3. If you want more sharpness, go for Spline64, for less sharpness, try Spline16.

Edit: As far as I'm aware though OpenCV doesn't have Spline as an option for resizing.

lxe

Thought this was gonna be a thoughtful nuanced critique but instead we get a wild unsubstantiated rant.

GrantMoyer

The thing is there's not much nuance to be had. Netflix's approach is just worse than existing best approaches on all axes.

bbstats

Speaking of using your eyes - zooming in on small parts of single frames is not an accurate representation of watching something!

henning

I love bagging on lazy engineers who just chuck code over the fence without caring about the user experience, but I seriously doubt I would notice this. The sad truth is a lot of video is watched in the background and Netflix knows this. For the specific case of a children's cartoon, I doubt the children watching will notice or care.

If there is user feedback about the quality, then by all means listen to users and at least have an "advanced settings" menu in the app to let users toggle between encoders if they really care.

lofaszvanitt

When will people finally learn that people do not know what they want. You, the expert has to know what works and how and whats the best way to use it properly.

And do not rely on user feedback. Again. People don't know whats wrong, they don't even know what's right. They just feel that something is off. And only a very small percentage of users would write an email and even fewer would get through the automated AI bullshit or the snotty person who would dare to hand you a HOW TO USE NETFLIX pdf when you want to report something.

henning

I am not the user. What I honestly believe is right may not work out the way I expect despite my best efforts as an expert.

If you can't get any user feedback on your products, that is its own problem.

null

[deleted]

bonoboTP

Is downscaling difficult? I can understand that upscaling is hard and you need learning. But when downscaling, for me OpenCV's "area" interpolation always gives great results super fast.

derf_

In short: no, it shouldn't be.

The Netflix post is sort of bizarre. They claim to be optimizing for minimum mean-squared error (MSE) given a conventional (bicubic) upscaling process [0], but... that should have an analytic closed-form solution, as this post states? You definitely do not need multiple layers of neural networks to achieve it. Then they present VMAF results, but VMAF is very much not equivalent to MSE, so you have no idea if they even improved the metric they optimized for. Subjective results are similarly unpersuasive: it isn't clear if "the deep downscaler was preferred by 77% of test subjects" means they thought it was closer to the original or simply "better" than Lanczos[1]. Netflix may not care: longer watch times are longer watch times. But as an engineer you might want to know if that is because you actually achieved the thing you were optimizing for, or it is due to an artifact of the process that might go away the next time you change something to actually improve what you were optimizing for. You can famously make people prefer one audio track over another by making it louder, and video has similar things around sharpness and contrast (and now, thanks to ML, hallucinated detail).

I agree that you can do better than Lanczos for large downscale factors [2]: you need to do something area-based, like you suggest (I have not looked at OpenCV's implementation, but it could be fine). The biggest thing to get wrong is handling gamma incorrectly, but the right thing to do depends on whether you intend to display the result at the downscaled size or upscaled back to the original size as seems to be assumed here (and whether or not your upscaler is gamma-aware, which it probably isn't).

As an aside to those struggling to see the visual differences, make sure you are looking at the image in its original (1874x1596) resolution: https://redvice.org/assets/images/netflix-downscaler-compare... (or right-click, View Image on the original page). Otherwise you also have your browser's resampling algorithm in the mix. To my eyes on my display there is also a pretty big color shift in the featureless pink background of the painting on the right wall, but when I look at the actual pixel values, that appears to be an optical illusion. Subjective comparison is hard!

[0] Unlike the post, I think this is reasonable. In the past, we did experiments that showed that optimizing for bilinear when nearest-neighbor is used (for chroma) is worse than optimizing for nearest-neighbor when bilinear is used. I suspect something similar will be true for bicubic and bilinear, but these days it may be safe to assume that you will get at least bicubic upscaling (for luma), because bilinear luma looks really bad. I haven't done a recent survey of actual playack devices, though.

[1] It's also not how you report subjective test results: what was the statistical significance? There are standard protocols for these kinds of tests and it would have been helpful to cite which one was used.

[2] Nobody ever says what downscaling factors are being tested here. The example graph shows 1080p to 342p, or ~3.16x, but Netflix goes as low as 144p (from, e.g, 2160p), so they can get pretty large (15x) in practice. A 6-tap filter is not going to cut it.

steinuil

I spent a good while looking at the image on my phone trying to spot what the author was talking about, but ironically the comparison image itself is compressed (not using a NN ;) ) and that obscures the artifacts you're supposed to be looking at.

If you're looking for examples of ringing and hallucinated details, they're really obvious in the framed picture on the right on respectively the character's shirt and the frame.

bonoboTP

Since, you mention gamma, I have to link this: http://www.ericbrasseur.org/gamma.html?i=1

Also, side-by-side comparisons are hard, its best to flip back and forth between the two images, like opening them in an image viewer and pressing arrow keys. Or cross eyes like with magic eye stereograms so you see them "layered".

But yes, more fundamentally, I think you're right that this is not really image content dependent, it doesn't need any image prior, if all you want is to minimize means squared error after upscaling with a fixed bicubic interpolation.

Sesse__

Is “area” just a box filter, which is what it sounds like? If so, it gives really blurry results; hardly great.

bonoboTP

Yes, it can be seen as a box filter (with the size equal to the stride length) on the nearest-neighbor upscaled image. Basically if you imagine overlaying the new square grid over the old one and then each new pixel gets a weighted combination of the overlapping old pixel squares, weighted by the overlap areas.

To my eyes it's more pleasant than Lanczos, which has too much ringing.

xmprt

Is it just me or does the author here sound like they're hating just to hate? The writing doesn't sound that terrible. Maybe it's a bit amateur but isn't that what you're expect from an engineering blog post written by people whose day job is to write code. And for a layman, the image comparison isn't as bad as they make it out to seem.

cvz

I am unsurprised that the author would hold a video distribution company which supposedly pays good money for experts to a higher standard than the average hobbyist blogger. I don't think it's hating for the sake of hating.