Don't "optimize" conditional moves in shaders with mix()+step()
181 comments
·February 9, 2025quuxplusone
azeemba
The main point is that the conditional didn't actually introduce a branch.
Showing the other generated version would only show that it's longer. It is not expected to have a branch either. So I don't think it would have added much value
comex
But it's possible that the compiler is smart enough to optimize the step() version down to the same code as the conditional version. If true, that still wouldn't justify using step(), but it would mean that the step() version isn't "wasting two multiplications and one or two additions" as the post says.
(I don't know enough about GPU compilers to say whether they implement such an optimization, but if step() abuse is as popular as the post says, then they probably should.)
MindSpunk
Okay but how does this help the reader? If the worse code happens to optimize to the same thing it's still awful and you get no benefits. It's likely not to optimize down unless you have fast-math enabled because the extra float ops have to be preserved to be IEEE754 compliant
idunnoman1222
Unless you’re writing an essay on why you’re right…
chrisjj
> Unless you’re writing an essay on why you’re right…
He's writing an essay on why they are wrong.
"But here's the problem - when seeing code like this, somebody somewhere will invariably propose the following "optimization", which replaces what they believe (erroneously) are "conditional branches" by arithmetical operations."
Hence his branchless codegen samples are sufficient.
Further, regarding.the side-issue "The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower", no amount of codegen is going to show lower /speed/.
ncruces
The other either optimizes the same, or has an additional multiplication, and it's definitely less readable.
Lockal
You missed the second part where article says that "it actually runs much slower than the original version", "wasting two multiplications and one or two additions", based on idea that compiler is unable to do a very basic optimization, implying that compiler compiler will actually multiply by one. No benchmarks, no checking assembly, just straightforward misinformation.
TheRealPomax
Correct: it would show proof instead of leaving it up to the reader to believe them.
creata
Generated code for RDNA 1:
https://shader-playground.timjones.io/5d3ece620f45091678dcee...
stevemk14ebr
There are 10 types of people in this work. Those who can extrapolate from missing data, and
account42
Making assumptions about performance when you can measure is generally not a good idea.
robertlagrant
and what? AND WHAT?
alkonaut
I wish there was a good way of knowing when an if forces an actual branch rather than when it doesn't. The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.
I do like that the most obvious v = x > y ? a : b; actually works, but it's also concerning that we have syntax where an if is some times a branch and some times not. In a context where you really can't branch, you'd almost like branch-if and non-branching-if to be different keywords. The non-branching one would fail compilation if the compiler couldn't do it without branching. The branching one would warn if it could be done with branching.
pandaman
>The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.
And the reason for that is the confusing documentation from NVidia and its cg/CUDA compilers. I believe they did not want to scare programmers at first and hid the execution model, talking about "threads" and then they kept using that abstraction to hype up their GPUs ("it has 100500 CUDA threads!"). The result is people coding for GPUs with some bizarre superstitions though.
You actually want branches in the the code. Those are quick. The problem is that you cannot have a branch off a SIMD way so, instead of a branch the compiler will emit code for both branches and the results will be masked out based on the branch's condition.
So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch. It will all be executed sequentially with masking. Even in the TFA example, both values of ? operator are computed, the same happens with any conditional on an SIMD value. There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.
Only conditionals based on scalar registers (shader constants/unform values) will generate branches and those are super quick.
account42
> So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch.
It can do an actual branch if the condition ends up the same for the entire workgroup - or to be even more pedantic, for the part of the workgroup that is still alive.
You can also check that explicitly to e.g. take a faster special case branch if possible for the entire workgroup and otherwise a slower general case branch but also for the entire workgroup instead of doing both and then selecting.
pandaman
And this is why I wrote There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.
ryao
You can always have the compiler dump the assembly output so you can examine it. I suspect few do that.
vanderZwan
Does this also apply for shaders? And is it even useful given the enormous variation in hardware capabilities out there. My impression was that it's all JIT compiled unless you know which hardware you're targeting, e.g. Valve precompiling highly optimized shaders for the Steam Deck
(I'm not a grapics programmer, mind you, so please correct any misunderstandings on my end)
account42
You got this the wrong way around: For GPUs conditional moves are the default and real branches are a performance optimization possible only if the branch is uniform (=same side taken for the entire workgroup).
mpreda
Exactly. Consider this example:
a = f(z);
b = g(z);
v = x > y ? a : b;
Assuming computing the two function calls f() and g() is relativelly expensive, it becomes a trade-off whether to emit conditional code or to compute both followed by a select. So it's not a simple choice, and the decision is made by the compiler.dragontamer
This is a GPU focused article.
The GPU will almost always execute f and g due to GPU differences vs CPU.
You can avoid the f vs g if you can ensure a scalar Boolean / if statement that is consistent across the warp. So it's not 'always' but requires incredibly specific coding patterns to 'force' the optimizer + GPU compiler into making the branch.
justsid
It depends. If the code flow is uniform for the warp, only side of the branch needs to be evaluated. But you could still end up with pessimistic register allocation because the compiler can’t know it is uniform. It’s sometimes weirdly hard to reason about how exactly code will end up executing on the GPU.
danybittel
f or g may have side effects too. Like writing to memory. Now a conditional has a different meaning.
You could also have some fun stuff, where f and g return a boolean, because thanks to short circuit evaluation && || are actually also conditionals in disguise.
chrisjj
The good way is to inspect the code :)
> it's also concerning that we have syntax where an if is some times a branch and some times not.
It would be more concerning if we didn't. We might get a branch on one GPU and none on another.
phkahler
>> The good way is to inspect the code :)
The best way is to profile the code. Time is what we are after, so measure that.
mwkaufma
One can do the precisely how it's done in the article -- inspect the assembly.
NohatCoder
But you don't generally need to care if the shader code contains a few branches, modern GPUs handles those reasonably well, and the compiler will probably make a reasonable guess about what is fastest.
account42
You do need to care about large non-uniform branches as in the general case the GPU will have to execute both sides.
ajross
> it's also concerning that we have syntax where an if is some times a branch and some times not.
That's true on scalar CPUs too though. The CMOV instruction arrived with the P6 core in 1995, for example. Branches are expensive everywhere, even in scalar architectures, and compilers do their best to figure out when they should use an alternative strategy. And sometimes get it wrong, but not very often.
masklinn
For scalar CPUs, historically CMOV used to be relatively slow on x86, and notably for reliable branching patterns (>75% reliable) branches could be a lot faster.
cmov also has dependencies on all three inputs, so if there's a high level of bias towards the unlikely input having a much higher latency than the likely one a cmov can cost a fair amount of waiting.
Finally cmov were absolutely terrible on P4 (10-ish cycles), and it's likely that a lot of their lore dates back to that.
null
nosferalatu123
A lot of the myth that "branches are slow on GPUs" is because, way back on the PlayStation 3, they were quite slow. NVIDIA's RSX GPU was on the PS3; it was documented that it was six cycles IIRC, but it always measured slower than that to me. That was for even a completely coherent branch, where all threads in the warp took the same path. Incoherent branches were slower because the IFEH instruction took six cycles, and the GPU would have to execute both sides of the branch. I believe that was the origin of the "branches are slow on GPUs" myth that continues to this day. Nowadays GPU branching is quite cheap especially coherent branches.
nice_byte
coherent branches are "free" but the extra instructions increase register pressure. that's the main reason why dynamic branches are avoided, not that they are inherently "slow".
aappleby
These sort of avoid-branches optimizations were effective once upon a time as I profiled them on the XBox 360 and some ancient Intel iGPUs, but yeah - don't do this anymore.
Same story for bit extraction and other integer ops - we used to emulate them with float math because it was faster, but now every GPU has fast integer ops.
Agentlien
> now every GPU has fast integer ops.
Is that true and to what extent? Looking at the ISA for RDNA2[0] for instance - which is the architecture of both PS5 and Xbox Series S|X - all I can find is 32-bit scalar instructions for integers.
[0] https://www.amd.com/content/dam/amd/en/documents/radeon-tech...
LegionMammal978
You're likely going to have a rough time with 64-bit arithmetic in any GPU. (At least on Nvidia GPUs, the instruction set doesn't give you anything but a 32-bit add-with-carry to help.) But my understanding is that a lot of the arithmetic hardware used for 53-bit double-precision ops can also be used for 32-bit integer ops, which hasn't always been the case.
ryao
The PTX ISA for Nvidia GPUs supports 64-bit integer arithmetic:
https://docs.nvidia.com/cuda/parallel-thread-execution/index...
https://docs.nvidia.com/cuda/parallel-thread-execution/index...
It needs to support 64-bit integer arithmetic for handling 64-bit address calculations efficiently. The SASS ISA since Volta has explicit 32I suffixed integer instructions alongside the regular integer instructions, so I would expect the regular instructions to be 64-bit, although the documentation leave something to be desired:
https://docs.nvidia.com/cuda/cuda-binary-utilities/index.htm...
Agentlien
I'm less concerned about it being 32-bit and more about them being exclusively scalar instructions, no vector instructions. Meaning only useful for uniforms, not thread-specific data.
[Update: I remembered and double checked. While there are only scalar 32-bit integer instructions you can use 24-bit integer vector instructions. Essentially ignoring the exponent part of the floats.]
qwery
It's always been less of a big deal than it used to be -- at least on "big" GPUs -- but the article isn't really about avoiding branches. The code presented is already branchless. The people giving out the advice seem to think they are avoiding branches as optimisation but their understanding of what branching code is is apparently based on if they can see some sort of conditional construct.
layer8
This article is also relevant: https://medium.com/@jasonbooth_86226/branching-on-a-gpu-18bf...
“If you consult the internet about writing a branch of a GPU, you might think they open the gates of hell and let demons in. They will say you should avoid them at all costs, and that you can avoid them by using the ternary operator or step() and other silly math tricks. Most of this advice is outdated at best, or just plain wrong.
Let’s correct that.”
qwery
Some of the mistakes/confusion being pointed out in the article is being replicated here, it seems.
The article is not claiming that conditional branches are free. In fact, the article is not making any point about the performance cost of branching code, as far as I can tell.
The article is pointing out that conditional logic in the form presented does not get compiled into conditionally branching code. And that people should not continue to propagate the harmful advice to cover up every conditional thing in sight[0].
Finally, on actually branching code: that branching code is more complicated to execute is self-evident. There are no free branches. Avoiding branches is likely (within reason) to make any code run faster. Luckily[1], the original code was already branchless. As always, there is no universal metric to say whether optimisation is worthwhile.
[0] the "in sight" is important -- there's no interest in the generated code, just in the source code not appearing to include conditional anythings.
[1] No luck involved, of course ... (I assume people wrote to IQ to suggest apparently glaringly obvious (and wrong) improvements to their shader code, lol)
magicalhippo
Processors change, compilers change. If you care about such details, best to ship multiple variants and pick the fastest one at runtime.
As I've mentioned here several times before, I've made code significantly faster by removing the hand-rolled assembly and replacing it with plain C or similar. While the assembly might have been faster a decade or two ago, things have changed...
Amadiro
I think figuring out the fastest version of a shader at runtime is very non-trivial, I'm not aware of any game or engine that can do this.
I think it'd be possible in principle, because most APIs (D3D, GL, Vulkan etc) expose performance counters (which may or may not be reliable depending on the vendor), and you could in principle construct a representative test scene that you replay a couple times to measure different optimizations. But a lot of games are quite dynamic, having dynamically generated scenes and also dynamically generated shaders, so the number of combinations you might have to test seems like an obstacle. Also you might have to ask the user to spend time waiting on the benchmark to finish.
You could probably just do this ahead of time with a bunch of different GPU generations from each vendor if you have the hardware, and then hard-code the most important decision. So not saying it'd be impossible, but yeah I'm not aware of any existing infrastructure for this.
hansvm
The last time I did anything like this (it was for CPU linear algebra code designed to run in very heterogeneous clusters), I first came up with a parameterization that approximated how I'd expect an algorithm to perform. Then, once for each hardware combination, you sweep through the possible parameterization space. I used log-scaled quantization to make it cheap to index into an array of function pointers based on input specifics.
The important thing to note is that you can do that computation just once, like when you install the game, and it isn't that slow. Your parameterization won't be perfect, but it's not bad to create routines that are much faster than any one implementation on nearly every architecture.
ijustlovemath
you'd only have to test worst/median case scenes, which would could find with a bit of profiling!
alexvitkov
This would be acceptable if it meant adding one more shader, but with "modern" graphics APIs forcing us to sometimes have thousands of permutations for the same shader, every variant you add multiplies that count by 2x.
We also don't have an infinite amount of time to work on each shader. You profile on the hardware you care about, and if the choice you've made is slower on some imaginary future processor, so be it - hopefully that processor is faster enough that this doesn't matter.
account42
Graphics APIs don't force your to have thousands of shaders. The abstraction in your engine might.
dist-epoch
Funnily enough, this is sort of what the NVIDIA drivers do: they intercept game shaders and replace them by custom ones optimized by NVIDIA. Which is why you see stuff like this in NVIDIA drivers changelog: "optimized game X, runs 40% faster"
Cieric
I don't work on the nvidia side of things but it's likely to be the same. Shader replacement is only one of a whole host of things we can do to make games run faster. It's actually kind of rare for use to do them since it boats the size of the driver so much. A lot of our options do change how shaders work though, like forcing a shader to use double precision floats instead of the single it was compiled with.
chrisjj
> > A lot of our options do change how shaders work though, like forcing a shader to use double precision floats instead of the single it was compiled with.
That will break code sufficienly reliant on the behaviour of sungle precision, though.
SpaghettiCthulu
> A lot of our options do change how shaders work though, like forcing a shader to use double precision floats instead of the single it was compiled with.
What benefit would that give? Is double precision faster than single on modern hardware?
crazygringo
Wow, how did they pick which games to optimize?
Does the studio pay them to do it? Because Nvidia wouldn't care otherwise?
Does Nvidia do it unasked, for competitive reasons? To maximize how much faster their GPU's perform than competitors' on the same games? And therefore decide purely by game popularity?
Or is it some kinda of alliance thing between Nvidia and studios, in exchange for something like the studios optimizing for Nvidia in the first place, to further benefit Nvidia's competitive lead?
Cieric
I can't give details on how we do our selections (not nvidia but another gpu manufacturer). But we do have direct contacts into a lot of studios and we do try and help them fix their game if possible before ever putting something in the driver to fix it. Studios don't pay us, it's mutually benefital for us to improve the performance of the games. It also help the game run better on our cards by avoiding some of the really slow stuff.
In general if our logo is in the game, we helped them by actually writing code for them, if it's not then we might have only given them directions on how to fix issues in their game or put something in the driver to tweak how things execute. From an outside perspective (but still inside on the gpu space) nvidia does give advice to keep their competitive advantage. In my experience so far ignoring barriers that are needed as per the spec, defaulting to massive numbers when the gpu isn't known ("batman and tessellation" should be enough to find that), and then doing out right weird stuff that doesn't look like something any sane person would do in writing shaders (I have a thought in my head for that one, but it's not considered public knowledge. )
flohofwoe
AFAIK NVIDIA and AMD do this unasked for popular game releases because it gives them a competitive advantage if 'popular game X' runs better on NVIDIA than AMD (and vice versa). If you're an AAA studio you typically also have a 'technical liason' at the GPU vendors though.
It's basically an arms race. This is also the reason why graphics drivers for Windows are so frigging big (also AFAIK).
esperent
I'd love to read more about this, what kind of changes they make and how many games they do it for. Do they ever release technical articles about it?
sigmoid10
The other commenter makes it sound a bit more crazy than it is. "Intercept shaders" sounds super hacky, but in reality, games simply don't ship with compiled shaders. Instead they are compiled by your driver for your exact hardware. Naturally that allows the compiler to perform more or less aggressive optimisations, similar to how you might be able to optimise CPU programs by shipping C code and only compiling everything on the target machine once you know the exact feature sets.
snicker7
Imagine being the dev competing game Y and seeing the changelog.
surajrmal
It wouldn't be surprising to find out Nvidia talks directly with game developers to give them hints as to how to optimize their games
chrisjj
> "optimized game X, runs 40% faster"
... and looks 4O% crappier? E.g. stuttery, because the driver does not get to see the code ahead of time.
tsylba
It's funny because I rarely seen this (wrong approach) done anywhere else but I pick it up by myself (like a lot did I presume) and still am the first to do it everytime I see the occasion, not so for optimizations (while I admit I thought it wouldn't hurt) but for the flow and natural look of it. It feels somehow more right to me to compose effect by signals interpolations rather than clear ternary branch instructions.
Now I'll have to change my ways in fear of being rejected socially for this newly approved bad practice.
At least in WebGPU's WGSL we have the `select` instruction that does that ternary operation hidden as a method, so there is that.
cwillu
Hmm, godbolt is showing branches in the vulkan output:
return x>0.923880?vec2(s.x,0.0):
x>0.382683?s*sqrt(0.5):
vec2(0.0,s.y);
turns into %24 = OpLoad %float %x
%27 = OpFOrdGreaterThan %bool %24 %float_0_923879981
OpSelectionMerge %30 None
OpBranchConditional %27 %29 %35
%29 = OpLabel
%31 = OpAccessChain %_ptr_Function_float %s %uint_0
%32 = OpLoad %float %31
%34 = OpCompositeConstruct %v2float %32 %float_0
OpStore %28 %34
OpBranch %30
%35 = OpLabel
%36 = OpLoad %float %x
%38 = OpFOrdGreaterThan %bool %36 %float_0_382683009
OpSelectionMerge %41 None
OpBranchConditional %38 %40 %45
%40 = OpLabel
%42 = OpLoad %v2float %s
%44 = OpVectorTimesScalar %v2float %42 %float_0_707106769
OpStore %39 %44
OpBranch %41
%45 = OpLabel
%47 = OpAccessChain %_ptr_Function_float %s %uint_1
%48 = OpLoad %float %47
%49 = OpCompositeConstruct %v2float %float_0 %48
OpStore %39 %49
OpBranch %41
%41 = OpLabel
%50 = OpLoad %v2float %39
OpStore %28 %50
OpBranch %30
%30 = OpLabel
%51 = OpLoad %v2float %28
OpReturnValue %51
https://godbolt.org/z/aqob7YfWqSideQuark
Vulcan opcode shader lang is not executed. It’s a platform neutral intermediate language, so won’t have the special purpose optional instructions most GPUs do since GPUs aren’t required to.
It likely compiles down on the relevant platforms as the original article did.
londons_explore
So why isn't the compiler smart enough to see that the 'optimised' version is the same?
Surely it understands "step()" and can optimize the "step()=0.0" and "step()==1.0" cases separately?
This is presumably always worth it, because you would at least remove one multiplication (usually turning it into a conditional load/store/something else)
NohatCoder
It may very well be, it is the type of optimisation where it is quite possible that some compilers may do it some of the time, but it is definitely also possible to write a version that the compiler can't grok.
ttoinou
Thanks Inigo !
The second wrong thing with the supposedly optimizer version is that it actually runs much slower than the original version. The reason is that the step() function is actually implemented like this:
float step( float x, float y )
{
return x < y ? 1.0 : 0.0;
}
How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives ?Const-me
The only way is do what OP did – compile your shader, disassemble, and read the assembly.
I do that quite often with my HLSL shaders, learned a lot about that virtual instruction set. For example, it’s interesting GPUs have instruction sincos, but inverse trigonometry is emulated while compiling.
account42
Why are you supposed to know?
Because you care about performance? step being implemented as a libray function on top of a conditional doesn't really say anything about its performance vs being a dedicated instruction. Don't worry about the implementation.
Because you are curious about GPU architectures? Look at disassembly, (open source) driver code (including LLVM) and/or ISA documentation.
SideQuark
I’ve never seen a GPU with special primitives for any functions than you’d see in pc style assembly. Every time I’ve looked at a decompiled shader, it’s always been pretty much what you think of in C.
Aldo specs like OpenGL specify many intrinsic behavior, which is then implemented as the spec, using standard assembly instructions.
Find an online site that decompiles to various architectures.
TeMPOraL
EDIT: I see my source of confusion must be that "branch" must have a well-understood hardware-specific meaning that goes beyond the meaning I grew up with, which is that a conditional is a branch, because the path control takes (at the machine code level) is chosen at runtime. This makes a conditional jump a branch by definition.
> How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives?
To me the problem was obvious, but then again I'm having trouble with both your and author's statements about it.
The problem I saw was, obviously by going for a step() function, people aren't turning logic into arithmetic, they're just hiding logic in a library function call. Just because step() is a built-in or something you'd find used in mathematical paper doesn't mean anything; the definition of step() in mathematics is literally a conditional too.
Now, the way to optimize it properly to have no conditionals, is you have to take a continuous function that resembles your desired outcome (which in the problem in question isn't step() but the thing it was used for!), and tune its parameters to get as close as it can to your target. I.e. typically you'd pick some polynomial and run the standard iterative approximation on it. Then you'd just have an f(x) that has no branching, just a bunch of extra additions and multiplications and some "weirdly specific" constants.
Where I don't get the author is in insisting that conditional move isn't "branching". I don't see how that would be except in some special cases, where lack of branching is well-known but very special implementation detail - like where the author says:
> also note that the abs() call does not become a GPU instruction and instead becomes an instruction modifier, which is free.
That's because we standardized on two's complement representation for ints, which has the convenient quality of isolating sign as the most significant bit, and for floats the representation (IEEE-754) was just straight up designed to achieve the same. So in both cases, abs() boils down to unconditionally setting the most significant bit to 0 - or, equivalently, masking it off for the instruction that's reading it.
step() isn't like that, nor any other arbitrary ternary operation construct, and nor is - as far as I know - a conditional move instruction.
As for where I don't get 'ttoinou:
> How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives
The basics like abs() and sqrt() and basic trigonometry are standard knowledge, the rest... does it even matter? step() obviously has to branch somewhere; whether you do it yourself, let a library do it, or let the hardware do it, shouldn't change the fundamental nature.
dahart
> the meaning I grew up with, which is that a conditional is a branch
A conditional jump is a branch. But a branch has always had a different meaning than a generic “conditional”. There are conditional instructions that don’t jump, e.g. CMP, and the distinction is very important. Branch or conditional jump means the PC can be set to something other than ‘next instruction’. A conditional, such a conditional select or conditional move, one that doesn’t change the PC, is not a branch.
> take a continuous function […] Then you’d just have an f(x) that has no branching
One can easily implement conditional functions without branching. You can use a compare instruction followed by a Heaviside function on the result, evaluate both sides of the result, and sum it up with a 2D dot product (against the compare result and its negation). That is occasionally (but certainly not always) faster on a GPU than using if/else, but only if the compiler is otherwise going to produce real branch instructions.
dkersten
Maybe I’m misunderstanding why branching is slow on a GPU. My understanding was that it’s because both sides of the branch are always executed, just one is masked out (I know the exact mechanics of this have changed), so that the different cores in the group can use the same program counter. Something to that effect, at least.
But in this case, would calculating both sides and then using a way to conditionally set the result not perform the same amount of work? Whether you’re calculating the result or the core masks the instructions out, it’s executing instructions for both sides of the branch in both cases, right?
On a CPU, the performance killer is often branch prediction and caches, but on the GPU itself executing a mostly linear set of instructions, or is my understanding completely off? I guess I don’t really understand what it’s doing, especially for loops.
mymoomin
A "branch" here is a conditional jump. This has the issues the article mentions, which branchless programming avoids:
Again, there is no branching - the instruction pointer isn't manipulated, there's no branch prediction involved, no instruction cache to invalidate, no nothing.
This has nothing to do with whether the behaviour of some instruction depends on its arguments. Looking at the Microsoft compiler output from the article, the iadd (signed add) instruction will get different results depending on its arguments, and the movc (conditional move) will store different values depending on its arguments, but after each the instruction pointer will just move onto the next instruction, so there are no branches.chrisjj
> a conditional is a branch, because the path control takes (at the machine code level) is chosen at runtime. This makes a conditional jump a branch by definition.
s/conditional is a/conditional jump is a/
Problem solved.
Non-jump conditionals have been a thing for decades.
burch45
Branching is different instruction paths, so it requires reading the instructions from different memory that causes a delay jumping to those new instructions rather than plowing ahead on the current stream of instructions. So a conditional jump is a branch but a conditional move is just an instruction that moves one of two values into a register but doesn’t affect what code is executed next.
ttoinou
It kinda does when you’re wondering what’s going on in backstage and working with shaders on multiple OS, drivers and hardware.
Now, the way to optimize it properly to have no conditionals, is you have to take a continuous
I suspect that we shaders authors really like Clean Math and that’s also why we like to think such “optimizations” with the step function is a nice modification :-)Waterluvian
This is a great question that I see everywhere in programming and I think it is core to why you measure first when optimizing.
You generally shouldn’t know or care how a built in is implemented. If do care, you’re probably thinking about optimization. At that point the answer is “measure and find out what works better.”
mirsadm
I've been caught by this. Even Claude/ChatGPT will suggest it as an optimisation. Every time I've measured a performance drop doing this. Sometimes significant.
WJW
Is that weird? LLMs will just repeat what is in their training corpus. If most of the internet is recommending something wrong (like this conditional move "optimization") then that is what they will recommend too.
xbar
Not weird but important to note.
diath
> Even Claude/ChatGPT will suggest it as an optimisation.
LLMs just repeat what people on the internet say, and people are often wrong.
I'm sure TFA's conclusion is right; but its argument would be strengthened by providing the codegen for both versions, instead of just the better version. Quote:
"The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower than the original version [...] wasting two multiplications and one or two additions. [...] But don't take my word for it, let's look at the generated machine code for the relevant part of the shader"
—then proceeds to show only one codegen: the one containing no multiplications or additions. That proves the good version is fine; it doesn't yet prove the bad version is worse.