Don't "optimize" conditional moves in shaders with mix()+step()

236 comments

·February 9, 2025

quuxplusone

I'm sure TFA's conclusion is right; but its argument would be strengthened by providing the codegen for both versions, instead of just the better version. Quote:

"The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower than the original version [...] wasting two multiplications and one or two additions. [...] But don't take my word for it, let's look at the generated machine code for the relevant part of the shader"

—then proceeds to show only one codegen: the one containing no multiplications or additions. That proves the good version is fine; it doesn't yet prove the bad version is worse.

azeemba

The main point is that the conditional didn't actually introduce a branch.

Showing the other generated version would only show that it's longer. It is not expected to have a branch either. So I don't think it would have added much value

comex

But it's possible that the compiler is smart enough to optimize the step() version down to the same code as the conditional version. If true, that still wouldn't justify using step(), but it would mean that the step() version isn't "wasting two multiplications and one or two additions" as the post says.

(I don't know enough about GPU compilers to say whether they implement such an optimization, but if step() abuse is as popular as the post says, then they probably should.)

MindSpunk

Okay but how does this help the reader? If the worse code happens to optimize to the same thing it's still awful and you get no benefits. It's likely not to optimize down unless you have fast-math enabled because the extra float ops have to be preserved to be IEEE754 compliant

idunnoman1222

Unless you’re writing an essay on why you’re right…

chrisjj

> Unless you’re writing an essay on why you’re right…

He's writing an essay on why they are wrong.

"But here's the problem - when seeing code like this, somebody somewhere will invariably propose the following "optimization", which replaces what they believe (erroneously) are "conditional branches" by arithmetical operations."

Hence his branchless codegen samples are sufficient.

Further, regarding.the side-issue "The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower", no amount of codegen is going to show lower /speed/.

ncruces

The other either optimizes the same, or has an additional multiplication, and it's definitely less readable.

TheRealPomax

Correct: it would show proof instead of leaving it up to the reader to believe them.

Lockal

You missed the second part where article says that "it actually runs much slower than the original version", "wasting two multiplications and one or two additions", based on idea that compiler is unable to do a very basic optimization, implying that compiler compiler will actually multiply by one. No benchmarks, no checking assembly, just straightforward misinformation.

creata

Generated code for RDNA 1:

https://shader-playground.timjones.io/5d3ece620f45091678dcee...

stevemk14ebr

There are 10 types of people in this work. Those who can extrapolate from missing data, and

account42

Making assumptions about performance when you can measure is generally not a good idea.

robertlagrant

and what? AND WHAT?

alkonaut

I wish there was a good way of knowing when an if forces an actual branch rather than when it doesn't. The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.

I do like that the most obvious v = x > y ? a : b; actually works, but it's also concerning that we have syntax where an if is some times a branch and some times not. In a context where you really can't branch, you'd almost like branch-if and non-branching-if to be different keywords. The non-branching one would fail compilation if the compiler couldn't do it without branching. The branching one would warn if it could be done with branching.

pandaman

>The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.

And the reason for that is the confusing documentation from NVidia and its cg/CUDA compilers. I believe they did not want to scare programmers at first and hid the execution model, talking about "threads" and then they kept using that abstraction to hype up their GPUs ("it has 100500 CUDA threads!"). The result is people coding for GPUs with some bizarre superstitions though.

You actually want branches in the the code. Those are quick. The problem is that you cannot have a branch off a SIMD way so, instead of a branch the compiler will emit code for both branches and the results will be masked out based on the branch's condition.

So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch. It will all be executed sequentially with masking. Even in the TFA example, both values of ? operator are computed, the same happens with any conditional on an SIMD value. There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.

Only conditionals based on scalar registers (shader constants/unform values) will generate branches and those are super quick.

account42

> So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch.

It can do an actual branch if the condition ends up the same for the entire workgroup - or to be even more pedantic, for the part of the workgroup that is still alive.

You can also check that explicitly to e.g. take a faster special case branch if possible for the entire workgroup and otherwise a slower general case branch but also for the entire workgroup instead of doing both and then selecting.

pandaman

And this is why I wrote There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.

ribit

Execution with masking is pretty much how broaching works on GPUs. What’s more relevant however is that conditional statements add overhead on terms of additional instructions and execution state management. Eliminating small branches using conditional moves or manual masking can be a performance win.

pandaman

No, branching works on GPU just like everywhere else - the instruction pointer gets changed to another value. But you cannot branch on a vector value unless every element of the vector is the same, this is why branching on vector values is a bad idea. However, if your vectorized computation is naturally divergent then there is no way around it, conditional moves are not going to help as they also will evaluate both branches in a conditional. The best you can do is to arrange it in such a way that you only add computation instead of alternating it, i.e. you do if() ... instead of if() ... else ... then you only take as long as the longest path.

This reminds me that people who believe that GPU is not capable of branches do stupid things like writing multiple shaders instead of branching off a shader constant e.g. you have some special mode, say x-ray vision, in a game and instead of doing a branch in your materials, you write an alternative version of every shader.

ryao

You can always have the compiler dump the assembly output so you can examine it. I suspect few do that.

vanderZwan

Does this also apply for shaders? And is it even useful given the enormous variation in hardware capabilities out there. My impression was that it's all JIT compiled unless you know which hardware you're targeting, e.g. Valve precompiling highly optimized shaders for the Steam Deck

(I'm not a grapics programmer, mind you, so please correct any misunderstandings on my end)

torginus

I'll comment this here as I got downvoted when I made the point in a standalone comment - this is mostly an academic issue, since you don't want to use step of pixel-level if statements in your shader code, as it will lead to ugly aliasing artifacts as the pixel color transitions from a to b.

What you want is to use smoothstep which blends a bit between these two values and for that you need to compute both paths anyway.

pandaman

It's absurd to claim that you'd never use step(), even in pixel shaders (there are all kinds of shaders not related to pixels at all).

ajross

> it's also concerning that we have syntax where an if is some times a branch and some times not.

That's true on scalar CPUs too though. The CMOV instruction arrived with the P6 core in 1995, for example. Branches are expensive everywhere, even in scalar architectures, and compilers do their best to figure out when they should use an alternative strategy. And sometimes get it wrong, but not very often.

masklinn

For scalar CPUs, historically CMOV used to be relatively slow on x86, and notably for reliable branching patterns (>75% reliable) branches could be a lot faster.

cmov also has dependencies on all three inputs, so if there's a high level of bias towards the unlikely input having a much higher latency than the likely one a cmov can cost a fair amount of waiting.

Finally cmov were absolutely terrible on P4 (10-ish cycles), and it's likely that a lot of their lore dates back to that.

account42

You got this the wrong way around: For GPUs conditional moves are the default and real branches are a performance optimization possible only if the branch is uniform (=same side taken for the entire workgroup).

mpreda

Exactly. Consider this example:

  a = f(z);
  b = g(z);
  v = x > y ? a : b;

Assuming computing the two function calls f() and g() is relativelly expensive, it becomes a trade-off whether to emit conditional code or to compute both followed by a select. So it's not a simple choice, and the decision is made by the compiler.

dragontamer

This is a GPU focused article.

The GPU will almost always execute f and g due to GPU differences vs CPU.

You can avoid the f vs g if you can ensure a scalar Boolean / if statement that is consistent across the warp. So it's not 'always' but requires incredibly specific coding patterns to 'force' the optimizer + GPU compiler into making the branch.

justsid

It depends. If the code flow is uniform for the warp, only side of the branch needs to be evaluated. But you could still end up with pessimistic register allocation because the compiler can’t know it is uniform. It’s sometimes weirdly hard to reason about how exactly code will end up executing on the GPU.

danybittel

f or g may have side effects too. Like writing to memory. Now a conditional has a different meaning.

You could also have some fun stuff, where f and g return a boolean, because thanks to short circuit evaluation && || are actually also conditionals in disguise.

plagiarist

I think that capability in the shader language would be interesting to have. One might even want it to two-color all functions in the code. Anything annotated nonbranching must have if statements compile down to conditional moves and must only call nonbranching functions.

catlifeonmars

This is also very relevant for cryptography use cases, where branching is a potential side channel for leaking secret information.

grg0

The good way of knowing is to look at the assembly generated by the compiler. Maybe not a completely satisfying answer given that the result is heavily vendor-dependent, but unless the high-level language exposes some way of explicitly controlling it, then assembly is what you got.

> The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.

This is a problem, though. People shouldn't do things potentially, they should look at the actual code that is generated and executed.

mwkaufma

One can do the precisely how it's done in the article -- inspect the assembly.

chrisjj

The good way is to inspect the code :)

> it's also concerning that we have syntax where an if is some times a branch and some times not.

It would be more concerning if we didn't. We might get a branch on one GPU and none on another.

phkahler

>> The good way is to inspect the code :)

The best way is to profile the code. Time is what we are after, so measure that.

chrisjj

For sure.

nosferalatu123

A lot of the myth that "branches are slow on GPUs" is because, way back on the PlayStation 3, they were quite slow. NVIDIA's RSX GPU was on the PS3; it was documented that it was six cycles IIRC, but it always measured slower than that to me. That was for even a completely coherent branch, where all threads in the warp took the same path. Incoherent branches were slower because the IFEH instruction took six cycles, and the GPU would have to execute both sides of the branch. I believe that was the origin of the "branches are slow on GPUs" myth that continues to this day. Nowadays GPU branching is quite cheap especially coherent branches.

dahart

If someone says branching without qualification, I have to assume it’s incoherent. The branching mechanics might have lower overhead today, but the basic physics of the situation is that throughput on each side of the branch is reduced to the percentage of active threads. If both sides of a branch are taken, and both sides are the same instruction length, the average perf over both sides is at least cut in half. This is why the belief that branches are slow on GPUs is both persistent and true. And this is why it’s worth trying harder to reformulate the problem without branching, if possible.

nice_byte

coherent branches are "free" but the extra instructions increase register pressure. that's the main reason why dynamic branches are avoided, not that they are inherently "slow".

aappleby

These sort of avoid-branches optimizations were effective once upon a time as I profiled them on the XBox 360 and some ancient Intel iGPUs, but yeah - don't do this anymore.

Same story for bit extraction and other integer ops - we used to emulate them with float math because it was faster, but now every GPU has fast integer ops.

Agentlien

> now every GPU has fast integer ops.

Is that true and to what extent? Looking at the ISA for RDNA2[0] for instance - which is the architecture of both PS5 and Xbox Series S|X - all I can find is 32-bit scalar instructions for integers.

[0] https://www.amd.com/content/dam/amd/en/documents/radeon-tech...

LegionMammal978

You're likely going to have a rough time with 64-bit arithmetic in any GPU. (At least on Nvidia GPUs, the instruction set doesn't give you anything but a 32-bit add-with-carry to help.) But my understanding is that a lot of the arithmetic hardware used for 53-bit double-precision ops can also be used for 32-bit integer ops, which hasn't always been the case.

ryao

The PTX ISA for Nvidia GPUs supports 64-bit integer arithmetic:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...

It needs to support 64-bit integer arithmetic for handling 64-bit address calculations efficiently. The SASS ISA since Volta has explicit 32I suffixed integer instructions alongside the regular integer instructions, so I would expect the regular instructions to be 64-bit, although the documentation leave something to be desired:

https://docs.nvidia.com/cuda/cuda-binary-utilities/index.htm...

Agentlien

I'm less concerned about it being 32-bit and more about them being exclusively scalar instructions, no vector instructions. Meaning only useful for uniforms, not thread-specific data.

[Update: I remembered and double checked. While there are only scalar 32-bit integer instructions you can use 24-bit integer vector instructions. Essentially ignoring the exponent part of the floats.]

qwery

It's always been less of a big deal than it used to be -- at least on "big" GPUs -- but the article isn't really about avoiding branches. The code presented is already branchless. The people giving out the advice seem to think they are avoiding branches as optimisation but their understanding of what branching code is is apparently based on if they can see some sort of conditional construct.

layer8

This article is also relevant: https://medium.com/@jasonbooth_86226/branching-on-a-gpu-18bf...

“If you consult the internet about writing a branch of a GPU, you might think they open the gates of hell and let demons in. They will say you should avoid them at all costs, and that you can avoid them by using the ternary operator or step() and other silly math tricks. Most of this advice is outdated at best, or just plain wrong.

Let’s correct that.”

magicalhippo

Processors change, compilers change. If you care about such details, best to ship multiple variants and pick the fastest one at runtime.

As I've mentioned here several times before, I've made code significantly faster by removing the hand-rolled assembly and replacing it with plain C or similar. While the assembly might have been faster a decade or two ago, things have changed...

Amadiro

I think figuring out the fastest version of a shader at runtime is very non-trivial, I'm not aware of any game or engine that can do this.

I think it'd be possible in principle, because most APIs (D3D, GL, Vulkan etc) expose performance counters (which may or may not be reliable depending on the vendor), and you could in principle construct a representative test scene that you replay a couple times to measure different optimizations. But a lot of games are quite dynamic, having dynamically generated scenes and also dynamically generated shaders, so the number of combinations you might have to test seems like an obstacle. Also you might have to ask the user to spend time waiting on the benchmark to finish.

You could probably just do this ahead of time with a bunch of different GPU generations from each vendor if you have the hardware, and then hard-code the most important decision. So not saying it'd be impossible, but yeah I'm not aware of any existing infrastructure for this.

hansvm

The last time I did anything like this (it was for CPU linear algebra code designed to run in very heterogeneous clusters), I first came up with a parameterization that approximated how I'd expect an algorithm to perform. Then, once for each hardware combination, you sweep through the possible parameterization space. I used log-scaled quantization to make it cheap to index into an array of function pointers based on input specifics.

The important thing to note is that you can do that computation just once, like when you install the game, and it isn't that slow. Your parameterization won't be perfect, but it's not bad to create routines that are much faster than any one implementation on nearly every architecture.

ijustlovemath

you'd only have to test worst/median case scenes, which you could find with a bit of profiling!

dist-epoch

Funnily enough, this is sort of what the NVIDIA drivers do: they intercept game shaders and replace them by custom ones optimized by NVIDIA. Which is why you see stuff like this in NVIDIA drivers changelog: "optimized game X, runs 40% faster"

Cieric

I don't work on the nvidia side of things but it's likely to be the same. Shader replacement is only one of a whole host of things we can do to make games run faster. It's actually kind of rare for use to do them since it boats the size of the driver so much. A lot of our options do change how shaders work though, like forcing a shader to use double precision floats instead of the single it was compiled with.

chrisjj

> > A lot of our options do change how shaders work though, like forcing a shader to use double precision floats instead of the single it was compiled with.

That will break code sufficienly reliant on the behaviour of sungle precision, though.

SpaghettiCthulu

> A lot of our options do change how shaders work though, like forcing a shader to use double precision floats instead of the single it was compiled with.

What benefit would that give? Is double precision faster than single on modern hardware?

crazygringo

Wow, how did they pick which games to optimize?

Does the studio pay them to do it? Because Nvidia wouldn't care otherwise?

Does Nvidia do it unasked, for competitive reasons? To maximize how much faster their GPU's perform than competitors' on the same games? And therefore decide purely by game popularity?

Or is it some kinda of alliance thing between Nvidia and studios, in exchange for something like the studios optimizing for Nvidia in the first place, to further benefit Nvidia's competitive lead?

Cieric

I can't give details on how we do our selections (not nvidia but another gpu manufacturer). But we do have direct contacts into a lot of studios and we do try and help them fix their game if possible before ever putting something in the driver to fix it. Studios don't pay us, it's mutually benefital for us to improve the performance of the games. It also help the game run better on our cards by avoiding some of the really slow stuff.

In general if our logo is in the game, we helped them by actually writing code for them, if it's not then we might have only given them directions on how to fix issues in their game or put something in the driver to tweak how things execute. From an outside perspective (but still inside on the gpu space) nvidia does give advice to keep their competitive advantage. In my experience so far ignoring barriers that are needed as per the spec, defaulting to massive numbers when the gpu isn't known ("batman and tessellation" should be enough to find that), and then doing out right weird stuff that doesn't look like something any sane person would do in writing shaders (I have a thought in my head for that one, but it's not considered public knowledge. )

flohofwoe

AFAIK NVIDIA and AMD do this unasked for popular game releases because it gives them a competitive advantage if 'popular game X' runs better on NVIDIA than AMD (and vice versa). If you're an AAA studio you typically also have a 'technical liason' at the GPU vendors though.

It's basically an arms race. This is also the reason why graphics drivers for Windows are so frigging big (also AFAIK).

esperent

I'd love to read more about this, what kind of changes they make and how many games they do it for. Do they ever release technical articles about it?

sigmoid10

The other commenter makes it sound a bit more crazy than it is. "Intercept shaders" sounds super hacky, but in reality, games simply don't ship with compiled shaders. Instead they are compiled by your driver for your exact hardware. Naturally that allows the compiler to perform more or less aggressive optimisations, similar to how you might be able to optimise CPU programs by shipping C code and only compiling everything on the target machine once you know the exact feature sets.

snicker7

Imagine being the dev competing game Y and seeing the changelog.

surajrmal

It wouldn't be surprising to find out Nvidia talks directly with game developers to give them hints as to how to optimize their games

chrisjj

> "optimized game X, runs 40% faster"

... and looks 4O% crappier? E.g. stuttery, because the driver does not get to see the code ahead of time.

alexvitkov

This would be acceptable if it meant adding one more shader, but with "modern" graphics APIs forcing us to sometimes have thousands of permutations for the same shader, every variant you add multiplies that count by 2x.

We also don't have an infinite amount of time to work on each shader. You profile on the hardware you care about, and if the choice you've made is slower on some imaginary future processor, so be it - hopefully that processor is faster enough that this doesn't matter.

account42

Graphics APIs don't force your to have thousands of shaders. The abstraction in your engine might.

qwery

Some of the mistakes/confusion being pointed out in the article is being replicated here, it seems.

The article is not claiming that conditional branches are free. In fact, the article is not making any point about the performance cost of branching code, as far as I can tell.

The article is pointing out that conditional logic in the form presented does not get compiled into conditionally branching code. And that people should not continue to propagate the harmful advice to cover up every conditional thing in sight[0].

Finally, on actually branching code: that branching code is more complicated to execute is self-evident. There are no free branches. Avoiding branches is likely (within reason) to make any code run faster. Luckily[1], the original code was already branchless. As always, there is no universal metric to say whether optimisation is worthwhile.

[0] the "in sight" is important -- there's no interest in the generated code, just in the source code not appearing to include conditional anythings.

[1] No luck involved, of course ... (I assume people wrote to IQ to suggest apparently glaringly obvious (and wrong) improvements to their shader code, lol)

londons_explore

So why isn't the compiler smart enough to see that the 'optimised' version is the same?

Surely it understands "step()" and can optimize the "step()=0.0" and "step()==1.0" cases separately?

This is presumably always worth it, because you would at least remove one multiplication (usually turning it into a conditional load/store/something else)

NohatCoder

It may very well be, it is the type of optimisation where it is quite possible that some compilers may do it some of the time, but it is definitely also possible to write a version that the compiler can't grok.

mbel

Yup, they most likely do. After all everything is LLVM based nowadays.

account42

That's not true for shader compilers included in drivers - some use LLVM but definitely not all of them.

Cieric

The other part of the optimization issue is that you can't take to long to try anything and everything. Most of the optimizations happen on the driver side, and anything that takes to long will show up as shader compilation stutter. I can't say currently if this is or isn't done, it's just always something you have to think about.

ttoinou

Thanks Inigo !

  The second wrong thing with the supposedly optimizer version is that it actually runs much slower than the original version. The reason is that the step() function is actually implemented like this:

  float step( float x, float y )
  {
      return x < y ? 1.0 : 0.0;
  }

How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives ?

Const-me

The only way is do what OP did – compile your shader, disassemble, and read the assembly.

I do that quite often with my HLSL shaders, learned a lot about that virtual instruction set. For example, it’s interesting GPUs have instruction sincos, but inverse trigonometry is emulated while compiling.

account42

Why are you supposed to know?

Because you care about performance? step being implemented as a libray function on top of a conditional doesn't really say anything about its performance vs being a dedicated instruction. Don't worry about the implementation.

Because you are curious about GPU architectures? Look at disassembly, (open source) driver code (including LLVM) and/or ISA documentation.

SideQuark

I’ve never seen a GPU with special primitives for any functions than you’d see in pc style assembly. Every time I’ve looked at a decompiled shader, it’s always been pretty much what you think of in C.

Aldo specs like OpenGL specify many intrinsic behavior, which is then implemented as the spec, using standard assembly instructions.

Find an online site that decompiles to various architectures.

Waterluvian

This is a great question that I see everywhere in programming and I think it is core to why you measure first when optimizing.

You generally shouldn’t know or care how a built in is implemented. If do care, you’re probably thinking about optimization. At that point the answer is “measure and find out what works better.”

TeMPOraL

EDIT: I see my source of confusion must be that "branch" must have a well-understood hardware-specific meaning that goes beyond the meaning I grew up with, which is that a conditional is a branch, because the path control takes (at the machine code level) is chosen at runtime. This makes a conditional jump a branch by definition.

> How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives?

To me the problem was obvious, but then again I'm having trouble with both your and author's statements about it.

The problem I saw was, obviously by going for a step() function, people aren't turning logic into arithmetic, they're just hiding logic in a library function call. Just because step() is a built-in or something you'd find used in mathematical paper doesn't mean anything; the definition of step() in mathematics is literally a conditional too.

Now, the way to optimize it properly to have no conditionals, is you have to take a continuous function that resembles your desired outcome (which in the problem in question isn't step() but the thing it was used for!), and tune its parameters to get as close as it can to your target. I.e. typically you'd pick some polynomial and run the standard iterative approximation on it. Then you'd just have an f(x) that has no branching, just a bunch of extra additions and multiplications and some "weirdly specific" constants.

Where I don't get the author is in insisting that conditional move isn't "branching". I don't see how that would be except in some special cases, where lack of branching is well-known but very special implementation detail - like where the author says:

> also note that the abs() call does not become a GPU instruction and instead becomes an instruction modifier, which is free.

That's because we standardized on two's complement representation for ints, which has the convenient quality of isolating sign as the most significant bit, and for floats the representation (IEEE-754) was just straight up designed to achieve the same. So in both cases, abs() boils down to unconditionally setting the most significant bit to 0 - or, equivalently, masking it off for the instruction that's reading it.

step() isn't like that, nor any other arbitrary ternary operation construct, and nor is - as far as I know - a conditional move instruction.

As for where I don't get 'ttoinou:

> How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives

The basics like abs() and sqrt() and basic trigonometry are standard knowledge, the rest... does it even matter? step() obviously has to branch somewhere; whether you do it yourself, let a library do it, or let the hardware do it, shouldn't change the fundamental nature.

dahart

> the meaning I grew up with, which is that a conditional is a branch

A conditional jump is a branch. But a branch has always had a different meaning than a generic “conditional”. There are conditional instructions that don’t jump, e.g. CMP, and the distinction is very important. Branch or conditional jump means the PC can be set to something other than ‘next instruction’. A conditional, such a conditional select or conditional move, one that doesn’t change the PC, is not a branch.

> take a continuous function […] Then you’d just have an f(x) that has no branching

One can easily implement conditional functions without branching. You can use a compare instruction followed by a Heaviside function on the result, evaluate both sides of the result, and sum it up with a 2D dot product (against the compare result and its negation). That is occasionally (but certainly not always) faster on a GPU than using if/else, but only if the compiler is otherwise going to produce real branch instructions.

dkersten

Maybe I’m misunderstanding why branching is slow on a GPU. My understanding was that it’s because both sides of the branch are always executed, just one is masked out (I know the exact mechanics of this have changed), so that the different cores in the group can use the same program counter. Something to that effect, at least.

But in this case, would calculating both sides and then using a way to conditionally set the result not perform the same amount of work? Whether you’re calculating the result or the core masks the instructions out, it’s executing instructions for both sides of the branch in both cases, right?

On a CPU, the performance killer is often branch prediction and caches, but on the GPU itself executing a mostly linear set of instructions, or is my understanding completely off? I guess I don’t really understand what it’s doing, especially for loops.

mymoomin

A "branch" here is a conditional jump. This has the issues the article mentions, which branchless programming avoids:

    Again, there is no branching - the instruction pointer isn't manipulated, there's no branch prediction involved, no instruction cache to invalidate, no nothing.

This has nothing to do with whether the behaviour of some instruction depends on its arguments. Looking at the Microsoft compiler output from the article, the iadd (signed add) instruction will get different results depending on its arguments, and the movc (conditional move) will store different values depending on its arguments, but after each the instruction pointer will just move onto the next instruction, so there are no branches.

chrisjj

> a conditional is a branch, because the path control takes (at the machine code level) is chosen at runtime. This makes a conditional jump a branch by definition.

s/conditional is a/conditional jump is a/

Problem solved.

Non-jump conditionals have been a thing for decades.

ttoinou

It kinda does when you’re wondering what’s going on in backstage and working with shaders on multiple OS, drivers and hardware.

   Now, the way to optimize it properly to have no conditionals, is you have to take a continuous

I suspect that we shaders authors really like Clean Math and that’s also why we like to think such “optimizations” with the step function is a nice modification :-)

burch45

Branching is different instruction paths, so it requires reading the instructions from different memory that causes a delay jumping to those new instructions rather than plowing ahead on the current stream of instructions. So a conditional jump is a branch but a conditional move is just an instruction that moves one of two values into a register but doesn’t affect what code is executed next.

mirsadm

I've been caught by this. Even Claude/ChatGPT will suggest it as an optimisation. Every time I've measured a performance drop doing this. Sometimes significant.

WJW

Is that weird? LLMs will just repeat what is in their training corpus. If most of the internet is recommending something wrong (like this conditional move "optimization") then that is what they will recommend too.

xbar

Not weird but important to note.

diath

> Even Claude/ChatGPT will suggest it as an optimisation.

LLMs just repeat what people on the internet say, and people are often wrong.

cwillu

Hmm, godbolt is showing branches in the vulkan output:

        return x>0.923880?vec2(s.x,0.0):
               x>0.382683?s*sqrt(0.5):
                          vec2(0.0,s.y);

turns into

         %24 = OpLoad %float %x
         %27 = OpFOrdGreaterThan %bool %24 %float_0_923879981
               OpSelectionMerge %30 None
               OpBranchConditional %27 %29 %35
         %29 = OpLabel
         %31 = OpAccessChain %_ptr_Function_float %s %uint_0
         %32 = OpLoad %float %31
         %34 = OpCompositeConstruct %v2float %32 %float_0
               OpStore %28 %34
               OpBranch %30
         %35 = OpLabel
         %36 = OpLoad %float %x
         %38 = OpFOrdGreaterThan %bool %36 %float_0_382683009
               OpSelectionMerge %41 None
               OpBranchConditional %38 %40 %45
         %40 = OpLabel
         %42 = OpLoad %v2float %s
         %44 = OpVectorTimesScalar %v2float %42 %float_0_707106769
               OpStore %39 %44
               OpBranch %41
         %45 = OpLabel
         %47 = OpAccessChain %_ptr_Function_float %s %uint_1
         %48 = OpLoad %float %47
         %49 = OpCompositeConstruct %v2float %float_0 %48
               OpStore %39 %49
               OpBranch %41
         %41 = OpLabel
         %50 = OpLoad %v2float %39
               OpStore %28 %50
               OpBranch %30
         %30 = OpLabel
         %51 = OpLoad %v2float %28
               OpReturnValue %51

https://godbolt.org/z/aqob7YfWq

SideQuark

Vulcan opcode shader lang is not executed. It’s a platform neutral intermediate language, so won’t have the special purpose optional instructions most GPUs do since GPUs aren’t required to.

It likely compiles down on the relevant platforms as the original article did.

toredo1729_2

Unrelated, but somehow similar: I really hate it that it's not possible to force gcc to transform things like this into a conditional move:

x > c ? y : 0.;

It annoyed me many times and it still does.

fweimer

What do you mean? Do you want to annotate the condition as unpredictable, so that the compiler always assumes that a conditional move is beneficial?

(Compilers obviously do this transformation, including GCC, but it is not always beneficial, especially on x86-64.)

toredo1729_2

Yes, that would be great. It's not always benefical, but in some (rare, but for me important) cases it's better. Currently, the only way to ensure a conditional move is used, is to use inline assembly. This is not portable and also less maintainable than a "proper" solution.

tjalfi

clang has the __builtin_unpredictable() intrinsic[0] for this purpose.

[0] https://clang.llvm.org/docs/LanguageExtensions.html#builtin-...

IshKebab

And it's not always possible! E.g. most RISC-V CPUs don't support it yet.

dzaima

Eh, it takes ~3-4 instrs to do a branchless "x ? y : z" on baseline rv64i (depending on the format you have the condition in) via "y^((y^z)&x)", and with Zicond that only goes down to 3 instrs (they really don't want to standardize GPR instrs with 3 operands so what Zicond adds is "x ? y : 0" and "x ? 0 : y" ¯\_(ツ)_/¯; might bring the latency down by an instr or two though).

flohofwoe

Seems to work just fine on gcc and clang?

https://www.godbolt.org/z/ffEvvjhz8

PS: and it also doesn't matter whether a ternary is used or a traditional if (as one would expect):

https://www.godbolt.org/z/zjb4KdqvK

(the float version also appears to not use branches: https://www.godbolt.org/z/98bdheKK4)

For such simple expression I would expect the compiler to pick the right output pattern based on the target CPU though...

dzaima

Not always - https://www.godbolt.org/z/zYxeahf3T. And for any modern (as in, made in the last two decades) x86 processor the branchless version will be hilariously better if the condition is unpredictable (which is a thing the compiler can't know by itself, hence wanting to have an explicit way to request a conditional move instr) and the per-branch code takes less than like multiple dozens of cycles.

dzaima

Worse, doing one of the idioms for a conditional move ends up getting gcc to actually produce a conditional move, but clang doesn't, even with its __builtin_unpredictable: https://www.godbolt.org/z/bq9axzvjG

HN

Don't "optimize" conditional moves in shaders with mix()+step()

Don't "optimize" conditional moves in shaders with mix()+step()