Beware of Fast-Math

209 comments

·May 31, 2025

orlp

I helped design an API for "algebraic operations" in Rust: <https://github.com/rust-lang/rust/issues/136469>, which are coming along nicely.

These operations are

1. Localized, not a function-wide or program-wide flag.

2. Completely safe, -ffast-math includes assumptions such that there are no NaNs, and violating that is undefined behavior.

So what do these algebraic operations do? Well, one by itself doesn't do much of anything compared to a regular operation. But a sequence of them is allowed to be transformed using optimizations which are algebraically justified, as-if all operations are done using real arithmetic.

pclmulqdq

-ffast-math is actually something like 15 separate flags, and you can use them individually if you want. 3 of them are "no NaNs," "no infinities," and "no subnormals." Several of the other flags allow you to treat math as associative or distributive if you want that.

The library has some merit, but the goal you've stated here is given to you with 5 compiler flags. The benefit of the library is choosing when these apply.

foota

There's probably a benefit to being able to choose which you want in different places though, right?

glkindlmann

That sounds neat. What would be really neat is if the language helped to expose the consequences of the ensuing rounding error by automating things that are otherwise clumsy for programmers to do manually, like running twice with opposite rounding directions, or running many many times with internally randomized directions (two of the options in Sec 4 of *). That is, it would be cool if Rust enabled people learn about the subtleties of floating point, instead of hiding them away.

* https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf

eqvinox

Are these calls going to clear the FTZ and DAZ flags in the MXCSR on x86? And FZ & FIZ in the FPCR on ARM?

orlp

I don't believe so, no. Currently these operations only set the LLVM flags to allow reassociation, contraction, division replaced by reciprocal multiplication, and the assumption of no signed zeroes.

This can be expanded in the future as LLVM offers more flags that fall within the scope of algebraically motivated optimizations.

eqvinox

Ah sorry I misunderstood and thought this API was for the other way around, i.e. forbidding "unsafe" operations. (I guess the question reverses to setting those flags)

('Naming: "algebraic" is not very descriptive of what this does since the operations themselves are algebraic.' :D)

evrimoztamur

Does that mean that a physics engine written with these operations will always compile to yield the same deterministic outcomes across different platforms (assuming they correctly implement (or able to do so) algebraic operations)?

Sharlin

It's more like the opposite. These tell the compiler to assume for optimization purposes that floats are associative and so on (ie. algebraic), even when in reality they aren't. So the results may vary depending on what transformations the compiler performs – in particular, they may vary between optimized and non-optimized builds, which normally isn't allowed.

vanderZwan

> These tell the compiler to assume for optimization purposes that floats are associative and so on (ie. algebraic), even when in reality they aren't.

I wonder if it is possible to add an additional constraint that guarantees the transformation has equal or fewer numerical rounding errors. E.g. for floating point doubles (0.2 + 0.1) - 0.1 results in 0.20000000000000004, so I would expect that transforming some (A + B) - B to just A would always reduce numerical error. OTOH, it's floating point maths, there's probably some kind of weird gotcha here as well.

orlp

No, there is no guarantee which (if any) optimizations are applied, only that they may be applied. For example a fused multiply-add instruction may be emitted for a*b + c on platforms which support it, which is not cross-platform.

SkiFire13

No, the result may depend on how the compiler reorders them, which could be different on different platforms.

smcameron

One thing I did not see mentioned in the article, or in these comments (according to ctrl-f anyway) is the use of feenableexcept()[1] to track down the source of NaNs in your code.

    feenableexcept(FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW);

will cause your code to get a SIGFPE whenever a NaN crawls out from under a rock. Of course it doesn't work with fast-math enabled, but if you're unknowingly getting NaNs without fast-math enabled, you obviously need to fix those before even trying fast-math, and they can be hard to find, and feenableexcept() makes finding them a lot easier.

[1] https://linux.die.net/man/3/feenableexcept

jart

Trapping math is the enlightened way to do things. I wrote an example in the cosmo repo of how to use it. https://github.com/jart/cosmopolitan/blob/master/examples/tr...

DavidVoid

Yeah it's pretty useful to enable every once in a while just to see if anything complains.

Be very careful with it in production code though [1]. If you're in a dll then changing the FPU exception flags is a big no-no (unless you're really really careful to restore them when your code goes out of scope).

[1]: https://randomascii.wordpress.com/2016/09/16/everything-old-...

emn13

I get the feeling that the real problem here are the IEEE specs themselves. They include a huge bunch of restrictions that each individually aren't relevant to something like 99.9% of floating point code, and probably even in aggregate not a single one is relevant to a large majority of code segments out in the wild. That doesn't mean they're not important - but some of these features should have been locally opt-in, not opt out. And at the very least, standards need to evolve to support hardware realities of today.

Not being able to auto-vectorize seems like a pretty critical bug given hardware trends that have been going on for decades now; on the other hand sacrificing platform-independent determinism isn't a trivial cost to pay either.

I'm not familiar with the details of OpenCL and CUDA on this front - do they have some way to guarrantee a specific order-of-operations such that code always has a predictable result on all platforms and nevertheless parallelizes well on a GPU?

adrian_b

Not being able to auto-vectorize is not the fault of the IEEE standard, but the fault of those programming languages which do not have ways to express that the order of some operations is irrelevant, so they may be executed concurrently.

Most popular programming languages have the defect that they impose a sequential semantics even where it is not needed. There have been programming languages without this defect, e.g. Occam, but they have not become widespread.

Because nowadays only a relatively small number of users care about computational applications, this defect has not been corrected in any mainline programming language, though for some programming languages there are extensions that can achieve this effect, e.g. OpenMP for C/C++ and Fortran. CUDA is similar to OpenMP, even if it has a very different syntax.

The IEEE standard for floating-point arithmetic has been one of the most useful standards in all history. The reason is that both hardware designers and naive programmers have always had the incentive to cheat in order to obtain better results in speed benchmarks, i.e. to introduce errors in the results with the hope that this will not matter for users, which will be more impressed by the great benchmark results.

There are always users who need correct results more than anything else and it can be even a matter of life and death. For the very limited in scope uses where correctness does not matter, i.e. mainly graphics and ML/AI, it is better to use dedicated accelerators, GPUs and NPUs, which are designed by prioritizing speed over correctness. For general-purpose CPUs, being not fully-compliant with the IEEE standard is a serious mistake, because in most cases the consequences of such a choice are impossible to predict, especially not by the people without experience in floating-point computation who are the most likely to attempt to bypass the standard.

Regarding CUDA, OpenMP and the like, by definition if some operations are parallelizable, then the order of their execution does not matter. If the order matters, then it is impossible to provide guarantees about the results, on any platform. If the order matters, it is the responsibility of the programmer to enforce it, by synchronization of the parallel threads, wherever necessary.

Whoever wants vectorized code should never rely on programming languages like C/C++ and the like, but they should always use one of the programming language extensions that have been developed for this purpose, e.g. OpenMP, CUDA, OpenCL, where vectorization is not left to chance.

emn13

If you care about absolute accuracy, I'm skeptical you want floats at all. I'm sure it depends on the use case.

Whether it's the standards fault or the languages fault for following the standard in terms of preventing auto-vectorization is splitting hairs; the whole point of the standard is to have predictable and usually fairly low-error ways of performing these operations, which only works when the order of operations is defined. That very aim is the problem; to the extent the stardard is harmless when ordering guarrantees don't exist you're essentially applying some of those tricky -ffast-math suboptimizations.

But to be clear in any case: there are obviously cases whereby order-of-operations is relevant enough and accuracy altering reorderings are not valid. It's just that those are rare enough that for many of these features I'd much prefer that to be the opt-in behavior, not opt-out. There's absolutely nothing wrong with having a classic IEEE 754 mode and I expect it's an essentialy feature in some niche corner cases.

However, given the obviously huge application of massively parallel processors and algorithms that accept rounding errors (or sometimes conversely overly precise results!), clearly most software is willing to generally accept rounding errors to be able to run efficiently on modern chips. It just so happens that none of the computer languages that rely on mapping floats to IEEE 754 floats in a straitforward fashion are any good at that, which is seems like its a bad trade off.

There could be multiple types of floats instead; or code-local flags that delineate special sections that need precise ordering; or perhaps even expressions that clarify how much error the user is willing to accept and then just let the compiler do some but not all transformations; and perhaps even other solutions.

alfiedotwtf

> Most popular programming languages have the defect that they impose a sequential semantics even where it is not needed. There have been programming languages without this defect, e.g. Occam, but they have not become widespread.

We have memory ordering functions to let compilers know the atomic operation preference of the programmer… couldn’t we do the same for maths and in general a set of expressions?

adrian_b

An example of programming language syntax that avoids to specify sequential execution where not needed is to specify that a sequence of expressions separated by semicolons must be executed sequentially, but a sequence of expressions separated by commas may be executed in any order or concurrently.

This is just a minor change from the syntax of the most popular programming languages, because they typically already specify that the order of evaluation of the expressions used for the arguments of a function, which are separated by commas, can be arbitrary.

Early in its history, the C language has been close to specifying this behavior for its comma operator, but unfortunately its designers have changed their mind and they have made the comma operator behave like a semicolon, in order to be able to use it inside for statement headers, where the semicolons have a different meaning. A much better solution for C, instead of making both comma and semicolon to have the same behavior, would have been to allow a block to appear in any place where an expression is expected, giving it the value of the last expression evaluated in the block.

dzaima

The precise requirements of IEEE-754 may not be important for any given program, but as long as you want your numbers to have any form of well-defined semantics beyond "numbers exist, and here's a list of functions that do Something™ that may or may not be related to their name", any number format that's capable of (approximately) storing both 10^20 and 10^-20 in 64 bits is gonna have those drawbacks.

AFAIK GPU code is basically always written as scalar code acting on each "thing" separately, that's, as a whole, semantically looped over by the hardware, same way as multithreading would (i.e. no order guaranteed at all), so you physically cannot write code that'd need operation reordering to vectorize. You just can't write an equivalent to "for (each element in list) accumulator += element;" (or, well, you can, by writing that and running just one thread of it, but that's gonna be slower than even the non-vectorized CPU equivalent (assuming the driver respects IEEE-754)).

null

[deleted]

Affric

How does IEEE 754 prevent auto-vectorisation?

dahart

The spec doesn’t prevent auto-vectorization, it only says the language should avoid it when it wants to opt in to producing “reproducible floating-point results” (section 11 of IEEE 754-2019). Vectorizing can be implemented in different ways, so whether a language avoids vectorizing in order to opt in to reproducible results is implementation dependent. It also depends on whether there is an option to not vectorize. If a language only had auto-vectorization, and the vectorization result was deterministic and reproducible, and if the language offered no serial mode, this could adhere to the IEEE spec. But since C++ (for example) offers serial reductions in debug & non-optimized code, and it wants to offer reproducible results, then it has to be careful about vectorizing without the user’s explicit consent.

kzrdude

If you write a loop `for x in array { sum += x }` Then your program is a specification that you want to add the elements in exactly that order, one by one. Vectorization would change the order.

dahart

The bigger problem there is the language not offering a way to signal the author’s intent. If an author doesn’t care about the order of operations in a sum, they will still write the exact same code as the author who does care. This is a failure of the language to be expressive enough, and doesn’t reflect on the IEEE spec. (The spec even does suggest that languages should offer and define these sorts of semantics.) Whether the program is specifying an order of operations is lost when the language offers no way for a coder to distinguish between caring about order and not caring. This is especially difficult since the vast majority of people don’t care and don’t consider their own code to be a specification on order of operations. Worse, most people would even be surprised and/or annoyed if the compiler didn’t do certain simplifications and constant folding, which change the results. The few cases where people do care about order can be extremely important, but they are rare nonetheless.

stingraycharles

Yup, because of the imprecision of floating points, cannot just assume that “(a + c) + (b + d)” is the same as “a + b + c + d”.

It would be pretty ironic if at some point fixed point / bignum implementations end up being faster because of this.

Kubuxu

IIRC reordering additions can cause the result to change which makes auto-vectorisation tricky.

goalieca

Floating point arithmetic is neither commutative or associative so you shouldn’t.

lo0dot0

While it technically correct to say this it also gets the wrong point across because it leaves out the fact that ordering changes create only a small difference. Other examples where arithmetic is not commutative, e.g. matrix multiplication , can create much larger differences.

layer8

IEEE-754 addition and multiplication is commutative. It isn't distributive, though.

eapriv

Why is it not commutative?

ajross

> I get the feeling that the real problem here are the IEEE specs themselves.

Well, all standards are bad when you really get into them, sure.

But no, the problem here is that floating point code is often sensitive to precision errors. Relying on rigorous adherence to a specification doesn't fix precision errors, but it does guarantee that software behavior in the face of them is deterministic. Which 90%+ of the time is enough to let you ignore the problem as a "tuning" thing.

But no, precision errors are bugs. And the proper treatment for bugs is to fix the bugs and not ignore them via tricks with determinism. But that's hard, as it often involves design decisions and complicated math (consider gimbal lock: "fixing" that requires understanding quaternions or some other orthogonal orientation space, and that's hard!).

So we just deal with it. But IMHO --ffast-math is more good than bad, and projects should absolutely enable it, because the "problems" it discovers are bugs you want to fix anyway.

chuckadams

> (consider gimbal lock: "fixing" that requires understanding quaternions or some other orthogonal orientation space, and that's hard!)

Or just avoiding gimbal lock by other means. We went to the moon using Euler angles, but I don't suppose there's much of a choice when you're using real mechanical gimbals.

ajross

That is the "tuning" solution. And mostly it works by limiting scope of execution ("just don't do that") and if that doesn't work by having some kind of recovery method ("push this button to reset", probably along with "use this backup to recalibrate"). And it... works. But the bug is still a bug. In software we prefer more robust techniques.

FWIW, my memory is that this was exactly what happened with Apollo 13. It lost its gyro calibration after the accident (it did the thing that was the "just don't do that") and they had to do a bunch of iterative contortions to recover it from things like the sun position (because they couldn't see stars out the iced-over windows).

NASA would have strongly preferred IEEE doubles and quaternions, in hindsight.

Sharlin

> -funsafe-math-optimizations

What's wrong with fun, safe math optimizations?!

keybored

Hah! I was just about to comment that I immediately read it as fun-safe, everytime I see it.

I guess that happens when I don’t deal with compiler flags daily.

vardump

”This roller coaster is optimized to be Fun and Safe!”

Sharlin

Many funroll loops in that coaster.

storus

This problem is happening even on Apple MPS with PyTorch in deep learning, where fast math is used by default in many operations, leading to a garbage output. I hit it recently while training an autoregressive image generation model. Here is a discussion by folks that hit it as well:

https://github.com/pytorch/pytorch/issues/84936

Sophira

Previously discussed at https://news.ycombinator.com/item?id=29201473 (which the article itself links to at the end).

anthk

On Forth, there's the philosophy of the fixed point:

https://www.forth.com/starting-forth/5-fixed-point-arithmeti...

With 32 and 64 bit numbers, you can just scale decimals up. So, Torvalds was right. On dangerous contexts (uper-precise medical doses, FP has good reasons to exist, and I am not completely sure).

Also, both Forth and Lisp internally suggest to use represented rationals before floating point numbers. Even toy lisps from https://t3x.org have rationals too. In Scheme, you have both exact->inexact and inexact->exact which convert rationals to FP and viceversa.

If you have a Linux/BSD distro, you may already have Guile installed as a dependency.

Hence, run it and then:

      scheme@(guile-user)> (inexact->exact 2.5)
      $2 = 5/2

      scheme@(guile-user)> (exact->inexact (/ 5 2))
      $3 = 2.5

Thus, in Forth, I have a good set of q{+,-,*,/} operations for rational (custom coded, literal four lines) and they work great for a good 99% of the cases.

As for irrational numbers, NASA used up 16 decimals, and the old 113/355 can be precise enough for a 99,99 of the pieces built in Earth. Maybe not for astronomical distances, but hey...

In Scheme:

         scheme@(guile-user)> (exact->inexact (/ 355 113))
         $5 = 3.1415929203539825

In Forth, you would just use

         : pi* 355 133 m*/ ;

with a great precision for most of the objects being measured against.

AlotOfReading

Floats are fixed point, just done in log space. The main change is that the designers dedicated a few bits to variable exponents, which introduces alignment and normalization steps before/after the operation. If you don't mix exponents, you can essentially treat it as identical to a lower precision fixed point system.

anthk

No, not even close. Scaling integers to mimic decimals under 32 and 64 bit can be much faster. And with 32 bit double numbers you can cover Plank numbers, so with 64 bit double numbers you can do any field.

eqvinox

Those rational numbers fly out the window as soon as your math involves any kind of more complicated trigonometry, or even a square root…

stassats

You can turn them back into rationals, (rational (sqrt 2d0)) => 6369051672525773/4503599627370496

Or write your own operations that compute to the precision you want.

dreamcompiler

If you want high precision trig functions on rationals, nothing's stopping you from writing a Taylor series library for them. Or some other polynomial appromation or a lookup table or CORDIC.

anthk

Check CORDIC, please.

https://en.wikipedia.org/wiki/CORDIC

Also, on sqrt functions, even a FP-enabled toy EForth under the Subleq VM (just as a toy, again, but it works) provides some sort of fsqrt functions:

    2 f fsqrt f.
    1.414 ok

Under PFE Forth, something 'bigger':

   40 set-precision ok  
   2e0 fsqrt f. 1.4142135623730951454746218587388284504414 ok

EForth's FP precision it's tiny but good enough for very small microcontrollers. But it wasn't so far from the exponents the 80's engineers worked to create properly usable machinery/hardware and even software.

chuckadams

I haven't worked with C in nearly 20 years and even I remember warnings against -ffast-math. It really ought not to exist: it's just a super-flag for things like -funsafe-math-optizations, and the latter makes it really clear that it's, well, unsafe (or maybe it's actually funsafe!)

teleforce

“Nothing brings fear to my heart more than a floating point number.” - Gerald Jay Sussman

Is there any IEEE standards committee working on FP alternative for examples Unum and Posit [1],[2].

[1] Unum & Posit:

https://posithub.org/about

[2] The End of Error:

https://www.oreilly.com/library/view/the-end-of/978148223986...

kvemkon

I'm wondering, why there are still no announcements for hardware support of such approaches in CPUs.

neepi

HP had proper deterministic decimal arithmetic since the 1970s.

Q6T46nT668w6i3m

Is this sarcasm? If not, the proposed posit standard, IEEE P3109.

pclmulqdq

The current P3109 draft has no posits in it.

teleforce

Great, didn't know that it exists.

cycomanic

I think this article overstates the importance of the problems even for scientific software. In the scientific code I've written, noise processes are often orders of magnitude larger than what what is discussed here and I believe this applies to many (most?) simulations modelling the real world (i.e. Physics chemistry,..). At the same time enabling fast-math has often yielded a very significant (>10%) performance boost.

I particularly find the discussion of - fassociative-math because I assume that most writers of some code that translates a mathetical formula to into simulations will not know which would be the most accurate order of operations and will simply codify their derivation of the equation to be simulated (which could have operations in any order). So if this switch changes your results it probably means that you should have a long hard look at the equations you're simulating and which ordering will give you the most correct results.

That said I appreciate that the considerations might be quite different for libraries and in particular simulations for mathematics.

DavidVoid

It matters for reproducibility between software versions, right?

I work in audio software and we have some comparison tests that compare the audio output of a chain of audio effects with a previous result. If we make some small refactoring of the code and the compiler decides to re-organize the arithmetic operations then we might suddenly get a slightly different output. So of course we disable fast-math.

One thing we do enable though, is flushing denormals to zero. That is predictable behavior and it saves some execution time.

recursivecaveat

Yeah that is the killer for me. I'm not particularly attached to IEEE semantics. Unfortunately the replacement is that your results can change between any two compiles, for nearly any reason. Even if you think you don't care about tiny precision variances: consider that if you ever score and rank things with an algorithm that involves floats, the resulting order can change.

londons_explore

It would be nice if there was some syntax for "math order matters, this is the order I want it done in".

Then all other math will be fast-math, except where annotated.

hansvm

The article mentioned that gcc and clang have such extensions. Having it in the language is nice though, and that's the approach Zig took.

sfn42

I thought most languages have this? If you simply write a formula operations are ordered according to the language specifiction. If you want different ordering you use parentheses.

Not sure how that interacts with this fast math thing, I don't use C

kstrauser

That’s a different kind of ordering.

Imagine a function like Python’s `sum(list)`. In abstract, Python should be able to add those values in any order it wants. Maybe it could spawn a thread so that one process sums the first half in the list, another sums the second half at the same time, and then you return the sum of those intermediate values. You could imagine a clever `sum()` being many times faster, especially using SIMD instructions or a GPU or something.

But alas, you can’t optimize like that with common IEEE-754 floats and expect to get the same answer out as when using the simple one-at-a-time addition. The result depends on what order you add the numbers together. Order them differently and you very well may get a different answer.

That’s the kind of ordering we’re talking about here.

on_the_train

I worked in cad, robotics and now semiconductor optics. In every single field, floating precision down to the very last digits was a huge issue

AlotOfReading

"precision" is an ambiguous term here. There's reproducibility (getting the same results every time), accuracy (getting as close as possible to same results computed with infinite precision), and the native format precision.

ffast-math is sacrificing both the first and the second for performance. Compilers usually sacrifice the first for the second by default with things like automation fma contraction. This isn't a necessary trade-off, it's just easier.

There's very few cases where you actually need accuracy down to the ULP though. No robot can do anything meaningful with femtometer+ precision, for example. Instead you choose a development balance between reproducibility (relatively easy) and accuracy (extremely hard). In robotics, that will usually swing a bit towards reproducibility. CAD would swing more towards accuracy.

cycomanic

Interesting, I stand corrected. In most of the fields I'm aware off one could easily work in 32bit without any issues.

I find the robotics example quite surprising in particular. I think the precision of most input sensors is less than 16bit so. If your inputs have this much noise on them how come you need so much precision your calculations?

spookie

The precision isn't uniform across a range of possible inputs. This means you need a higher bit depth, even though "you aren't really using it", just so you can establish a good base precision you are sure you are hitting at every range. The part where you are saying "most sensors" is doing a lot of leverage here.

datameta

Luckily outside of mission critical systems, like in demoscene coding, I can happily use "44/7" as a 2pi approximation (my beloved)

zinekeller

(2021)

Previous discussion: Beware of fast-math (Nov 12, 2021, https://news.ycombinator.com/item?id=29201473)

Affric

For non-associativity what is the best way to order operations? Is there an optimal order for precision whereby more similar values are added/multiplied first?

EDIT: I am now reading Goldberg 1991

Double edit: Kahan Summation formula. Goldberg is always worth going back to.

zokier

Herbie can optimize arbitrary floating point expressions for accuracy

https://herbie.uwplse.org/

HN

Beware of Fast-Math

Beware of Fast-Math