C is not suited to SIMD (2019)
116 comments
·January 23, 2025Const-me
nine_k
Why does FORTRAN, which is older than C, has no trouble with auto-vectorization?
Mostly because of a very different memory model, of course. Arrays are first-class things, not hacks on top of pointers, and can't alias each other.
nitwit005
When I tested this before (some years ago), adding "restrict" to the C pointers resulted in the same behavior, so the aliasing default seem to be the key issue.
AlotOfReading
Your compiler very likely assumes strict aliasing at whatever optimization level you're already using, so restrict usually won't do much. GCC does at O2, keil (and clang?) at O1.
kazinator
But restrict says "I promise that these things don't overlap, on penalty of my program going to hell". Is that what's going on in Fortran?
TinkersW
The article seems to conflate C and C++ in its very first sentence, though in reality these are very different languages. In C++ you can use std::array, and you can make your SIMD containers use restrict to make it more like Fortran.
Anyway auto-vectorization will never compete with intrinsics for performance, so this all seems rather silly. Try expressing pshub in auto-vectorizer. Or a bunch of optimal load/swizzles to out perform gather.
Const-me
I believe it’s ecosystem. For example, Python is very high level language, not sure it even has a memory model. But it has libraries like NumPy which support all these vectorized exponents when processing long arrays of numbers.
goku12
Everything that's heavily vectorized in the Python ecosystem, including numpy achieves it using optimized backend code written in other languages - fortran in particular. Python is only a thin veneer over those backends. In fact, you're constantly reminded to offload the control flow as much as possible to those backends for the sake of performance, instead of doing things like looping in Python. If that's enough to consider Python to be good in vectorization, I can just link high performance fortran libraries with C, handle non-vector control flow from there and call it a day. I guarantee you that this arrangement will be far more performant than what Python can ever achieve. I have to strongly agree with the other commenter's observation that the memory model is the key to vector performance.
And of course Python has a memory model. While that model is not as well understood as C's model, it is the key to Python's success and popularity as a generic programming language and as a numeric/scientific programming language. Python's memory model unlike C's or Fortran's, isn't designed for high performance. It's designed for rich abstractions, high ergonomics and high interoperability with those performant languages. For most people, the processing time lost executing python code is an acceptable tradeoff for the highly expressive control that Python gives them over the scheduling of lower level operations.
vrighter
any python libraries that offer any sort of real performance, is written in C
MisterTea
> Arrays are first-class things, not hacks on top of pointers, and can't alias each other.
"Hacks" feels like the wrong term to use here. Comparing managed objects surrounded by runtime to memory isn't a fair comparison.
nine_k
There's no language-level difference between int[] and int* in C. Indexing is just pointer arithmetics. No support for (non-jagged) multidimensional arrays either.
A proper array would at least know its length. It does not require a lot of runtime to do, and most C code require libc anyway for basic stuff like malloc().
pjmlp
Fortran is not a managed language, and has a runtime as big a C.
pjmlp
None of that is ISO C though, and any language can have similar extensions, no need to be stuck with C for that.
null
Laiho
I've had great success with autovectorization in Rust. Majority of non-branching loops seem to consistently generate great assembly.
Keyframe
isn't DirectXMath C++, not C?
npalli
Yes, it is in C++.
Const-me
Yeah, but if you need C versions of low-level SIMD algorithms from there, copy-paste the implementation and it will probably compile as C code. That’s why I mentioned the MIT license: copy-pasting from GPL libraries may or may not work depending on the project, but the MIT license is copy-paste friendly.
AnimalMuppet
The argument seems to be that C isn't suited to SIMD because some functions (exp, for instance) are not. But it then talks about how exp is implemented in hardware, in a way that forces it to be a non-SIMD calculation. That's not a C issue, that's a CPU issue. No language can save you there (unless it uses a software-only function for exp).
Is this article really confused, or did I misunderstand it?
The thing that makes C/C++ a good language for SIMD is how easily it lets you control memory alignment.
woooooo
Not in the article, but I've read that the way C does pointers, a single address with length implicit, makes it hard for compilers to assert that 2 arrays don't overlap/alias and this is an obstacle for generating SIMD instructions.
drysine
That's what the restrict type qualifier[0] is for.
alkh
I thought you could simply add `restrict`[1] to indicate that only one pointer points to a specific object? Shouldn't this help?
aidenn0
You can add that qualifier, but nothing will enforce that objects don't overlap. So you replaced slow correct code with fast incorrect code.
wayoverthecloud
In C++ there is the align_alloc specificier. https://en.cppreference.com/w/c/memory/aligned_alloc
Not sure for C
ryanpetrich
aligned_alloc is a C function. It doesn't help the compiler prove that two pointers can't alias. restrict is the keyword that does.
bee_rider
When compiling a library, it is generally impossible for the compiler to know whether or not the pointers will be aliased, right? That decision is made after the library is already compiled.
drwu
The function declaration in the header file of the library can carry the required 'restrict'. It works for c++ invokers too, as most c++ compilers also support and check the __restrict__ for old plain pointer types.
1718627440
Not really. Pointers from different allocations are guaranteed to never alias. If they do, it is UB.
AnimalMuppet
Right, but the compiler in general has no way to know that pointers are from different allocations - especially several function calls away from the allocations, with some pointer arithmetic in between.
ta3401
> But it then talks about how exp is implemented in hardware, in a way that forces it to be a non-SIMD calculation.
The fact that exp is implemented in hardware is not the argument. The argument is that exp is a library function, compiled separately, and thus the compiler cannot inline the function and fuse it with an array-wide loop, to detect later on an opportunity to generate the SIMD instructions.
It is true however that exp is implemented in hardware in the X86 world, and to be fair, perhaps a C compiler only needs to represent that function as an intrinsic instead, to give itself a chance to later replace it either with the function call or some SIMD instruction; but I guess that the standard doesn't provide that possibility?
AlotOfReading
The compiler can (and does) know enough about the stdlib functions to treat them specially. The C standard doesn't require that libm exist as a separately linked library either, that's just an implementation detail from an earlier time that got enshrined in POSIX.
vmchale
> perhaps a C compiler only needs to represent that function as an intrinsic instead, to give itself a chance to later replace it either with the function call or some SIMD instruction
I gesture to this in the blog post:
> In C, one writes a function, and it is exported in an object file. To appreciate why this is special, consider sum :: Num a => [a] -> a in Haskell. This function exists only in the context of GHC. > ... > perhaps there are more fluent methods for compilation (and better data structures for export à la object files).
ta3401
I now realize that I somewhat rephrased what was said in the blog. How strange!
adgjlsfhk1
> But it then talks about how exp is implemented in hardware, in a way that forces it to be a non-SIMD calculation.
This is especially surprising given that very few architectures have an `exp` in hardware. It's almost always done in software.
stabbles
I think the interpretation is more conceptual about functions, reusability and composability.
The function exp has an abstract mathematical definition as mapping from numbers to numbers, and you could implement that generically if the language allows for it, but in C you cannot because it's bound to the signature `double exp(double)` and fixed like that at compile time. You cannot use this function in a different context and pass e.g. __m256d.
Basically because C does not have function dispatch, it's ill-suited for generic programming. You cannot write an algorithm on scalar, and now pass arrays instead.
Someone
> You cannot write an algorithm on scalar, and now pass arrays instead.
It is limited to cases where you know all overloads beforehand and limited because of C’s weak type system (can’t have an overload for arrays of int and one for pointers to int, for example), but you can. https://en.cppreference.com/w/c/language/generic:
#include <math.h>
#include <stdio.h>
// Possible implementation of the tgmath.h macro cbrt
#define cbrt(X) _Generic((X), \
long double: cbrtl, \
default: cbrt, \
float: cbrtf \
)(X)
int main(void)
{
double x = 8.0;
const float y = 3.375;
printf("cbrt(8.0) = %f\n", cbrt(x)); // selects the default cbrt
printf("cbrtf(3.375) = %f\n", cbrt(y)); // converts const float to float,
// then selects cbrtf
}
It also, IMO, is a bit ugly, but that fits in nice with the rest of C, as seen through modern eyes.There also is a drawback for lots of cases that you cannot overload operators. Implementing vector addition as ‘+’ isn’t possible, for example.
Many compilers partially support that as a language extension, though, see for example https://clang.llvm.org/docs/LanguageExtensions.html#vectors-....
uecker
I am not sure what you mean by "weak" type system. C's type system is not weak IMHO and also differentiates between arrays and pointers:
_Generic((typeof(x), int*: 1, int[]: 2)
\_Generic with type argument is a new feature, it is a bit less elegant without it: _Generic((typeof(x)*)0, typeof(int*)*: 1, typeof(int[])\*: 2)
uecker
You can definitely implement generic algorithms in C and there are different ways to do this. For example, you create an abstract data type "ring" with addition, multiplication, and multiplication with a scalar, and pass objects to this type and functions for these three operations to a generic exponential function.
leecarraher
i think the article is confused. i write quite a bit of SIMD code for HPC and c is my goto for low level computing because of the low level memory allocation in c. In fact that tends to be the lion share of work to implement simd operations where memory isn't the bottleneck. if only it were as simple as calling vfmadd132ps on two datatypes.
high_na_euv
Even high level languages like c# allow you to express such a thing
https://stackoverflow.com/questions/73269625/create-memory-a...
Also structlayout and fieldoffset
PaulHoule
So are those functional languages. The really cool SIMD code I see people writing now is by people like Lemire using the latest extensions who do clever things like decoding UTF-8 which I think will always take assembly language to express unless you are getting an SMT solver to write it for you.
janwas
Lemire and collaborators often write in C++ intrinsics, or thin platform-specific wrappers on top of them.
I count ~8 different implementations [1], which demonstrates considerable commitment :) Personally, I prefer to write once with portable intrinsics.
https://github.com/simdutf/simdutf/tree/1d5b5cd2b60850954df5...
mananaysiempre
I don’t think that really works beyond 128 bits. AVX/AVX2 with its weird tiered 2×128-bit structure occasionally forces you down some really awkward paths. AVX-512 with its masks is powerful and quite elegant, but the idioms it enables don’t really translate to other ISAs. And even in 128-bit land, the difference between good code for SSE and NEON can be quite significant, all because of a single instruction (PMOVMSKB) the latter lacks[1].
Portable instructions and “scalable” length-agnostic vectors are usually fine for straightforward mathy (type-1[2]) SIMD code. But real SIMD trickery, the one that starts as soon as you take your variable byte shuffle out of the stable, is rarely so kind.
[1] https://branchfree.org/2019/04/01/fitting-my-head-through-th...
[2] https://branchfree.org/2024/06/09/a-draft-taxonomy-of-simd-u...
janwas
Valid points, but I am doing exactly that full-time for the past years :) Including compression and quicksort, which are nontrivial uses of SIMD.
The 2x128 is handled by adopting AVX2 semantics for our InterleaveLower/Upper op, and when we want the extra fixup, there is also InterleaveWholeLower. I agree AVX-512 is awesome, it's my favorite target. Some of its useful ops (compress[store]) are fine to emulate. What idioms did you have in mind that don't translate? Spot on about PMOVMSKB, that's a pain point. We provide several ops to replace it with more relaxed semantics that are easier to emulate on other platforms, for example AllTrue, CountTrue, AllFalse. (I think DanilaK's NEON VSHRN trick is faster than the one from your first link?)
kibwen
> So are those functional languages.
The author is suggesting array languages as the solution, which are a separate category from functional languages.
dagelf
Link?
throwawaymaths
not even an SMT solver, that's not what they are for.
1propionyl
No? SMT solvers underpin the majority of program synthesis research tools.
When you get down to it, you're optimizing (searching) for some program that maximizes/minimizes some objective function with terms for error relative to specification/examples and size of synthesized program, while applying some form of cleverness to pruning the search space.
This is absolutely within the wheelhouse of SMT solvers, and something that they are used for.
SMT doesn't have to be used, but for implementation it enables iterating on the concept more quickly (see also, more broadly: prototyping in Prolog), and in other cases it's simply the most effective tool for the job. So, it tends to get a lot of play.
throwawaymaths
If SMT solvers were all you needed, then for example, LLVM wouldn't need a polyhedral optimizer. And a polyhedral optimizer is a subset of the sorts of things you will need for doing the sort of things you're asking to do automatically.
npalli
vmchale
From highway:
> Does what you expect: Highway is a C++ library with carefully-chosen functions that map well to CPU instructions without extensive compiler transformations. The resulting code is more predictable and robust to code changes/compiler updates than autovectorization.
So C compilers are not a good place to start if one wants to write a compiler for an array language (which naturally expresses SIMD calculations). Which is what I point out in the last paragraph of the blog post:
> To some extent this percolates compilers textbooks. Array languages naturally express SIMD calculations; perhaps there are more fluent methods for compilation (and better data structures for export à la object files).
pjmlp
It is more like, ISO C++ with compiler extensions does ok.
mlochbaum
Several comments seem confused about this point: the article is not about manual SIMD, which of course C is perfectly capable of with intrinsics. It's discussing problems in compiling architecture-independent code to SIMD instructions, which in practice C compilers very often fail to do (so, exp having scalar hardware support may force other arithmetic not to be vectorized). An alternative mentioned is array programming, where operations are run one at a time on all the data; these languages serve as a proof of concept that useful programs can be run in a way that uses SIMD nearly all the time it's applicable, but at the cost of having to write every intermediate result to memory. So the hope is that "more fluent methods for compilation" can generate SIMD code without losing the advantages of scalar compilation.
As an array implementer I've thought about the issue a lot and have been meaning to write a full page on it. For now I have some comments at https://mlochbaum.github.io/BQN/implementation/versusc.html#... and the last paragraph of https://mlochbaum.github.io/BQN/implementation/compile/intro....
mlochbaum
Finding it increasingly funny how many people have come out of the woodworks to "defend" C by offering this or that method of explicit vectorization. What an indictment! C programmers can no longer even conceive of a language that works from a description of what should be done and not which instructions should be used!
mlochbaum
Got the introduction written at https://mlochbaum.github.io/BQN/implementation/compile/fusio....
brundolf
Thank you. The title is confusing
pjmlp
Compiler specific C is capable of, ISO C has no SIMD support of any form.
null
commandlinefan
This seems more of a generalization of Richard Steven's observation:
"Modularity is the enemy of performance"
If you want optimal performance, you have to collapse the layers. Look at Deepseek, for example.
camel-cdr
This depends entirely on compiler support. Intels ICX compiler can easily vectorize a sigmoid loop, by calling SVMLs vectorized expf function: https://godbolt.org/z/no6zhYGK6
If you implement a scalar expf in a vectorizer friendly way, and it's visible to the compiler, then it could also be vectorized: https://godbolt.org/z/zxTn8hbEe
dzaima
gcc and clang are also capable of it, given certain compiler flags: https://godbolt.org/z/z766hc64n
camel-cdr
Thanks, I didn't know about this. Interesting that it seems to require fast-math.
ComputerGuru
That means it’ll never be used. ffast-math is verboten in most serious codebases.
dzaima
Seems "-fno-math-errno" is enough for clang. gcc needs a whole "-fno-math-errno -funsafe-math-optimizations -ffinite-math-only".
bee_rider
I’m not sure what “suited to SIMD” means exactly in this context. I mean, it is clearly possible for a compiler to apply some SIMD optimizations. But the program is essentially expressed as a sequential thing, and then the compiler discovers the SIMD potential. Of course, we write programs that we hope will make it easy to discover that potential. But it can be difficult to reason about how a compiler is going to optimize, for anything other than a simple loop.
camel-cdr
Suites for SIMD means you write the scalar equivalent of what you'd do on a single element in a SIMD implementation.
E.g. you avoid lookup tables when you can, or only use smaller ones you know to fit in one or two SIMD registers. gcc and clang can't vevtorize it as is, but they do if you remove the brancjes than handle infinity and over/under-flow.
In the godbolt link I copied the musl expf implementation and icx was able to vectorize it, even though it uses a LUT to large for SIMD registers.
#pragma omp simd and equivalents will encourage the compiler to vectorize a specific loop and produce a warning if a loop isn't vectorized.
bee_rider
I shouldn’t have started my comment with the sort of implied question or note of confusion. Sorry, that was unclear communication.
I agree that it is possible to write some C programs that some compilers will be able to discover the parallel potential of. But it isn’t very ergonomic or dependable. So, I think this is not a strong counter-argument to the theory of the blog post. It is possible to write SIMD friendly C, but often it is easier for the programmer to fall back to intrinsics to express their intent.
sifar
It means auto-vectorization. Write scalar code that can be automatically vectorized by the compiler by using SIMD instructions
notorandit
C has been invented when CPUs, those few available, were just single cored and single threaded.
Wanna SMP? Use multi-thread libreries. Wanna SIMD/MIMD? Use (inline) assembler functions. Or design your own language.
dragontamer
Fortran is older but it's array manipulation maps naturally to SIMD.
LiamPowell
That's missing the point of the article, but it's also not true, at least not at the time of C89. This can be easily verified by a Google search. As an aside, Ada also had multi-threading support before C89 was released, but this article is about SIMD, not multi-threading.
johnnyjeans
c89 is 19 years into c's life
pjmlp
You had plenty of places outside Bell Labs doing concurrency stuff, Concurrent Pascal and Solo OS is one example among many.
LiamPowell
Yes, but before standardisation there were many implementations all doing different things based on a book with no particularly formal specifications, so picking 1989 makes more sense than 1970 in my opinion. Multi-threading also existed in 1970 but wasn't as widespread.
svilen_dobrev
reminds me of this old article.. ~2009, Sun/MIT - "The Future is Parallel" .. and "sequential decomposition and usual math are wrong horse.."
https://ocw.mit.edu/courses/6-945-adventures-in-advanced-sym...
vkaku
I also think that vectorizers and compilers can detect parallel memory adds/moves/subs and without that, many do not take time to provide adequate hints to the compiler about it.
Some people have vectorized successfully with C, even with all the hacks/pointers/union/opaque business. It requires careful programming, for sure. The ffmpeg cases are super good examples of how compiler misses happen, and how to optimize for full throughput in those cases. Worth a look for all compiler engineers.
exitcode0000
Of course C easily allows you to directly write SIMD routines via intrinsic instructions or inline assembly:
```
generic
type T is private;
Aligned : Bool := True;
function Inverse_Sqrt_T (V : T) return T;
function Inverse_Sqrt_T (V : T) return T is
Result : aliased T;
THREE : constant Real := 3.0;
NEGATIVE_HALF : constant Real := -0.5;
VMOVPS : constant String := (if Aligned then "vmovaps" else "vmovups");
begin
Asm (Clobber => "xmm0, xmm1, xmm2, xmm3, memory",
Inputs => (Ptr'Asm_Input ("r", Result'Address),
Ptr'Asm_Input ("r", V'Address),
Ptr'Asm_Input ("r", THREE'Address),
Ptr'Asm_Input ("r", NEGATIVE_HALF'Address)),
Template => VMOVPS & " (%1), %%xmm0 " & E & -- xmm0 ← V
" vrsqrtps %%xmm0, %%xmm1 " & E & -- xmm1 ← Reciprocal sqrt of xmm0
" vmulps %%xmm1, %%xmm1, %%xmm2 " & E & -- xmm2 ← xmm1 \* xmm1
" vbroadcastss (%2), %%xmm3 " & E & -- xmm3 ← NEGATIVE_HALF
" vfmsub231ps %%xmm2, %%xmm0, %%xmm3 " & E & -- xmm3 ← (V - xmm2) \* NEGATIVE_HALF
" vbroadcastss (%3), %%xmm0 " & E & -- xmm0 ← THREE
" vmulps %%xmm0, %%xmm1, %%xmm0 " & E & -- xmm0 ← THREE \* xmm1
" vmulps %%xmm3, %%xmm0, %%xmm0 " & E & -- xmm0 ← xmm3 \* xmm0
VMOVPS & " %%xmm0, (%0) "); -- Result ← xmm0
return Result;
end;
function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_2D, Aligned => False);
function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_3D, Aligned => False);
function Inverse_Sqrt is new Inverse_Sqrt_T (Vector_4D);
``````C
vector_3d vector_inverse_sqrt(const vector_3d\* v) {
...
vector_4d vector_inverse_sqrt(const vector_4d\* v) {
vector_4d out;
static const float THREE = 3.0f; // 0x40400000
static const float NEGATIVE_HALF = -0.5f; // 0xbf000000
__asm__ (
// Load the input vector into xmm0
"vmovaps (%1), %%xmm0\n\t"
"vrsqrtps %%xmm0, %%xmm1\n\t"
"vmulps %%xmm1, %%xmm1, %%xmm2\n\t"
"vbroadcastss (%2), %%xmm3\n\t"
"vfmsub231ps %%xmm2, %%xmm0, %%xmm3\n\t"
"vbroadcastss (%3), %%xmm0\n\t"
"vmulps %%xmm0, %%xmm1, %%xmm0\n\t"
"vmulps %%xmm3, %%xmm0, %%xmm0\n\t"
"vmovups %%xmm0, (%0)\n\t" // Output operand
:
: "r" (&out), "r" (v), "r" (&THREE), "r" (&NEGATIVE_HALF) // Input operands
: "xmm0", "xmm1", "xmm2", "memory" // Clobbered registers
);
return out;
}
```dvorack101
Not my code, but illustrates how SIMD works.
https://github.com/dezashibi-c/a-simd_in_c
Copy rights go to Navid Dezashibi.
I would rather conclude that automatic vectorizers are still less than ideal, despite SIMD instructions have been widely available in commodity processors for 25 years now.
The language is actually great with SIMD, you just have to do it yourself with intrinsics, or use libraries. BTW, here’s a library which implements 4-wide vectorized exponent functions for FP32 precision on top of SSE, AVX and NEON SIMD intrinsics (MIT license): https://github.com/microsoft/DirectXMath/blob/oct2024/Inc/Di...