Skip to content(if available)orjump to list(if available)

Towards fearless SIMD, 7 years later

Towards fearless SIMD, 7 years later

173 comments

·March 29, 2025

ashvardanian

I’ve said it before and I’ll say it again: Rust feels like a Python developer’s idea of a high-performance computing language. It’s a great language for many kinds of applications — just not when you need to squeeze out every bit of performance from advanced hardware.

Even before getting into SIMD, try using Rust for concurrent, succinct, or external-memory data structures. It quickly becomes clear where the friction is.

Cargo is fantastic — clean, ergonomic, and a joy compared to many toolchains. But it’s much easier to keep things simple when you don’t have to support dozens of AVX-512 variants, AMX, SME, different CUDA generations, ROCm, or any of the other modern hardware capabilities.

Standardising SIMD in the standard library — in Rust or C++ — has always been a questionable idea. Most of these APIs cater to operations that compilers already auto-vectorize reasonably well, and they barely touch the recent capabilities of SIMD. Just consider how hard it is to build any meaningful abstraction over the predicate/register models across AVX-512, SVE, and RVV.

RVV aside, this should illustrate the point: https://www.modular.com/blog/understanding-simd-infinite-com...

1932812267

I've written a fair bit of SIMD code in Rust, and it definitely had lots of sore spots.

The main advantage was that, because Rust doesn't use TBAA, it's completely legal (and safe, if you use bytemuck) to freely cast pointers and values around. TBAA in C++ makes it much easier to hit undefined behavior.

But also, because of various miscompilations, Rust refuses (or at least refused) to pass SIMD arguments in registers, so every non-inlined function call passes arguments via the stack. There were also miscompilations if you enabled a target_feature for just one function, so we ended up just passing `-C target-cpu=...` globally, and if we wanted to support a different microarchitecture, we just recompiled the whole program. On top of that, there's no good way to check to see what microarchitecture you're compiling for, so we had to resort to specifying the target cpu in multiple places, with comments reminding us to keep the places in sync.

dzaima

I don't think Rust is particularly problematic here. As long as you don't want to do funky things like use immutable argument memory as temporary scratch space (with you restoring the values afterwards of course), all it means is some `unsafe`ing at worst, compared to C/C++. And there are some safe abstractions you can make over loads/stores (everything else being safe, even if not yet marked as such).

Do agree that a standard SIMD type is rather pointless, if not immediately, then in like 5 years. (and, seemingly, both Rust and C++ are like over 10 years behind on SIMD, so they're already out-of-date)

Maybe somewhat useful if you just want the simple ~8x speedup, and not squeeze out the last 1.4x or whatever, but autovectorization should be capable of covering a significant amount of such.

ashvardanian

… except for byte-level processing, variable-length codecs, or mixed-precision numerics. That never works with autovectorization and can’t be solved with general-purpose SIMD wrappers. For me the solution was to implement those manually, and even at a scale of just 2 libraries I’ve eneded up with somewhat different project layouts & dispatch mechanisms: https://github.com/ashvardanian/SimSIMD , https://github.com/ashvardanian/StringZilla

One big family not covered there, is sparse data-strictures and related algorithms. I’ve only started integrating scatter/gather in AVX-512 and SVE, and on synthetic benchmarks both look promising: https://github.com/ashvardanian/less_slow.cpp/releases/tag/v...

Those should probably unlock a much wider set of applications for SIMD, but designing libraries for those may benefit from yet another project structure.

burntsushi

You started with this... take:

> Rust feels like a Python developer’s idea of a high-performance computing language. It’s a great language for many kinds of applications — just not when you need to squeeze out every bit of performance from advanced hardware.

And went on to say that Rust in particular is problematic for:

> byte-level processing

It's particularly odd for you to say this given that the memchr Rust crate is just as fast as stringzilla for substring search. And is generally faster in cases where the needle is invariant, because stringzilla doesn't have APIs for amortizing searcher construction.

We've had a discussion about this before where I provided receipts[1] and we have not had a meeting of the minds on this point. The thing I'm trying to achieve here is to point out that your claims are contested and there is evidence that you're wrong. And so I'd caution readers to also in turn question your higher level claims about Rust being a "Python developer's idea of a high-performance computing language."

[1]: https://old.reddit.com/r/rust/comments/1ayngf6/memchr_vs_str...

dzaima

Ah yeah, gather/scatter are indeed a rather problematic thing for autovectorization. That said, with no-alias info (which Rust has a lot of) it's possible: https://rust.godbolt.org/z/zTfo9nxhd.

Unfortunately it doesn't get autovectorized without the unsafes, but theoretically it should be possible-ish for bounds checking to be autovectorized (most problematic aspect being that it might be hard to annoying-to-impossible to ensure that in the case of multiple panic/UB sources the proper one happens first).

I'd imagine in any non-trivial situation you'd want a custom layer of abstractions over whatever the language provides for all languages. For that a portable-simd thing is actually a rather good base, on which you could add custom arch-specific abstractions/ops as desired.

Not sure what's problematic with mixed-precision (I know SVE is rather weird for mixed-width elements, but that's about it?), though I primarily don't care about float stuff generally. Also no clue what's problematic with byte-level stuff.

Indeed there are still a bunch of things that you want proper manual SIMD for (hell the SIMDful project I work on has an entire DSL for doing nice SIMD), but autovectorization still covers a good amount.

janwas

> except for byte-level processing, variable-length codecs, or mixed-precision numerics. That never works with autovectorization and can’t be solved with general-purpose SIMD wrappers.

Counterexamples: Chromium's byte-level HTML scanning, several var-len bit packing codecs, and Gemma.cpp's matmul is mixed-precision (fp8->bf16->fp32->fp64). All written with the Highway general-purpose SIMD wrapper. Please revise your post or expand upon the structure/dispatch concern.

scottlamb

> Do agree that a standard SIMD type is rather pointless, if not immediately, then in like 5 years. (and, seemingly, both Rust and C++ are like over 10 years behind on SIMD, so they're already out-of-date)

What language would you consider to have cutting-edge SIMD support?

I've dipped my toes into SIMD with Rust, [1] on stable with platform-specific intrinsics (SSE2, AVX2, NEON). I would have liked to use stable `std::simd`. I learned that (particularly on AVX2) getting things into the right lanes efficiently is a pain. I would have liked to just use `simd_swizzle!` for that part, and mix that with intrinsics calls. My approach of writing a small C++ or unstable Rust program that does the swizzling and then copying the intrinsics operations it chose into my program's "source" code worked, but I prefer to not have a manual copy'n'paste step between compilation and assembly.

If there's something much better out there in another language, well, I'd be very interested to see it.

[1] I wrote this: https://github.com/infiniteathlete/pixelfmt/blob/main/docs/s...

dzaima

Don't think any language has standardized SIMD that's particularly nice; Highway is probably a quite nice library on C++, though I haven't used it enough to get comfortable.

The thing I use for my projects is Singeli[1], a DSL specifically made for SIMD stuff (though it's capable of generally sanely doing abstractions over types/operations/loops; it's just a fancy code generator). Obligatory disclaimer that I'm one of the two people working on its design. It's far from a nice experience starting from nothing, but it's pretty nice for what I do.

CBQN's the main place it's used, can click around its source: https://github.com/dzaima/CBQN/tree/develop/src/singeli/src

Its goal isn't necessarily to unify architectures, but rather make it as easy as possible to make abstractions that do; as such its built-in includes for x86 don't have arbitrary shuffling, but do provide a sane interface over the cases that are supported in a single instruction (not including constant creation/loading), and those can run on NEON unchanged (assuming they're ran on 128-bit vectors, of course, as NEON doesn't support larger ones); and, with Singeli just generating C/C++ currently, you can just map in __builtin_shufflevector if desired. e.g. here's your AVX2 `pre`:

    include 'skin/c' # defines infix a+b & a*b etc to run __add/__mul/... (yes, those aren't here by default, and you can define custom infix/prefix ops)
    include 'arch/c' # defines __add & __mul to do C ops
    include 'arch/iintrinsic/basic' # not necessary for a shuffle, but provides basic x86 arith ops
    include 'arch/iintrinsic/select' # x86 shuffles; there's similar 'arch/neon_intrin/basic' & 'arch/neon_intrin/select' for NEON

    fn pre(inp: [32]i8, out: *[32]i8) : void = {
      store{out, 0, vec_shuffle{16, inp, merge{ # 16 specifies to repeat per 16-elt lane
        range{8}*2+1,            # lower half: 8 Y components; compile-time index calculations
        range{4}*4, range{4}*4+2 # upper half: (4 * U), (4 * V).
      }}}
    }
As a more fancy thing, I've got this working (via bodging together the definitions in CBQN with some sugar to make this pretty; not including all that boilerplate here), compilable to SSE2/AVX2/NEON producing a 4x unrolled core loop, plus tail handling (via reading past the end and doing a load-blend-store if necessary because that's what CBQN's fine with; could easily define a fancy_loop such that it does a scalar tail though). (also can be compiled to RVV via currently-unpublished mappings; no need to unroll for RVV; can choose to do either a stripmined loop or one with a separate tail):

    fn sigmoid{E}(r:*E, x:*E, n:ux) : void = {
      def V = arch_preferred_vector{E}
      @fancy_loop{V,4}(r in tup{'dst',r}, x, M in 'mask' over n) {
        # this loop body is generated 3 times for x86 & ARM - with x being a 4-elt tuple (core unrolled loop); a 1-elt tuple and no masking; a 1-elt tuple and masking
        if (any_hom{M, ...(x!=x)}) {
          emit{void, 'abort'}
        }
        r{x / __sqrt{1 + x*x}}
      }
      # were it not for a bug in tuple loop var mutation in Singeli having undesired pervasion, this would be possible:
      # @fancy_loop{V,4}(r, x, M in 'mask' over n) {
      #   if (...) ...
      #   r = x / __sqrt{1 + x*x}
      # }
    }
    export{'sigmoid', sigmoid{f32}}
(e.g. generated C for AVX2: https://godbolt.org/z/KTeGsazKP)

(I'm not actually particularly expecting interest in Singeli; I just like writing stuff :) )

[1]: https://github.com/mlochbaum/Singeli

jandrewrogers

One of the things I find interesting about SIMD is that a lot of behavior that is “undefined” for scalar types in C-derived languages is explicitly fully defined when you use SIMD intrinsics with the same integral types. UB exists to hide the fact that major CPU architectures give different results for basic ALU operations in some cases. SIMD makes no such pretense of abstraction. If I am using AVX-512 I explicitly get the full Intel architecture experience, the implementation details are not hidden behind UB. Same with ARM, etc.

For example, shift overflows are masked on x86, zero-filled on ARM, and undefined in C/C++. In SIMD-land, none of this is hidden and so you design your code to leverage the reality that those instructions behave differently, whereas in C/C++ only the behavior they have in common is “defined”.

The vector ISAs are sufficiently different from each other (and normal CPUs) that it is like trying to build a compiler that can automagically produce optimized code for both CPUs and GPUs from the same source tree. I am not optimistic that this will happen anytime soon. AVX-512 essentially started life as a GPU ISA, which probably explains the interesting fact that a modern x86 CPU core has more AVX-512 registers than x86 registers.

grandempire

Of course. Many things are UB because dictating a policy for all machines doesn’t make sense. Whereas AVX is a specification for a specific hardware capability.

pcwalton

> Even before getting into SIMD, try using Rust for concurrent, succinct, or external-memory data structures. It quickly becomes clear where the friction is.

It's the exact opposite for me: I use concurrent data structures more often in Rust than I do in C++ because I don't have to worry about dumb data race bugs. If one of my Bevy systems is slow, I slap par_iter() on the query and if it compiles it probably works, or at least fails for a not-stupid reason.

pclmulqdq

Concurrent data structures are rarely faster than "lock + non-concurrent data structure" and if you're putting constructs like par_iter() in a lot of places, there's a good chance you would be better off with the "dumb" pattern than the concurrent data structure.

The same goes with Arc - if you're using it a lot there's a good chance that code with a lot of Arcs is slower than equivalent GC-ed code.

1932812267

While it's true that par_iter() uses a concurrent data structure under the hood, it's specifically designed to use work-stealing to avoid needing threads to communicate.

Why would putting a lock over a global workqueue be faster than per-thread workqueues that don't require inter-thread communication (except in the case where work-stealing is required)?

SleepyMyroslav

What do you think about 'par_iter' having to wait for work imbalance to return execution back? With 'when_all' like primitive one can continue execution on any thread without losing one for waiting. ps as someone who does not have rust job i would like to see an example how rust deals with task based systems if you have public one at hand ofc.

pornel

The rayon library uses work stealing for this. Its parallel iterators offer some control of splitting strategies and granularity, so you can tune the trade-off between full utilization and cost of moving things between threads.

Additionally, in Bevy, independent queries (systems) are executed in parallel, so there's always something else to do, except your one worst loop.

pclmulqdq

Rust is easy to understand as "a language by browser writers for writing browsers." That statement alone gives you most of Rust's design choices:

* Safety over everything else

* Very good single threaded performance

* Javascript-like syntax and a Javascript-like package manager

I have been working on some software in Rust recently that needs bit and byte manipulation, and we have "unsafe" everywhere and hugely complicated spaghetti compared to the equivalent code in C.

pcwalton

> I have been working on some software in Rust recently that needs bit and byte manipulation, and we have "unsafe" everywhere and hugely complicated spaghetti compared to the equivalent code in C.

I'm curious what makes this so different from my experience. I rarely ever have to write "unsafe", and I'm writing quite low-level engine code that certainly uses bit manipulation. In fact, crates like bitflags and fixedbitset make it so easy that I tend to get dinged in code reviews for using bit flags when structs of booleans would be simpler :)

eru

> I'm curious what makes this so different from my experience. I rarely ever have to write "unsafe", and I'm writing quite low-level engine code that certainly uses bit manipulation.

Perhaps your usecase is similar enough to eg JavaScript engines? Because that's a usecase that browser writers would at least have in mind?

motorest

> * Javascript-like syntax and a Javascript-like package manager

I think this is not serious criticism. The "javascript-like package manager" reference actually refers to fixing a major problem with the developer experience playing legacy programming languages such as C or C++. Java has those, .NET has those, every single mainstream programming language has those. Except C or C++.

Rust might be riddled with "the emperor has no clothes" aspects, but having a package manager is not it.

pclmulqdq

That's not a negative aspect of Rust, and I'm not picking out a list of things I dislike about Rust. Making a statement that isn't unequivocally positive does not equal criticizing something.

Cargo is, however, a similarity with JS. On the whole a good one. Also, cargo works much more like npm than maven, for example.

pjmlp

Most likely because many keep forgetting that bit and byte manipulation in C is a mix of implementation defined and UB, depending on how it is coded.

motorest

What's the problem of using toolchain-specific features instead of behavior defined by the standard? Isn't this the bread and butter of embedded development and the reason why some behavior is purposely left undefined in the standard?

pclmulqdq

Exactly this. Most of the art of it is avoiding UB and sticking to implementation-defined behavior.

null

[deleted]

1932812267

I love your username, btw :)

lifthrasiir

> Just consider how hard it is to build any meaningful abstraction over the predicate/register models across AVX-512, SVE, and RVV.

Note that Highway mentioned in the post does take care of this, which is no easy feat but also a proof that it is doable.

creata

> Rust feels like a Python developer’s idea of a high-performance computing language.

I might be wrong, but I think it sounds more like Rust doesn't move as far away from the C or C++ way of doing things as you want it to. At the very least, Rust is no worse than C or C++ at any of the things you mentioned.

ashvardanian

In my (biased) experience, Rust is much harder to use for advanced projects, than C and C++. On the bright side, it’s also harder to misuse :)

vlovich123

I've found it much easier to use from writing a web service to writing a high performance DB that outperforms RocksDB and clearly people are using it for things like writing operating systems as well as game engines. I'm not sure what in your mind falls under "advanced projects" but I suspect it's something like number crunching (although you link StringZilla so not sure).

I'm still not seeing any description of specific challenges you feel are harder in Rust than in C/C++. In my mind Rust is completely equivalent in being able to accomplish the same tasks.

saagarjha

I don’t think Rust is worse at SIMD code, though.

knorker

In my experience Rust is much easier for advanced projects. Once you get into high performance code, C++ takes much more time, for two reasons.

1. With C++ you have to think a lot. Like, a lot. To convince yourself that what you are doing is safe, and that the lifetimes and races of various objects are safe. Rust takes a huge load off, by failing to compile your mistakes. You know that for the code that doesn't use "unsafe", if it compiled, then you don't even have to think about races or lifetimes.

2. C++ has so many places where copies sneak in. Tracking down needless copies in C++, for object types where it's not as simple as "just disable copy construct & copy assign", can be tricky and is extremely brittle to future changes of code. And not just for the CPU cost of copies, but RAM costs too.

And I say that as someone who's been coding C++ on a daily basis since the 1990s.

the__alchemist

I guess it comes down to application. If you don't attempt to find the most general solution, you can dodge those pitfalls. Case in point, abstracting over AVX-512, SVE, and RVV may be tough, but picking one is fine (On nightly only for now), can with the right abstractions can be almost as straightforward as using normal scalar values. I don't have a solution on the CUDA variants either; have been hard-coding that as well... (Cudarc lib with CUDA-version feature gates and GPU-series-specific code). Haven't hit a brick wall yet, but might... or might not.

dzaima

Seems rustc nightly does successfully vectorize the first sigmoid example: https://rust.godbolt.org/z/e1WYexqWY

Also there's progress on making safe intrinsics safe: https://github.com/rust-lang/stdarch/pull/1714

the__alchemist

Very interesting! I posted a vector and quaternion lib here a few weeks ago, and got great feedback on the state of SIMD on these things. I since have went on a deep dive, and implemented wrapper types in a similar way to this library. Used macros to keep repetition down. Split into three main sections:

  - Floating point primitives. Like this lib. Basically, copied `core::simd`'s API. Will delete this part once core::simd is stable. `f32x8`, `f64x4` types etc, with standard operator overloads, and utility methods like `splat`, `to_array` etc.

  - Vec and Quaternion analogs. Same idea, similar API. Vec3x8, Quaternionx8 etc.

  - Code to convert slices of floating point values, or non-SIMD vectors and quaternions to SIMD ones, including (partial) handling of accessing valid lanes in the last chunk.
I've incorporated these `x8` types into a WIP molecular dynamics application; relatively painless after setting up the infra. Would love to try `Vec3x16` etc, but 512-bit types aren't stable yet. But from Github activity on Rust, it sounds like this is right around the corner!

Of note, as others pointed out in the thread here I mentioned, the other vector etc libs are using the AoS approach, where a single f32x4 value etc is used to represent a Vec3 etc. While with this SoA approach, a `Vec3x8` is for performing operations on 8 Vec3s at once.

The article had interesting and surprising points on AVX-512 (Needed for f32x16, Vec3x16 etc). Not sure of the implications of exposing this in a public library is, i.e. might be a trap if the user isn't on one of the AMD Zen CPUs mentioned.

From a few examples, I seem to get 2-4x speedup from using the x8 intrinsics, over scalar (non-SIMD) operations.

camel-cdr

Why do the x4/x8 types seem to be the default in rust?

A portable SIMD feature should encurage portable SIMD and not a specific vector register size.

janwas

Amen! I really do not understand this. It has been 7 years since SVE was introduced.

Writing an application in terms of a specific lane count loses performance portability - either it's too many, or too few, for the particular CPU. And it also enables/encourages antipatterns like putting RGB in Vec4.

the__alchemist

This sounds like a great idea. I went with this approach because I'm new to SIMD, so I aped the most promising API (core::simd), extending it naturally.

I need to think through the consequences. It might involve feature gates, and/or an enum. So, for example, instead of:

  pub struct f32x8(__m256);
It might be this internally, with some method to auto-choose variant based on system capability?:

  pub enum f32_simd {
    X8(__m256),
    X16(__m512), 
  }
etc. Thoughts?

camel-cdr

I'm not proficient in Rust, but API wise I'd conceptually define types like f32xn or f32s, which have the number of elements that fit into a vector register for your target architecture, so 4 for NEON/SSE, 8 for AVX and 16 for AVX512.

I can recommens lookig at the highway library: https://github.com/google/highway

isusmelj

I’ve been playing around with SIMD since uni lectures about 10 years ago. Back then I started with OpenMP, then moved to x86 intrinsics with AVX. Lately I’ve been exploring portable SIMD for a side project where I’m (re)writing a Numpy-like library in Rust, mostly sticking to the standard library. Portable SIMD has been super helpful so far.

I’m on an M-series MacBook now but still want to target x86 as well, and without portable SIMD that would’ve been a headache.

If anyone’s curious, the project is here: https://github.com/IgorSusmelj/rustynum. It's just a learning exercise for learning Rust, but I’m having a lot of fun with it.

IshKebab

A problem for RISC-V is going to be that there's currently no way for user code to detect the presence of RVV. I have no idea how you can do multiversioning with that limitation.

hmry

The solution is to ask the OS to detect it for you. Linux offers a syscall for this (riscv_hwprobe). Has the drawback that it requires OS support, of course. But RVV requires OS support anyway (e.g. managing mstatus, saving vector registers on context switch), so that seems fine to me.

dzaima

There is some work on an OS-agnostic feature detection C API: https://github.com/riscv-non-isa/riscv-c-api-doc/blob/main/s.... Still quite new though, and potentially might change (as it did a month ago).

janwas

Or also getauxval? Highway has code to check for this, including that vectors are at least 128 bits: https://github.com/google/highway/blob/master/hwy/targets.cc...

thomashabets2

Oh? Isn't that what this does? std::arch::is_riscv_feature_detected!("v")

Hmm… now that I actually experiment with it, I can't get it to return `true` on hardware that does support it, unless I also compile with -Ctarget-feature=+v. And if I do, then the binary crashes with SIGILL before getting to that point on hardware without rvv.

So if it's always equal to cfg!(target_feature="v"), then what does that even mean?

I have created https://github.com/rust-lang/rust/issues/139139

quesomaster9000

It's a shame that the `MISA` CSR is in the 'Privileged Architecture' spec, otherwise you could just check bit 21 for 'V', but that appears to only be available in the highest privilege machine-mode.

Presumably your OS could trap attempts to read the CSR and allow it, but if not then it's a fatal error and your program shits the bed, otherwise you rely on some OS-specific way of getting that info at runtime.

fulafel

What happens there when you try to execute a missing RVV instruction ? On other archs you get a SIGILL which you can handle.

dzaima

With RISC-V being an open ISA a vendor can freely add some non-RVV thing on the instruction space used by RVV if it doesn't desire to support RVV. And that already exists with Xtheadvector (aka RVV 0.7.1) where thead made hardware with pre-ratification RVV that's rather incompatible with the ratified RVV1.0 but still uses generally the same encoding space.

IshKebab

It's "reserved" which is basically the same as C's UB - anything can happen (nasal dragons etc.) so you can't rely on it.

fulafel

I guess being unpriviliged this still guarantees the dragons are restricted your process, otherwise the chip would have a security problem. So as a heuristic you could fork your process, and try to execute SIMD operations (detecting known nonstandard versions and failing them), and if results seems fine send a "ok" flag up a pipe. (or compile this into a separate test executable)

bigstrat2003

Just because it's reserved doesn't mean anything can happen. On x86_64, you get a clearly defined error when you use reserved bits and the like. I don't know if RISC-V is the same, but if it isn't it's because they chose to be vague, not because that's what it means to be reserved.