Span<T>.SequenceEquals is faster than memcmp

108 comments

·March 30, 2025

xnorswap

A more meaningful adventure into microbenchmarking than my last. I look at why we no longer need to P/Invoke memcmp to efficiently compare arrays in C# / .NET.

Old stackoverflow answers are a dangerous form of bit-rot. They get picked up by well-meaning developers and LLMs alike and recreated years after they are out of date.

neonsunset

For loop regression in .NET 9, please submit an issue at dotnet/runtime. It’s yet another loop tearing miscompilation caused by suboptimal loop lowering changes if my guess is correct.

xnorswap

No problem, I've raised the issue as https://github.com/dotnet/runtime/issues/114047 .

neonsunset

Thanks!

neonsunset

UPD: For those interested, it was an interaction between microbenchmark algorithm and tiered compilation and not a regression.

https://github.com/dotnet/runtime/issues/114047#issuecomment...

Dylan16807

This is a ten line function that takes half a second to run.

Why do you have to call it more than 50 times before it gets fully optimized?? Is the decision-maker completely unaware of the execution time?

timewizard

The number of times I've caught developers wholesale copying stack overflow posts, errors and all, is far too high.

guerrilla

Indeed, the problems of LLMs are not new. We just automated what people who have no idea what they are doing were doing anyway. We... optimized incompetence.

SketchySeaBeast

The problem with the LLM equivalent is that you can't see the timestamp of the knowledge it's drawing from. With stack overflow I can see a post is from 2010 and look for something more modern, that due diligence is no longer available with an LLM, which has little reason to choose the newest solution.

eddythompson80

This is a bit elitist isn’t it. It highly depends on the type of code copied and it’s huge part of software engineer bullishness approach to LLMs compared to most other professions.

Regardless of how competent as a programmer you are, you don’t necessarily possess the knowledge/answer to “How to find open ports on Linux” or “How to enumerate child pids of a parent pid” or “what is most efficient way to compare 2 byte arrays in {insert language}” etc. A search engine or an LLM is a fine solution for those problems.

You know that the answer to that question if what you’re after. I’d generally consider you knowing the right question to ask is all that matters. The answer is not interesting. It’s most likely a deeply nested knowledge about how Linux networking stack works, or how process management works on a particular OS. If that was the central point of the software we’re build (like for example we’re a Linux Networking Stack company) then by all means. It’s silly to find a lead engineer in our company who is confused about how open ports work in Linux.

jayd16

What's even worse is when you catch someone copying from the questions instead of the answers!

Dwedit

The call to "memcmp" has overhead. It's an imported function which cannot be inlined, and the marshaller will automatically create pinned GC handles to the memory inside of the arrays as they are passed to the native code.

I wonder how it would compare if you passed actual pointers to "memcmp" instead of marshalled arrays. You'd use "fixed (byte *p = bytes) {" on each array first so that the pinning happens outside of the function call.

MarkSweep

I'm pretty sure the marshaling code for the pinvoke is not creating GC handles. It is just using a pinned local, like a fixed statement in csharp does. This is what the LibraryImport at least and I don't see why the built in marshaller would be different. The author says in the peer comment that they confirmed the performance is the same.

I think the blog post is quite good at showing that seemingly similar things can have different performance tradeoffs. A follow up topic might digging deeper into the why. For example, if you look at the disassembly of the p/invoke method, you can see the source of the overhead: setting up a p/invoke frame so the stack is walkable while in native code, doing a GC poll after returning from the native function, and removing the frame.

https://gist.github.com/AustinWise/21d518fee314ad484eeec981a...

xnorswap

I tried that but cut it from the code because it had the same performance.

Rohansi

Have you tried the newer [LibraryImport] attribute?

xnorswap

I haven't, I wasn't aware of that attribute. I would gratefully accept a PR with such a benchmark case.

Edit: I've now tried it, and it reduced overhead a small amount. (e.g. Average 7.5 ns vs 8 ns for the 10-byte array )

mhh__

memcmp and friends can be a funny one when looking at disasm

Depending on context and optimization settings we might see:

  - Gone entirely
  - A memcmp call has been inlined and turned into a single instruction
  - It's turned into a short loop
  - A loop has been turned into a memcmp call.

FWIW This is also one of the reasons why I think the VM-by-default / JIT way holds dotnet back. I find it very hard to be confident about what the assembly actually looks like, and after that.

Subtly I think it also encourages a "that'll do" mindset up the stack. You're working in an environment where you're not really incentivised to care so some patterns just don't feel like they'd have happened in a more native language.

int_19h

For what it's worth, I have read .NET JIT disassembly as part of perf work on a couple of occasions. On Windows, at least, Visual Studio enables this seamlessly - if you break inside managed code, you can switch to Disassembly view and see the actual native code corresponding to each line, step through it etc.

neonsunset

> I find it very hard to be confident about what the assembly actually looks like, and after that.

Godbolt is your friend as a DPGO-less baseline. Having JIT is an advantage w.r.t. selecting the best SIMD instruction set.

> Subtly I think it also encourages a "that'll do" mindset up the stack.

What is the basis for this assumption?

mhh__

> Having JIT is an advantage w.r.t. selecting the best SIMD instruction set.

On paper yes but does anyone really rely on it? multiversioning is easy to do in a aot model too and even then most people don't bother. obviously sometimes its critical.

The more magic you put into the jit also makes it slower, so even though there are _loads_ of things you can do with a good JIT a lot them don't actually happen in practice.

PGO is one of those things. I've never really encountered it in dotnet but it is basically magic in frontend-bound programs like compilers.

> What is the basis for this assumption?

It's not an assumption, it's my impression of the dotnet ecosystem.

I do think also some patterns somewhat related to JITed-ness has led to some patterns (particularly around generics) that mean that common patterns in the language can't actually be expressed statically so one ends up with all kinds of quasi-dynamically typed runtime patterns e.g. dependency injection. But this is more of a design decision that comes from the same place.

null

[deleted]

merb

> That's not a super helpful description, but the summary is that it's stack-allocated rather than heap allocated.

I’m pretty sure that this is not 100% correct, since one can also use other allocation methods and use a span to represent it. Only with stackalloc will the memory it points to be stackallocated. What it basically means is that the type is stack allocated, always, but not the memory it points to.

MarkSweep

Yeah, as written this is quite confusing and does not describe why a Span is useful. It seems to be a garbled quoting of the first sentence of the supplement documentation about this API:

https://learn.microsoft.com/en-us/dotnet/fundamentals/runtim...

I think a better description of what a Span does is later in the article:

> A Span<T> represents a contiguous region of arbitrary memory. A Span<T> instance is often used to hold the elements of an array or a portion of an array. Unlike an array, however, a Span<T> instance can point to managed memory, native memory, or memory managed on the stack.

The fact that you have to put the Span<T> on the stack only is a limitation worth knowing (and enforced by the compiler). But it is not the most interesting thing about them.

xnorswap

Thank you, it was indeed a "garbled quoting" of that article. I am generally terrible at explaining things.

Trying to improve my ability to explain things was part of my motivation for taking up blogging.

int_19h

IIRC it is enforced not only by the compiler, but the runtime as well (for verifiable code).

john-h-k

Yes, this is correct. The span itself - the (ptr, len) pair - is on stack (by default) but the data is almost always on the heap, with stackalloc being the most notable exception

neonsunset

The design of spans does not make assumptions about this however. `ref T` pointer inside the span can point to any memory location.

It is not uncommon to wrap unmanaged memory in spans. Another popular case, even if it's something most developers not realize, is readonly spans wrapping constant data embedded in the application binary. For example, if you pass '[1, 2, 3, 4]' to an argument accepting 'ReadOnlySpan<int>' - this will just pass a reference to constant data. It also works for new T[] { } as long as T is a primitive and the target of the expression is a read-only span. It's quite prevalent nowadays but the language tries to get out of your way when doing so.

neonsunset

FWIW LINQ's SequenceEqual and many other CoreLib methods performing sequence comparison forward to the same underlying comparison routine used here whenever possible.

All of this builds on top of very powerful portable SIMD primitives and platform intrinsics that ship with the standard library.

runevault

The amount of optimizations, specifically around using stack allocated objects, .net has seen in recent years is amazing.

Another one beyond all the span stuff (though related) that got added in dotnet 9 was AlternateLookup for stuff like dictionary and HashSet where you create a stack allocated object that lets you use stack related objects to compare.

Simple example, if you have a dictionary you are building and you're parsing a json file, you can use spans and compare those directly into the dictionary without having to allocate new strings until you know it is a distinct value. (Yes I know you can just use the inbuilt json library, this was just he simplest example of the idea I could think of to get the point across).

junto

It’s astounding just how fast modern .NET has become. I’d be curious as to how the .NET (Framework excluded) benchmarks run in a Linux container.

jiggawatts

I just did some benchmarks of this!

Linux in general provides the same speed for pure CPU workloads like generating JSON or HTML responses.

Some I/O operations run about 20% better, especially for small files.

One killer for us was that the Microsoft.Data.SqlClient is 7x slower on Linux and 10x slower on Linux with Docker compared to a plain Windows VM!

That has a net 2x slowdown effect for our applications which completely wipes out the licensing cost benefit when hosted in Azure.

Other database clients have different performance characteristics. Many users have reported that PostgreSQL is consistent across Windows and Linux.

neonsunset

> Microsoft.Data.SqlClient is 7x slower on Linux

It is probably worth reporting your findings and environment here: https://github.com/dotnet/SqlClient

Although I'm not sure how well-maintained SqlClient w.r.t. such regressions as I don't use it.

Also make sure to use the latest version of .NET and note that if you give a container anemic 256MB and 1C - under high throughput it won't be able to perform as fast as the application that has an entire host to itself.

jiggawatts

I’m using the latest everything and it’s still slow as molasses.

This issue has been reported years ago by multiple people and Microsoft has failed to fix it, despite at least two attempts at it.

Basically, only the original C++ clients work with decent efficiency, and the Windows client is just a wrapper around this. The portable “managed”, MARS, and async clients are all buggy (including data corruption) and slow as molasses. This isn’t because of the .NET CLR but because of O(n^2) algorithms in basic packet reassembly steps!

I’ve researched this quite a bit, and a fundamental issue I noticed was that the SQL Client dev team doesn’t test their code for performance with realistic network captures. They replay traces from disk, which is “cheating” because they never see a partial buffer like you would see on an Ethernet network where you get ~1500 bytes per packet instead of 64KB aligned(!) reads from a file.

Analemma_

I agree, .NET Core has improved by gigantic leaps and bounds. Which makes it all the more frustrating to me that .NET and Java both had "lost decades" of little to no improvement. Java mostly only on the language side, where 3rd-party JVMs still saw decent changes, but .NET both on the language and runtime side. I think this freeze made (and continues to make) people think the ceiling of both performance and developer ergonomics of these languages is much lower than it actually is.

paavohtl

I certainly agree that Java / JVM had a lost decade (or even more), but not really with C# / .NET. When do you consider that lost decade to have been? C# has had a major release with new language features every 1-3 years, consistently for the past 20+ years.

CharlieDigital

Lost decade in another sense in the case of C#.

It's sooooo good now. Fast, great DX, LINQ, Entity Framework, and more!

But I still come across a lot of folks that think it's still in the .NET Framework days and bound to Windows or requires paid tooling like Visual Studio.

torginus

.NET was always fast. I remember in the .NET framework 2.0 days, .NET's JIT for derived from the Microsoft C++ compiler, with some of the more expensive optimizations (like loop hoisting) removed and general optimization effort pared back.

But If you knew what you were doing, for certain kinds of math heavy code, and aggressive use of low level features (like raw pointers) you could get within 10% of C++ code, with the general case being that garden variety non super optimized code being half as fast as equivalent C++ code.

I think this ratio has remained pretty consistent over the years.

api

I wonder how it compares to (1) Go, (2) the JVM, and (3) native stuff like Rust and C++.

Obviously as with all such benchmarks the skill of the programmer doing the implementing matters a lot. You can write inefficient clunky code in any language.

kfuse

All modern popular languages are fast, except the most popular one.

api

JavaScript is hella fast for a dynamically typed language, but that's because we've put insane amounts of effort into making fast JITing VMs for it.

paulddraper

And Python+Ruby

kristianp

I would say go is not in the same category of speed as rust anf c/c++. The level of optimisation done by them is next level. Go also doesn't inline your assembly functions, has less vectorisation in the standard libraries, and doesn't allow you to easily add vectorisation with intrinsics.

junto

https://medium.com/deno-the-complete-reference/net-vs-java-v...

jeffbee

Java and .NET (and JS or anything that runs under v8 or HotSpot) usually compare favorably to others because they come out of the box with PGO. The outcomes for peak-optimized C++ are very good, but few organizations are capable of actually getting from their C++ build what every .NET user gets for free.

metaltyphoon

.NET go as far as having D(ynamic)PGO, which is enabled by default.

userbinator

Interestingly, Intel made REP CMPS much faster in the latest CPUs:

https://stackoverflow.com/questions/75309389/which-processor...

bob1029

Span<T> is easily my favorite new abstraction. I've been using the hell out of it for building universal Turing machine interpreters. It's really great at passing arbitrary views of physical data around. I default to using it over arrays in most places now.

null

[deleted]

CyanLite2

The article missed the biggest thing:

SequenceEquals is SIMD accelerated. memcmp is not.

qingcharles

This is almost certainly the answer.

There are a bunch of Intel folks on the dotnet core github regularly pushing new SIMD updates for CPUs that aren't even released yet. They are trying to make sure your .NET code runs nice on your new datacenter servers.

OptionOfT

I did some digging, and found that SequenceEquals is heavily optimized for when T = Byte: https://github.com/dotnet/runtime/blob/454673e1d6da406775064...

Does memcmp do all of these things? Is msvcrt.dll checking at runtime which extensions the CPU support?

Because I don't think msvcrt.dll is recompiled per machine.

I think a better test would be to create a DLL in C, expose a custom version of memcmp, and compile that with all the vectorization enabled.

xnorswap

The comparison isn't to prove that .NET is always faster than C in all circumstances, it was to demonstrate that the advice to call out to C from .NET is outdated and now worse than the naive approach.

Can C wizards write faster code? I'm sure they can, but I bet it takes longer than writing a.SequenceEquals(b) and moving on to the next feature, safe in the knowledge that the standard library is taking care of business.

"Your standard library is more heavily optimised" isn't exactly a gotcha. Yes, the JIT nature of .NET means that it can leverage processor features at runtime, but that is a benefit to being compiled JIT.

asveikau

> Does memcmp do all of these things? Is msvcrt.dll checking at runtime which extensions the CPU support

It's possible for a C implemention to check the CPU at dynamic link time (when the DLL is loaded) and select which memcmp gets linked.

The most heavily used libc string functions also have a tendency to use SIMD when the data sizes and offsets align, and fall back to the slow path for any odd/unaligned bytes.

I don't know to what extent MSVCRT is using these techniques. Probably some.

Also, it's common for a compiler to recognize references to common string functions and not even emit a call to a shared library, but provide an inline implementation.

neonsunset

It's not limited to bytes. It works with any bitwise comparable primitive i.e. int, long, char, etc.

The logic which decides which path to use is here https://github.com/dotnet/runtime/blob/main/src/libraries/Sy... and here https://github.com/dotnet/runtime/blob/main/src/coreclr/tool... (this one is used by ILC for NativeAOT but the C++ impl. for the JIT is going to be similar)

The [Intrinsic] annotation is present because such comparisons on strings/arrays/spans are specially recognized in the compiler to be unrolled and inlined whenever one of the arguments has constant length or is a constant string or a span which points to constant data.

int_19h

memcmp is also supposed to be heavily optimized for comparing arrays of bytes since, well, that is literally all that it does.

msvcrt.dll is the C runtime from VC++6 days; a modern (as in, compiled against VC++ released in the last 10 years) C app would use the universal runtime, ucrt.dll. That said, stuff like memcpy or memcmp is normally a compiler intrinsic, and the library version is there only so that you can take an pointer to it and do other such things that require an actual function.

loeg

This has gotta be some sort of modest overhead from calling into C memcmp that is avoided by using the native C# construct, right? There's no reason the two implementations shouldn't be doing essentially the same thing internally.

xnorswap

Outside of the 10 elements case, I don't think it's an overhead issue, the overhead is surely minuscule compared to the 1GB of data in the final tests, which also show a large difference in performance.

I suspect it's that the memcmp in the Visual C++ redistributable isn't as optimised for modern processor instructions as the .NET runtime is.

I'd be interested to see a comparison against a better more optimised runtime library.

Ultimately you're right that neither .NET nor C can magic out performance from a processor that isn't fundamentally there, but it's nice that doing the out-of-the-box approach performs well and doesn't require tricks.

lstodd

might as well mean that msvcrt's memcmp is terrible

airybreath

The post links to one answer to a StackOverflow question, but the top answer to that same question when sorting by "Trending (recent votes count more)", << https://stackoverflow.com/a/48599119/1083771 >>, suggests exactly* the same thing: use ReadOnlySpan<T>.SequenceEqual

*the post suggests that IEnumerable<T>.SequenceEqual is more-or-less the same, but the underlying reason is because ReadOnlySpan<T>.SequenceEqual is so fast that the implementation of IEnumerable<T>.SequenceEqual spends a bit of overhead in order to let it use that exact same call when feasible: https://github.com/dotnet/runtime/blob/v9.0.3/src/libraries/...

HN

Span<T>.SequenceEquals is faster than memcmp

Span<T>.SequenceEquals is faster than memcmp