Skip to content(if available)orjump to list(if available)

21 GB/s CSV Parsing Using SIMD on AMD 9950X

chao-

It feels crazy to me that Intel spent years dedicating die space on consumer SKUs to "make fetch happen" with AVX-512, and as more and more libraries are finally using it, as Intel's goal is achieved, they have removed AVX-512 from their consumer SKUs.

It isn't that AMD has better AVX-512 support, which would be an impressive upset on it's own. Instead, it is only that AMD has AVX-512 on consumer CPUs, because Intel walked away from their own investment.

Aurornis

In this article, they saw the following speeds:

Original: 18 GB/s

AVX2: 20 GB/s

AVX512: 21 GB/s

This is an AMD CPU, but it's clear that the AVX512 benefits are marginal over the AVX2 version. Note that Intel's consumer chips do support AVX2, even on the E-cores.

But there's more to the story: This is a single-threaded benchmark. Intel gave up AVX512 to free up die space for more cores. Intel's top of the line consumer part has 24 cores as a result, whereas AMD's top consumer part has 16. We'd have to look at actual Intel benchmarks to see, but if the AVX2 to AVX512 improvements are marginal, a multithreaded AVX2 version across more cores would likely outperform a multithreaded AVX512 version across fewer cores. Note that Intel's E-cores run AVX2 instructions slower than the P-cores, but again the AVX boost is marginal in this benchmark anyway.

I know people like to get angry at Intel for taking a feature away, but the real-world benefit of having AVX512 instead of only AVX2 is very minimal. In most cases, it's probably offset by having extra cores working on the problem.

sitkack

That is what Intel does, they build up a market (Optane) and then do a rug pull (Depth Cameras). They continue to do this thing where they do a huge push into a new technology, then don't see the uptake and let it die. Instead of building slowly and then at the right time, doing a big push. Optane support was just getting mature in the Linux kernel when they pulled it. And they focused on some weird cost cutting move when marketing it as a ram replacement for semi-idle VMs, ok.

They keep repeating the same mistakes all the way back to https://en.wikipedia.org/wiki/Intel_iAPX_432

gnfargbl

The rugpull on Optane was incredibly frustrating. Intel developed a technology which made really meaningful improvements to workloads in an industry that is full of sticky late adopters (RDBMSes). They kept investing until the point where they had unequivocally made their point and the late adopters were just about getting it... and then killed it!

It's hard to understand how they could have played that particular hand more badly. Even a few years on, I'm missing Optane drives because there is still no functional alternative. If they just held out a bit longer, they would have created a set of enterprise customers who would still be buying the things in 2040.

jerryseff

Optane was incredible. It's insane that Intel dropped this.

Gud

Indeed. Octane/3dxpoint was mind blowing futuristic stuff but it was just gone after 5 years? On the market? Talk about short sighted.

etaioinshrdlu

Well, Itanium might be a counterexample, they probably tried to make that work for far too long..

mrweasel

Itanium was more of an HP product than an Intel one.

sitkack

Itanium worked as intended.

sebmellen

Bad habits are hard to break!

sheepscreek

> They continue to do this thing where they do a huge push into a new technology, then don't see the uptake and let it die.

Except Intel deliberately made AVX 512 a feature exclusively available to Xeon and enterprise processors in future generations. This backward step artificially limits its availability, forcing enterprises to invest in more expensive hardware.

I wonder if Intel has taken a similar approach with Arc GPUs, which lack support for GPU virtualization (SR-IOV). They somewhat added vGPU support to all built-in 12th-14th Gen chips through the i915 driver on Linux. It’s a pleasure to have graphics-acceleration in multiple VMs simultaneously, through the same GPU.

sitkack

They go out their way to segment their markets, ECC, AVX, Optane support (only specific server class skus). I hate it, I hate as a home pc user, I hate it as an enterprise customer, I hate as a shareholder.

FpUser

I am very disappointed about Optane drives. Perfect case for superfast vertically scalable database. I was going to build a solution based on this but suddenly it is gone for all practical intents and purposes.

tedunangst

I mean, the most interesting part of the article for me:

> A bit surprisingly the AVX2 parser on 9950X hit ~20GB/s! That is, it was better than the AVX-512 based parser by ~10%, which is pretty significant for Sep.

They fixed it, that's the whole point, but I think there's evidence that AVX-512 doesn't actually benefit consumers that much. I would be willing to settle for a laptop that can only parse 20GB/s and not 21GB/s of CSV. I think vector assembly nerds care about support much more than users.

vardump

That probably just means it's a memory bandwidth bound problem. It's going to be a different story for tasks that require more computation.

wyager

You can still saturate an ultrawide vector unit with narrower instructions if you have wide enough dispatch

neonsunset

If it's any consolation, Sep will happily use AVX-512 whenever available, without having to opt into that explicitly, including the server parts, as it will most likely run under a JIT runtime (although it's NAOT-compatible). So you're not missing out by being forced to target the lowest common denominator.

MortyWaves

It’s wild seeing how stupid Intel is being.

buyucu

Intel is horrible with software. My laptop has a pretty good iGPU, but it's not properly supported by PyTorch or most other software. Vulkan inference with llama.cpp does wonders, and it makes me sad that most software other than llama.cpp does not take advantage of it.

kristianp

Sounds like something to try. Do I just need to compile Vulkan support to use the igpu?

null

[deleted]

null

[deleted]

chpatrick

In my experience I've found it difficult to get substantial gains with custom SIMD code compared to modern compiler auto-vectorization, but to be fair that was with more vector-friendly code than JSON parsing.

stabbles

Instead of doing 4 comparisons against each character `\n`, `\r`, `;` and `"` followed by 3 or operations, a common trick is to do 1 shuffle, 1 comparison and 0 or operations. I blogged about this trick: https://stoppels.ch/2022/11/30/io-is-no-longer-the-bottlenec... (Trick 2)

Edit: they do make use of ternary logic to avoid one or operation, which is nice. Basically (a | b | c) | d is computed using `vpternlogd` and `vpor` resp.

justinhj

really cool thanks

winterbloom

This is a staggering ~3x improvement in just under 2 years since Sep was introduced June, 2023.

You can't claim this when you also do a huge hardware jump

jbverschoor

They also included 0.9.0 vs 0.10.0. on the new hardware. (21385 vs 18203), so the jump because of software is 17%.

Then if we take 0.9.0 on previous hardware (13088) and add the 17%, it's 15375. Version 0.1.0 was 7335.

So... 15375/7335 -> a staggering 2.1x improvement in just under 2 years

freeone3000

They claim a 3GB/s improvement versus previous version of sep on equal hardware — and unlike “marketing” benchmarks, include the actual speed achieved and the hardware used.

stabbles

Do note that this speed even before the 3GB/s improvement exceeds the bandwidth of most disks, so the bottleneck is loading data in memory. I don't know of many applications where CSV is produced and consumed in memory, so I wonder what the use is.

pdpi

"We can parse at x GB/s" is more or less the reciprocal of "we need y% of your CPU capacity to saturate I/O".

Higher x -> lower y -> more CPU for my actual workload.

vardump

Decompression is your friend. Usually CSV compresses really well.

Multiple cores decompressing LZ4 compressed data can achieve crazy bandwidth. More than 5 GB/s per core.

freeone3000

Slower than network! In-memory processing of OLAP tables, streaming splitters, large data set division… but also the faster the parser, the less time you spend parsing and the more you spend doing actual work

perching_aix

> You can't claim this when you also do a huge hardware jump

Well, they did. Personally, I find it an interesting way of looking at it, it's a lens for the "real performance" one could get using this software year over year. (Not saying it isn't a misleading or fallacious claim though.)

WD-42

Yea wtf is that chart, it literally skips 4 cpu generations where it shows “massive performance gain”.

Straight to the trash with this post.

ziml77

But it repeats the 0.9.0 test on the new hardware. So the first big jump is a hardware change, but the second jump is the software changes.

g-mork

It also appears to be reporting whole-CPU vs. single thread, 1.3 GB/sec is not impressive for single thread perf

Remnant44

I mean... A single 9950x core is going to struggle to do more than 16 GB/second of direct mem copy bandwidth. So being within an order of magnitude of that seems reasonable

iamleppert

Agreed. How hard is it to keep hardware fixed, load the data into memory, and use a single core for your benchmarks? When I see a chart like that I think, "What else are they hiding?"

Folks should check out https://github.com/dathere/qsv if they need an actually fast CSV parser.

matja

4 generations?

5950x is Zen 3

9950x is Zen 5

chupasaurus

Sine Zen 2 (3000) the mobile CPUs are up by a thousand respectively to their desktop counterparts. edit: Or Nx2000 where N is from Zen N.

Aardwolf

Take that, Intel and your "let's remove AVX-512 from every consumer CPU because we want to put slow cores on every single one of them and also not consider multi-pumping it"

tadfisher

A lot of this stems from the 10nm hole they had to dig themselves out from. Yields are bad, so costs are high, so let's cut the die as much as possible, ship Atom-derived cores and market it as an energy-saving measure. The expensive parts can be bigger and we'll cut the margins on those to retain the server/cloud sector. Also our earnings go into the shitter and we lose market share anyway, but at least we tried.

wtallis

This issue is less about Intel's fab failures and more about their inability to decouple their architecture update cadence from their fab progress. They stopped iterating on their CPU designs while waiting for 10nm to get fixed. That left them with an oversized P core and an outdated E core, and all they could do for Alder Lake was slap them onto one die and ship it, with no ability to produce a well-matched pair of core designs in any reasonable time frame. We're still seeing weird consequences of their inability to port CPU designs between processes and fabs: this year's laptop processors have HyperThreading only in the lowest-cost parts—those that still have the CPU chiplet fabbed at Intel while the higher core count parts are made by TSMC.

vessenes

If we are lucky we will see Arthur Whitney get triggered and post either a one liner beating this or a shakti engine update and a one liner beating this. Progress!

voidUpdate

I shudder to think who needs to process a million lines of csv that fast...

moregrist

I have. I think it's a pretty easy situation for certain kinds of startups to find themselves in:

- Someone decides on CSV because it's easy to produce and you don't have that much data. Plus it's easier for the <non-software people> to read so they quit asking you to give them Excel sheets. Here <non-software people> is anyone who has a legit need to see your data and knows Excel really well. It can range from business types to lab scientists.

- Your internal processes start to consume CSV because it's what you produce. You build out key pipelines where one or more steps consume CSV.

- Suddenly your data increases by 10x or 100x or more because something started working: you got some customers, your sensor throughput improved, the science part started working, etc.

Then it starts to make sense to optimize ingesting millions or billions of lines of CSV. It buys you time so you can start moving your internal processes (and maybe some other teams' stuff) to a format more suited for this kind of data.

trollbridge

It's become a very common interchange format, even internally; it's also easy to deflate. I have had to work on codebases where CSV was being pumped out at basically the speed of a NIC card (its origin was Netflow, and then aggregated and otherwise processed, and the results sent via CSV to a master for further aggregation and analysis).

I really don't get, though, why people can't just use protocol buffers instead. Is protobuf really that hard?

bombela

protobuf is more friction, and actually slow to write and read.

For better or worse, CSV is easy to produce via printf. Easy to read by breaking lines and splitting by the delimiter. Escaping delimiters part of the content is not hard, though often added as an afterthought.

Protobuf requires to install a library. Understand how it works. Write a schema file. Share the shema to others. The API is cumbersome.

Finally to offer this mutable struct via setter and getter abstraction, with variable length encoded numbers, variable length strings etc. The library ends up quite slow.

In my experience protobuf is slow and memory hungry. The generated code is also quite bloated, which is not helping.

See https://capnproto.org/ for details from the original creator of protobuf.

Is CSV faster than protobuf? I don't know, and I haven't tested. But I wouldn't be surprised if it is.

raron

> For better or worse, CSV is easy to produce via printf. Easy to read by breaking lines and splitting by the delimiter. Escaping delimiters part of the content is not hard, though often added as an afterthought.

Based on the amount of software I seen producing broken CSV or can't parse (more-or-less) valid CSV, I don't think that is true.

It seems to be easy, because just printf("%s,%d,%d\n", ...) but it is full of edge cases most programmers don't think about.

nobleach

Extremely hard to tell an HR person, "Right-click on here in your Workday/Zendesk/Salesforce/etc UI and export a protobuf". Most of these folks in the business world LIVE in Excel/Spreadsheet land so a CSV feels very native. We can agree all day long that for actual data TRANSFER, CSV is riddled with edge cases. But it's what the customers are using.

heavenlyblue

It's extremely unlikely they need to load spreadsheets large enough for 21Gb/s speed to matter

matja

Kind of, there isn't a 1:1 mapping of protobuf wire types to schema types, so you need to package the protobuf schema with the data and compile it to parse the data, or decide on the schema before-hand. So now you need to decide on a file format to bundle the schema and the data.

sunrunner

I shudder to think of what it means to be storing the _results_ of processing 21 GB/s of CSV. Hopefully some useful kind of aggregation, but if this was powering some kind of search over structured data then it has to be stored somewhere...

devmor

Just because you’re processing 21GB/s of CSV doesn’t mean you need all of it.

If your data is coming from a source you don’t own, it’s likely to include data you don’t need. Maybe there’s 30 columns and you only need 3 - or 200 columns and you only need 1.

Enterprise ETL is full of such cases.

hermitcrab

For all its many weaknesses, I believe CSV is still the most common data interchange format.

adra

Erm, maybe file based? JSON is the king if you count exchanges worldwide a sec. Maybe no 2 is form-data which is basically email multipart, and if course there's email as a format. Very common =)

hermitcrab

I meant file-based.

devmor

I honestly wonder if JSON is king. I used to think so until I started working in fintech. XML is unfortunately everywhere.

segmondy

lots of folks in Finance, you can share csv with any Finance company and they can process it. It's text.

zzbn00

Humans generate decisions / text information at rates of ~bytes per second at most. There is barely enough humans around to generate 21GB/s of information even if all they did was make financial decisions!

So 21 GB/s would be solely algos talking to algos... Given all the investment in the algos, surely they don't need to be exchanging CSV around?

cyral

The only real example I can think of is the US options market feed. It is up to something like 50 GiB/s now, and is open 6.5 hours per day. Even a small subset of the feed that someone may be working on for data analysis could be huge. I agree CSV shouldn't even be used here but I am sure it is.

hermitcrab

CSV is a questionable choice for a dataset that size. It's not very efficient in terms of size (real numbers take more bytes to store as text than as binary), it's not the fastest to parse (due to escaping) and a single delimiter or escape out of place corrupts everything afterwards. That not to mention all the issues around encoding, different delimiters etc.

adrianN

You might have accumulated some decades of data in that format and now want to ingest it into a database.

internetter

> Humans generate decisions / text information at rates of ~bytes per second at most

Yes, but the consequences of these decisions are worth much more. You attach an ID to the user, and an ID to the transaction. You store the location and time where it was made. Ect.

wat10000

Standards (whether official or de facto) often aren't the best in isolation, but they're the best in reality because they're widely used.

Imagine you want to replace CSV for this purpose. From a purely technical view, this makes total sense. So you investigate, come up with a better standard, make sure it has all the capabilities everyone needs from the existing stuff, write a reference implementation, and go off to get it adopted.

First place you talk to asks you two questions: "Which of my partner institutions accept this?" "What are the practical benefits of switching to this?"

Your answer to the first is going to be "none of them" and the answer to the second is going to be vague hand-wavey stuff around maintainability and making programmers happier, with maybe a little bit of "this properly handles it when your clients' names have accent marks."

Next place asks the same questions, and since the first place wasn't interested, you have the same answers....

Replacing existing standards that are Good Enough is really, really hard.

ourmandave

That cartesian product file accounting sends you at year end?

null

[deleted]

pak9rabid

Ugh.....I do unfortunately.

criddell

I was expecting to see assembly language and was pleasantly surprised to see C#. Very impressive.

Nice work!

gavinray

Modern .NET has the deepest integration with SIMD and vector intrinsics of what most people would consider "high-level languages".

https://learn.microsoft.com/en-us/dotnet/standard/simd

Tanner Gooding at Microsoft is responsible for a lot of the developments in this area and has some decent blogposts on it, e.g.

https://devblogs.microsoft.com/dotnet/dotnet-8-hardware-intr...

jerryseff

Christ using... .NET?

I want to vomit.

Use elixir, you can easily get this close using Rust NIFs and pattern matching.

anthk

> Net 9.0

heh, do it again with mawk.

constantcrying

There are very good alternatives to csv for storing and exchanging floating point/other data.

The HDF5 format is very good and allows far more structure in your files, as well as metadata and different types of lossless and lossy compression.

null

[deleted]