Launch HN: ParaQuery (YC X25) – GPU Accelerated Spark/SQL
84 comments
·May 12, 2025andygrove
Congrats on the launch!
I contributed to the NVIDIA Spark RAPIDS project for ~4 years and for the past year have been contributing to DataFusion Comet, so I have some experience in Spark acceleration and I have some questions!
1. Given the momentum behind the existing OSS Spark accelerators (Spark RAPIDS, Gluten + Velox, DataFusion Comet), have you considered collaborating with and/or extending these projects? All of them are multi-year efforts with dedicated teams. Both Spark RAPIDS and Gluten + Velox are leveraging GPUs already.
2. You mentioned that "We're fully compatible with Spark SQL (and Spark)." and that is very impressive if true. None of the existing accelerators claim this. Spark compatibility is notoriously difficult with Spark accelerators built with non-JVM languages and alternate hardware architectures. You have to deal with different floating-point implementations and regex engines, for example.
Also, Spark has some pretty quirky behavior. Do you match Spark when casting the string "T2" to a timestamp, for example? Spark compatibility has been pretty much the bulk of the work in my experience so far.
Providing acceleration at the same time as guaranteeing the same behavior as Spark is difficult and the existing accelerators provide many configuration options to allow users to choose between performance and compatibility. I'm curious to hear your take on this topic and where your focus is on performance vs compatibility.
winwang
1. Yes! Would love to contribute back to these projects, since I am already using RAPIDS under the hood. My general goal is to bring GPU acceleration to more workloads. Though, as solo founder, I am finding it difficult to have any time for this at the moment, haha.
2. Hmm, maybe I should mention that we're not "accelerating all operations" -- merely compatible. Spark-RAPIDS has the goal of being byte-for-byte compatible unless incompatible ops are specifically allowed. But... you might be right about that kind of quirk. Would not be surprising, and reminds me of checking behavior between compilers.
I'd say the default should be a focus on compatibility, and work through any extra perf stuff with our customers. Maybe a good quick way to contribute back to open source is to first upstream some tests?
Thanks for your great questions :)
sitkack
> I was trying to craft a CUDA-based lambda calculus interpreter
This is awesome!
I assume you have seen https://github.com/HigherOrderCO/Bend https://github.com/higherorderco/hvm
Previous discussions https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
winwang
Indeed I have, though I have some reservations about its focus on interaction nets (or rather, its marketing).
I ended up making a CUDA-based, data-parallel STLC typechecker (Hindley-Milner)... I want to formally prove its correctness first, but maybe a blog post would be okay either way.
dogman123
This could be incredibly useful for me. Currently struggling to complete jobs with massive amounts of shuffle with Spark on EMR (large joins yielding 150+ billion rows). We use Glue currently, but it has become cost prohibitive.
threeseed
You should try using an S3 based shuffle plugin: https://github.com/IBM/spark-s3-shuffle
Then mount FSX for Lustre on all of your EMR nodes and have it write shuffle data there. It will massively improve performance and shuffle issues will disappear.
Is expensive though. But you can offset the cost now because you can run entirely Spot instances for your workers as if you lose a node there's no recomputation of the shuffle data.
winwang
Is the shuffle the biggest issue? Not too sure about joins but one of the datasets we're currently dealing with has a couple trillion rows. Would love to chat about this!
wenbin
Congrats on the launch.
This reminds me of https://www.heavy.ai/ (previously MapD back in 2015/16?)
jelder
Any relationship with the PG-Strom project?
winwang
No relationship... yet! Hoping to have a good relationship in the future so I have a business reason to fly to Japan :D
Btw, interesting thing they said here: "By utilization of GPU (Graphic Processor Unit) device which has thousands cores per chip"
It's more like "hundreds", since the number of "real" cores is like (CUDA cores / 32). Though I think we're about to see 1k cores (SMSPs).
That being said, I do believe CUDA cores have more interesting capabilities than a typical vector lane, i.e. for memory operations (thank the compiler). Would love to be corrected!
random17
Congrats on the launch!
Im curious about what kinds of workloads you see GPU-accelerated compute have a significant impact, and what kinds still pose challenges. You mentioned that I/O is not the bottleneck, is that still true for queries that require large scale shuffles?
winwang
Large scale shuffles: Absolutely. One of the larger queries we ran saw a 450TB shuffle -- this may require more than just deploying the spark-rapids plugin, however (depends on the query itself and specific VMs used). Shuffling was the majority of the time and saw 100% (...99%?) GPU utilization. I presume this is partially due to compressing shuffle partitions. Network/disk I/O is definitely not the bottleneck here.
It's difficult to say what "workloads" are significant, and easier to talk about what doesn't really work AFAIK. Large-scale shuffles might see 4x efficiency, assuming you can somehow offload the hash shuffle memory, have scalable fast storage, etc... which we do. Note this is even on GCP, where there isn't any "great" networking infra available.
Things that don't get accelerated include multi-column UDFs and some incompatible operations. These aren't physical/logical limitations, it's just where the software is right now: https://github.com/NVIDIA/spark-rapids/issues
Multi-column UDF support would likely require some compiler-esque work in Scala (which I happen to have experience in).
A few things I expect to be "very" good: joins, string aggregations (empirically), sorting (clustering). Operations which stress memory bandwidth will likely be "surprisingly" good (surprising to most people).
Otherwise, Nvidia has published a bunch of really-good-looking public data, along with some other public companies.
Outside of Spark, I think many people underestimate how "low-latency" GPUs can be. 100 microseconds and above is highly likely to be a good fit for GPU acceleration in general, though that could be as low as 10 microseconds (today).
_zoltan_
8TB/s bandwidth on the B200 helps :-) [yes, yes, that is at the high end, but 4.8TB/s@H200, 4TB/s@H100, 2TB/s@A100 is nothing to sneeze at either).
winwang
Very true. Can't get those numbers even if you get an entire single-tenant CPU VM. Minor note, A100 40G is 1.5TB/s (and much easier to obtain).
That being said, ParaQuery mainly uses T4 and L4 GPUs with "just" ~300 GB/s bandwidth. I believe (correct me if I'm wrong) that should be around a 64-core VM, though obviously dependent on the actual VM family.
threeseed
Many of us have been using GPU accelerated Spark for years:
https://developer.nvidia.com/rapids/
https://github.com/NVIDIA/spark-rapids
And it's supported on AWS: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spar...
winwang
Indeed, Spark-RAPIDS has been around for a while! And it's quite simple to have a setup that works. Most of the issues come after the initial PoC, especially for teams not wanting to manage infra, not to mention GPU infra.
mritchie712
> they're saving over 60% off of their BigQuery bill
how big is their data?
A lot of BigQuery users would be surprised to find they don't need BigQuery.
This[0] post (written by founding engineer of BigQuery) has a bit of hyperbole, but this part is inline with my experience:
> A couple of years ago I did an analysis of BigQuery queries, looking at customers spending more than $1000 / year. 90% of queries processed less than 100 MB of data. I sliced this a number of different ways to make sure it wasn’t just a couple of customers who ran a ton of queries skewing the results. I also cut out metadata-only queries, which are a small subset of queries in BigQuery that don’t need to read any data at all. You have to go pretty high on the percentile range until you get into the gigabytes, and there are very few queries that run in the terabyte range.
We're[1] built on duckdb and I couldn't be happier about it. Insanely easy to get started with, runs locally and client-side in WASM, great language features.
winwang
They have >1PB of data to ETL, with some queries hitting 450TB of pure shuffle.
It's very true that most users don't need something like BigQuery or Snowflake. That's why some startups have come up to save Snowflake cost by "simply" putting a postgres instance in front of it!
In fact, I just advised someone recently to simply use Postgres instead of BigQuery since they had <1TB and their queries weren't super intensive.
threeseed
> A lot of BigQuery users would be surprised to find they don't need BigQuery.
No they wouldn't.
a) BigQuery is the only managed, supported solution on GCP for SQL based analytical workloads. And they are using it because they started with GCP and then chose BigQuery.
b) I have supported hundreds of Data Scientists over the years using Spark and it is nothing like BigQuery. You need to have much more awareness of how it all fits together because it is sitting on a JVM that when exposed to memory pressure will do a full GC and kill the executor. When this happens at best your workload gets significantly slower and at worst your job fails.
winwang
Hopefully, we can be another managed solution for those on GCP.
And as for your second point, yep, Spark tuning is definitely annoying! BigQuery is a lot more than jusr the engine, and building a simple interface for a complicated, high-performance process is hard. That's a big reason why I made ParaQuery.
threeseed
You may want to look into DataMechanics who is another YC startup who tried something similar. They were acqui-hired by NetApp.
If I remember they focused on SME space because in enterprise you will likely struggle against pre-allocated cloud spend budgets which lock companies into just using GCP services. I've worked at a dozen enterprise companies now and every one had this.
mritchie712
> No they wouldn't.
haha, you're giving people way too much credit. Tons of people make bad software purchasing decisions. It's hard, people make mistakes.
torsstei
Cool! Do you have a positioning versus Databricks support for Spark-RAPIDS (https://github.com/NVIDIA/spark-rapids-ml/blob/main/notebook...)?
Boxxed
I'm surprised the GPU is a win when the data is coming from GCS. The CPU still has to touch all the data, right? Or do you have some mechanism to keep warm data live in the GPUs?
winwang
Yep, CPU has to transfer data because no RDMA setup on GCP lol. But that's like 16-32 GB/s of transfer per GPU (assuming T4/L4 nodes), which is much more than network bandwidth. And we're not even network bound, even if there's no warm data (i.e. for our ETL workloads). However, there is some stuff kept on GPU during actual execution for each Spark task even if they aren't running on the GPU at the moment, which makes handling memory and partition sizes... "fun", haha.
threeseed
I've used GPU based Spark SQL for many years now and it sounds flashy but it's not going to make a meaningful difference for most use cases.
As you say the issue is that you have an overall process to optimise from getting the data off slow GCS onto the nodes, shuffling it which often then writes it to a slow disk before the real processing even starts then writing back to a slow GCS.
acstorage
Why do you say GCS performance isn’t an issue? I would imagine a highly parallel compute system would require higher throughput from object storage? I’m surprised you aren’t I/O bound.
debarshri
It reminds me of Hadoop days, where the data would be stored in the HDFS and you would use mapreduce to process it. However, the concept was to send computation to the location of the data.
This really make sense. I might be a little out of touch. I wonder, do you incur transfer cost when you data is in buckets and you process by bringing data to the compute.
winwang
If you stand up your compute cluster in the same region as your bucket, there are no egress fees. Otherwise, yes, in general. There are some clouds that don't have egress fees though, i.e. Cloudflare R2.
Hey HN! I'm Win, founder of ParaQuery (https://paraquery.com), a fully-managed, GPU-accelerated Spark + SQL solution. We deliver BigQuery's ease of use (or easier) while being significantly more cost-efficient and performant.
Here's a short demo video demonstrating ParaQuery (vs. BigQuery) on a simple ETL job: https://www.youtube.com/watch?v=uu379YnccGU
It's well known that GPUs are very good for many SQL and dataframe tasks, at least by researchers and GPU companies like NVIDIA. So much so that, in 2018, NVIDIA launched the RAPIDS program and the Spark-RAPIDS plugin (https://github.com/NVIDIA/spark-rapids). I actually found out because, at the time, I was trying to craft a CUDA-based lambda calculus interpreter…one of several ideas I didn't manage to implement, haha.
There seems to be a perception among at least some engineers that GPUs are only good for AI, graphics, and maybe image processing (maybe! someone actually told me they thought GPUs are bad for image processing!) Traditional data processing doesn’t come to mind. But actually GPUs are good for this as well!
At a high level, big data processing is a high-throughput, massively parallel workload. GPUs are a type of hardware specialized for this, are highly programmable, and (now) happen to be highly available on the cloud! Even better, GPU memory is tuned for bandwidth over raw latency, which only improves their throughput capabilities compared to a CPU. And by just playing with cloud cost calculators for a couple of minutes, it's clear that GPUs are cost-effective even on the major clouds.
To be honest, I thought using GPUs for SQL processing would have taken off by now, but it hasn't. So, just over a year ago, I started working on actually deploying a cloud-based data platform powered by GPUs (i.e. Spark-RAPIDS), spurred by a friend-of-a-friend(-of-a-friend) who happened to have BigQuery cost concerns at his startup. After getting a proof of concept done and a letter of intent... well, nothing happened! Even after over half a year. But then, something magical did happen: their cloud credits ran out!
And now, they're saving over 60% off of their BigQuery bill by using ParaQuery, while also being 2x faster -- with zero data migration needed (courtesy of Spark's GCS connector). By the way, I'm not sure about other people's experiences but... we're pretty far from being IO-bound (to the surprise of many engineers I've spoken to).
I think that the future of high-throughput compute is computing on high-throughput hardware. If you think so too, or you have scaling data challenges, you can sign up here: https://paraquery.com/waitlist. Sorry for the waitlist, but we're not ready for a self-serve experience just yet—it would front-load significant engineering and hardware cost. But we’ll get there, so stay tuned!
Thanks for reading! What have your experiences been with huge ETL / processing loads? Was cost or performance an issue? And what do you think about GPU acceleration (GPGPU)? Did you think GPUs were simply expensive? Would love to just talk about tech here!