ArkFlow – High-performance Rust stream processing engine

59 comments

·March 14, 2025

bbminner

I work in one of large tech companies, and I can attest that while the idea seems very neat in theory (esp if your schemas are typed), and even if you define an api for defining new building blocks, sooner or later people realize that they need to dynamically adjust parts of the pipeline, and they write components to dynamically set and resolve these, and then other components on top of these components, and then complements for composing components - and now you forced yourself into implementing a weird and hard to debug functional programming language in yaml which is not a place someone wants to find themselves in :'(

one lesson I learned from this: any bit of logic that defines a computation should prefer explicit imperative code (eg python) over configuration, because you are likely to eventually implement an imperative language in that configuration language anyway

bob1029

> sooner or later people realize that they need to dynamically adjust parts of the pipeline

The customer is the hard part in all of this, but there is respite if you are patient and careful with the tech.

If you are in a situation where you need to go from one SQL database to another SQL database, the # of additional tools required should be zero. Using a merge statement & recursive CTEs per target table, you can transform any schema into any other. Most or all of the actual business logic can reside in the command text - how we filter & project data into the target system.

If we accept the SQL-to-SQL case has a good general solution, I would then ask if it is possible to refactor all problems such that they wind up with this shape in the middle. All of that nasty systems code could then be focused more on loading and extracting data into and out of this regime where it can be trivially sliced & diced. Once you have something in Postgres or SQL Server, you are at the top of the hill. Everything adapts to you at that point. Talking to another instance of yourself - or something that looks & talks like you - is trivial.

The other advantage with this path is that refactoring SQL scripts is something the customer (B2B) can directly manage in many situations. The entire pipeline can live in a single text file that you throw around an email chain. You don't have to teach them things like python, yaml or source control.

bbminner

In fact, I also converge to sql as a universal data transformation language. External analogs include things like duck db. Unfortunately, even with pipe syntax sql lacks expressiveness causing me to revert to c-style macros in sql (eg making table name dynamic), which in the long run makes things far less maintainable if anything.

lucyjojo

yeah, most projects when you spot a config file, its complexity will tend to scale with the increasing complexity of the domain you capture.

so either it's very small/mature and you don't have to worry too much, or in the active development case your config files are pretty much the instruction set of some kind of logical foggy vm... and eventually a whole environment of tools etc. will "compile down" to your config files and you get a pain knot to endlessly massage...

chenquan

Thank you for your valuable experience, I will seriously think about what you said.

NeutralForest

Pretty much my take any time I see all the convoluted Bicep and YAML we have since there's a bunch of conditional logic and more in our pipelines.

narad_muni

Have been building same thing in rust, The part for processing data is quite complex, and I agree that coding in python is a better way

_ink_

So far, this was exactly my experience as well. Well said.

simgt

I worked on something very similar for inference on video streams. To avoid the limitations of the config files mentioned in a sibling comment, I added a tool to convert a config to plain Rust. Your primary focus has to be the quality of the Rust API, and the config files are syntactic sugar for the beginning or simpler projects.

chenquan

Hi, friend. How did you do it specifically?

narad_muni

You can read a config file at runtime, but this will need recompilation with every config change, which is kind of useless.

abound

Very cool! Seems like a Rust version of something like Bento? [1] Have you done any benchmarking against similar stream processing tools?

[1] https://github.com/warpstreamlabs/bento

regecks

I haven’t benchmarked this, but I have recently benchmarked Spark Streaming vs self-rolled Go vs Bento vs RisingWave (which is also in Rust) and RW matched/exceeded self-rolled, and absolutely demolished Bento and Spark. Not even in the same ballpark.

Highly recommend checking RisingWave out if you have real time streaming transformation use cases. It’s open source too.

The benchmark was some high throughput low latency JSON transformations.

chenquan

Thanks for your recommendation.

chenquan

Yes, they are similar. ArkFlow is mainly based on DataFusion. Bento actually comes from Benthos. Currently, the ArkFlow project is in the early stages and no performance comparison test has been conducted, but I believe that ArkFlow will outperform them in the long run.

Benthos: https://github.com/redpanda-data/benthos

DataFusion: https://github.com/apache/datafusion

agallego

What we found with RPCN (redpanda connect)/old benthos is that most systems are very slow and only cpu intensive things require manual CPU instruction optimizations like the snowflake connector we wrote (https://docs.redpanda.com/redpanda-connect/components/output...). The bulk of it is just about completeness. Go feels like the Perl of the 2020s. Cool little libs for just about everything.

chenquan

Yes, RPCN (redpanda connect)/old benthos is very cool and can solve most of the scenes. Let me tell you quietly that I am using it too.

sakesun

Arroyo is another one based on DataFusion

chenquan

Yes, Arroyo is entirely based on DataFusion, but ArkFlow is not exactly. In the future, ArkFlow will establish a plug-in ecosystem, allowing anyone to process data through plug-ins, not limited to DataFusion.

chenquan

High Performance: Built on Rust and Tokio async runtime, offering excellent performance and low latency Multiple Data Sources: Support for Kafka, MQTT, HTTP, files, and other input/output sources Powerful Processing Capabilities: Built-in SQL queries, JSON processing, Protobuf encoding/decoding, batch processing, and other processors Extensible: Modular design, easy to extend with new input, output, and processor components

esafak

Is there a product motivation; a deficiency you seek to rectify in existing solutions?

chenquan

I think a stream processing engine written in rust will have better performance, lower latency, more stable services, lower memory footprint, and cost savings. At the same time, ArkFlow is based on DataFusion implementation, which will put ArkFlow on a strong open source community.

winwang

Are there benchmarks you can share? Not discounting Rust, just wondering if you're already seeing some obvious numbers.

heyheyyouyouqq

Reminds me of Pathway https://pathway.com/

chenquan

Good job, this is a rich reference.

Keyframe

yeah, without opentelemetry spying hopefully.

yu3zhou4

Good job brother! What do you think you need to implement before it is production-ready?

chenquan

Hi,brother! I'm still thinking, but it's certainly not now.

m00dy

Rust is increasingly becoming the default language for building infrastructure.

chenquan

I think stability, reliability and high performance are the foundation of infrastructure.

tzm

I love the simplicity of this design

chenquan

Welcome to follow anytime.

pstoll

Please tell me you are at least aware that tremor exists and that you rebuilt it on purpose?

https://www.tremor.rs/

esafak

I had not heard of it either! They have not been mentioned much here.

chenquan

I don't know it exists.

Licenser

Hello friend,

I'm one of the maintainers of tremor, happy to get together and talk about rust event processing if you ever want to :)

chenquan

Yeah, I'm looking forward to having further conversations once I get to know tremor.

HN

ArkFlow – High-performance Rust stream processing engine

ArkFlow – High-performance Rust stream processing engine