Skip to content(if available)orjump to list(if available)

Streaming Joins Are Hard

Streaming Joins Are Hard

14 comments

·October 22, 2024

mattxxx

JOINs are just hard period. When you're operating at a large scale, you need to be thinking about exactly how to partition + index your data for the types of queries that you want to write with JOINs.

Streaming joins are so hard, that they're an anti pattern. If you're using external storage to make it work, then your architecture has probably gone really wrong or you're using streams for something that you shouldn't.

crazygringo

Can someone explain what the use case is for streaming joins in the first place?

I've written my fair share of joins in SQL. They're indispensable.

But I've never come across a situation where I needed to join data from two streams in real time as they're both coming in. I'm not sure I even understand what that's supposed to mean conceptually.

It's easy enough to dump streams into a database and query the database but clearly this isn't about that.

So what's the use case for joins on raw stream data?

GeneralMayhem

Event correlations are a typical one. Think about ad tech: you want every click event to be hydrated with information about the impression or query that led to it. Both of those are high-volume log streams.

You want to end up with the results of:

``` select * from clicks left join impressions on (clicks.impression_id=impressions.id) ```

but you want to see incremental results - for instance, because you want to feed the joined rows into a streaming aggregator to keep counts as up to date as possible.

tshaddox

Isn't the use case just any time you want a client to essentially subscribe to an SQL query and receive message every time the result of that SQL query changes?

BeefWellington

I'll use a contrived example here to explain what the value of streaming the data itself is.

Let's say you run a large installation that has a variety of very important gauges and sensors. Due to the size and complexity of this installation, these gauges and sensors need to be fed back to a console somewhere so that an overseer role of sorts can get that big picture view to ensure the installation is functioning fully healthy.

For that scenario, if you look at your data in the sense of a typical RDBMS / Data Warehouse, you would probably want to save as much over the wire traffic as possible to ensure there's no delays in getting the sensor information fed into the system reliably on time. So you trim down things to just a station ID and some readings coming into your "fact" table (it could be more transactionally modeled but mostly it'll fit the same bill).

Basically the streaming is useful so that in near-realtime you can live scroll the recordset as data comes in. Your SQL query becomes more of an infinite Cursor.

Older ways of doing this did exist on SQL databases just fine; typically you'd have some kind of record marker, whether it was ROWID, DateTime, etc., and you'd just reissue an identical query to get the newer records. That introduces some overhead though, and the streaming approach kind of minimizes/eliminates that.

hotstickyballs

Should’ve just cached the output of group bys.

closeparen

Anything you can do with stateful streaming technology, you can do with a database and a message handler. It’s just a question of programming model and scaling characteristics. You typically get an in-process embedded DB per shard, with an API that makes it seem closer to managing state in memory.

10000truths

Streams are conceptually infinite, yes, but many streaming use cases are dealing with a finite amount of data that's larger than memory but fits on disk. In those cases, you can typically get away with materializing your inputs to a temporary file in order to implement joins, sorts, percentile aggregations, etc.

fifilura

A couple of years ago Materialize had all the buzz, not sure what is the difference.

https://materialize.com/

neeleshs

"Unlike batch tables, streams are infinite. You can't "just wait" for all the rows to arrive before performing a join."

I view batch tables as simply a given state of some set of streams at a point in time. Running the same query against "batch" tables at different points in time yields different results (assuming the table is churning over time).

jdelman

The ability to express joins in terms of SQL with Estuary is pretty cool. Flink can do a lot of what is described in this post, but you have to set up a lot of intermediate structures, write a lot of Java/Scala, and store your state as protos to support backwards compatibility. Abstracting all of that away would be a huge time saver, but I imagine not having fine grained control over the results and join methods could be frustrating.

fiddlerwoaroof

Flink does have a SQL join now that you can make work. Streaming joins remain a hard problem, though and, imo, SQL doesn’t map nicely onto streaming systems.

null

[deleted]

hamandcheese

It seems intuitive to me that a correct streaming join is impossible without an infinite buffer and strong guarantees on how events are ordered. The number of real world systems offering both of those guarantees is zero. Anyone espousing streaming joins as a general solution should be avoided at all costs, particularly if they have a title that contains "architect" or "enterprise" (god forbid both in the same title).

At best, it is a trick to be applied in very specific circumstances.