Data modeling guide for real-time analytics with ClickHouse

nrjames

I went down this road with Clickhouse and spent 18 months setting it up for realtime analytics. I'm not sure it's going to stick, primarily because our Data Transformation & Aggregation steps require some gnarly joins across large tables and Clickhouse does not handle large joins well. The consequence is that the aggregation has to happen in a separate system (currently using Snowflake) and when there were changes to what we were processing, it sometimes requires gymnastics both in the aggregation layer and inside of Clickhouse to accommodate the change. Denormalizing was rife with tradeoffs, mostly just to make Clickhouse happy. On top of that, we leaned heavily on projections for performance, which is wonderfully automated, but also meant waiting for unpredictable background processing during backloads, etc.

We might stick with Clickhouse, but after working with it for a year and a half, I'm curious to see whether a system that handles joins more effectively would be a better fit. To that end, my next R&D project is to set up a vertical slice of our analytics on Apache Doris to see how well it handles a similar workload.

Merick

Curious, you mention Doris. I wonder if you've tried looking into StarRocks?

DataWizard1986

+1 to StarRocks. You can cut out a lot of the weight associated with denormalization, which ClickHouse almost forces you to do. Crazy big cluster sizes as well

apwell23

what about druid ? I don't hear much about druid these days. Has it fallen off?

apwell23

low latency and high concurrecy is a hard problem with large data.

nrjames

We also have a budget issue with Snowflake, so it would help to aggregate elsewhere.

miked98

In our experience, the primary driver of Snowflake costs is not the compute for aggregation, but the compute required for lots of reads/scans.

We recently built a Snowflake-to-ClickHouse pipeline for a customer where aggregates are built hourly in Snowflake, then pushed into a ClickHouse table to power their user-facing dashboards.

By offloading dashboard queries to ClickHouse, they slashed their Snowflake bill by ~25%, which was worth millions to them.

(Admittedly, running aggregations elsewhere—for example, in Spark—could further reduce costs, but you would then need Iceberg to make the tables queryable in Snowflake.)

sails

Simon is one of the few people exploring the contemporary challenges of data modelling in a meaningful way.

apwell23

Is there anyone i can learn about latest challenges in datamodeling ?

floatrock

[dead]

null

[deleted]

HN

Data modeling guide for real-time analytics with ClickHouse

Data modeling guide for real-time analytics with ClickHouse