We built a modern data stack from scratch and reduced our bill by 70%
45 comments
·March 9, 2025snake_doc
moltar
I came from traditional engineering into data engineering by accident and had a similar view. But every time I tried to make a pipeline from first principles it always eventually turned out something like this for a reason. This is especially true when trying to bridge many teams and skillsets - everyone wants their favourite tool.
polskibus
What is the current state of the art (open source) when doing oltp to olap pipelines in these days? I don’t mean a one-off etl style load at night but a continuous process with relatively low latency?
williamdclt
Idk what the state of the art is, but I’ve used change data capture with Debezium and Kafka, sink’d into Snowflake. Not sure Kafka is the right tool as you don’t need persistence, and having replication slots makes a lot of operations (eg DB engine upgrade) a lot harder though.
SkyPuncher
I know it's easy to be critical, but I'm having trouble seeing the ROI on this.
This is a $20k/year savings. Perhaps, I'm not aware of the pricing in the Indian market (where this startup is), but that simply doesn't seem like a good use of time. There's an actual cost of doing these implementations. Both in hard financial dollars (salaries of the people doing the work) and the trade-offs of de prioritizing other other.
paxys
The biggest issue IMO is that engineers who work on projects like these inevitably get bored and move on, and then the company is stuck trying to add features, fix bugs and generally untangle the mess, all taking away time and resources from their actual product.
mattbillenstein
Yeah, but you can always make this argument and build nothing - dealing with all the problems of every 3rd party SaaS / PaaS under the sun. Sometimes it's much easier to just build the thing and then you know where its' limitations are and you can address them over time.
naijaboiler
Even having 1 engineer working on this for only 1 month in one year it’s still cost ineffective.
TZubiri
>Perhaps, I'm not aware of the pricing in the Indian market
It's approximately 4 annual salaries (non dev)
bob1029
When working with ETL, it really helps to not conflate the letters or worry about them in the wrong order. A lot of the most insane complexity comes out of moving too quickly with data.
If you don't have good staging data after running extraction (i.e., a 1:1 view of the source system data available in your database), there is nothing you can do to help with this downstream. You should stop right there and keep digging.
Extracting the data should be the most challenging aspect of an ETL pipeline. It can make a lot of sense to write custom software to handle this part. It is worth the investment because if you do the extraction really well, the transform & load stages can happen as a combined afterthought [0,1,2,3] in many situations.
This also tends to be one of the fastest ways to deal with gigantic amounts of data. If you are doing things like pulling 2 different tables and joining them in code as part of your T/L stages, you are really missing out on the power of views, CTEs, TVFs, merge statements, etc.
[0] https://learn.microsoft.com/en-us/sql/t-sql/statements/merge...
[1] https://www.postgresql.org/docs/current/sql-merge.html
[2] https://docs.oracle.com/database/121/SQLRF/statements_9017.h...
[3] https://www.ibm.com/docs/en/db2/12.1?topic=statements-merge
mulmen
> Extracting the data should be the most challenging aspect of an ETL pipeline.
Why should this be difficult? It’s the easiest part. You run SELECT * and you’re done.
The difficult part is transforming all the disparate upstream systems and their evolving schemas into a useful analytical model for decision support.
bob1029
Not all data lives in a SQL database. Much of the extraction code I write does things like loading flat files from unusual sources and querying APIs.
If the source data is already in a SQL store, then the solution should be obvious. You don't need any other tools to produce the desired view of the business at that point. Transforming for an upstream schema is a select statement per target table. This doesn't need to be complicated.
mulmen
Yeah I extract a lot of data out of Dynamo. It’s still the easiest part. Change capture just isn’t complicated. You need some basic constructs and then you’re golden. The data mart design phase is orders of magnitude more effort.
1a527dd5
There is something here that doesn't sit right.
We use BQ and Metabase heavily at work. Our BQ analytics pipeline is several hundred TBs. In the beginning we had data (engineer|analyst|person) run amock and run up a BQ bill around 4,000 per month.
By far the biggest things was:-
- partition key was optional -> fix: required
- bypass the BQ caching layer -> fix: make queries use deterministic inputs [2]
It took a few weeks to go through each query using the metadata tables [1] but it worth it. In the end our BQ analysis pricing was down to something like 10 per day.
[1] https://cloud.google.com/bigquery/docs/information-schema-jo...
[2] https://cloud.google.com/bigquery/docs/cached-results#cache-...
ripped_britches
So you saved just $20k per year? Not sure the context of your company but I’m not sure if this turns out to be a net win given the cost of engineering resources to produce this infra gain
TZubiri
If it's only for cost savings it's a hard sell.
But generally rolling your own has other benefits.
crazygringo
> But generally rolling your own has other benefits.
Not for startups it doesn't. The only rolling-your-own they should be doing is their main product.
Once you get bigger with many hundreds of employees, and existing software starts becoming a measurable blocker, then you can gradually build your own stuff once the tradeoffs actually make sense. But it generally takes a while for that to become the case. For startups, premature optimization isn't the root of all evil, but it's the root of a lot of it.
TZubiri
You might argue that it is never worth it to roll anything on your own (which is already an extreme proposition), but to argue that it has no benefits (other than cost?), I think it's either bad reading comprehension or overzeal to jump into the keyboard and type dogma that you should never roll your own and that you should download 1000 dependencies and a dependency manager to manage all those version conflicts.
vood
Rolling your own generally has mainly downsides in the context they are in. 1. This is clearly a small team with very little spend 2. Tomorrow someone leaves and next engineer will have to manage all of this. 3. I don't think they realize that they actually increased cost of this service not decreased it. Now they need to manage their own Kafka monthly. Engineering time is expensive.
TZubiri
Yes, rolling your own X has downsides and it has upsides.
Welcome to tradeoff engineering.
jchandra
We did have a discussion on Self vs Managed and TCOs associated with it. 1> We have multi regional setup so it came up with Data Sovereignty requirements. 2> Vendor Lock ins - Few of the services were not available in that geographic region 3> With managed services, you often pay for capacity you might not always use. our workloads were often consistent and predictable, so self managed solutions helped in fine tuning our resources. 4> One og the goal was to keep our storage and compute loosely coupled while staying Iceberg-compatible for flexibility. Whether it’s Trino today or Snowflake/Databricks tomorrow, we aren’t locked in.
mosselman
You’d think that pushing all of the data into any ldap database, but especially some of the newer postgres based ones would give you all the performance you need at 10% of the costs? Let alone all the maintenance of the mind boggling architecture drawing.
tacker2000
Is Debezium the only good CDC tool out there? I have a fairly simple data stack and am looking at integrating a CDC solution but I really dont want to touch Kafka just for this. Are there any easier alternatives?
rockwotj
Redpanda Connect is a yaml file that has a cdc input and you can process the data and send it anywhere, no Kafka required.
I’ve seen postgres cdc directly being written to Snowflake.
https://docs.redpanda.com/redpanda-connect/components/inputs...
nchmy
Conduit.io is where it's at. FAR more source and destination connectors, easier to deploy etc... Pair it with NATS to replace Kafka's mess
rockwotj
Why confluent instead of something like MSK, Redpanda or one of the new leaderless, direct to S3 Kafka implementations?
thecleaner
I think they do mean Kafka. Anyways theres connectors from Kafka to a bunch of things so I think its a reasonable choice.
reillyse
How much did this cost in engineering time and how much will it cost to maintain? How about when you need to add a new feature? Seems like you saved roughly 1.5k per month which pays for a couple days of engineering time (ignoring product,mgmt and costs related to maintaining the software)
grayhatter
No idea how many hours to build, but I maintain something similar, (different stack though) and its so trivial I don't even count the hours, so it's probably about 1/2 a day to maintain that every 3 months?
Even if you needed to invent a new feature, you could invent a months worth of features every year and still save money.
nxm
Keep components up to date & dealing with schema changes will easily take more than a half day every 3 months.
reillyse
I know this is off the topic of the actual post but I'm confused as to why I've been downvoted, seems other people have since made similar comments but my comment has been downvoted. Help me out here, why?
vivahir215
Good read.
I do have a question on the BigQuery. i f you were experiencing unpredictable query costs or customization issues, that sounds like user error. There are ways to optimize or commit slots for reducing the cost. Did you try that ?
jchandra
As for BigQuery, while it's a great tool, we faced challenges with high-volume, small queries where costs became unpredictable as it is priced per data volume scanned. Clustered tables, Materialised views helped to some extent, but they didn’t fully mitigate the overhead for our specific workloads. There are ways to overcome and optimize it for sure so i wouldn't exactly put it on GBQ or any limitations.
It’s always a trade-off, and we made the call that best fit our scale, workloads, and long-term plans
throwaway7783
Did you consider slots based pricing model for BQ?
vivahir215
Hmm, Okay.
I am not sure if managing kafka connect cluster in too expensive in long term. This solution might work for you based on your needs. I would suggest to look for alternatives.
cratermoon
AKA The Monty Hall Rewrite https://alexsexton.com/blog/2014/11/the-monty-hall-rewrite
lifeisstillgood
This is a great concept and of course implies the very real idea that you should just rewrite a lot of your stuff anyway …
These just seems like over engineered solutions trying to guarantee their job security. When the dataflows are so straight forward, just replicate into pick your OLAP, and transform there.