Bauplan – Git-for-data pipelines on object storage

45 comments

·April 16, 2025

jtagliabuetooso

Looking to get feedback for a code-first platform for data: instead of custom frameworks, GUIs, notebooks on a chron, bauplan runs SQL / Python functions from your IDE, in the cloud, backed by your object storage. Everything is versioned and composable: time-travel, git-like branches, scriptable meta-logic.

Perhaps surprisingly, we decided to co-design the abstractions and the runtime, which allowed novel optimizations at the intersection of FaaS and data - e.g. rebuilding functions can be 15x faster than the corresponding AWS stack (https://arxiv.org/pdf/2410.17465). All capabilities are available to humans (CLI) and machines (SDK) through simple APIs.

Would love to hear the community’s thoughts on moving data engineering workflows closer to software abstractions: tables, functions, branches, CI/CD etc.

anentropic

I am very interested in this but have some questions after a quick look

It mentions "Serverless pipelines. Run fast, stateless Python functions in the cloud." on the home page... but it took me a while of clicking around looking for exactly what the deployment model is

e.g. is it the cloud provider's own "serverless functions"? or is this a platform that maybe runs on k8s and provides its own serverless compute resources?

Under examples I found https://docs.bauplanlabs.com/en/latest/examples/data_product... which shows running a cli command `serverless deploy` to deploy an AWS Lambda

for me deploying to regular Lambda func is a plus, but this example raises more questions...

https://docs.bauplanlabs.com/en/latest/commands_cheatsheet.h... doesn't show any 'serverless' or 'deploy' command... presumably the example is using an external tool i.e. the Serverless framework?

which is fine, great even - I can presumably use my existing code deployment methodology like CDK or Terraform instead

Just suggesting that the underlying details could be spelled out a bit more up front.

In the end I kind of understand it as similar to sqlmesh, but with a "BYO compute" approach? So where sqlmesh wants to run on a Data Warehouse platform that provides compute, and only really supports Iceberg via Trino, bauplan is focused solely on Iceberg and defining/providing your own compute resources?

I like it

Last question is re here https://docs.bauplanlabs.com/en/latest/tutorial/index.html

> "Need credentials? Fill out this form to get started"

Should I understand therefore that this is only usable with an account from bauplanlabs.com ?

What does that provide? There's no pricing mentioned so far - what is the model?

zenlikethat

> or is this a platform that maybe runs on k8s and provides its own serverless compute resources?

This one, although it’s a custom orchestration system, not Kubernetes. (there are some similarities but our system is really optimized for data workloads)

We manage Iceberg for easy data versioning, take care of data caching and Python modules, etc., and you just write some Python and SQL and exec it over your data catalog without having to worry about Docker and all infra stuff.

I wrote a bit on what the efficient SQL half takes care of for you here: https://www.bauplanlabs.com/blog/blending-duckdb-and-iceberg...

> In the end I kind of understand it as similar to sqlmesh, but with a "BYO compute" approach? So where sqlmesh wants to run on a Data Warehouse platform that provides compute, and only really supports Iceberg via Trino, bauplan is focused solely on Iceberg and defining/providing your own compute resources?

Philosophically, yes. In practice so far we manage the machines in separate AWS accounts _for_ the customers, in a sort of hybrid approach, but the idea is not dissimilar.

> Should I understand therefore that this is only usable with an account from bauplanlabs.com ?

Yep. We’d help you get started and use our demo team. Send jacopo.tagliabue@bauplanlabs.com an email

RE: pricing. Good question. Early startup stage bespoke at the moment. Contact your friendly neighborhood Bauplan founder to learn more :)

anentropic

So there's no self-hosted option?

I think currently the docs are lacking some context if you arrive there via a link rather than via your SaaS home page

esafak

It is a service, not an open source tool, as far as I can tell. Do you intend to stay that way? What is the business model and pricing?

I am a bit concerned that you want users to swap out both their storage and workflow orchestrator. It's hard enough to convince users to drop one.

How does it compare to DuckDB or Polars for medium data?

barabbababoon

- Yes. it is a service and at least the runner will stay like that for the time being.

- We are not quite live yet, but the pricing model is based on compute capacity and it is divided in tiers (e.g. small=50GB for concurrent scans=$1500/month, large can get up to a TB). infinite queries, infinte jobs, infinite users. The idea is to have a very clear pricing with no sudden increases due to volume.

- You do not have to swap your storage - our runner comes to your S3 bucket and your data never ever have to be anywhere else that is not your S3.

- You do not have to swap your orchestrator either. Most of our clients are actually using it with their existing orchestrator. You call the platform's APIs, including run from your Airflow/Prefect/Temporal tasks https://www.prefect.io/blog/prefect-on-the-lakehouse-write-a...

Does it help?

zenlikethat

Yep, staying service.

RE: workflow orchestrators. You can use the Bauplan SDK to query, launch jobs and get results from within your existing platform, we don’t want to replace entirely if it’s doesn’t fit for you, just to augment.

RE: DuckDB and Polars. It literally uses DuckDB under the hood but with two huge upgrades: one, we plug into your data catalog for really efficient scanning even on massive data lake houses, before it hits the DuckDB step. Two, we do efficient data caching. Query results and intermediate scans and stuff can be reused across runs.

More details here: https://www.bauplanlabs.com/blog/blending-duckdb-and-iceberg...

As for Polars, you can use Polars itself within your Python models easily by specifying it in a pip decorator. We install all requested packages within Python modules.

pablomendes

In what kinds of workloads or usage patterns do you see the biggest performance gains vs traditional FaaS + storage stacks?

jtagliabuetooso

In a nutshell, data and AI workloads require fast re-building and vertical scaling:

1) you should not need to redeploy a Lambda if you you're running January and February vs only January now. In the same vein, you should not need to redeploy a lambda if you upgrade from pandas to polars: rebuilding functions is 15x faster than lambda, 7x snowpark (-> https://arxiv.org/pdf/2410.17465)

2) the only way (even in popular orchestrators, e.g. Airflow, not just FaaS) to pass data around in DAGs is through object storage, which is slow and costly: we use Arrow as intermediate data format and over the wire, with a bunch of optimizations in caching and zero-copy sharing to make the development loop extra-fast, and the usage of compute efficient!

Our current customers run near real-time analytics pipelines (Kafka -> S3 / Iceberg -> Bauplan run -> Bauplan query), DS / AI workloads and WAP for data ingestion.

sbpayne

I have really enjoyed the conversations I have had with Jacopo and Ciro over the years. They have really revisited a lot of assumptions behind commonly used tools/infrastructure in the data space and build something that really has a much better developer experience.

So excited to see them take this step!

jtagliabuetooso

Thanks @sbpayne <3

korijn

How does this compare to dbt? Seems like it can do the same?

zenlikethat

Some similarities, but Bauplan offers:

1. Great Python support. Piping something from a structured data catalog into Python is trivial, and so is persisting results. With materialization, you never need to recompute something in Python twice if you don’t want to — you can store it in your data catalog forever.

Also, you can request anything Python package you want, and even have different Python versions and packages in different workflow steps.

2. Catalog integration. Safely make changes and run experiments in branches.

3. Efficient caching and data re-use. We do a ton of tricks behind to scenes to avoid recomputing or rescanning things that have already been done, and pass data between steps with Arrow zero copy tables. This means your DAGs run a lot faster because the amount of time spent shuffling bytes around is minimal.

laminarflow027

To me they seem like the pythonic version of dbt! Instead of yaml, you write Python code. That, and a lot of on-the-fly computations to generate an optimized workflow plan.

barabbababoon

Plenty of stuff in common with dbt's philosophy. One big thing though, dbt does not run your compute or manage your lake. It orchestrate your code and pushes it down to a runtime (e.g. 90% of the time Snowflake).

This IS a runtime.

You import bauplan, write your functions and run them in straight into the cloud - you don't need anything more. When you want to make a pipeline you chain the functions together, and the system manages the dependencies, the containerization, the runtime, and gives you a git-like abstractions over runs, tables and pipelines.

dijksterhuis

the big question i have is — where is the code executed? “the cloud”? who’s cloud? my cloud? your environment on AWS?

the paper briefly mentions “bring your own cloud” in 4.5 but the docs page doesn’t seem to have any information on doing that (or at least none that i can find).

zenlikethat

The code you execute on your data currently runs in a per-customer AWS account managed by us. We leave the door open for BYOC based on the architecture we’ve designed, but due to lean startup life, that’s not an option yet. We’d definitely be down to chat about it

buremba

Looks interesting! Bauplan seems like a mix of an orchestration engine and a data warehouse. It's similar to Motherduck as it runs DuckDB on managed EC2, with more data engineer-focused branching and Python support similar to SQLMesh.

It's interesting that most vendors compute in their own managed account instead of BYOC though. I understand it's hard to manage compute on the customer cloud for vendors, but I was under the impression that it's a no-go for most enterprise companies. Maybe I'm wrong?

jtagliabuetooso

Correct.

Unlike warehouses or SQL lakehouses, we also any Python code, including from your private AWS repositories for example, through a simple decorator, while giving you transactional pipelines, fully versioned and revertible, like it's a database on your S3.

Wrt deployment, I think things are a bit more nuanced: we are soc2 compliant and provide a enterprise ready control vs data plane separation - data are only processed in single tenant VPC, which is private linked to your account, effectively making the same account networking wise. If you insist on having the data plane in your own account, the architecture supports that as our only data plane dependency is VMs (we install our own custom runtime there!).

To give you a sense, one of our large customers is a 4BN / year large broadcaster with tens of milions of users, and they run with the above AWS security posture.

Happy to answer more offline if you're curious (jacopo.tagliabue@bauplanlabs.com)

tech_ken

The Git-like approach to data versioning seems really promising to me, but I'm wondering what those merge operations are expected to look like in practice. In a coding environment, I'd review the PR basically line-by-line to check for code quality, engineering soundness, etc. But in the data case it's not clear to me that a line-by-line review would be possible, or even useful; and I'm also curious about what (if any) tooling is provided to support it?

For example: I saw the YouTube video demo someone linked here where they had an example of a quarterly report pipeline. Say that I'm one of two analysts tasked with producing that report, and my coworker would like to land a bunch of changes. Say in their data branch, the topline report numbers are different from `main` by X%. Clearly it's due to some change in the pipeline, but it seems like I will still have to fire up a notebook and copy+paste chunks of the pipeline to see step-by-step where things are different. Is there another recommended workflow (or even better: provided tooling) for determining which deltas in the pipeline contributed to the X% difference?

zenlikethat

That’s a great question. Diffing is one area we’ve thought a bit about but still need to dedicate more cycles to. One thing I would be curious about is, what are you doing in these notebooks to check? For what it’s worth, could possibly have an intermediate Python model that does some calculation to look at differences and materializes the results to a table, which you could then query directly for further insight.

One thing we do have support for “expectations” — model-like Python steps that check data quality, and can flag it if the pipeline violates them.

tech_ken

> For what it’s worth, could possibly have an intermediate Python model that...materializes the results to a table

I think this is kind of the answer I was looking for, and in other systems I've actually manually implemented things like this with a "temp materialize" operator that's enabled by a "debug_run=True" flag. With the NB thing basically I'm trying to "step inside" the data pipeline, like how an IDE debugger might run a script line by line until an error hits and then drops you into a REPL located 'within' the code state. In the notebook I'll typically try to replicate (as close as possible) the state of the data inside some intermediate step, and will then manually mutate the pipeline between the original and branch versions to determine how the pipeline changes relate to the data changes. I think the dream for me would be to having something that can say "the delta on line N is responsible for X% of the variance in the output", although I recognize that's probably not a well defined calculation in many cases. But either way at a high level my goal is to understand why my data changes, so I can be confident that those changes are legit and not an artifact of some error in the pipeline.

Asserting that a set of expectations is met at multiple pipeline stages also gets pretty close, although I do think it's not entirely the same. Seems loosely analogous to the difference between unit and integration/E2E tests. Obviously I'm not going to land something with failing unit tests, but even if tests are passing the delta may include more subtle (logical) changes which violate the assumptions of my users or integrated systems (ex. that their understanding the business logic is aligned with what was implemented in the pipeline).

jtagliabuetooso

"In the notebook I'll typically try to replicate (as close as possible) the state of the data inside some intermediate step, and will then manually mutate the pipeline between the original and branch versions to determine how the pipeline changes relate to the data changes."

You can automate many changes / tests by materializing the parent(s) of the target table, and use the SDK to produce variations of a pipeline programmatically. If your pipeline has a free parameter (say top-k=5 for some algos), you could just write a Python for loop, and do something like:

client.create_branch() client.run()

for each variation, materializing k versions at the end that you can inspect (client.query("SELECT MAX ...")

The broader concept is that every operation in the lake is immutably stored with an ID, so every run can be replicated with the exact same data sources and the exact same code (even if not committed to GHub), which also means you can run the same code varying the data source, or run a different code on the same data: all zero-copy, all in production.

As for the semantics of merge and other conflicts, we will be publishing by end of summer some new research: look out for a new blog post and paper if you like this space!

whinvik

How do you compare with DVC and LakeFS?

jtagliabuetooso

Thanks for the question!

On the data side of things, DVC is more about versioning static datasets / local files, while Bauplan manages your entire lakehouse, potentially hundreds of tables with point in time versioning (time travel) and branching (at any given time, different version of the same table) -> https://docs.bauplanlabs.com/en/latest/tutorial/02_catalog.h....

On the compute side of things, Bauplan runs the functions for you, unlike catalogs which only see a partial truth and provide only a piece of the puzzle: Bauplan knows both your code (because it runs your pipeline) and your data (because it handles all the commits on the lakehouse), which allows a one-liner reply to question such as:

"who change, when, with which code, this table on this branch?"

It also allows a lot of optimizations in multi-player mode, such as efficient caching of data (https://arxiv.org/abs/2411.08203) and packages (https://arxiv.org/pdf/2410.17465).

russellthehippo

Congrats on the more official launch! Super promising, first product that shares dbt-type data organization/orchestration capabilities with a compute layer worthy of replacing existing data warehouses/python environments.

jtagliabuetooso

Glad to see it resonates, especially the Python part <3

rustyconover

I'd love to see a 10 minute YouTube video of the capabilities of this product.

jtagliabuetooso

Thanks for your interest! Aside from the demo video in the home page, our quick start takes <3 minutes, which is way less! Just ask for an invite to the free sandbox in our website.

If you love videos and would like to understand the decisions behind it, the GeekNarrator episode is a good start: https://www.youtube.com/watch?v=8aMm7RHEgIw&t=4812s

For real-world enterprise grade deployment stories, just use our blog or reach out to any of us to learn more (jacopo.tagliabue@bauplanlabs.com).

mehdmldj

Not really 10 minutes, be here is what you're looking for: https://www.youtube.com/watch?v=Di2AkSmitTc

vira28

For someone like me (who is not an ML expert, but can write Python fluently) Bauplan looks like an ideal fit. Looking forward to taking a deeper look and building something in production.

jtagliabuetooso

Glad to see it resonates! Happy to help if needed!

gigatexal

I’m intrigued but what’s the pricing going to be? What am I paying for? Something to make faas easier? What’s the magic behind the scenes?

jtagliabuetooso

The pricing is a bit bespoke at the moment as we work closely with our customers - you can reach out at anytime to any of us for a chat (jacopo.tagliabue@bauplanlabs.com). The general driver is just compute capacity: how much resources you want to have available at any point in time?

My usual suggestion is to get a feeling for the APIs and capabilities on the public sandbox on our home page, which is free and with a lot of examples and datasets to start from!

As for the magic, the reasons behind building a FaaS runtime and the main optimizations have been shared in a few recent papers with the community - e.g. https://arxiv.org/pdf/2410.17465 and https://arxiv.org/abs/2411.08203 - and deep dive on podcasts (e.g. https://www.youtube.com/watch?v=gPJvgkHIEBY).

If you want to geek out more, just reach out!

davistreybig

This is first principles where data infrastructure should go in terms of developer ergonomics

redskyluan

Amazing, seeking for similar service for years

jtagliabuetooso

Glad it resonates! Happy to help answer any question you may have - the sandbox in our home page is free to try: just ask for an invite!

HN

Bauplan – Git-for-data pipelines on object storage

Bauplan – Git-for-data pipelines on object storage