Pg_lake: Postgres with Iceberg and data lake access

inglor

This is really nice though looking at the code - a lot of the postgres types are missing as well a lot of the newer parquet logical types - but this is a great start and a nice use of FDW.

inglor

Also, any planned support for more catalogs?

ozgune

This is huge!

When people ask me what’s missing in the Postgres market, I used to tell them “open source Snowflake.”

Crunchy’s Postgres extension is by far the most ahead solution in the market.

Huge congrats to Snowflake and the Crunchy team on open sourcing this.

anentropic

When Snowflake bought Crunchy Data I was hoping they were going to offer a managed version of this

It's great that I can run this locally in a Docker container, I'd love to be able to run a managed instance on AWS billed through our existing Snowflake account

gajus

Man, we are living in the golden era of PostgreSQL.

ayhanfuat

With S3 Table Buckets, Cloudflare R2 Data Catalog and now this, Iceberg seems to be winning.

dkdcio

I was going to ask if you could then put DuckDB over Postgres for the OLAP query engine -- looks like that's already what it does! very interesting development in the data lake space alongside DuckLake and things

pgguru

You create foreign tables in postgres using either the pg_lake_table wrapper or pg_lake_iceberg.

Once those tables exist, queries against them are able to either push down entirely to the remote tables and uses a Custom Scan to execute and pull results back into postgres, or we transform/extract the pieces that can be executed remotely using a FDW and then treat it as a tuple source.

In both cases, the user does not need to know any of the details and just runs queries inside postgres as they always have.

dharbin

Why would Snowflake develop and release this? Doesn't this cannibalize their main product?

kentm

It's not going to scale as well as Snowflake, but it gets you into an Iceberg ecosystem which Snowflake can ingest and process at scale. Analytical data systems are typically trending to heterogenous compute with a shared storage backend -- you have large, autoscaling systems to process the raw data down to something that is usable by a smaller, cheaper query engine supporting UIs/services.

hobs

But if you are used to this type of compute per dollar what on earth would make you want to move to Snowflake?

oulipo2

Interesting! How does it compare with ducklake?

mslot

You could say

With DuckLake, the query frontend and query engine are DuckLake, and Postgres is used as a catalog in the background.

With pg_lake, the query frontend and catalog are Postgres, and DuckDB is used as a query engine in the background.

Of course, they also use different table formats (though similar in data layer) with different pros and cons, and the query frontends differ in significant ways.

An interesting thing about pg_lake is that it is effectively standalone, no external catalog required. You can point Spark et al. directly to Postgres with pg_lake by using the Iceberg JDBC driver.

chaps

I love postgres and have created my own "data lake" sorta systems -- what would this add to my workflows?

beoberha

Curious why pgduck_server is a totally separate process?

rmnclmnt

The README explains it:

> This separation also avoids the threading and memory-safety limitations that would arise from embedding DuckDB directly inside the Postgres process, which is designed around process isolation rather than multi-threaded execution. Moreover, it lets us interact with the query engine directly by connecting to it using standard Postgres clients.

pgguru

What has been pointed out from the README; also:

- Separation of concerns, since with a single external process we can share object store caches without complicated locking dances between multiple processes. - Memory limits are easier to reason about with a single external process. - Postgres backends end up being more robust, as you can restart the pgduck_server process separately.

dkdcio

from the README:

mberning

Does anyone know how access control works to the underlying s3 objects? I didn’t see anything regarding grants in the docs.

mslot

There are Postgres roles for read/write access to the S3 object that DuckDB has access to. Those roles can create tables from specific files or at specific locations, and can then assign more fine-grained privileges to other Postgres roles (e.g. read access on a specific view or table).

pgguru

Hi, one of the developers here. You define credentials that can access the S3 buckets and use those as DuckDB secrets, usually in an init script for pgduck_server. (You can see some examples of this in the testing framework.)

I'll see if we can improve the docs or highlight that part better, if it is already documented—we did move some things around prior to release.

onderkalaci

Maybe this could help: https://github.com/Snowflake-Labs/pg_lake?tab=readme-ov-file...

mberning

Interesting. I am working on a project to integrate access management to iceberg/parquet files for sagemaker. Controlling what users logged into sagemaker studio have access to in s3. It’s fine using static policies for mvp, but eventually it needs to be dynamic and integrated into enterprise iam tools. Those tools generally have great support for managing sql grants. Not so much for s3 bucket policies.

pgguru

DuckDB secrets management supports custom IAM roles and the like; at this point we are basically treating the pgduck_server external system as a black box.

For the postgres grants themselves, we provide privs to allow read/write to the remote tables, which is done via granting the `pg_lake_read`, `pg_lake_write` or `pg_lake_read_write` grants. This is a blanket all-or-nothing grant, however, so would need some design work/patching to support per-relation grants, say.

(You could probably get away with making roles in postgres that have the appropriate read/write grant, then only granting those specific roles to a given relation, so it's probably doable though a little clunky at the moment.)

HN

Pg_lake: Postgres with Iceberg and data lake access

Pg_lake: Postgres with Iceberg and data lake access