Skip to content(if available)orjump to list(if available)

Apache iceberg the Hadoop of the modern-data-stack?

hendiatris

This is a huge challenge with Iceberg. I have found that there is substantial bang for your buck in tuning how parquet files are written, particularly in terms of row group size and column-level bloom filters. In addition to that, I make heavy use of the encoding options (dictionary/RLE) while denormalizing data into as few files as possible. This has allowed me to rely on DuckDB for querying terabytes of data at low cost and acceptable performance.

What we are lacking now is tooling that gives you insight into how you should configure Iceberg. Does something like this exist? I have been looking for something that would show me the query plan that is developed from Iceberg metadata, but didn’t find anything. It would go a long way to showing where the bottleneck is for queries.

jasonjmcghee

Have you written about your parquet strategy anywhere? Or have suggested reading related to the tuning you've done? Super interested.

joking

¿chatgpt?

Gasp0de

Does anyone have a good alternative for storing large amounts of very small files that need to be individually queriable? We are dealing with a large amount of sensor readings that we need to be able to query on a per sensor basis and a timespan, and we are dealing with the problem mentioned in the article, that storing millions of small files in S3 is expensive.

this_user

Do you absolutely have to write the data to files directly? If not, then using a time series database might be the better option. Most of them are pretty much designed for workloads with large numbers of append operations. You could always export to individual files later on if you need it.

Another option if you have enough local storage would be to use something like JuiceFS that creates a virtual file system where the files are initially written to the local cache before JuiceFS writes the data to your S3 provider as larger chunks.

SeaweedFS can do something similar if you configure it the right way. But both options require that you have enough storage outside of your object storage.

alchemist1e9

https://github.com/mxmlnkn/ratarmount

> To use all fsspec features, either install via pip install ratarmount[fsspec] or pip install ratarmount[fsspec]. It should also suffice to simply pip install fsspec if ratarmountcore is already installed.

paulsutter

If you want to keep them in S3, consolidate into sorted parquet files. You get random access to row groups, and only the columns you need are read so it’s very efficient. DuckDB can both build and access these files efficiently. You could compact files hourly/nightly/weekly whatever

Of course you could also use Aurora for a clean scalable Postgres that can survive zone failures for a simpler solution

Gasp0de

The problem is that the initial writing is already so expensive, I guess we'd have to write multiple sensors into the same file instead of having one file per sensor per interval. I'll look into parquet access options, if we could write 10k sensors into one file but still read a single sensor from that file that could work.

spothedog1

New S3 Table Buckets [1] do automatic compaction

[1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tab...

alexmorley

Most of these issues will be ring true to lots of folk using Iceberg at the moment. But this does not:

> Yet, competing table formats like Delta Lake and Hudi mirror this fragmentation. [ ... ] > Just as Spark emerged as the dominant engine in the Hadoop ecosystem, a dominant table format and catalog may appear in the Iceberg era.

I think extremely few people are making bets on any other open source table format now - that consolidation has already happened in 2023-2024 (see e.g. Databricks who have their own competing format leaning heavily into iceberg; or adoption from all of the major data warehouse providers).

twoodfin

Microsoft is right now making a huge bet on Delta by way of their “Microsoft Fabric” initiative (as always with Microsoft: Is it a product? Is it a branding scheme? Yes.)

They seem to be the only vendor crazy enough to try to fast-follow Databricks, who is clearly driving the increasingly elaborate and sophisticated Delta ecosystem (check the GitHub traffic…)

But Microsoft + Databricks is a lot of momentum for Delta.

On the merits of open & simple, I agree, better for everyone if Iceberg wins out—as Iceberg and not as some Frankenstandard mashed together with Delta by the force of 1,000 Databricks engineers.

datadrivenangel

The only reason Microsoft is using Delta is to emphasize to CTOs and investors that fabric is as good as databricks, even when that is obviously false to anyone who has smelled the evaporative scent of vaporware before.

twoodfin

Very different business, of course, but Databricks v. Fabric reminds me a lot of Slack v. Teams.

Regardless of the relative merits now, I think everyone agrees that a few years ago Slack was clearly superior. Microsoft could have certainly bought Slack instead of pumping probably billions into development, marketing, discounts to destroy them.

I think Microsoft could and would consider buying Databricks—$80–100B is a lot, but not record-shattering.

If I were them, though, I’d spend a few billion competing as an experiment, first.

esafak

Microsoft's gonna Microsoft.

paulsutter

Does this feel about 3x too verbose, like it’s generated?

jasonjmcghee

Idk if it's the verbosity but yes, reads as generated to me. Specifically sounds like ChatGPT's writing.