Embedding user-defined indexes in Apache Parquet

jasim

I think this post is a response to some new file format initiatives, based on the criticism that the Parquet file format is showing its age.

One of the arguments is that there is no standardized way to extend Parquet with new kinds of metadata (like statistical summaries, HyperLogLog etc.)

This post was written by the DataFusion folks, who have shown a clever way to do this without breaking backward compatibility with existing readers.

They have inserted arbitrary data between footer and data pages, which other readers will ignore. But query engines like DataFusion can exploit it. They embed a new index to the .parquet file, and use that to improve query performance.

In this specific instance, they add an index with all the distinct values of a column. Then they extend the DataFusion query engine to exploit that so that queries like `WHERE nation = 'Singapore'` can use that index to figure out whether the value exists in that .parquet file without having to scan the data pages (which is already optimized because there is a min-max filter to avoid scanning the entire dataset).

Also in general this is a really good deep dive into columnar data storage.

Nelkins

Cool, but this is very specific to DataFusion, no? Is there any chance this would be standardized so other Parquet readers could leverage the same technique?

gdubya

The technique can be applied by any engine, not just DataFusion. Each engine would have to know about the indexes in order to make use of them, but the fallback to parquet standard defaults means that the data is still readable by all.

aerzen

But does data fusion publish a specification of how this metadata can be read, along with a test suite for verifying implementations? Because if they don't, this cannot be reliably used by any other impl

HN

Embedding user-defined indexes in Apache Parquet

Embedding user-defined indexes in Apache Parquet