The two versions of Parquet

crmd

I am saying this as a lifelong supporter and user of open source software: issues like this are why governments and enterprises still run on Oracle and SQL Server.

The author was able to rollback his changes, but in some industries an unplanned enterprise-wide data unavailability event means the end of your career at that firm, if you don’t have a CYA email from the vendor confirming you were good to go. That CYA email, and the throat to choke, is why Oracle does 7 and 8 figure licensing deals with enterprises selling inferior software solutions versus open source options.

It seems that Linux, through Linus’ leadership, has been able to solve this risk issue and fully displace commercial UNIX operating systems. I hope many other projects up and down the stack can have the same success.

atombender

Sorry, I think you misunderstood this article.

When the author is talking about rolling back his changes, it's not referring to a database, but a version of his library. If someone tried used his new version, I assume the only thing that would have gone wrong is that their code wouldn't work because Pandas didn't support the format.

This article is about how a new version of the Parquet format hasn't been widely adopted, and so now the Parquer community is in a split state where different forces are pulling the direction of the format in two directions, and this happens to be caused by two different areas of focus that don't need to be tightly coupled together.

I don't see how the problems the article discusses relate to the reliability of software.

forinti

People keep using Oracle because they have a ton of code and migration would be too costly.

Oracle is not imune to software issues. In fact, this year I lost two weekends because of a buggy upgrade on the cloud that left my production cluster in a failed state.

chrismustcode

A lot of these have business logic literally in the database built up over years.

It’s a mammoth task for them to migrate

reactordev

Oracle Consulting gladly built it all as stored procs with a UI.

taneq

It’s not about being immune to software issues. It’s about having a vendor to cop the blame if something goes wrong.

1a527dd5

Polite disagree; governments and enterprises remain on Oracle / SQL Server because it is borderline sisphean. It can be done (we are doing it) but it requires a team who are doing it non-stop. It's horrible work.

duncanfwalker

At the start of your comment I thought the 'issues like this' were going to be the 4 year discussions about what is and isn't core.

crmd

So did I :-) but I think the concepts are related: Linus’ ability to shift into autocratic leadership mode when necessary seems to prevent issues like the 4 year indecisiveness on v2/core from compromising product quality to the point where Linux is trusted in a way that rivals commercial software.

moelf

and why CERN rocking their own file format, again in, 2025, https://cds.cern.ch/record/2923186

3eb7988a1663

To be fair, CERNs needs do seem fairly niche. Petabyte numeric datasets with all sorts of access patterns from researchers. All of which they want to maintain compatible software forever.

moelf

yeah except this new RNTuple thing is really really similar to Apache Arrow

adrian17

I was quite confused when I learned that the spec technically supports metadata about whether the data is already pre-sorted by some column(s); in my eyes seemed like it would allow some non-brainer optimizations. And yet, last I checked, it looked like pretty much nothing actually uses it, and some libraries don't even read this field at all.

viccis

Yeah I had to wait years to really use Parquet effectively in Python code back in the 2010s because there were two main ones (Pyarrow and Fastparquet), and they were neither compatible with either other nor compatible with Spark. Parquet support is much like Javascript support in browsers. You only get to use the more advanced features when they are supported compatibly on every platform you expect them to be used.

1a527dd5

https://www.jeronimo.dev/the-two-versions-of-parquet/#perfor...

First paragraph under that heading as a markdown error

    which I hadn’t considered in [my previous post on compression algorithms]](/compression-algorithms-parquet/).

sbassi

Shameless plug: made a parquet conversion utility: pip install parquetconv

It is a command line wrapper to generate a Pandas SF and save it as CSV (or the other way around)

sighansen

As long as iceberg and delta lake won't support v2, adoption will be really hard. I'm working aot with parquet and wasn't even aware that there is a version 2.0.

lolive

Why wouldn't they adopt the v2.0?

lowbloodsugar

When working with your own datasets, v2 is a must. If you are willing to make trade offs you can get insane compression and speed.

HN

The two versions of Parquet

The two versions of Parquet