Sparrow, a modern C++ implementation of the Apache Arrow columnar format

27 comments

·January 31, 2025

binary132

I am glad to see others in the comments share my concerns about the code quality considering that the offered samples are not really idiomatic or modern C++. In particular, there is really no reason to use pointers here in either the accessor or the mutator; both of those probably should have been references, although the case can be made for the mutator at least since it makes it explicit in the written code that we may mutate the arguments.

seertaak

I had a go at this, and think my attempt is more idiomatic C++:

https://github.com/seertaak/xbow/blob/main/examples/table_ex...

https://github.com/seertaak/xbow/blob/main/examples/print_sc...

Handles nested structures iirc. It's been a while...

amluto

This is supposed to be idiomatic?!?

    namespace sp = sparrow;
    sp::primitive_array<int> ar = { 1, 3, 5, 7, 9 };
    // Caution: get_arrow_structures returns pointers, not values
    auto [arrow_array, arrow_schema] = sp::get_arrow_structures(std::move(ar));
    // Use arrow_array and arrow_schema as you need (serialization,
    // passing it to a third party library)
    // ...
    // do NOT release the C structures in the end, the "ar" variable will do it
   // for you

I’m sorry, resources are kept alive by an object that has been moved from?

senkora

This bothered me enough to check the source code, because I simply had to know:

    template <layout_or_array A>
    std::pair<ArrowArray*, ArrowSchema*> get_arrow_structures(A& a)
    {
        arrow_proxy& proxy = detail::array_access::get_arrow_proxy(a);
        return std::make_pair(&(proxy.array()), &(proxy.schema()));
    }

https://github.com/man-group/sparrow/blob/c01a768f590ebf3b22...

So the answer is that the `std::move` does nothing and should be omitted, because this function only has one overload, and that overload takes its argument by lvalue-reference.

(And as far as I can tell, `detail::array_access::get_arrow_proxy(a)` eventually just reads a member on `a`, so there's no copying anywhere)

It's a harmless mistake; I'm just surprised it wasn't caught. The authors seem pretty experienced with the language.

quietbritishjim

> So the answer is that the `std::move` does nothing and should be omitted

You can't assign an rvalue into an lvalue reference, precisely to avoid this sort of mistake. If this is the only overload then this just wouldn't compile (e.g. [1]). So the std::move isn't doing nothing, but yes it should be omitted. Maybe it's just a weird typo.

[1] https://ideone.com/4NS5dI

tempodox

Indeed, I don't even have to compile it, vim (+ Syntastic) tells me as soon as it sees it:

  candidate function not viable: expects an lvalue for 1st argument

BoingBoomTschak

Modern C++ in a nutshell

t. C++17 dev and pain connoisseur

jandrewrogers

Not speaking to the specific design choices here, but in C++ moved-from objects are not destroyed and must be valid in their moved-from state (e.g. a sentinel value to indicate they’ve been moved) so that they can be destroyed in the indefinite future. This is useful even though “destroy on move” is the correct semantics for most cases. Making “move” and “destroy” distinct operations increases the flexibility and expressiveness.

A common case where this is useful is if the address space where the object lives is accessible, for read or write, by references exogenous to the process like some kinds of shared memory or hardware DMA. If the object is immediately destroyed on move, it implies the memory can be reused while things your process can’t control may not know you destroyed the object. This is essentially a use-after-free bug factory. Being able to defer destruction to a point in time when you can guarantee this kind of UAF bug is not possible is valuable.

quietbritishjim

> Making “move” and “destroy” distinct operations increases the flexibility and expressiveness.

No, it does not. It's an artefact of the evolution of the language and highly undesirable. Rust has destructive moves (and copies built on top of moves, rather than the other way round) and it's far cleaner.

jandrewrogers

Sure, if you never need to deal with actual low-level high-performance systems code. Just because this use case doesn’t apply to anything you do doesn’t mean it applies to nobody. This is the kind of attitude that undermines languages like Rust (which I use in my systems). A fair criticism of Rust as a “systems language” is that it simply excludes all the really difficult parts of being a systems language.

C++ deserves a lot of criticism. Many aspects of the language are quite fucked. But willfully ignoring that it solves real problems that other nominal systems languages are unwilling to address doesn’t mean those problems don’t exist.

motorest

> (...) but in C++ moved-from objects are not destroyed and must be valid in their moved-from state (e.g. a sentinel value to indicate they’ve been moved) so that they can be destroyed in the indefinite future.

I don't think there is any requirement for a moved-from state other than the moved-from instance to remain in a valid state. There is zero relationship between moved-from state and the end of the object's life cycle. You can continue to use the instance of the moved-from object without any concern other than being in a valid state, but that applies to all instances of all conceivable object types.

Focus on the problem that move semantics solves: avoiding copies in general, specifically when resources are transfered to other instances. Does instance lifetimes change? No.

remexre

If an object is a RAII container though, shouldn't moving transfer the ownership of the contents to the destination? Otherwise there's no way to increase the lifetime of the contents.

amluto

> A common case where this is useful is if the address space where the object lives is accessible, for read or write, by references exogenous to the process like some kinds of shared memory or hardware DMA.

Huh? Okay, I allocate some memory and DMA-map it to the device [0]. Then:

1. Device uses it.

2. I copy or otherwise consume the data. Hopefully not in a way that causes UB.

3. I’m logically done, but maybe the device isn’t.

4. I’m 100% done, and I unmap the memory. And it’s an error if the device touches it again.

Why would I represent steps 2 and 3 as std::move? Maybe this results in efficient code in some particular code base, but it is not idiomatic, and I can almost guarantee that the same performance could be achieved in a less mind-bending and expectation-defying manner that doesn’t call itself a move.

I don’t see why Rust would have any particular trouble with this as long as you don’t try to force the lifetime of a DMA allocation or mapping into an object that doesn’t live long enough.

[0] The mapping operation may or may not be a no-op given IOMMUs, various VM translation schemes, etc.

SylvainCorlay

Excellent catch.

This std::move should not have been in this code snippet. It is a copy-paste mistake, carried over from the previous code snippet of the post, and should have been omitted.

(The `std::move` does nothing in this snippet, since `sp::get_arrow_structure` takes and lvalue reference).

In the previous example with `sp::extract_arrow_structures`, which takes an rvalue reference, std::move is required and the sparrow primitive array cannot be operated upon after.

juunpp

I think that comment is a copy-paste mistake. If you look at the next code snippet, the comment actually makes sense there.

That being said, I've also given up on C++ and learn it mostly to keep up with the job, if that's where you are coming from. I don't find Rust to be a satisfying replacement, though. No language scratches the itch for me right now.

SylvainCorlay

Excellent catch. The std::move in the snippet is a copy-paste mistake. It was carried over from the previous code snippet.

arka2147483647

Why does sp::primitive_array<> even exist?

Should it not be std::array<>?

SylvainCorlay

This is really not the same thing at all.

sp::primitive_array holds memory following the Arrow specification, which can be operated upon in place from e.g. ArrowCpp, PyArrow, etc.

null

[deleted]

JTyQZSnP3cQGa8B

It's a new technique called "reverse GC" where you care about the first living object instead of the last one /s

Also:

> nullable can hold references

We have: https://en.cppreference.com/w/cpp/utility/functional/referen...

> Assigning nullval [...] does not trigger the destruction of the underlying object

That's weird.

rubenvanwyk

Seems cool but I have questions about Arcticdb - is Polars, DuckDB etc really so limited for data science analysis that it's justified to write a new library specifically for time-series analysis on S3 files?

rubenvanwyk

Apparently this: https://www.infoq.com/presentations/arcticdb/

willdealtry

ArcticDB is more concerned with storage and persistence than with in-memory processing, so it's complementary to Polars/DuckDB etc, rather than an alternative.

nly

Trying to find a good C++ parquet library is equally hard

HN

Sparrow, a modern C++ implementation of the Apache Arrow columnar format

Sparrow, a modern C++ implementation of the Apache Arrow columnar format