SQLBackendArray._raw_getitem and the eager aggregation path both copy PyArrow RecordBatch columns into a preallocated numpy buffer (one read pass per batch column). The copy is necessary today because PyArrow buffers are immutable views and numpy needs a writable buffer.
xarray.Variable does not strictly require a numpy ndarray; it accepts anything that meets the _array_like contract (see xarray/core/variable.py#L373). If we can wrap a pyarrow.Array in a tiny adapter that satisfies that contract, the scatter step becomes a pointer assignment and the round-trip skips a full read pass.
Open questions:
- Do we need contiguous storage per dim, or can we hand xarray a list of
pyarrow.Array chunks with a custom __array__ that materializes lazily?
- How does indexing (slicing, fancy indexing) compose with the adapter once xarray asks for sub-views?
- Does this play nicely with the chunked / dask path users will reach for after
.chunk()?
Worth prototyping on a single dtype family first (float64) to see if the adapter shape is even tractable.
Spun out of #167.
SQLBackendArray._raw_getitemand the eager aggregation path both copy PyArrowRecordBatchcolumns into a preallocated numpy buffer (one read pass per batch column). The copy is necessary today because PyArrow buffers are immutable views and numpy needs a writable buffer.xarray.Variabledoes not strictly require a numpy ndarray; it accepts anything that meets the_array_likecontract (see xarray/core/variable.py#L373). If we can wrap apyarrow.Arrayin a tiny adapter that satisfies that contract, the scatter step becomes a pointer assignment and the round-trip skips a full read pass.Open questions:
pyarrow.Arraychunks with a custom__array__that materializes lazily?.chunk()?Worth prototyping on a single dtype family first (float64) to see if the adapter shape is even tractable.
Spun out of #167.