Investigate numpy-array-like Arrow view to avoid the scatter copy

`SQLBackendArray._raw_getitem` and the eager aggregation path both copy PyArrow `RecordBatch` columns into a preallocated numpy buffer (one read pass per batch column). The copy is necessary today because PyArrow buffers are immutable views and numpy needs a writable buffer.

`xarray.Variable` does not strictly require a numpy ndarray; it accepts anything that meets the `_array_like` contract (see [xarray/core/variable.py#L373](https://github.com/pydata/xarray/blob/main/xarray/core/variable.py#L373)). If we can wrap a `pyarrow.Array` in a tiny adapter that satisfies that contract, the scatter step becomes a pointer assignment and the round-trip skips a full read pass.

Open questions:

- Do we need contiguous storage per dim, or can we hand xarray a list of `pyarrow.Array` chunks with a custom `__array__` that materializes lazily?
- How does indexing (slicing, fancy indexing) compose with the adapter once xarray asks for sub-views?
- Does this play nicely with the chunked / dask path users will reach for after `.chunk()`?

Worth prototyping on a single dtype family first (float64) to see if the adapter shape is even tractable.

Spun out of #167.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate numpy-array-like Arrow view to avoid the scatter copy #173

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Investigate numpy-array-like Arrow view to avoid the scatter copy #173

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions