Skip to content

Investigate numpy-array-like Arrow view to avoid the scatter copy #173

Description

@ghostiee-11

SQLBackendArray._raw_getitem and the eager aggregation path both copy PyArrow RecordBatch columns into a preallocated numpy buffer (one read pass per batch column). The copy is necessary today because PyArrow buffers are immutable views and numpy needs a writable buffer.

xarray.Variable does not strictly require a numpy ndarray; it accepts anything that meets the _array_like contract (see xarray/core/variable.py#L373). If we can wrap a pyarrow.Array in a tiny adapter that satisfies that contract, the scatter step becomes a pointer assignment and the round-trip skips a full read pass.

Open questions:

  • Do we need contiguous storage per dim, or can we hand xarray a list of pyarrow.Array chunks with a custom __array__ that materializes lazily?
  • How does indexing (slicing, fancy indexing) compose with the adapter once xarray asks for sub-views?
  • Does this play nicely with the chunked / dask path users will reach for after .chunk()?

Worth prototyping on a single dtype family first (float64) to see if the adapter shape is even tractable.

Spun out of #167.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions