Skip to content

read_xarray registration is O(num_chunks): one partition per native chunk, intractable on finely-chunked stores #174

Description

@ghostiee-11

read_xarray_table creates one scan partition per native chunk and enumerates them all at registration, so registration cost is O(num_chunks) rather than O(data_size). On finely-chunked stores this makes registration intractable.

Measured (tiny synthetic data, same data, only the chunking changes)

import numpy as np, xarray as xr, time, xarray_sql as xql
for T in (1_000, 10_000, 100_000):
    ds = xr.Dataset(
        {"v": (("t", "y", "x"), np.zeros((T, 2, 2), "float32"))},
        coords={"t": np.arange(T), "y": np.arange(2), "x": np.arange(2)},
    ).chunk({"t": 1, "y": 2, "x": 2})
    t0 = time.time(); xql.read_xarray_table(ds, chunks={"t": 1})
    print(T, f"{time.time()-t0:.2f}s")
partitions register time
1,000 0.06s
10,000 0.24s
100,000 2.45s

~25us/partition, linear.

Real-world impact

Public Arraylake GOES (earthmover-public/goes-16, group ABI-L2-MCMIPF/post-2023-04-19) is chunked one timestep per chunk:

CMI_C13 chunks -> t: 102,988 chunks of size 1 ; y, x: 24 chunks of 226

That is 102988 * 24 * 24 = ~59M partitions for a single variable, i.e. ~25-30 minutes just to register one variable. Registering all 188 variables never finishes. Coarsening the time chunks and pre-loading coordinates did not help (still minutes).

Proposal

Decouple partition count from native chunk count: coalesce native chunks into a bounded number of scan partitions (e.g. a target partition count / size), or enumerate partitions lazily, so registration cost is independent of how finely the store is chunked.

Context

Surfaced via Lumen's XArraySQLSource (holoviz/lumen#1889), which compounds it by registering per-variable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions