read_xarray_table creates one scan partition per native chunk and enumerates them all at registration, so registration cost is O(num_chunks) rather than O(data_size). On finely-chunked stores this makes registration intractable.
Measured (tiny synthetic data, same data, only the chunking changes)
import numpy as np, xarray as xr, time, xarray_sql as xql
for T in (1_000, 10_000, 100_000):
ds = xr.Dataset(
{"v": (("t", "y", "x"), np.zeros((T, 2, 2), "float32"))},
coords={"t": np.arange(T), "y": np.arange(2), "x": np.arange(2)},
).chunk({"t": 1, "y": 2, "x": 2})
t0 = time.time(); xql.read_xarray_table(ds, chunks={"t": 1})
print(T, f"{time.time()-t0:.2f}s")
| partitions |
register time |
| 1,000 |
0.06s |
| 10,000 |
0.24s |
| 100,000 |
2.45s |
~25us/partition, linear.
Real-world impact
Public Arraylake GOES (earthmover-public/goes-16, group ABI-L2-MCMIPF/post-2023-04-19) is chunked one timestep per chunk:
CMI_C13 chunks -> t: 102,988 chunks of size 1 ; y, x: 24 chunks of 226
That is 102988 * 24 * 24 = ~59M partitions for a single variable, i.e. ~25-30 minutes just to register one variable. Registering all 188 variables never finishes. Coarsening the time chunks and pre-loading coordinates did not help (still minutes).
Proposal
Decouple partition count from native chunk count: coalesce native chunks into a bounded number of scan partitions (e.g. a target partition count / size), or enumerate partitions lazily, so registration cost is independent of how finely the store is chunked.
Context
Surfaced via Lumen's XArraySQLSource (holoviz/lumen#1889), which compounds it by registering per-variable.
read_xarray_tablecreates one scan partition per native chunk and enumerates them all at registration, so registration cost isO(num_chunks)rather thanO(data_size). On finely-chunked stores this makes registration intractable.Measured (tiny synthetic data, same data, only the chunking changes)
~25us/partition, linear.
Real-world impact
Public Arraylake GOES (
earthmover-public/goes-16, groupABI-L2-MCMIPF/post-2023-04-19) is chunked one timestep per chunk:That is
102988 * 24 * 24 = ~59Mpartitions for a single variable, i.e. ~25-30 minutes just to register one variable. Registering all 188 variables never finishes. Coarsening the time chunks and pre-loading coordinates did not help (still minutes).Proposal
Decouple partition count from native chunk count: coalesce native chunks into a bounded number of scan partitions (e.g. a target partition count / size), or enumerate partitions lazily, so registration cost is independent of how finely the store is chunked.
Context
Surfaced via Lumen's
XArraySQLSource(holoviz/lumen#1889), which compounds it by registering per-variable.