fix: coalesce native chunks into bounded scan partitions (#174) by ghostiee-11 · Pull Request #175 · alxmrs/xarray-sql

ghostiee-11 · 2026-06-15T11:37:18Z

What this does

read_xarray_table creates one DataFusion scan partition per native xarray chunk, so table registration is O(num_chunks) rather than O(data_size). On finely chunked stores this becomes intractable. The case in #174 is a single Arraylake GOES-16 variable with roughly 102,988 x 24 x 24 (about 59M) native chunks, which takes 25 to 30 minutes to register and never finishes for all 188 variables.

This coalesces consecutive native chunks into at most target_partitions scan partitions (default 16384), allocated in a balanced way across dimensions so filter pushdown keeps pruning in every dimension. Each coalesced partition still streams one native chunk at a time, so peak per-partition memory during a query is unchanged. Passing target_partitions=None restores the previous behavior of one partition per native chunk.

Fixes #174.

Why this is safe

Coalescing only ever widens a partition's min/max bounding box. The pruning logic excludes a partition only when a filter provably cannot match its box, so a wider box can never cause a false exclusion. Query results are therefore identical with or without coalescing; only pruning granularity changes. An oracle test asserts this by comparing results at the default target against target_partitions=None.

The group-size allocation is O(D log D) in the number of dimensions and independent of the native chunk count, so it stays fast even at tens of millions of chunks.

Results

Registration time vs native chunk count (synthetic, perf_tests/registration_scaling.py):

native chunks	before (one per chunk)	after (coalesced)
1,000	0.02s	0.05s
10,000	0.25s	0.25s
100,000	2.93s	0.32s
1,000,000	54.27s	0.46s

Below the target the behavior is unchanged (no coalescing); above it, registration stays flat.

API

read_xarray_table(ds, ..., target_partitions=16384)
XarrayContext.from_dataset(..., target_partitions=16384) (the bound is applied per table)
target_partitions=None disables coalescing.

Testing

tests/test_coalesce.py: group-size algorithm, exact sub-block tiling, bounded partition count, correctness oracle (coalesced vs uncoalesced including WHERE filters), bounded-memory streaming, explicit-chunks capping, and cftime (360_day) transparency.
perf_tests/registration_scaling.py: registration-scaling benchmark.
Full suite passes: uv run --no-project pytest -v . -m "not integration" (155 tests). cargo check --all-features passes. No Rust changes.
Docstrings updated (Google style) and a note added to the docs.

ghostiee-11 · 2026-06-19T08:10:13Z

Hey @alxmrs, can you review this?

read_xarray_table created one DataFusion scan partition per native xarray chunk, so table registration was O(num_chunks) rather than O(data_size). On finely chunked stores this is intractable: a single GOES-16 variable has roughly 59M native chunks and never finishes registering. Coalesce consecutive native chunks into at most target_partitions scan partitions (default 16384), allocated in a balanced way across dimensions so filter pushdown keeps pruning in every dimension. Each coalesced partition still streams one native chunk at a time, so peak per-partition memory is unchanged. Pass target_partitions=None to keep one partition per native chunk. Coalescing only widens a partition's min/max box, which can never cause a false prune, so query results are identical. Registration is now bounded regardless of chunking granularity: a synthetic 1M-chunk dataset drops from 54s to 0.46s locally. Adds algorithm, tiling, correctness-oracle, memory, capping, and cftime tests plus a registration-scaling benchmark. No Rust changes.

alxmrs · 2026-06-20T16:25:15Z

After merging my speedup changes with this PR, I found this turned ERA5 registration from 10s --> 1.2s. But, it made a query that previously took 1.4s to take 24s. Would you mind exploring more to see if we can get both the win of faster registration and fast queries? Please see if you can compare to main again -- esp, on the GOES example -- because I may have independently fixed the issue using a different means (addressing dask).

ghostiee-11 · 2026-06-21T14:39:04Z

Thanks for testing this against dask @alxmrs . I looked into both sides before changing anything.

The query regression is real: coalescing prunes to a whole coalesced group instead of a single chunk, so a selective filter reads many times more than it needs (the 1.4s to 24s you saw).

But coalescing is the only thing making GOES-scale registration tractable. On current main, registration is still O(num_chunks): 1M chunks take ~15s, so one 59M-chunk GOES variable is ~15 min, and there are 188. So #174 is still open at that scale, and this PR fixes it the wrong way, by sacrificing pruning.

I think we can get both, building on #176. The pruning metadata is separable per dimension: a partition's min/max box is fixed by per-axis chunk bounds, not the full product (about 103k per-axis values vs 59M boxes for GOES). So:

Store per-axis chunk metadata at registration. Bounded, O(sum of per-axis chunk counts).
Prune per axis at scan time and enumerate only the surviving partitions, at full native granularity. No pruning lost.
Coalesce only the survivors of an unfiltered scan, purely for parallelism.

That gives bounded registration and exact pruning. The cost is reworking the Rust table provider to prune-then-enumerate instead of materialising all N partitions up front.

Want me to prototype this so we can benchmark it on the GOES example? If you would rather keep it small for now, I can default target_partitions to None so this PR is opt-in and changes nothing by default.

alxmrs · 2026-06-22T07:23:21Z

Can you paste in a comment or the change a demo script of opening GOES? I'd like to make a few sanity checks to see if there are other ways we could make it open faster. For example, do you set chunks=None when you open the Zarr?

alxmrs force-pushed the fix/registration-bounded-partitions-174 branch from 8c97843 to 847fbc7 Compare June 20, 2026 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: coalesce native chunks into bounded scan partitions (#174)#175

fix: coalesce native chunks into bounded scan partitions (#174)#175
ghostiee-11 wants to merge 1 commit into
alxmrs:mainfrom
ghostiee-11:fix/registration-bounded-partitions-174

ghostiee-11 commented Jun 15, 2026

Uh oh!

ghostiee-11 commented Jun 19, 2026

Uh oh!

alxmrs commented Jun 20, 2026

Uh oh!

ghostiee-11 commented Jun 21, 2026

Uh oh!

alxmrs commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ghostiee-11 commented Jun 15, 2026

What this does

Why this is safe

Results

API

Testing

Uh oh!

ghostiee-11 commented Jun 19, 2026

Uh oh!

alxmrs commented Jun 20, 2026

Uh oh!

ghostiee-11 commented Jun 21, 2026

Uh oh!

alxmrs commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants