Skip to content

fix: coalesce native chunks into bounded scan partitions (#174)#175

Open
ghostiee-11 wants to merge 1 commit into
alxmrs:mainfrom
ghostiee-11:fix/registration-bounded-partitions-174
Open

fix: coalesce native chunks into bounded scan partitions (#174)#175
ghostiee-11 wants to merge 1 commit into
alxmrs:mainfrom
ghostiee-11:fix/registration-bounded-partitions-174

Conversation

@ghostiee-11

Copy link
Copy Markdown
Contributor

What this does

read_xarray_table creates one DataFusion scan partition per native xarray chunk, so table registration is O(num_chunks) rather than O(data_size). On finely chunked stores this becomes intractable. The case in #174 is a single Arraylake GOES-16 variable with roughly 102,988 x 24 x 24 (about 59M) native chunks, which takes 25 to 30 minutes to register and never finishes for all 188 variables.

This coalesces consecutive native chunks into at most target_partitions scan partitions (default 16384), allocated in a balanced way across dimensions so filter pushdown keeps pruning in every dimension. Each coalesced partition still streams one native chunk at a time, so peak per-partition memory during a query is unchanged. Passing target_partitions=None restores the previous behavior of one partition per native chunk.

Fixes #174.

Why this is safe

Coalescing only ever widens a partition's min/max bounding box. The pruning logic excludes a partition only when a filter provably cannot match its box, so a wider box can never cause a false exclusion. Query results are therefore identical with or without coalescing; only pruning granularity changes. An oracle test asserts this by comparing results at the default target against target_partitions=None.

The group-size allocation is O(D log D) in the number of dimensions and independent of the native chunk count, so it stays fast even at tens of millions of chunks.

Results

Registration time vs native chunk count (synthetic, perf_tests/registration_scaling.py):

native chunks before (one per chunk) after (coalesced)
1,000 0.02s 0.05s
10,000 0.25s 0.25s
100,000 2.93s 0.32s
1,000,000 54.27s 0.46s

Below the target the behavior is unchanged (no coalescing); above it, registration stays flat.

API

  • read_xarray_table(ds, ..., target_partitions=16384)
  • XarrayContext.from_dataset(..., target_partitions=16384) (the bound is applied per table)
  • target_partitions=None disables coalescing.

Testing

  • tests/test_coalesce.py: group-size algorithm, exact sub-block tiling, bounded partition count, correctness oracle (coalesced vs uncoalesced including WHERE filters), bounded-memory streaming, explicit-chunks capping, and cftime (360_day) transparency.
  • perf_tests/registration_scaling.py: registration-scaling benchmark.
  • Full suite passes: uv run --no-project pytest -v . -m "not integration" (155 tests). cargo check --all-features passes. No Rust changes.
  • Docstrings updated (Google style) and a note added to the docs.

@ghostiee-11

Copy link
Copy Markdown
Contributor Author

Hey @alxmrs, can you review this?

read_xarray_table created one DataFusion scan partition per native xarray
chunk, so table registration was O(num_chunks) rather than O(data_size). On
finely chunked stores this is intractable: a single GOES-16 variable has
roughly 59M native chunks and never finishes registering.

Coalesce consecutive native chunks into at most target_partitions scan
partitions (default 16384), allocated in a balanced way across dimensions so
filter pushdown keeps pruning in every dimension. Each coalesced partition
still streams one native chunk at a time, so peak per-partition memory is
unchanged. Pass target_partitions=None to keep one partition per native chunk.

Coalescing only widens a partition's min/max box, which can never cause a
false prune, so query results are identical. Registration is now bounded
regardless of chunking granularity: a synthetic 1M-chunk dataset drops from
54s to 0.46s locally.

Adds algorithm, tiling, correctness-oracle, memory, capping, and cftime tests
plus a registration-scaling benchmark. No Rust changes.
@alxmrs alxmrs force-pushed the fix/registration-bounded-partitions-174 branch from 8c97843 to 847fbc7 Compare June 20, 2026 16:16
@alxmrs

alxmrs commented Jun 20, 2026

Copy link
Copy Markdown
Owner

After merging my speedup changes with this PR, I found this turned ERA5 registration from 10s --> 1.2s. But, it made a query that previously took 1.4s to take 24s. Would you mind exploring more to see if we can get both the win of faster registration and fast queries? Please see if you can compare to main again -- esp, on the GOES example -- because I may have independently fixed the issue using a different means (addressing dask).

@ghostiee-11

Copy link
Copy Markdown
Contributor Author

Thanks for testing this against dask @alxmrs . I looked into both sides before changing anything.

The query regression is real: coalescing prunes to a whole coalesced group instead of a single chunk, so a selective filter reads many times more than it needs (the 1.4s to 24s you saw).

But coalescing is the only thing making GOES-scale registration tractable. On current main, registration is still O(num_chunks): 1M chunks take ~15s, so one 59M-chunk GOES variable is ~15 min, and there are 188. So #174 is still open at that scale, and this PR fixes it the wrong way, by sacrificing pruning.

I think we can get both, building on #176. The pruning metadata is separable per dimension: a partition's min/max box is fixed by per-axis chunk bounds, not the full product (about 103k per-axis values vs 59M boxes for GOES). So:

  1. Store per-axis chunk metadata at registration. Bounded, O(sum of per-axis chunk counts).
  2. Prune per axis at scan time and enumerate only the surviving partitions, at full native granularity. No pruning lost.
  3. Coalesce only the survivors of an unfiltered scan, purely for parallelism.

That gives bounded registration and exact pruning. The cost is reworking the Rust table provider to prune-then-enumerate instead of materialising all N partitions up front.

Want me to prototype this so we can benchmark it on the GOES example? If you would rather keep it small for now, I can default target_partitions to None so this PR is opt-in and changes nothing by default.

@alxmrs

alxmrs commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Can you paste in a comment or the change a demo script of opening GOES? I'd like to make a few sanity checks to see if there are other ways we could make it open faster. For example, do you set chunks=None when you open the Zarr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

read_xarray registration is O(num_chunks): one partition per native chunk, intractable on finely-chunked stores

2 participants