fix: coalesce native chunks into bounded scan partitions (#174)#175
fix: coalesce native chunks into bounded scan partitions (#174)#175ghostiee-11 wants to merge 1 commit into
Conversation
|
Hey @alxmrs, can you review this? |
read_xarray_table created one DataFusion scan partition per native xarray chunk, so table registration was O(num_chunks) rather than O(data_size). On finely chunked stores this is intractable: a single GOES-16 variable has roughly 59M native chunks and never finishes registering. Coalesce consecutive native chunks into at most target_partitions scan partitions (default 16384), allocated in a balanced way across dimensions so filter pushdown keeps pruning in every dimension. Each coalesced partition still streams one native chunk at a time, so peak per-partition memory is unchanged. Pass target_partitions=None to keep one partition per native chunk. Coalescing only widens a partition's min/max box, which can never cause a false prune, so query results are identical. Registration is now bounded regardless of chunking granularity: a synthetic 1M-chunk dataset drops from 54s to 0.46s locally. Adds algorithm, tiling, correctness-oracle, memory, capping, and cftime tests plus a registration-scaling benchmark. No Rust changes.
8c97843 to
847fbc7
Compare
|
After merging my speedup changes with this PR, I found this turned ERA5 registration from 10s --> 1.2s. But, it made a query that previously took 1.4s to take 24s. Would you mind exploring more to see if we can get both the win of faster registration and fast queries? Please see if you can compare to main again -- esp, on the GOES example -- because I may have independently fixed the issue using a different means (addressing dask). |
|
Thanks for testing this against dask @alxmrs . I looked into both sides before changing anything. The query regression is real: coalescing prunes to a whole coalesced group instead of a single chunk, so a selective filter reads many times more than it needs (the 1.4s to 24s you saw). But coalescing is the only thing making GOES-scale registration tractable. On current main, registration is still O(num_chunks): 1M chunks take ~15s, so one 59M-chunk GOES variable is ~15 min, and there are 188. So #174 is still open at that scale, and this PR fixes it the wrong way, by sacrificing pruning. I think we can get both, building on #176. The pruning metadata is separable per dimension: a partition's min/max box is fixed by per-axis chunk bounds, not the full product (about 103k per-axis values vs 59M boxes for GOES). So:
That gives bounded registration and exact pruning. The cost is reworking the Rust table provider to prune-then-enumerate instead of materialising all N partitions up front. Want me to prototype this so we can benchmark it on the GOES example? If you would rather keep it small for now, I can default |
|
Can you paste in a comment or the change a demo script of opening GOES? I'd like to make a few sanity checks to see if there are other ways we could make it open faster. For example, do you set |
What this does
read_xarray_tablecreates one DataFusion scan partition per native xarray chunk, so table registration is O(num_chunks) rather than O(data_size). On finely chunked stores this becomes intractable. The case in #174 is a single Arraylake GOES-16 variable with roughly 102,988 x 24 x 24 (about 59M) native chunks, which takes 25 to 30 minutes to register and never finishes for all 188 variables.This coalesces consecutive native chunks into at most
target_partitionsscan partitions (default 16384), allocated in a balanced way across dimensions so filter pushdown keeps pruning in every dimension. Each coalesced partition still streams one native chunk at a time, so peak per-partition memory during a query is unchanged. Passingtarget_partitions=Nonerestores the previous behavior of one partition per native chunk.Fixes #174.
Why this is safe
Coalescing only ever widens a partition's min/max bounding box. The pruning logic excludes a partition only when a filter provably cannot match its box, so a wider box can never cause a false exclusion. Query results are therefore identical with or without coalescing; only pruning granularity changes. An oracle test asserts this by comparing results at the default target against
target_partitions=None.The group-size allocation is O(D log D) in the number of dimensions and independent of the native chunk count, so it stays fast even at tens of millions of chunks.
Results
Registration time vs native chunk count (synthetic,
perf_tests/registration_scaling.py):Below the target the behavior is unchanged (no coalescing); above it, registration stays flat.
API
read_xarray_table(ds, ..., target_partitions=16384)XarrayContext.from_dataset(..., target_partitions=16384)(the bound is applied per table)target_partitions=Nonedisables coalescing.Testing
tests/test_coalesce.py: group-size algorithm, exact sub-block tiling, bounded partition count, correctness oracle (coalesced vs uncoalesced including WHERE filters), bounded-memory streaming, explicit-chunks capping, and cftime (360_day) transparency.perf_tests/registration_scaling.py: registration-scaling benchmark.uv run --no-project pytest -v . -m "not integration"(155 tests).cargo check --all-featurespasses. No Rust changes.