Skip to content

Add lazy round-trip benchmark (case 09)#178

Open
ghostiee-11 wants to merge 1 commit into
alxmrs:geospatial-sql-benchmarksfrom
ghostiee-11:geobench-lazy-roundtrip
Open

Add lazy round-trip benchmark (case 09)#178
ghostiee-11 wants to merge 1 commit into
alxmrs:geospatial-sql-benchmarksfrom
ghostiee-11:geobench-lazy-roundtrip

Conversation

@ghostiee-11

@ghostiee-11 ghostiee-11 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

For #177 , Adds 09_lazy_roundtrip.py: the same air[t0] slab is 1,325 rows / ~0.7 MB via lazy to_dataset(chunks={"time":1}) + .sel(time=t0) (one WHERE pushed down) vs 3.86M rows / ~161 MB eager, all asserted equal to the xarray reference.
Also hardens _harness.py (grid-coverage guard in assert_grid_close, tracemalloc finally in measured); runs green locally via uv.

Add benchmarks/geospatial/09_lazy_roundtrip.py showing the SQL to xarray
round-trip (to_dataset) is lazy: a chunked to_dataset plus .sel(time=t0)
pushes a single WHERE into SQL (1,325 vs 3,869,000 rows; ~0.7 vs ~161 MB)
and asserts equal to the xarray reference. Also harden _harness.py:
assert_grid_close fails on a partial grid, and measured() stops tracemalloc
in a finally. Document case 09 in the suite README.
#
# [tool.uv.sources]
# xarray-sql = { path = "../../", editable = true }
# ///

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this as a good unit test or property that cross cuts all the other geo benchmarks, but I don't think it alone makes for a good benchmark example.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, makes sense

@alxmrs alxmrs left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few other notes. If you removed case 09 and we got to the bottom of the fixes, I'd be happy to merge this.

Comment on lines 187 to +193
t0 = time.perf_counter()
yield
elapsed = time.perf_counter() - t0
_, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()
try:
yield
finally:
elapsed = time.perf_counter() - t0
_, peak = tracemalloc.get_traced_memory()
tracemalloc.stop()

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix seems like a good idea

Comment on lines +224 to +233
short = {
d: (got.sizes[d], ref.sizes[d])
for d in ref.dims
if d in got.sizes and got.sizes[d] != ref.sizes[d]
}
if short:
raise AssertionError(
f"{name}: SQL result does not cover the reference grid "
f"(dim: got vs ref = {short}); the comparison would be partial"
)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that Xarray's all close doesn't cover this case. Are you sure this is necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, you're right allclose would catch it on its own. it's the reindex_like(got) one line up that hides it: it shrinks ref down to got's coords first, so a result missing cells still passes on the subset.

got = ref.isel(lat=[0, 1, 2])  # 2 cells dropped
xr.testing.assert_allclose(got, ref.reindex_like(got))  # passes, silently
xr.testing.assert_allclose(got, ref)                    # raises

so the guard just restores the check reindex_like removes. could also drop reindex_like for xr.align(..., join="exact"), but that line handles label ordering so the guard felt smaller. either works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants