Geospatial SQL benchmark suite: prove core geo/climate ops are relational#177
Geospatial SQL benchmark suite: prove core geo/climate ops are relational#177alxmrs wants to merge 26 commits into
Conversation
A suite of worked examples arguing that core geospatial/climate operations
we associate with the array paradigm are, underneath, relational operations.
Each case poses the operation in SQL against xarray-sql and asserts the result
matches an xarray/array reference to floating-point tolerance.
Cases (all validated):
- 01 NDVI: per-pixel column algebra + CASE mask, on a real Sentinel-2 L2A
scene in Zarr (ESA EOPF sample service).
- 02 climatology: diurnal cycle as GROUP BY lat, lon, hour (ARCO-ERA5).
- 03 zonal mean: GROUP BY latitude over the full ARCO-ERA5 archive,
WHERE-pruned to one day.
- 04 anomaly: climatology CTE self-JOIN (ARCO-ERA5).
- 05 forecast skill: Pangu-Weather and GraphCast vs ERA5 (WeatherBench2),
a forecast<->truth JOIN on valid_time = init + lead; reproduces GraphCast
beating Pangu at every lead.
- 06 zonal stats: raster x vector range JOIN of full ERA5 against continental
bounding boxes.
- 07 reprojection: PROJ-backed scalar UDF (ST_Transform-style), returning a
{lon, lat} struct to keep PROJ single-threaded.
- 08 regridding: bilinear regrid as a sparse weight-table JOIN, validated
against xarray .interp().
Includes a shared harness (_harness.py: timing, peak memory, assert_allclose,
SQL echo), ERA5 helpers (_era5.py), a suite README, and a narrative write-up
(docs/geospatial.md) wired into the docs nav. Scripts carry PEP 723 inline
metadata so they are runnable standalone via `uv run`.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
| b04 = zarr.open_array(_R10M + "/b04", mode="r") | ||
| b08 = zarr.open_array(_R10M + "/b08", mode="r") | ||
| rows = slice(_ROW0, _ROW0 + _N) | ||
| cols = slice(_COL0, _COL0 + _N) | ||
| red = np.asarray(b04[rows, cols]) | ||
| nir = np.asarray(b08[rows, cols]) |
There was a problem hiding this comment.
Is this really the canonical way to open the dataset in Xarray? Please review the docs and sentinel example to see if we can make this more typical. Is this fixed if we use pystack?
| # Pixel-center coordinates from the affine transform. | ||
| x = _X0 + (np.arange(_COL0, _COL0 + _N) + 0.5) * _RES | ||
| y = _Y0 - (np.arange(_ROW0, _ROW0 + _N) + 0.5) * _RES | ||
| return xr.Dataset( | ||
| {"red": (["y", "x"], red), "nir": (["y", "x"], nir)}, | ||
| coords={"y": y.astype("float64"), "x": x.astype("float64")}, | ||
| ).chunk({"y": 256, "x": 256}) |
There was a problem hiding this comment.
We should be able to calculate NDVI in pure xarray after opening the dataset.
| got = got.sortby(["y", "x"]) | ||
| ref = ref.sortby(["y", "x"]) |
There was a problem hiding this comment.
I belive we can omit this if we use ORDER BY in the SQL statement.
…nual scaling/CASE/sort Addresses PR review of case 01: - Discover the Sentinel-2 L2A product with pystac-client instead of a hard-coded object-store href. - Open it the canonical xarray way: xr.open_datatree on the EOPF Zarr, then the measurements/reflectance/r10m node. This yields B04/B08 already scaled to reflectance with x/y coordinates, so the manual zarr.open_array reads and proj:transform coordinate reconstruction are gone. - NDVI is now pure column arithmetic on both sides: xarray decodes the band _FillValue to NaN on open, so nodata masking is free and the CASE WHEN is dropped. The reference is a clean one-line xarray expression. - Use ORDER BY y, x in SQL and compare via coordinate-aligned xr.testing.assert_allclose (reindex_like) instead of manual .sortby() on both sides. Docs (suite README, docs/geospatial.md) updated to match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
|
Thanks for the review of case 01 — all four points addressed in 6bc41aa:
Verified end-to-end: 1,048,576 pixels match the xarray reference; mean NDVI 0.235 over the window. Docs (suite README + One note: |
| ds = load_window( | ||
| full, "2m_temperature", time=_TIME, latitude=_LAT, longitude=_LON | ||
| ) |
There was a problem hiding this comment.
I think repeating the sel command here is instructive.
There was a problem hiding this comment.
I recommend a fluent interface
There was a problem hiding this comment.
In fact, the open_zarr call and the slicing could be one fluent chain.
| show_sql(sql) | ||
|
|
||
| with timed("SQL diurnal climatology"): | ||
| got = ctx.sql(sql).to_pandas() |
There was a problem hiding this comment.
This should be loaded to a dataset. That's how climatologies are typically used.
| # Array reference: the textbook groupby-over-the-cycle reduction. | ||
| with timed("xarray reference"): | ||
| clim = ds["2m_temperature"].groupby("time.hour").mean("time") - 273.15 | ||
| ref = clim.to_dataframe(name="clim_c").reset_index() |
There was a problem hiding this comment.
Omit transforming this to a dataframe.
|
|
||
| ctx = xql.XarrayContext() | ||
| with timed("register full ERA5"): | ||
| register_era5(ctx, ds) |
There was a problem hiding this comment.
I don't like this helper either since repetition here serves as useful documentation.
| show_sql(sql) | ||
|
|
||
| with timed("SQL zonal mean (WHERE-pruned to one day)"): | ||
| got = ctx.sql(sql).to_pandas() |
There was a problem hiding this comment.
I believe to_pandas() invokes a copy. Is it needed?
… coordinate-aligned comparison Addresses round-2 PR review (cases 02, 03) and applies the same patterns across the suite: - Remove the _era5.py helper module; each case now shows the xr.open_zarr(...).sel(...).load() fluent chain inline (and registration inline), since the repetition documents the canonical xarray usage. - Round-trip gridded SQL results back to xarray Datasets via to_dataset(...) instead of to_pandas() — a climatology/zonal-mean/anomaly/regrid is gridded data and is used as such. - Keep references in pure xarray (no to_dataframe); compare with a new assert_grid_close helper that aligns by coordinate label (xr.testing.assert_allclose), so no manual .sortby() is needed. - Add ORDER BY to the climatology and anomaly queries. - Apply the same Dataset-return + xarray-comparison treatment to the forecast skill (RMSE-by-lead as a DataArray over lead), reprojection (lon/lat grid), and regrid (reshaped to the target grid) cases. - Drop the now-unused check_close harness helper. docs/geospatial.md notes that every query round-trips its answer back to an xarray.Dataset. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
|
Round-2 feedback addressed in e1cbc8b (cases 02/03 explicitly, plus the same treatment swept across the suite). Your comments on 02 / 03:
Swept across the suite (anticipating the same notes):
Full suite re-run is green (NDVI 1.05M px; climatology 535,704 cells; anomaly 1.6M cells; GraphCast still beats Pangu at every lead). Background note: I evaluated Thanks for the careful review — the cases read much more like idiomatic xarray now. |
| _URL, chunks=None, storage_options={"token": "anon"} | ||
| )[["2m_temperature"]] | ||
| .sel(time=_TIME, latitude=_LAT, longitude=_LON) | ||
| .load() |
There was a problem hiding this comment.
I think we should omit the load calls to test out the lazy property of xql. What do you think?
| ds = ( | ||
| xr.open_zarr( | ||
| _URL, chunks=None, storage_options={"token": "anon"} | ||
| )[["2m_temperature"]] |
There was a problem hiding this comment.
I don't think we need this. Xarray-SQL with have a pushdown opt to only select columns we care about.
| # /// | ||
| """Forecast skill — scoring ML weather models against ERA5 is a JOIN + aggregate. | ||
|
|
||
| This is the real thing: scoring the **Pangu-Weather** and **GraphCast** machine- |
There was a problem hiding this comment.
I don't like "this is that real thing"
| Snaps lat/lon onto ERA5's exact coordinate arrays (identical 64×32 grid) so | ||
| the equality JOIN on coordinates is bit-safe across the two Zarr stores. | ||
| """ | ||
| da = _open(url)[_VAR].sel(time=_INIT) |
There was a problem hiding this comment.
I think at can omit the var selection here too, and elsewhere.
| return da.to_dataset().load() | ||
|
|
||
|
|
||
| def _rmse_sql(table: str) -> str: |
There was a problem hiding this comment.
This is vulnerable to a SQL injection IIUC. Can we demo this in a way that puts forth best SQL practices?
| era5_full = _open(_ERA5) | ||
|
|
||
| # Load the forecasts first; snap their grid to ERA5's exact coordinates. | ||
| with timed("read Pangu + GraphCast forecasts"): |
There was a problem hiding this comment.
If we don't load up front, then I don't think these timed blocks for measuring loading are necessary. This is good, it demos laziness.
…-3 review
Addresses round-3 PR review (cases 04, 05) and extends the patterns PR-wide:
- Demonstrate laziness + pushdown: ERA5 cases (02,03,04,06) now register the
whole archive lazily (no .load(), no variable pre-selection) and let
projection pushdown read only 2m_temperature and partition pruning read only
the queried window. References use lazy .sel(); the up-front "read into
memory" steps are gone.
- Parameterized queries (no SQL-string interpolation of values): time/region
bounds are passed via param_values rather than f-string-formatted into the
SQL. Verified pruning still fires with bound parameters.
- Forecast skill (05): stack the two models along a `model` dimension into one
forecast table and GROUP BY the model column, eliminating the `FROM {table}`
identifier interpolation (the injection smell) entirely. The 2m_temperature
selection stays because the models store different pressure-level sets
(Pangu 13, GraphCast 37) and must be aligned on the common surface field.
Reword the "this is the real thing" docstring opening.
docs/geospatial.md and the suite README updated to describe lazy registration,
parameterized filters, and the model-column forecast query.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
|
Round-3 feedback addressed in a971f2b (cases 04/05 explicitly, extended across the suite). Your comments:
One deliberate exception: 05 still selects Full suite re-run is green (NDVI 1.05M px; climatology 535,704 cells; anomaly 1.6M cells; Sahara 33 °C vs Greenland −8 °C; GraphCast still beats Pangu at every lead). The lazy cases are slower now (each ERA5 case reads its window for the SQL and again for the reference), which is the expected cost of demonstrating laziness rather than pre-loading. Background note (carried over): |
alxmrs
left a comment
There was a problem hiding this comment.
A few last thoughts. The geospatial docs are excellent.
| For worked examples of common geospatial operations expressed in SQL — NDVI, | ||
| climatology, anomalies, zonal statistics, forecast skill (Pangu-Weather and | ||
| GraphCast vs ERA5), reprojection, and regridding, each checked against an xarray | ||
| reference — see [`benchmarks/geospatial/`](benchmarks/geospatial/) and the | ||
| [Geospatial operations are relational operations](docs/geospatial.md) write-up. | ||
|
|
There was a problem hiding this comment.
I think this paragraph should live in a new section below called "does it work?" and be framed to answer that question. Then add can close this issue: #46
There was a problem hiding this comment.
I think "does it work" should be between "how does this work" and "why does this work".
| interpolation — which is case 08, where regridding turns out to be a JOIN | ||
| against a weight table rather than a scalar UDF. | ||
|
|
||
| **An honest caveat on threading:** PROJ's transformation context is not |
…real SRTM regrid) - 07 reprojection: open a UTM grid through Xee carrying ee.Image.pixelLonLat(), so Earth Engine's own geodesy reports each pixel's true lon/lat. The SQL PROJ UDF is now validated against an *independent* implementation (EE), not against PROJ via pyproj. Grid built with xee.helpers.fit_geometry; compared at ~1e-5°. - 08 regridding: regrid real SRTM elevation (USGS/SRTMGL1_003, Sierra Nevada) opened via Xee, validated against xarray's bilinear .interp() on the same source field — a new (DEM) data type for the suite. - Both require Earth Engine access (earthengine authenticate + EARTHENGINE_PROJECT) and skip cleanly when EE is not initialized, like the other network cases. Deps add xee, earthengine-api, shapely. Verified offline: imports, the graceful EE-not-initialized skip, ruff/mypy, and the xee open_dataset / fit_geometry API surface. The live EE round-trip was not run here (no credentials in this environment). docs/geospatial.md and the suite README updated for the EE/Xee sources. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
- Move the geospatial-suite pointer out of "What is this?" into a new "Does it work?" section, placed between "How does it work?" and "Why does this work?", reframed to answer that question with a findings summary (validated operations, real datasets, the GraphCast-beats-Pangu result). Addresses #46. - Reword the two remaining uses of "honest" in docs/geospatial.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
|
Round-4 feedback addressed in df67d98:
README section order is now: How does it work? → Does it work? → Why does this work? |
… both pass live - Add _harness.initialize_earth_engine(): initialize EE from Application Default Credentials (gcloud auth application-default login) with the Earth Engine scope, using the ADC/EARTHENGINE_PROJECT project. This avoids the separate `earthengine authenticate` OAuth flow (and the "this app is blocked" org-policy error), and needs no service account. - 07/08 use the helper instead of bare ee.Initialize. - 07: Xee names the projected grid coords x/y (not X/Y), so drop the erroneous rename. Verified end-to-end against Earth Engine: - 07 reprojection: PROJ UDF matches EE's own ee.Image.pixelLonLat() to ~1e-5° over 2,565 UTM pixels (independent geodesy engines agree; pixel centers align). - 08 regrid: real SRTM elevation (Sierra Nevada, 1206–3439 m) regridded 50×50 → 60×72 matches xarray bilinear .interp() exactly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
…xarray-sql - Resolve the script's own directory (BASH_SOURCE) so it works from any cwd — the old `uv run ./01_nvdi.py` only worked from benchmarks/geospatial/ (and had a typo: 01_nvdi → 01_ndvi). - Glob all NN_*.py cases instead of an incomplete hand-written list; report a per-case PASS/FAIL summary and exit non-zero if any case fails. - Hand `uv` the local checkout via `--with-editable $REPO_ROOT`: these cases use xarray-sql features (XarrayDataFrame.to_dataset) newer than the latest PyPI release, so plain `uv run` would resolve an xarray-sql without `to_dataset`. - 01_ndvi.py: declare aiohttp + requests (needed by pystac-client / zarr-over- HTTPS under uv's isolated environment). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
The latest PyPI xarray-sql predates XarrayDataFrame.to_dataset, so plain
`uv run <script>` resolved a build without it. Each script now declares
[tool.uv.sources]
xarray-sql = { path = "../../", editable = true }
in its PEP 723 metadata. uv resolves this path relative to the *script*, so
`uv run <script>` uses the in-repo build from any working directory. run_all.sh
drops the `--with-editable` workaround and is back to plain `uv run`.
Verified `uv run 08_regrid_weights.py` from /tmp uses the local xarray_sql (with
to_dataset) and passes. Once a release ships these features, the [tool.uv.sources]
block can be removed for standalone PyPI use.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
Add _harness.show_result(), and call it in every case right after
assert_grid_close passes, so a run shows *what* it computed — the gridded answer
round-tripped back out of SQL as an xarray Dataset/DataArray, via the standard
repr. Drop the now-redundant "<case> Dataset: {sizes}" one-liners in 02/04
(the repr already shows sizes).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
Revise the prose across the suite for the future reader/user — someone from the
SQL or Xarray world — rather than for the PR conversation:
- Frame features by what that audience values (lazy, idiomatic) instead of the
mechanics that justified review decisions ("nothing loaded or column-selected
up front", "bounds passed as query parameters, not formatted into the SQL",
"no table name formatted into the SQL").
- Fix stale claims: the suite compares with xarray (not `numpy.assert_allclose`),
Earth Engine uses your `gcloud` login (not `earthengine authenticate`), and the
run instructions cover `run_all.sh` and the in-repo build.
- Trim jargon from runtime labels and tighten a couple of code comments.
Touches README ("Does it work?"), docs/geospatial.md, the suite README, and the
02–06 docstrings/labels. No code behavior changes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
Add a "Where this list comes from" section linking the Large Scale Geospatial Benchmarks discussion (coiled/benchmarks #1545, opened by James Bourbeau, 2024) and explaining what it asked: the end-to-end geospatial/climate workflows the Xarray/Dask ecosystem needs at the 100-TB scale. A table maps the discussion's seven workflow categories to the cases here — the suite covers six of the seven (all but vector polygon-to-polygon spatial joins) — underscoring that the operations are an independent community survey, not a SQL-friendly cherry-pick. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
|
Skimmed the PR; wondering if the benchmark timings can be documented (e.g. how long it took on the last run) |
With GEOBENCH_PROFILE set, run_case runs a case's main() a warmup plus GEOBENCH_REPS times, suppressing the repeated output, and writes one row of summary stats (min/median/mean/stdev/max seconds + peak MB) per timed step to GEOBENCH_CSV — a single shared table across all cases. Because each main() is self-contained (builds its own context, reads its own data), re-running it samples the operations with no change to the case scripts: all eight remain byte-identical, and runs without the flag are unchanged. Documented in the suite README (Profiling section). Verified end-to-end on case 08 (normal run unchanged; profiled run emits a clean per-step summary and CSV). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
…ain() Cleaner than the previous approach (which re-ran each case's main() and suppressed its output): a repeatable step is now wrapped in `for _ in measured(label): ...` — one line different from `with timed(...)`, body unchanged. Without profiling it runs once; under GEOBENCH_PROFILE it runs a warmup plus GEOBENCH_REPS measured passes and writes one summary row per step to GEOBENCH_CSV. The harness no longer re-runs main() or redirects stdout, and the perf table now carries both the SQL query and the xarray reference per case (a built-in side-by-side). Registration/opens stay in `with timed(...)`. The case bodies remain idiomatic xarray/SQL; only the SQL-query and reference blocks gained the measured() wrapper. Verified normal runs are unchanged and profiled runs of 08 (EE) and 05 (WB2 JOIN) emit clean summaries + CSV. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
ghostiee-11
left a comment
There was a problem hiding this comment.
Read through the suite end to end, really like the relational framing and that every case round-trips back to xarray and asserts against a reference. Two small things on the harness and one idea below.
| t0 = time.perf_counter() | ||
| yield | ||
| elapsed = time.perf_counter() - t0 | ||
| _, peak = tracemalloc.get_traced_memory() | ||
| tracemalloc.stop() |
There was a problem hiding this comment.
nit: if the body raises, we skip tracemalloc.stop() and leave it running. small try/finally like timed():
| t0 = time.perf_counter() | |
| yield | |
| elapsed = time.perf_counter() - t0 | |
| _, peak = tracemalloc.get_traced_memory() | |
| tracemalloc.stop() | |
| t0 = time.perf_counter() | |
| try: | |
| yield | |
| finally: | |
| elapsed = time.perf_counter() - t0 | |
| _, peak = tracemalloc.get_traced_memory() | |
| tracemalloc.stop() |
alxmrs
left a comment
There was a problem hiding this comment.
Latest round of feedback
| r10m[["b04", "b08"]] | ||
| .rename(b04="red", b08="nir") | ||
| .isel(y=slice(_Y0, _Y0 + _N), x=slice(_X0, _X0 + _N)) | ||
| .load() |
There was a problem hiding this comment.
Why do we have this load op here still?
| # decode_timedelta=True: forecasts store prediction_timedelta as a | ||
| # real duration (and it silences xarray's decode-timedelta warning). | ||
| return xr.open_zarr( | ||
| url, chunks=None, storage_options=_SO, decode_timedelta=True |
There was a problem hiding this comment.
Does this storage options value have to be extracted into a constant if we are already extracting the open call to a function?
| got = xr.DataArray( | ||
| flat.values.reshape(len(tlat), len(tlon)), | ||
| dims=["lat", "lon"], | ||
| coords={"lat": tlat, "lon": tlon}, | ||
| ) |
There was a problem hiding this comment.
So we need to reorganize the script as a data array? Why do we do this copy? Isn't the output + da access enough?
|
built the lazy round-trip case I mentioned and opened it against this branch as #178 (also carries the two harness tweaks). same slab is 1,325 rows / ~0.7 MB lazy vs 3.86M / ~161 MB eager, asserted equal. |
|
@alxmrs a few cases I'd like to add next, all pulled from the #1545 workflow list so they're community-asked rather than SQL-friendly cherry-picks:
|
- assert_grid_close: fail when the SQL result misses grid cells. reindex_like
aligned a short result onto the reference and silently passed; now a per-dim
size mismatch raises, since the suite's whole point is "same numbers".
- measured(): wrap the profiled yield in try/finally so tracemalloc.stop()
always runs even if the body raises.
- 01 NDVI: register the scene lazily (drop the eager .load()); the xarray
reference now .compute()s in its block so its perf row measures the same
read-and-compute the SQL side does, not just graph construction.
- 05: inline storage_options={"token": "anon"} into the single _open() that
uses it; drop the _SO module constant.
- 08 regrid: name each target cell by its (dst_lat, dst_lon) in the weight
table and GROUP BY those, so the result round-trips straight to a (lat, lon)
grid via to_dataset — no manual reshape/copy into a new DataArray.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
The in-process warm loop was unfair: xr.open_zarr(chunks=None) caches each variable in memory after the first read, so the xarray reference served its later repetitions from RAM while the SQL side re-read the store every query — inflating the gap (case 05 read as 23.5s SQL vs 0.29s reference; cold-vs-cold it's ~0.8s vs ~0.6s). run_perf.sh runs each case once per fresh process with no warmup, repeated GEOBENCH_REPS times, so both the SQL operation and the xarray reference pay a cold read on every measurement. perf_summary.py aggregates the independent cold runs into a per-step table. The cases and the measured() harness are unchanged; fairness comes from process isolation, not in-process repetition. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
Add Results, Analysis, and Conclusion sections to docs/geospatial.md. Results publishes the cold-vs-cold timings (run_perf.sh on an e2-standard-8 in-region), with the methodology note on why a fresh process per repetition is the only fair comparison. Analysis decomposes case 05 into read vs compute vs round-trip (cProfile): the read is comparable on both sides, the ~30x gap is compute (explode to rows, hash JOIN, aggregate vs a vectorized NumPy reduction), and the SQL->xarray round-trip is sub-millisecond. It explains the table's shape — JOINs widest, GROUP BYs narrowest — and why the ratio shifts with hardware. Conclusion states the tradeoff plainly: arrays for dense, grid-aligned numerics; SQL for relationally shaped work and SQL-fluent audiences; and the distributed query-engine payoff the single-node benchmark can only point toward. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
| @@ -0,0 +1,113 @@ | |||
| #!/usr/bin/env python3 | |||
| """Aggregate the cold-run perf CSV into a per-step summary and a markdown table. | |||
There was a problem hiding this comment.
Let's not check in this script.
There was a problem hiding this comment.
Done — removed perf_summary.py and folded its aggregation into run_perf.sh as an inline step, so the perf tooling is one self-contained driver (0941e0b).
|
|
||
| * **Spectral indices** (NDVI) — column arithmetic over a real Sentinel-2 scene. | ||
| * **Climatology, anomalies, zonal means** — `GROUP BY` and self-`JOIN` over the | ||
| full 0.25° **ARCO-ERA5** archive (≈1.3M hourly steps), read lazily so each |
There was a problem hiding this comment.
Is it really this many hourly steps?
There was a problem hiding this comment.
Did we really do this broad of a query?
There was a problem hiding this comment.
Good catch — I dropped the “≈1.3M hourly steps” figure; it overstated what the queries touch. Reworded to say the archive is multi-decade and registered as a lazy table, while each query reads only its small window (0941e0b).
There was a problem hiding this comment.
Right — the queries are bounded to a small window (a few days over a region), not a scan of the archive. Reworded so the README says exactly that (0941e0b).
| ``da.groupby("time.hour").mean()``. ERA5 is hourly, so grouping by hour of day | ||
| gives a clean 24-bin **diurnal cycle**, one sample per day in the window. | ||
|
|
||
| The table is the *whole* ARCO-ERA5 archive, opened lazily: the query reads only |
There was a problem hiding this comment.
This doesn't appear to be the whole ERA5 archive, because we have a time range. What is the truth?
There was a problem hiding this comment.
Agreed. Reworded: we register the whole archive as a lazy table, but this climatology is computed over a bounded window — the WHERE prunes the read, so the query never scans the full record (0941e0b).
| FROM era5 a JOIN clim c | ||
| ON (a.latitude, a.longitude, hour(a.time)) = (c.latitude, c.longitude, c.hour) | ||
|
|
||
| The table is the *whole* ARCO-ERA5 archive, opened lazily: both the climatology |
There was a problem hiding this comment.
This is not the whole ERA5 archive.
There was a problem hiding this comment.
Reworded the same way — the table spans the archive lazily; the anomaly is computed over a bounded window the WHERE prunes to (0941e0b).
| horizon" curve (≈0.3 K at 6 h rising to ≈2.5 K at 9 days): | ||
|
|
||
| ``` | ||
| lead (days) Pangu GraphCast |
There was a problem hiding this comment.
If we use the dataframe code above, then let's print that repr here.
There was a problem hiding this comment.
Done — the doc now shows the got.to_pandas() repr (RMSE by lead × model), matching the script's new headline (1cd2c93).
| relational, and that for a surprising share of geoscience, the operation is | ||
| relational. | ||
|
|
||
| ## Running the suite |
There was a problem hiding this comment.
Let's put this section above results.
There was a problem hiding this comment.
Done — “Running the suite” now sits above Results (1cd2c93).
… fixes Fairness: case 05's reference used `.load()`, which caches data in place on the forecasts/truth objects the SQL table also reads from — so running the reference after the SQL query (as the harness does) could serve it a warm read. Switch to `.compute()`, which returns a fresh array and leaves the inputs lazy. Verified: reading a window repeatedly in one process stays flat, and neither side warms the other. The other cases either reopen their data (06) or recompute eagerly with `chunks=None` (02-04), so they were already cold; only 05 leaked. Case 05: render the headline RMSE-by-lead table with `to_pandas()` instead of a hand-rolled print loop; clarify that the `chunks=` arg is the Arrow batch size, not a filter (no data dropped). Case 06: the docstring claimed "full ARCO-ERA5" as the dataset; the query aggregates one day's window (the WHERE prunes the read). Make that precise. Case 07: drop the "graduate into the package" paragraph; clarify that there is one dataset (UTM x/y as SQL input, EE's pixelLonLat as the independent reference), not the same image opened twice in two CRS. Docs: correct the profiling methodology note (the caching trap is in-place `.load()`, not `open_zarr` auto-caching); point the "earns its keep" section forward to Results/Analysis/Conclusion; move "Running the suite" above Results; link James Bourbeau's profile; show the to_pandas headline repr. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
The ARCO-ERA5 cases (02, 04, README) said the query runs over the *whole* archive; each query is in fact bounded by its WHERE window. Reword to say the table spans the archive (lazy) while the query reads only its slice — and drop the unverified "~1.3M hourly steps" framing from the README. Fold perf_summary.py into run_perf.sh (an inline aggregation step) and remove the separate script, so the perf tooling is one self-contained driver. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
Fresh in-region e2-standard-8 run of the full suite. The key change is case 05's reference: now genuinely cold (~0.34s, stable across reps) instead of the ~0.22s warmed number — confirming the .load()->.compute() fix closed the leak. Update the Results table (cases 01-06), note case 05's CPU/hardware sensitivity (~12s vs ~23s across two VMs), and correct the NDVI ratio (~8x). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
| ctx.from_dataset("forecasts", forecasts, chunks={"time": 6}) | ||
| ctx.from_dataset("era5", truth, chunks={"time": 100}) |
There was a problem hiding this comment.
This doesn't make sense to me. This ERA5 dataset is a 6 hourly step version at a small grid designed to be comparable to the forecasts. By throw away data, I mean we load 100 timesteps in the chunk and then discard the remainder that doesn't match the forecast chunk (as they divide together).
Can you write a small prototype (that's discarded) that tries 6 and 6 chunks ve 6 and 100 chunks to see what the empirical answer is? I'm still not convinced.
| "2m_temperature" | ||
| ] | ||
| .sel(time=_DAY) | ||
| .load() |
There was a problem hiding this comment.
Since we're not using dask anyway, let's get rid of this load. (Is this essentially for any reason?)
| rename[d] = "lat" | ||
| elif dl in ("x", "lon", "longitude"): | ||
| rename[d] = "lon" | ||
| return da.rename(rename).sortby("lat").sortby("lon").load() |
There was a problem hiding this comment.
I want this to be a valid perf comparison test of laziness for the xr ref too, so let's get rid of the load here.
| # Aggregate the per-process cold runs (one row each) into a per-step summary and | ||
| # a markdown table. Each raw row is one cold sample (reps=1), so we collect the | ||
| # samples per (case, title, step) and report median/spread/peak across them. | ||
| python3 - "$RAW" "$SUMMARY" <<'PY' |
There was a problem hiding this comment.
At this point, fine, let's make this a second script.
There was a problem hiding this comment.
Done — moved the aggregation back into perf_summary.py as a second script (the inline heredoc was worse to read) (7bfd56f).
Case 06: drop the reference's `.load()`. The day's field is now read inside the timed block (like the SQL side) via a single masked reduction over stacked region masks — one lazy read for all regions, instead of pre-materializing. Verified against the SQL result. perf: move the aggregation back out of run_perf.sh into perf_summary.py (a second script reads cleaner than the inline heredoc). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
Previously the SQL side read SRTM into an in-memory (cell_id, value) table while the reference read lazily — asymmetric. Now both read the source lazily: register the Xee field itself as the SQL `src` table and key the weight join on the source coordinates (src_lat, src_lon) instead of a pre-raveled cell_id, and drop the `.load()` in `_open_srtm`. Coords are forced to float64 so the join matches the source grid exactly. The regrid math (the new weight table + coordinate join) is validated against xarray `.interp` on a synthetic source to ~1e-12; the Earth Engine end-to-end run still needs confirmation in an EE-authenticated environment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
…ding
A 2×2 sweep of (forecasts, era5) time-chunks ∈ {6,100} shows the truth/era5 chunk
is what matters — small splits it into many partitions and costs ~3× — while the
forecasts chunk is in the noise and a mismatch costs nothing. So set both to one
partition (time:100) for these small windows, and document why.
Also: cases 07 and 08 now validated end-to-end against real Earth Engine (08's
symmetric-lazy regrid matches xarray .interp over 4,320 target cells; 07's PROJ
UDF matches EE pixelLonLat).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
Case 09 composes 07 (the reproject UDF) and 08 (the regrid weight-table JOIN) into a full warp — GDAL/rasterio `reproject`: change the CRS *and* resample onto the new grid. The 07 UDF carries the target lon/lat grid back into source UTM space, arrays turn those reprojected points into bilinear weights, and the 08 JOIN applies them. Validated against xarray `.interp()` at the reprojected points (exact), with Earth Engine's own lon/lat SRTM as a cross-CRS sanity check. Wires 09 into the README case table + dataset note, the docs mapping table and the "hard cases" narrative, and the run_perf.sh glob. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
Re-ran the whole suite (01-09) on a single e2-standard-8 in us-central1, with Earth Engine reached from the same VM, so every case shares one machine and one release build. Updates the Results table and the analysis to match: - Case 05 came in at 10.7 s (not 23 s) — the hardware-sensitivity caveat borne out; 23 s was the high outlier across three runs. - SQL is no longer slower in *every* case: 08 is at parity (0.875 vs 0.850 s) and 09 is faster (0.281 vs 0.817 s), because the geometry (the weights) is precomputed and the relational half just applies it. The analysis and conclusion now reflect this instead of a blanket "slower." - Case 08 is no longer the stale eager-load number; both sides read SRTM lazily, so the comparison is symmetric. Adds the new case 09 row. - Marks the cProfile decomposition explicitly as a separate laptop profile (the intended hardware contrast), resolving the apparent total-vs-table mismatch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9
What & why
A suite of worked examples in
benchmarks/geospatial/arguing a deliberately radical thesis: the core operations of geospatial and climate analysis — the ones we reach for an array library to perform — are, underneath, relational operations. Climatologies, anomalies, zonal means, spectral indices, forecast skill, even reprojection and regridding map onto ordinary SQL (GROUP BY,JOIN, window functions,CASE, and the occasional scalar UDF).Every case poses the operation in SQL against
xarray-sql, round-trips the result back to anxarray.Dataset/DataArray(viato_dataset), and asserts it matches an xarray reference (coordinate-alignedxr.testing.assert_allclose). SQL in, an array out, verified — and each run prints the result's standard xarray repr. The headline is correctness + clarity of the SQL, not raw speed.The README gains a "Does it work?" section summarizing the findings (addresses #46), and a full write-up lives in
docs/geospatial.md(wired into the docs nav).The cases (all validated end-to-end on real data)
pystac-client+xr.open_datatree)GROUP BY lat, lon, hourGROUP BY latitudeWHERE-prunedJOINJOINonvalid_time = init + leadJOINST_Transform-style)pixelLonLatJOIN+ weightedGROUP BY.interp()Highlights
Running
Run a single case with
uv run benchmarks/geospatial/<case>.py, or the whole suite withbenchmarks/geospatial/run_all.sh— from any directory. Cases that need data or credentials you do not have skip cleanly with a clear message: the cloud datasets (ERA5, WeatherBench 2) are read anonymously, and the Earth Engine cases reuse your existinggcloudlogin. (Until the nextxarray-sqlrelease, the scripts run against the version in this branch.)Performance
Profiled timings — 8 cases on an
e2-standard-8VM inus-central1Measured cold vs cold with
benchmarks/geospatial/run_perf.sh: each case runsonce per fresh process, with no warmup, repeated 5 times. Fairness took some care
— the trap is caching. A reference that calls
.load()caches its data in placeon the very object the SQL table also reads from, so a later read (even just running
the reference after the SQL query in the same process) can be served warm. Case 05
loaded shared objects and now uses
.compute()instead (fresh array, inputs staylazy); the other references either reopen their data or recompute eagerly with
chunks=None(NumPy, no graph to keep warm). Verified directly: reading a windowrepeatedly in one process stays flat, and neither side warms the other. Reported are
the median/spread of wall-clock and peak Python-allocator memory across the cold
runs. Hardware: Google Compute Engine
e2-standard-8(8 vCPU, 32 GB),us-central1— in-region with the ARCO-ERA5 and WeatherBench 2 buckets.
ctx.sql(...).to_dataset(...)— the cold lazy read from cloud storage plus the relational compute; xarray reference is the array-paradigm computation each case is validated against. SQL is slower because the paradigm has real cost: the grid is exploded to rows, joined with a hash join, and shuttled through Arrow — work the array path does in-place over contiguous buffers.e2-standard-8run measured ≈12 s rather than ≈23 s. The reference is read-bound and stable, so read the 05 ratio as "the relational form costs real CPU," not a fixed multiplier.reps=1).Notes for review
__init__.py), so it is not collected by pytest or shipped in the wheel.pyproject.toml.docs/geospatial.md.Follow-ups
cftimeUDF.🤖 Generated with Claude Code