Geospatial SQL benchmark suite: prove core geo/climate ops are relational by alxmrs · Pull Request #177 · alxmrs/xarray-sql

alxmrs · 2026-06-23T15:32:37Z

What & why

A suite of worked examples in benchmarks/geospatial/ arguing a deliberately radical thesis: the core operations of geospatial and climate analysis — the ones we reach for an array library to perform — are, underneath, relational operations. Climatologies, anomalies, zonal means, spectral indices, forecast skill, even reprojection and regridding map onto ordinary SQL (GROUP BY, JOIN, window functions, CASE, and the occasional scalar UDF).

Every case poses the operation in SQL against xarray-sql, round-trips the result back to an xarray.Dataset/DataArray (via to_dataset), and asserts it matches an xarray reference (coordinate-aligned xr.testing.assert_allclose). SQL in, an array out, verified — and each run prints the result's standard xarray repr. The headline is correctness + clarity of the SQL, not raw speed.

The README gains a "Does it work?" section summarizing the findings (addresses #46), and a full write-up lives in docs/geospatial.md (wired into the docs nav).

The cases (all validated end-to-end on real data)

#	Operation	Relational form	Data + reference
01	NDVI	column arithmetic (nodata auto-masked on decode)	real Sentinel-2 L2A Zarr (ESA EOPF, via `pystac-client` + `xr.open_datatree`)
02	Diurnal climatology	`GROUP BY lat, lon, hour`	ARCO-ERA5
03	Zonal mean	`GROUP BY latitude`	full ARCO-ERA5, `WHERE`-pruned
04	Anomaly	climatology CTE self-`JOIN`	ARCO-ERA5
05	Forecast skill	forecast↔truth `JOIN` on `valid_time = init + lead`	WeatherBench 2 Pangu & GraphCast vs ERA5
06	Zonal stats over regions	raster × vector range `JOIN`	full ARCO-ERA5 + continental boxes
07	Reprojection	PROJ scalar UDF (`ST_Transform`-style)	Earth Engine via Xee, validated against EE's own `pixelLonLat`
08	Regridding	sparse weight-table `JOIN` + weighted `GROUP BY`	real SRTM elevation via Xee, vs xarray `.interp()`

Highlights

It reads naturally to the people who would use it. The array side is idiomatic Xarray; the query side is idiomatic, lazy SQL. You can point a query at the entire ERA5 archive — decades of global hourly data — and it reads only the slice the query asks for.
Real models, checked against the literature. Case 05 scores the Pangu-Weather and GraphCast forecast models against ERA5 and reproduces the published WeatherBench 2 result that GraphCast leads at every forecast horizon — written as a single JOIN that lines each forecast up with its valid time.
Validated against independent implementations. Case 07 checks its SQL reprojection against Earth Engine's own coordinate geometry (through Xee) — a genuinely different implementation, not PROJ compared with itself. Case 08 regrids real SRTM terrain and matches Xarray's own interpolation.
Honest about the limits. Reprojection and regridding are the operations most tied to arrays; the write-up shows where SQL expresses them cleanly and where arrays still earn their keep — computing the interpolation weights (the geometry), which SQL then applies as a join.

Running

Run a single case with uv run benchmarks/geospatial/<case>.py, or the whole suite with benchmarks/geospatial/run_all.sh — from any directory. Cases that need data or credentials you do not have skip cleanly with a clear message: the cloud datasets (ERA5, WeatherBench 2) are read anonymously, and the Earth Engine cases reuse your existing gcloud login. (Until the next xarray-sql release, the scripts run against the version in this branch.)

Performance

Profiled timings — 8 cases on an e2-standard-8 VM in us-central1

Measured cold vs cold with benchmarks/geospatial/run_perf.sh: each case runs
once per fresh process, with no warmup, repeated 5 times. Fairness took some care
— the trap is caching. A reference that calls .load() caches its data in place
on the very object the SQL table also reads from, so a later read (even just running
the reference after the SQL query in the same process) can be served warm. Case 05
loaded shared objects and now uses .compute() instead (fresh array, inputs stay
lazy); the other references either reopen their data or recompute eagerly with
chunks=None (NumPy, no graph to keep warm). Verified directly: reading a window
repeatedly in one process stays flat, and neither side warms the other. Reported are
the median/spread of wall-clock and peak Python-allocator memory across the cold
runs. Hardware: Google Compute Engine e2-standard-8 (8 vCPU, 32 GB), us-central1
— in-region with the ARCO-ERA5 and WeatherBench 2 buckets.

Case	Step	reps	median (s)	stdev (s)	min (s)	max (s)	peak (MB)
01 · NDVI (per-pixel arithmetic)	SQL	5	3.093	0.050	2.998	3.128	98.5
	xarray reference	5	0.402	0.060	0.372	0.507	42.0
02 · Climatology (GROUP BY lat, lon, hour)	SQL	5	5.987	1.328	5.901	8.942	513.8
	xarray reference	5	2.449	0.401	2.332	3.312	43.7
03 · Zonal mean (GROUP BY latitude)	SQL	5	3.224	0.048	3.175	3.288	245.3
	xarray reference	5	0.552	0.020	0.529	0.576	249.5
04 · Anomaly (climatology self-JOIN)	SQL	5	9.479	0.413	9.382	10.280	524.4
	xarray reference	5	3.013	0.584	2.875	4.199	80.5
05 · Forecast skill (forecast↔truth JOIN)	SQL	5	23.241	0.101	23.114	23.348	34.2
	xarray reference	5	0.336	0.014	0.314	0.352	2.2
06 · Zonal stats (raster × vector JOIN)	SQL	5	6.285	0.052	6.191	6.317	509.9
	xarray reference	5	2.235	0.216	2.195	2.703	359.9
07 · Reprojection (PROJ scalar UDF)	SQL	1	0.039	0.000	0.039	0.039	0.3
08 · Regridding (weight-table JOIN)	SQL	5	0.061	0.001	0.060	0.063	0.8
	xarray reference	5	0.018	0.001	0.018	0.020	0.2

SQL is ctx.sql(...).to_dataset(...) — the cold lazy read from cloud storage plus the relational compute; xarray reference is the array-paradigm computation each case is validated against. SQL is slower because the paradigm has real cost: the grid is exploded to rows, joined with a hash join, and shuttled through Arrow — work the array path does in-place over contiguous buffers.
Case 05's SQL time is CPU-bound (the join + GIL-held row production) and the most hardware-sensitive number here — a second e2-standard-8 run measured ≈12 s rather than ≈23 s. The reference is read-bound and stable, so read the 05 ratio as "the relational form costs real CPU," not a fixed multiplier.
Case 01 reads Sentinel-2 from EODC (Europe), the only non-US source, so its time includes a cross-region read.
Cases 07–08 load their Earth Engine inputs into memory once and then compute, so they are methodology-agnostic (cold ≡ warm) and are reported from a single in-memory run. Case 07's PROJ UDF is not re-entrant across repeated in-process executions, so it is measured once (reps=1).

Notes for review

ruff and mypy pass via pre-commit. The suite lives outside the package (no __init__.py), so it is not collected by pytest or shipped in the wheel.
Each script declares its own dependencies inline (PEP 723); nothing is added to pyproject.toml.
Adds a "Does it work?" section to the README (addresses Add a "Does it work?" section to the README #46) and a longer write-up at docs/geospatial.md.

Follow-ups

Promote the reprojection UDF (case 07) into the package, alongside the existing cftime UDF.
Once a release includes the SQL→Xarray round-trip, the cases run straight from PyPI.

🤖 Generated with Claude Code

A suite of worked examples arguing that core geospatial/climate operations we associate with the array paradigm are, underneath, relational operations. Each case poses the operation in SQL against xarray-sql and asserts the result matches an xarray/array reference to floating-point tolerance. Cases (all validated): - 01 NDVI: per-pixel column algebra + CASE mask, on a real Sentinel-2 L2A scene in Zarr (ESA EOPF sample service). - 02 climatology: diurnal cycle as GROUP BY lat, lon, hour (ARCO-ERA5). - 03 zonal mean: GROUP BY latitude over the full ARCO-ERA5 archive, WHERE-pruned to one day. - 04 anomaly: climatology CTE self-JOIN (ARCO-ERA5). - 05 forecast skill: Pangu-Weather and GraphCast vs ERA5 (WeatherBench2), a forecast<->truth JOIN on valid_time = init + lead; reproduces GraphCast beating Pangu at every lead. - 06 zonal stats: raster x vector range JOIN of full ERA5 against continental bounding boxes. - 07 reprojection: PROJ-backed scalar UDF (ST_Transform-style), returning a {lon, lat} struct to keep PROJ single-threaded. - 08 regridding: bilinear regrid as a sparse weight-table JOIN, validated against xarray .interp(). Includes a shared harness (_harness.py: timing, peak memory, assert_allclose, SQL echo), ERA5 helpers (_era5.py), a suite README, and a narrative write-up (docs/geospatial.md) wired into the docs nav. Scripts carry PEP 723 inline metadata so they are runnable standalone via `uv run`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

alxmrs

Review of first example

alxmrs · 2026-06-23T16:41:42Z

+        b04 = zarr.open_array(_R10M + "/b04", mode="r")
+        b08 = zarr.open_array(_R10M + "/b08", mode="r")
+        rows = slice(_ROW0, _ROW0 + _N)
+        cols = slice(_COL0, _COL0 + _N)
+        red = np.asarray(b04[rows, cols])
+        nir = np.asarray(b08[rows, cols])


Is this really the canonical way to open the dataset in Xarray? Please review the docs and sentinel example to see if we can make this more typical. Is this fixed if we use pystack?

alxmrs · 2026-06-23T16:42:22Z

+    # Pixel-center coordinates from the affine transform.
+    x = _X0 + (np.arange(_COL0, _COL0 + _N) + 0.5) * _RES
+    y = _Y0 - (np.arange(_ROW0, _ROW0 + _N) + 0.5) * _RES
+    return xr.Dataset(
+        {"red": (["y", "x"], red), "nir": (["y", "x"], nir)},
+        coords={"y": y.astype("float64"), "x": x.astype("float64")},
+    ).chunk({"y": 256, "x": 256})


We should be able to calculate NDVI in pure xarray after opening the dataset.

alxmrs · 2026-06-23T16:45:06Z

+    got = got.sortby(["y", "x"])
+    ref = ref.sortby(["y", "x"])


I belive we can omit this if we use ORDER BY in the SQL statement.

…nual scaling/CASE/sort Addresses PR review of case 01: - Discover the Sentinel-2 L2A product with pystac-client instead of a hard-coded object-store href. - Open it the canonical xarray way: xr.open_datatree on the EOPF Zarr, then the measurements/reflectance/r10m node. This yields B04/B08 already scaled to reflectance with x/y coordinates, so the manual zarr.open_array reads and proj:transform coordinate reconstruction are gone. - NDVI is now pure column arithmetic on both sides: xarray decodes the band _FillValue to NaN on open, so nodata masking is free and the CASE WHEN is dropped. The reference is a clean one-line xarray expression. - Use ORDER BY y, x in SQL and compare via coordinate-aligned xr.testing.assert_allclose (reindex_like) instead of manual .sortby() on both sides. Docs (suite README, docs/geospatial.md) updated to match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

alxmrs · 2026-06-23T17:01:01Z

Thanks for the review of case 01 — all four points addressed in 6bc41aa:

Use pystac for discovery — now uses pystac-client to search the EOPF STAC catalog and resolve the product, instead of a hard-coded object-store href. Added to the script's PEP 723 deps.
Idiomatic xarray open — replaced zarr.open_array + manual proj:transform coordinate reconstruction with xr.open_datatree(item.assets["product"].href, engine="zarr") and the measurements/reflectance/r10m node. xarray decodes the bands to reflectance and brings x/y coordinates for free.
NDVI in pure xarray — the reference is now a one-liner: (scene.nir - scene.red) / (scene.nir + scene.red). Because xarray decodes the band _FillValue to NaN on open, nodata masking is automatic and the CASE WHEN is gone on both sides.
ORDER BY instead of sortby — the SQL uses ORDER BY y, x, and the comparison aligns by coordinate label via xr.testing.assert_allclose(got, ref.reindex_like(got)), so the manual .sortby() on both sides is removed.

Verified end-to-end: 1,048,576 pixels match the xarray reference; mean NDVI 0.235 over the window. Docs (suite README + docs/geospatial.md) updated to match.

One note: xarray-eopf's eopf-zarr engine would let you open the group directly, but it pins zarr<3, which conflicts with this project's zarr 3 — so I kept the plain engine="zarr" + open_datatree path, which needs no extra engine.

alxmrs · 2026-06-23T19:02:04Z

+        ds = load_window(
+            full, "2m_temperature", time=_TIME, latitude=_LAT, longitude=_LON
+        )


I think repeating the sel command here is instructive.

I recommend a fluent interface

In fact, the open_zarr call and the slicing could be one fluent chain.

alxmrs · 2026-06-23T19:03:27Z

+    show_sql(sql)
+
+    with timed("SQL diurnal climatology"):
+        got = ctx.sql(sql).to_pandas()


This should be loaded to a dataset. That's how climatologies are typically used.

alxmrs · 2026-06-23T19:03:53Z

+    # Array reference: the textbook groupby-over-the-cycle reduction.
+    with timed("xarray reference"):
+        clim = ds["2m_temperature"].groupby("time.hour").mean("time") - 273.15
+        ref = clim.to_dataframe(name="clim_c").reset_index()


Omit transforming this to a dataframe.

alxmrs · 2026-06-23T19:06:01Z

+
+    ctx = xql.XarrayContext()
+    with timed("register full ERA5"):
+        register_era5(ctx, ds)


I don't like this helper either since repetition here serves as useful documentation.

alxmrs · 2026-06-23T19:07:02Z

+    show_sql(sql)
+
+    with timed("SQL zonal mean (WHERE-pruned to one day)"):
+        got = ctx.sql(sql).to_pandas()


I believe to_pandas() invokes a copy. Is it needed?

… coordinate-aligned comparison Addresses round-2 PR review (cases 02, 03) and applies the same patterns across the suite: - Remove the _era5.py helper module; each case now shows the xr.open_zarr(...).sel(...).load() fluent chain inline (and registration inline), since the repetition documents the canonical xarray usage. - Round-trip gridded SQL results back to xarray Datasets via to_dataset(...) instead of to_pandas() — a climatology/zonal-mean/anomaly/regrid is gridded data and is used as such. - Keep references in pure xarray (no to_dataframe); compare with a new assert_grid_close helper that aligns by coordinate label (xr.testing.assert_allclose), so no manual .sortby() is needed. - Add ORDER BY to the climatology and anomaly queries. - Apply the same Dataset-return + xarray-comparison treatment to the forecast skill (RMSE-by-lead as a DataArray over lead), reprojection (lon/lat grid), and regrid (reshaped to the target grid) cases. - Drop the now-unused check_close harness helper. docs/geospatial.md notes that every query round-trips its answer back to an xarray.Dataset. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

alxmrs · 2026-06-23T19:44:54Z

Round-2 feedback addressed in e1cbc8b (cases 02/03 explicitly, plus the same treatment swept across the suite).

Your comments on 02 / 03:

Inline the open/slice; the helper hides it — deleted _era5.py. Each case now shows the fluent chain inline, e.g.
```
ds = (
    xr.open_zarr(_URL, chunks=None, storage_options={"token": "anon"})[["2m_temperature"]]
    .sel(time=_TIME, latitude=_LAT, longitude=_LON)
    .load()
)
```
and registration is inline too (03/06 show the full from_dataset(..., table_names=...)).
ORDER BY latitude DESC, longitude, … — added to the climatology (… , hour) and anomaly queries.
Time is already hourly — right; the docstring now states ERA5 is hourly, so grouping by hour-of-day gives a clean 24-bin diurnal cycle (one sample per day in the window).
Load the climatology to a Dataset — the result now round-trips via to_dataset(dims=["latitude","longitude","hour"]) and is returned/used as a gridded xr.Dataset.
Keep the reference in pure xarray (no to_dataframe) — done: ds["2m_temperature"].groupby("time.hour").mean("time"), compared by coordinate label (no merge, no sortby).
to_pandas() copy in zonal mean — gone; it returns to_dataset(dims=["latitude"]).

Swept across the suite (anticipating the same notes):

All gridded results (NDVI, climatology, zonal mean, anomaly, forecast RMSE-by-lead, reprojection, regrid) now round-trip to xr.Dataset/DataArray and are compared with a new assert_grid_close helper that aligns by coordinate label via xr.testing.assert_allclose — so no case sorts manually.
Forecast skill returns each model's skill curve as a DataArray over lead (a real timedelta coordinate), with an xarray reference.
Dropped the now-unused check_close harness helper.

Full suite re-run is green (NDVI 1.05M px; climatology 535,704 cells; anomaly 1.6M cells; GraphCast still beats Pangu at every lead). Background note: I evaluated xarray-eopf for case 01's open but it pins zarr<3 (conflicts with this project), so case 01 stays on pystac-client + xr.open_datatree.

Thanks for the careful review — the cases read much more like idiomatic xarray now.

alxmrs · 2026-06-23T19:53:32Z

+                    _URL, chunks=None, storage_options={"token": "anon"}
+                )[["2m_temperature"]]
+                .sel(time=_TIME, latitude=_LAT, longitude=_LON)
+                .load()


I think we should omit the load calls to test out the lazy property of xql. What do you think?

alxmrs · 2026-06-24T04:46:12Z

+            ds = (
+                xr.open_zarr(
+                    _URL, chunks=None, storage_options={"token": "anon"}
+                )[["2m_temperature"]]


I don't think we need this. Xarray-SQL with have a pushdown opt to only select columns we care about.

alxmrs · 2026-06-24T04:49:42Z

+# ///
+"""Forecast skill — scoring ML weather models against ERA5 is a JOIN + aggregate.
+
+This is the real thing: scoring the **Pangu-Weather** and **GraphCast** machine-


I don't like "this is that real thing"

alxmrs · 2026-06-24T04:52:43Z

+    Snaps lat/lon onto ERA5's exact coordinate arrays (identical 64×32 grid) so
+    the equality JOIN on coordinates is bit-safe across the two Zarr stores.
+    """
+    da = _open(url)[_VAR].sel(time=_INIT)


I think at can omit the var selection here too, and elsewhere.

alxmrs · 2026-06-24T04:54:15Z

+    return da.to_dataset().load()
+
+
+def _rmse_sql(table: str) -> str:


This is vulnerable to a SQL injection IIUC. Can we demo this in a way that puts forth best SQL practices?

alxmrs · 2026-06-24T04:56:45Z

+    era5_full = _open(_ERA5)
+
+    # Load the forecasts first; snap their grid to ERA5's exact coordinates.
+    with timed("read Pangu + GraphCast forecasts"):


If we don't load up front, then I don't think these timed blocks for measuring loading are necessary. This is good, it demos laziness.

…-3 review Addresses round-3 PR review (cases 04, 05) and extends the patterns PR-wide: - Demonstrate laziness + pushdown: ERA5 cases (02,03,04,06) now register the whole archive lazily (no .load(), no variable pre-selection) and let projection pushdown read only 2m_temperature and partition pruning read only the queried window. References use lazy .sel(); the up-front "read into memory" steps are gone. - Parameterized queries (no SQL-string interpolation of values): time/region bounds are passed via param_values rather than f-string-formatted into the SQL. Verified pruning still fires with bound parameters. - Forecast skill (05): stack the two models along a `model` dimension into one forecast table and GROUP BY the model column, eliminating the `FROM {table}` identifier interpolation (the injection smell) entirely. The 2m_temperature selection stays because the models store different pressure-level sets (Pangu 13, GraphCast 37) and must be aligned on the common surface field. Reword the "this is the real thing" docstring opening. docs/geospatial.md and the suite README updated to describe lazy registration, parameterized filters, and the model-column forecast query. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

alxmrs · 2026-06-24T05:28:56Z

Round-3 feedback addressed in a971f2b (cases 04/05 explicitly, extended across the suite).

Your comments:

04:54 / 05:90 — don't pre-select the variable; pushdown handles columns. The ERA5 cases (02, 03, 04, 06) now register the whole archive lazily with no variable selection; SELECT "2m_temperature" FROM era5.surface lets projection pushdown read just that column. (05 is the one exception, see below.)
05:136 — don't load up front; demo laziness. Nothing is .load()-ed up front anymore — datasets are registered lazily and the query pulls only the partitions/columns it needs at read time. The "read into memory" timed blocks are gone; the references use lazy .sel().
05:97 — FROM {table} is a SQL-injection smell. Eliminated the dynamic identifier entirely: the two models are stacked along a model dimension into one forecast table, and the query does GROUP BY f.model, f.prediction_timedelta — no table name formatted into SQL. More broadly, all value filters (dates, region bounds) are now passed as bound query parameters (param_values=...) instead of f-string interpolation. I verified partition pruning still fires with bound parameters (a parameterized one-day WHERE reads one day, not the 1.3M-step archive).
05:14 — "this is the real thing". Reworded.

One deliberate exception: 05 still selects 2m_temperature in xarray. Pangu stores 13 pressure levels and GraphCast stores 37, so the full datasets can't be concatenated over model; selecting the common surface field is semantically required to align them, not just an optimization. Noted in a code comment.

Full suite re-run is green (NDVI 1.05M px; climatology 535,704 cells; anomaly 1.6M cells; Sahara 33 °C vs Greenland −8 °C; GraphCast still beats Pangu at every lead). The lazy cases are slower now (each ERA5 case reads its window for the SQL and again for the reference), which is the expected cost of demonstrating laziness rather than pre-loading.

Background note (carried over): xarray-eopf's eopf-zarr engine for case 01 pins zarr<3, conflicting with this project, so case 01 stays on pystac-client + xr.open_datatree.

alxmrs

A few last thoughts. The geospatial docs are excellent.

alxmrs · 2026-06-24T05:04:17Z

+For worked examples of common geospatial operations expressed in SQL — NDVI,
+climatology, anomalies, zonal statistics, forecast skill (Pangu-Weather and
+GraphCast vs ERA5), reprojection, and regridding, each checked against an xarray
+reference — see [`benchmarks/geospatial/`](benchmarks/geospatial/) and the
+[Geospatial operations are relational operations](docs/geospatial.md) write-up.
+


I think this paragraph should live in a new section below called "does it work?" and be framed to answer that question. Then add can close this issue: #46

I think "does it work" should be between "how does this work" and "why does this work".

alxmrs · 2026-06-24T05:22:29Z

+interpolation — which is case 08, where regridding turns out to be a JOIN
+against a weight table rather than a scalar UDF.
+
+**An honest caveat on threading:** PROJ's transformation context is not


Omit the word "honest".

…real SRTM regrid) - 07 reprojection: open a UTM grid through Xee carrying ee.Image.pixelLonLat(), so Earth Engine's own geodesy reports each pixel's true lon/lat. The SQL PROJ UDF is now validated against an *independent* implementation (EE), not against PROJ via pyproj. Grid built with xee.helpers.fit_geometry; compared at ~1e-5°. - 08 regridding: regrid real SRTM elevation (USGS/SRTMGL1_003, Sierra Nevada) opened via Xee, validated against xarray's bilinear .interp() on the same source field — a new (DEM) data type for the suite. - Both require Earth Engine access (earthengine authenticate + EARTHENGINE_PROJECT) and skip cleanly when EE is not initialized, like the other network cases. Deps add xee, earthengine-api, shapely. Verified offline: imports, the graceful EE-not-initialized skip, ruff/mypy, and the xee open_dataset / fit_geometry API surface. The live EE round-trip was not run here (no credentials in this environment). docs/geospatial.md and the suite README updated for the EE/Xee sources. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

- Move the geospatial-suite pointer out of "What is this?" into a new "Does it work?" section, placed between "How does it work?" and "Why does this work?", reframed to answer that question with a findings summary (validated operations, real datasets, the GraphCast-beats-Pangu result). Addresses #46. - Reword the two remaining uses of "honest" in docs/geospatial.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

alxmrs · 2026-06-24T06:37:42Z

Round-4 feedback addressed in df67d98:

README:116 — "Does it work?" section. Moved the geospatial-suite pointer out of "What is this?" into a new ## Does it work? section, placed between "How does it work?" and "Why does this work?", and reframed it to answer that question — a findings summary (which operations were validated, on which real datasets, and the headline GraphCast-beats-Pangu result). This addresses Add a "Does it work?" section to the README #46; I'll leave the issue for you to close on merge.
07:34 — drop "honest". The 07 docstring already lost that wording in the Xee rewrite; I also reworded the two remaining uses ("Two honest caveats" → "Two caveats", "The honest boundary" → "The boundary") in docs/geospatial.md. No "honest" left in the suite or docs.

README section order is now: How does it work? → Does it work? → Why does this work?

… both pass live - Add _harness.initialize_earth_engine(): initialize EE from Application Default Credentials (gcloud auth application-default login) with the Earth Engine scope, using the ADC/EARTHENGINE_PROJECT project. This avoids the separate `earthengine authenticate` OAuth flow (and the "this app is blocked" org-policy error), and needs no service account. - 07/08 use the helper instead of bare ee.Initialize. - 07: Xee names the projected grid coords x/y (not X/Y), so drop the erroneous rename. Verified end-to-end against Earth Engine: - 07 reprojection: PROJ UDF matches EE's own ee.Image.pixelLonLat() to ~1e-5° over 2,565 UTM pixels (independent geodesy engines agree; pixel centers align). - 08 regrid: real SRTM elevation (Sierra Nevada, 1206–3439 m) regridded 50×50 → 60×72 matches xarray bilinear .interp() exactly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

…xarray-sql - Resolve the script's own directory (BASH_SOURCE) so it works from any cwd — the old `uv run ./01_nvdi.py` only worked from benchmarks/geospatial/ (and had a typo: 01_nvdi → 01_ndvi). - Glob all NN_*.py cases instead of an incomplete hand-written list; report a per-case PASS/FAIL summary and exit non-zero if any case fails. - Hand `uv` the local checkout via `--with-editable $REPO_ROOT`: these cases use xarray-sql features (XarrayDataFrame.to_dataset) newer than the latest PyPI release, so plain `uv run` would resolve an xarray-sql without `to_dataset`. - 01_ndvi.py: declare aiohttp + requests (needed by pystac-client / zarr-over- HTTPS under uv's isolated environment). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

The latest PyPI xarray-sql predates XarrayDataFrame.to_dataset, so plain `uv run <script>` resolved a build without it. Each script now declares [tool.uv.sources] xarray-sql = { path = "../../", editable = true } in its PEP 723 metadata. uv resolves this path relative to the *script*, so `uv run <script>` uses the in-repo build from any working directory. run_all.sh drops the `--with-editable` workaround and is back to plain `uv run`. Verified `uv run 08_regrid_weights.py` from /tmp uses the local xarray_sql (with to_dataset) and passes. Once a release ships these features, the [tool.uv.sources] block can be removed for standalone PyPI use. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

Add _harness.show_result(), and call it in every case right after assert_grid_close passes, so a run shows *what* it computed — the gridded answer round-tripped back out of SQL as an xarray Dataset/DataArray, via the standard repr. Drop the now-redundant "<case> Dataset: {sizes}" one-liners in 02/04 (the repr already shows sizes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

Revise the prose across the suite for the future reader/user — someone from the SQL or Xarray world — rather than for the PR conversation: - Frame features by what that audience values (lazy, idiomatic) instead of the mechanics that justified review decisions ("nothing loaded or column-selected up front", "bounds passed as query parameters, not formatted into the SQL", "no table name formatted into the SQL"). - Fix stale claims: the suite compares with xarray (not `numpy.assert_allclose`), Earth Engine uses your `gcloud` login (not `earthengine authenticate`), and the run instructions cover `run_all.sh` and the in-repo build. - Trim jargon from runtime labels and tighten a couple of code comments. Touches README ("Does it work?"), docs/geospatial.md, the suite README, and the 02–06 docstrings/labels. No code behavior changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

Add a "Where this list comes from" section linking the Large Scale Geospatial Benchmarks discussion (coiled/benchmarks #1545, opened by James Bourbeau, 2024) and explaining what it asked: the end-to-end geospatial/climate workflows the Xarray/Dask ecosystem needs at the 100-TB scale. A table maps the discussion's seven workflow categories to the cases here — the suite covers six of the seven (all but vector polygon-to-polygon spatial joins) — underscoring that the operations are an independent community survey, not a SQL-friendly cherry-pick. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

ahuang11 · 2026-06-24T14:58:47Z

Skimmed the PR; wondering if the benchmark timings can be documented (e.g. how long it took on the last run)

With GEOBENCH_PROFILE set, run_case runs a case's main() a warmup plus GEOBENCH_REPS times, suppressing the repeated output, and writes one row of summary stats (min/median/mean/stdev/max seconds + peak MB) per timed step to GEOBENCH_CSV — a single shared table across all cases. Because each main() is self-contained (builds its own context, reads its own data), re-running it samples the operations with no change to the case scripts: all eight remain byte-identical, and runs without the flag are unchanged. Documented in the suite README (Profiling section). Verified end-to-end on case 08 (normal run unchanged; profiled run emits a clean per-step summary and CSV). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

…ain() Cleaner than the previous approach (which re-ran each case's main() and suppressed its output): a repeatable step is now wrapped in `for _ in measured(label): ...` — one line different from `with timed(...)`, body unchanged. Without profiling it runs once; under GEOBENCH_PROFILE it runs a warmup plus GEOBENCH_REPS measured passes and writes one summary row per step to GEOBENCH_CSV. The harness no longer re-runs main() or redirects stdout, and the perf table now carries both the SQL query and the xarray reference per case (a built-in side-by-side). Registration/opens stay in `with timed(...)`. The case bodies remain idiomatic xarray/SQL; only the SQL-query and reference blocks gained the measured() wrapper. Verified normal runs are unchanged and profiled runs of 08 (EE) and 05 (WB2 JOIN) emit clean summaries + CSV. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

ghostiee-11

Read through the suite end to end, really like the relational framing and that every case round-trips back to xarray and asserts against a reference. Two small things on the harness and one idea below.

ghostiee-11 · 2026-06-24T17:15:40Z

+        t0 = time.perf_counter()
+        yield
+        elapsed = time.perf_counter() - t0
+        _, peak = tracemalloc.get_traced_memory()
+        tracemalloc.stop()


nit: if the body raises, we skip tracemalloc.stop() and leave it running. small try/finally like timed():

Suggested change

t0 = time.perf_counter()

yield

elapsed = time.perf_counter() - t0

_, peak = tracemalloc.get_traced_memory()

tracemalloc.stop()

t0 = time.perf_counter()

try:

yield

finally:

elapsed = time.perf_counter() - t0

_, peak = tracemalloc.get_traced_memory()

tracemalloc.stop()

alxmrs

Latest round of feedback

alxmrs · 2026-06-24T17:14:57Z

+        r10m[["b04", "b08"]]
+        .rename(b04="red", b08="nir")
+        .isel(y=slice(_Y0, _Y0 + _N), x=slice(_X0, _X0 + _N))
+        .load()


Why do we have this load op here still?

alxmrs · 2026-06-24T17:19:25Z

+        # decode_timedelta=True: forecasts store prediction_timedelta as a
+        # real duration (and it silences xarray's decode-timedelta warning).
+        return xr.open_zarr(
+            url, chunks=None, storage_options=_SO, decode_timedelta=True


Does this storage options value have to be extracted into a constant if we are already extracting the open call to a function?

alxmrs · 2026-06-24T17:23:17Z

+        got = xr.DataArray(
+            flat.values.reshape(len(tlat), len(tlon)),
+            dims=["lat", "lon"],
+            coords={"lat": tlat, "lon": tlon},
+        )


So we need to reorganize the script as a data array? Why do we do this copy? Isn't the output + da access enough?

ghostiee-11 · 2026-06-24T18:06:07Z

built the lazy round-trip case I mentioned and opened it against this branch as #178 (also carries the two harness tweaks). same slab is 1,325 rows / ~0.7 MB lazy vs 3.86M / ~161 MB eager, asserted equal.

ghostiee-11 · 2026-06-24T20:13:36Z

@alxmrs a few cases I'd like to add next, all pulled from the #1545 workflow list so they're community-asked rather than SQL-friendly cherry-picks:

TEM diagnostic: zonal mean + eddy covariances (U'V', V'T', U'W'), which is just GROUP BY (time, level, lat) plus AVG(x*y) - AVG(x)*AVG(y). dcherian posted the exact xarray for this in #1545; validates against aostools/pyzome on ERA5.
forecast skill, ACC + MAE/bias: extends case 05 with anomaly correlation, which needs a JOIN to a climatology table to form the anomalies. reuses the gs://weatherbench2 data already in the suite.
conservative region aggregation: raster onto polygons as a sparse weight-table JOIN + weighted GROUP BY (the carbonplan/extreme-heat workflow). same shape as the regrid case, different weights.
building-footprint spatial join: bbox range-join + ST_Intersects refinement on Overture. the one #1545 category the suite doesn't cover yet, though it needs spatial UDFs since DataFusion has none natively.
spectral-index time series (NDWI/NDSI): scales the NDVI case from one scene to a GROUP BY over time on Sentinel-2.

- assert_grid_close: fail when the SQL result misses grid cells. reindex_like aligned a short result onto the reference and silently passed; now a per-dim size mismatch raises, since the suite's whole point is "same numbers". - measured(): wrap the profiled yield in try/finally so tracemalloc.stop() always runs even if the body raises. - 01 NDVI: register the scene lazily (drop the eager .load()); the xarray reference now .compute()s in its block so its perf row measures the same read-and-compute the SQL side does, not just graph construction. - 05: inline storage_options={"token": "anon"} into the single _open() that uses it; drop the _SO module constant. - 08 regrid: name each target cell by its (dst_lat, dst_lon) in the weight table and GROUP BY those, so the result round-trips straight to a (lat, lon) grid via to_dataset — no manual reshape/copy into a new DataArray. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

The in-process warm loop was unfair: xr.open_zarr(chunks=None) caches each variable in memory after the first read, so the xarray reference served its later repetitions from RAM while the SQL side re-read the store every query — inflating the gap (case 05 read as 23.5s SQL vs 0.29s reference; cold-vs-cold it's ~0.8s vs ~0.6s). run_perf.sh runs each case once per fresh process with no warmup, repeated GEOBENCH_REPS times, so both the SQL operation and the xarray reference pay a cold read on every measurement. perf_summary.py aggregates the independent cold runs into a per-step table. The cases and the measured() harness are unchanged; fairness comes from process isolation, not in-process repetition. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

Add Results, Analysis, and Conclusion sections to docs/geospatial.md. Results publishes the cold-vs-cold timings (run_perf.sh on an e2-standard-8 in-region), with the methodology note on why a fresh process per repetition is the only fair comparison. Analysis decomposes case 05 into read vs compute vs round-trip (cProfile): the read is comparable on both sides, the ~30x gap is compute (explode to rows, hash JOIN, aggregate vs a vectorized NumPy reduction), and the SQL->xarray round-trip is sub-millisecond. It explains the table's shape — JOINs widest, GROUP BYs narrowest — and why the ratio shifts with hardware. Conclusion states the tradeoff plainly: arrays for dense, grid-aligned numerics; SQL for relationally shaped work and SQL-fluent audiences; and the distributed query-engine payoff the single-node benchmark can only point toward. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

alxmrs · 2026-06-25T10:52:18Z

@@ -0,0 +1,113 @@
+#!/usr/bin/env python3
+"""Aggregate the cold-run perf CSV into a per-step summary and a markdown table.


Let's not check in this script.

Done — removed perf_summary.py and folded its aggregation into run_perf.sh as an inline step, so the perf tooling is one self-contained driver (0941e0b).

alxmrs · 2026-06-25T13:37:35Z

+
+* **Spectral indices** (NDVI) — column arithmetic over a real Sentinel-2 scene.
+* **Climatology, anomalies, zonal means** — `GROUP BY` and self-`JOIN` over the
+  full 0.25° **ARCO-ERA5** archive (≈1.3M hourly steps), read lazily so each


Is it really this many hourly steps?

Did we really do this broad of a query?

Good catch — I dropped the “≈1.3M hourly steps” figure; it overstated what the queries touch. Reworded to say the archive is multi-decade and registered as a lazy table, while each query reads only its small window (0941e0b).

Right — the queries are bounded to a small window (a few days over a region), not a scan of the archive. Reworded so the README says exactly that (0941e0b).

alxmrs · 2026-06-25T13:43:08Z

+``da.groupby("time.hour").mean()``. ERA5 is hourly, so grouping by hour of day
+gives a clean 24-bin **diurnal cycle**, one sample per day in the window.
+
+The table is the *whole* ARCO-ERA5 archive, opened lazily: the query reads only


This doesn't appear to be the whole ERA5 archive, because we have a time range. What is the truth?

Agreed. Reworded: we register the whole archive as a lazy table, but this climatology is computed over a bounded window — the WHERE prunes the read, so the query never scans the full record (0941e0b).

alxmrs · 2026-06-25T13:53:58Z

+    FROM era5 a JOIN clim c
+      ON (a.latitude, a.longitude, hour(a.time)) = (c.latitude, c.longitude, c.hour)
+
+The table is the *whole* ARCO-ERA5 archive, opened lazily: both the climatology


This is not the whole ERA5 archive.

Reworded the same way — the table spans the archive lazily; the anomaly is computed over a bounded window the WHERE prunes to (0941e0b).

alxmrs · 2026-06-25T14:44:36Z

+horizon" curve (≈0.3 K at 6 h rising to ≈2.5 K at 9 days):
+
+```
+  lead (days)     Pangu   GraphCast


If we use the dataframe code above, then let's print that repr here.

Done — the doc now shows the got.to_pandas() repr (RMSE by lead × model), matching the script's new headline (1cd2c93).

alxmrs · 2026-06-25T14:52:08Z

+relational, and that for a surprising share of geoscience, the operation is
+relational.
+
+## Running the suite


Let's put this section above results.

Done — “Running the suite” now sits above Results (1cd2c93).

… fixes Fairness: case 05's reference used `.load()`, which caches data in place on the forecasts/truth objects the SQL table also reads from — so running the reference after the SQL query (as the harness does) could serve it a warm read. Switch to `.compute()`, which returns a fresh array and leaves the inputs lazy. Verified: reading a window repeatedly in one process stays flat, and neither side warms the other. The other cases either reopen their data (06) or recompute eagerly with `chunks=None` (02-04), so they were already cold; only 05 leaked. Case 05: render the headline RMSE-by-lead table with `to_pandas()` instead of a hand-rolled print loop; clarify that the `chunks=` arg is the Arrow batch size, not a filter (no data dropped). Case 06: the docstring claimed "full ARCO-ERA5" as the dataset; the query aggregates one day's window (the WHERE prunes the read). Make that precise. Case 07: drop the "graduate into the package" paragraph; clarify that there is one dataset (UTM x/y as SQL input, EE's pixelLonLat as the independent reference), not the same image opened twice in two CRS. Docs: correct the profiling methodology note (the caching trap is in-place `.load()`, not `open_zarr` auto-caching); point the "earns its keep" section forward to Results/Analysis/Conclusion; move "Running the suite" above Results; link James Bourbeau's profile; show the to_pandas headline repr. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

The ARCO-ERA5 cases (02, 04, README) said the query runs over the *whole* archive; each query is in fact bounded by its WHERE window. Reword to say the table spans the archive (lazy) while the query reads only its slice — and drop the unverified "~1.3M hourly steps" framing from the README. Fold perf_summary.py into run_perf.sh (an inline aggregation step) and remove the separate script, so the perf tooling is one self-contained driver. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

Fresh in-region e2-standard-8 run of the full suite. The key change is case 05's reference: now genuinely cold (~0.34s, stable across reps) instead of the ~0.22s warmed number — confirming the .load()->.compute() fix closed the leak. Update the Results table (cases 01-06), note case 05's CPU/hardware sensitivity (~12s vs ~23s across two VMs), and correct the NDVI ratio (~8x). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

alxmrs · 2026-06-25T19:38:39Z

+    ctx.from_dataset("forecasts", forecasts, chunks={"time": 6})
+    ctx.from_dataset("era5", truth, chunks={"time": 100})


This doesn't make sense to me. This ERA5 dataset is a 6 hourly step version at a small grid designed to be comparable to the forecasts. By throw away data, I mean we load 100 timesteps in the chunk and then discard the remainder that doesn't match the forecast chunk (as they divide together).

Can you write a small prototype (that's discarded) that tries 6 and 6 chunks ve 6 and 100 chunks to see what the empirical answer is? I'm still not convinced.

alxmrs · 2026-06-25T19:40:06Z

+                "2m_temperature"
+            ]
+            .sel(time=_DAY)
+            .load()


Since we're not using dask anyway, let's get rid of this load. (Is this essentially for any reason?)

alxmrs · 2026-06-25T19:44:10Z

+            rename[d] = "lat"
+        elif dl in ("x", "lon", "longitude"):
+            rename[d] = "lon"
+    return da.rename(rename).sortby("lat").sortby("lon").load()


I want this to be a valid perf comparison test of laziness for the xr ref too, so let's get rid of the load here.

alxmrs · 2026-06-25T19:45:47Z

+# Aggregate the per-process cold runs (one row each) into a per-step summary and
+# a markdown table. Each raw row is one cold sample (reps=1), so we collect the
+# samples per (case, title, step) and report median/spread/peak across them.
+python3 - "$RAW" "$SUMMARY" <<'PY'


At this point, fine, let's make this a second script.

Done — moved the aggregation back into perf_summary.py as a second script (the inline heredoc was worse to read) (7bfd56f).

Case 06: drop the reference's `.load()`. The day's field is now read inside the timed block (like the SQL side) via a single masked reduction over stacked region masks — one lazy read for all regions, instead of pre-materializing. Verified against the SQL result. perf: move the aggregation back out of run_perf.sh into perf_summary.py (a second script reads cleaner than the inline heredoc). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

Previously the SQL side read SRTM into an in-memory (cell_id, value) table while the reference read lazily — asymmetric. Now both read the source lazily: register the Xee field itself as the SQL `src` table and key the weight join on the source coordinates (src_lat, src_lon) instead of a pre-raveled cell_id, and drop the `.load()` in `_open_srtm`. Coords are forced to float64 so the join matches the source grid exactly. The regrid math (the new weight table + coordinate join) is validated against xarray `.interp` on a synthetic source to ~1e-12; the Earth Engine end-to-end run still needs confirmation in an EE-authenticated environment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

…ding A 2×2 sweep of (forecasts, era5) time-chunks ∈ {6,100} shows the truth/era5 chunk is what matters — small splits it into many partitions and costs ~3× — while the forecasts chunk is in the noise and a mismatch costs nothing. So set both to one partition (time:100) for these small windows, and document why. Also: cases 07 and 08 now validated end-to-end against real Earth Engine (08's symmetric-lazy regrid matches xarray .interp over 4,320 target cells; 07's PROJ UDF matches EE pixelLonLat). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

Case 09 composes 07 (the reproject UDF) and 08 (the regrid weight-table JOIN) into a full warp — GDAL/rasterio `reproject`: change the CRS *and* resample onto the new grid. The 07 UDF carries the target lon/lat grid back into source UTM space, arrays turn those reprojected points into bilinear weights, and the 08 JOIN applies them. Validated against xarray `.interp()` at the reprojected points (exact), with Earth Engine's own lon/lat SRTM as a cross-CRS sanity check. Wires 09 into the README case table + dataset note, the docs mapping table and the "hard cases" narrative, and the run_perf.sh glob. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

Re-ran the whole suite (01-09) on a single e2-standard-8 in us-central1, with Earth Engine reached from the same VM, so every case shares one machine and one release build. Updates the Results table and the analysis to match: - Case 05 came in at 10.7 s (not 23 s) — the hardware-sensitivity caveat borne out; 23 s was the high outlier across three runs. - SQL is no longer slower in *every* case: 08 is at parity (0.875 vs 0.850 s) and 09 is faster (0.281 vs 0.817 s), because the geometry (the weights) is precomputed and the relational half just applies it. The analysis and conclusion now reflect this instead of a blanket "slower." - Case 08 is no longer the stale eager-load number; both sides read SRTM lazily, so the comparison is symmetric. Adds the new case 09 row. - Marks the cProfile decomposition explicitly as a separate laptop profile (the intended hardware contrast), resolving the apparent total-vs-table mismatch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AWvrZYAT2NbuETBqNAN3o9

alxmrs commented Jun 23, 2026

View reviewed changes

alxmrs commented Jun 24, 2026

View reviewed changes

alxmrs and others added 2 commits June 24, 2026 09:14

alxmrs and others added 6 commits June 24, 2026 10:19

alxmrs and others added 2 commits June 24, 2026 19:29

ghostiee-11 reviewed Jun 24, 2026

View reviewed changes

alxmrs commented Jun 24, 2026

View reviewed changes

ghostiee-11 mentioned this pull request Jun 24, 2026

Add lazy round-trip benchmark (case 09) #178

Open

alxmrs and others added 2 commits June 25, 2026 00:15

alxmrs and others added 2 commits June 25, 2026 13:38

simpler conclusion title.

dfd0039

alxmrs commented Jun 25, 2026

View reviewed changes

alxmrs and others added 3 commits June 25, 2026 18:41

alxmrs commented Jun 25, 2026

View reviewed changes

alxmrs and others added 5 commits June 25, 2026 23:03

		return da.to_dataset().load()


		def _rmse_sql(table: str) -> str:

		@@ -0,0 +1,113 @@
		#!/usr/bin/env python3
		"""Aggregate the cold-run perf CSV into a per-step summary and a markdown table.

		ctx.from_dataset("forecasts", forecasts, chunks={"time": 6})
		ctx.from_dataset("era5", truth, chunks={"time": 100})

Conversation

alxmrs commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

The cases (all validated end-to-end on real data)

Highlights

Running

Performance

Notes for review

Follow-ups

Uh oh!

alxmrs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alxmrs commented Jun 23, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alxmrs commented Jun 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alxmrs commented Jun 24, 2026

Uh oh!

alxmrs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alxmrs commented Jun 24, 2026

Uh oh!

ahuang11 commented Jun 24, 2026

Uh oh!

ghostiee-11 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alxmrs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

alxmrs commented Jun 23, 2026 •

edited

Loading

ghostiee-11 commented Jun 24, 2026 •

edited

Loading