alxmrs · alxmrs · Jun 23, 2026 · Jun 23, 2026 · Jun 23, 2026 · Jun 24, 2026
diff --git a/README.md b/README.md
@@ -142,6 +142,35 @@ that lets the DB engine translate the underlying Dataset arrays into DataFusion
 Ultimately, the initial insight of the `pivot()` function -- that any ndarray can be
 translated into a 2D table -- underlies this performant query mechanism. 
 
+## Does it work?
+
+Yes. The recurring worry is that the SQL interface is a toy — fine for `SELECT`s,
+but not for the operations geoscience actually runs. So we wrote a suite that
+takes the staples of geospatial and climate analysis — the ones we assume *need*
+an array library — and expresses each one in SQL, then **checks the SQL answer
+against an xarray/array reference** to floating-point tolerance:
+
+* **Spectral indices** (NDVI) — column arithmetic over a real Sentinel-2 scene.
+* **Climatology, anomalies, zonal means** — `GROUP BY` and self-`JOIN` against
+  the 0.25° **ARCO-ERA5** archive registered as a lazy table. Each query is
+  bounded to a small window (a few days over a region) and reads only that
+  slice — the point is that you can aim a query at a multi-decade archive and
+  pay only for the data it asks for, not that the query scans the whole record.
+* **Forecast skill** — scoring the **Pangu-Weather** and **GraphCast** ML models
+  against ERA5 (WeatherBench 2) as a `JOIN` on `valid_time = init + lead`; it
+  reproduces the published result that GraphCast beats Pangu at every lead.
+* **Raster × vector zonal stats** — a range `JOIN` of the ERA5 grid against a
+  table of regions.
+* **Reprojection and regridding** — a scalar PROJ UDF (validated against Earth
+  Engine's own geodesy via [Xee](https://github.com/google/Xee)) and a
+  sparse-weight-table `JOIN` (regridding real SRTM terrain).
+
+Every case matches its array reference. The headline finding: these operations
+are not really "array" operations at all — they are `GROUP BY`, `JOIN`, window
+functions, and `CASE` in disguise, and a query engine runs them at scale. See
+[`benchmarks/geospatial/`](benchmarks/geospatial/) and the write-up,
+[Geospatial operations are relational operations](docs/geospatial.md).
+
 ## Why does this work?
 
 Underneath Xarray, Dask, and Pandas, there are NumPy arrays. These are paged in

diff --git a/benchmarks/geospatial/01_ndvi.py b/benchmarks/geospatial/01_ndvi.py
@@ -0,0 +1,149 @@
+#!/usr/bin/env python3
+# /// script
+# requires-python = ">=3.11"
+# dependencies = [
+#   "xarray-sql",
+#   "xarray",
+#   "aiohttp",
+#   "requests",
+#   "pystac-client",
+#   "zarr>=3",
+#   "numpy",
+# ]
+#
+# [tool.uv.sources]
+# xarray-sql = { path = "../../", editable = true }
+# ///
+"""NDVI — "apply_ufunc over a raster" is just column arithmetic.
+
+The Normalized Difference Vegetation Index is the workhorse of optical remote
+sensing: ``NDVI = (NIR - Red) / (NIR + Red)``, computed per pixel. The array
+paradigm reaches for ``xarray.apply_ufunc`` (the coiled/benchmarks #1545
+"vectorized operations" case) to broadcast this over a whole scene.
+
+But a per-pixel formula over two bands is just *column arithmetic over two
+columns*::
+
+    SELECT x, y, (nir - red) / (nir + red) AS ndvi
+    FROM scene
+    ORDER BY y, x
+
+Each pixel is one row; the ufunc is the SELECT expression. Invalid pixels are
+already NaN (xarray decodes the band's ``_FillValue`` on open), and NaN
+propagates through the arithmetic on both sides — so the masking is free, no
+``CASE`` required.
+
+Dataset: a real Sentinel-2 L2A scene in **Zarr** from the ESA EOPF sample
+service, discovered with ``pystac-client`` and opened the canonical way with
+``xarray`` — ``xr.open_datatree`` yields the reflectance bands (B04=red,
+B08=NIR at 10 m) already scaled to reflectance and carrying their ``x``/``y``
+coordinates. We read one window so the case stays bounded. Requires network;
+skips cleanly if the service is offline.
+"""
+
+from __future__ import annotations
+
+import xarray as xr
+
+import xarray_sql as xql
+
+from _harness import (
+    CaseSkipped,
+    assert_grid_close,
+    measured,
+    run_case,
+    show_result,
+    show_sql,
+)
+
+# EOPF sample-service STAC catalog; an agricultural AOI near Torino, Italy, in
+# early May (peak spring growth). The search is deterministic — it resolves to
+# a specific archived Sentinel-2 product.
+_STAC = "https://stac.core.eopf.eodc.eu"
+_BBOX = [7.2, 44.5, 7.4, 44.7]
+_DATETIME = "2025-04-25/2025-05-05"
+
+# A 1024×1024 (~105 km²) window over vegetated valley floor.
+_Y0, _X0, _N = 4_000, 6_000, 1_024
+
+
+def _load_scene() -> tuple[xr.Dataset, str]:
+    """Discover a Sentinel-2 L2A product and open its 10 m red/NIR bands.
+
+    Idiomatic end to end: ``pystac-client`` finds the product, ``open_datatree``
+    opens the hierarchical EOPF Zarr, and the ``reflectance/r10m`` node already
+    carries B04/B08 scaled to reflectance (nodata decoded to NaN) with
+    ``x``/``y`` coordinates — no manual scaling or coordinate reconstruction.
+    """
+    try:
+        from pystac_client import Client
+
+        catalog = Client.open(_STAC)
+        search = catalog.search(
+            collections=["sentinel-2-l2a"],
+            bbox=_BBOX,
+            datetime=_DATETIME,
+            max_items=1,
+        )
+        item = next(search.items())
+        tree = xr.open_datatree(
+            item.assets["product"].href, engine="zarr", chunks={}
+        )
+    except StopIteration as exc:
+        raise CaseSkipped("no Sentinel-2 product found for the query") from exc
+    except Exception as exc:  # noqa: BLE001 — any failure → skip, not crash
+        raise CaseSkipped(f"EOPF Sentinel-2 unavailable ({exc})") from exc
+
+    r10m = tree["measurements/reflectance/r10m"].to_dataset()
+    scene = (
+        r10m[["b04", "b08"]]
+        .rename(b04="red", b08="nir")
+        .isel(y=slice(_Y0, _Y0 + _N), x=slice(_X0, _X0 + _N))
+    )
+    return scene, item.id
+
+
+def main() -> None:
+    scene, item_id = _load_scene()
+    n = scene.sizes["y"] * scene.sizes["x"]
+    print(f"  Sentinel-2 L2A {item_id}")
+    print(
+        f"  scene window: {dict(scene.sizes)}  ({n:,} pixels, B04=red/B08=NIR)"
+    )
+
+    ctx = xql.XarrayContext()
+    ctx.from_dataset("scene", scene, chunks={"y": 256, "x": 256})
+
+    sql = """
+        SELECT x, y, (nir - red) / (nir + red) AS ndvi
+        FROM scene
+        ORDER BY y, x
+    """
+    show_sql(sql)
+
+    for _ in measured("SQL NDVI"):
+        got = ctx.sql(sql).to_dataset(dims=["y", "x"]).ndvi
+
+    # Array reference: the same formula in pure xarray. ``.compute()`` reads the
+    # window and evaluates it here (the scene is lazy), so this measures the same
+    # read-and-compute the SQL side does — not just graph construction.
+    for _ in measured("xarray reference"):
+        ref = ((scene.nir - scene.red) / (scene.nir + scene.red)).compute()
+
+    # Compare the xarray way — aligned by coordinate label, so the ORDER BY
+    # above is enough and neither side needs an explicit sort.
+    assert_grid_close("NDVI (per-pixel)", got, ref, rtol=1e-6)
+
+    show_result(got)
+
+    valid = ref.notnull()
+    print(
+        f"\n  NDVI over {int(valid.sum()):,} valid pixels: "
+        f"min {float(ref.min()):.3f}, "
+        f"mean {float(ref.mean()):.3f}, "
+        f"max {float(ref.max()):.3f}"
+    )
+
+
+if __name__ == "__main__":
+    raise SystemExit(run_case(main, "NDVI: per-pixel column arithmetic"))
diff --git a/benchmarks/geospatial/02_climatology.py b/benchmarks/geospatial/02_climatology.py
@@ -0,0 +1,137 @@
+# /// script
+# requires-python = ">=3.11"
+# dependencies = [
+#   "xarray-sql",
+#   "xarray",
+#   "gcsfs",
+#   "zarr>=3",
+# ]
+#
+# [tool.uv.sources]
+# xarray-sql = { path = "../../", editable = true }
+# ///
+"""Diurnal climatology — the "rechunk + grouped reduction" that is a GROUP BY.
+
+A *climatology* is the average value for each time-of-cycle, computed
+independently at every location: "what is the typical temperature here at
+06:00?" In the array paradigm (and in the coiled/benchmarks #1545 write-up)
+this is the canonical painful workload — load native Zarr chunks, *rechunk* to
+put all of time in one chunk ("pencils"), run a grouped reduction over the
+calendar, then rechunk back to "pancakes" for output.
+
+The rechunking exists only to serve the array layout. The *operation* is::
+
+    SELECT latitude, longitude, hour_of_day, AVG("2m_temperature")
+    GROUP BY latitude, longitude, hour_of_day
+
+Group by location and time-of-cycle, average the rest — the same answer as
+``da.groupby("time.hour").mean()``. ERA5 is hourly, so grouping by hour of day
+gives a clean 24-bin **diurnal cycle**, one sample per day in the window.
+
+We register the full ARCO-ERA5 archive as a lazy table, but the climatology here
+is computed over a *bounded window* — a few summer days over a CONUS-ish box. The
+``WHERE`` prunes the read, so the query touches only ``2m_temperature`` over that
+window and never scans the rest of the archive. The point is not that we reduce
+the whole record; it is that you can aim a query at a multi-decade archive and pay
+only for the slice it asks for.
+"""
+
+from __future__ import annotations
+
+import datetime
+
+import xarray as xr
+
+import xarray_sql as xql
+
+from _harness import (
+    CaseSkipped,
+    assert_grid_close,
+    measured,
+    run_case,
+    show_result,
+    show_sql,
+    timed,
+)
+
+_URL = "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3"
+# A few days over a CONUS-ish box (ERA5 latitude descends; lon is 0–360°E).
+_START, _END = datetime.datetime(2020, 6, 1), datetime.datetime(2020, 6, 3, 23)
+_LAT_N, _LAT_S = 50.0, 25.0
+_LON_W, _LON_E = 235.0, 290.0
+_PARAMS = {
+    "start": _START,
+    "end": _END,
+    "lat_s": _LAT_S,
+    "lat_n": _LAT_N,
+    "lon_w": _LON_W,
+    "lon_e": _LON_E,
+}
+
+
+def main() -> None:
+    # Open the full ARCO-ERA5 archive lazily — no data is read here. ERA5 mixes
+    # surface (time, lat, lon) and atmospheric (… level …) variables, so register
+    # it as two tables under an ``era5`` schema; the query below touches only the
+    # surface table's 2m_temperature.
+    try:
+        import gcsfs  # noqa: F401 — required by the gs:// protocol
+
+        ds = xr.open_zarr(_URL, chunks=None, storage_options={"token": "anon"})
+    except Exception as exc:  # noqa: BLE001 — any failure → skip, not crash
+        raise CaseSkipped(f"ARCO-ERA5 unavailable ({exc})") from exc
+
+    ctx = xql.XarrayContext()
+    with timed("register full ERA5 (lazy)"):
+        ctx.from_dataset(
+            "era5",
+            ds,
+            chunks={"time": 6},
+            table_names={
+                ("time", "latitude", "longitude"): "surface",
+                ("time", "level", "latitude", "longitude"): "atmosphere",
+            },
+        )
+
+    sql = """
+        SELECT latitude,
+               longitude,
+               date_part('hour', time) AS hour,
+               AVG("2m_temperature") - 273.15 AS clim_c
+        FROM era5.surface
+        WHERE time      BETWEEN $start AND $end
+          AND latitude  BETWEEN $lat_s AND $lat_n
+          AND longitude BETWEEN $lon_w AND $lon_e
+        GROUP BY latitude, longitude, date_part('hour', time)
+        ORDER BY latitude DESC, longitude, hour
+    """
+    show_sql(sql)
+
+    # A climatology is a gridded product: round-trip the result back to an
+    # xarray Dataset keyed by (latitude, longitude, hour) — how it is used.
+    for _ in measured("SQL diurnal climatology (lazy read)"):
+        got = ctx.sql(sql, param_values=_PARAMS).to_dataset(
+            dims=["latitude", "longitude", "hour"]
+        )
+
+    # Array reference: the textbook groupby-over-the-cycle reduction, in °C —
+    # the same lazy window, materialized only on demand.
+    for _ in measured("xarray reference"):
+        window = ds["2m_temperature"].sel(
+            time=slice(_START, _END),
+            latitude=slice(_LAT_N, _LAT_S),
+            longitude=slice(_LON_W, _LON_E),
+        )
+        ref = window.groupby("time.hour").mean("time") - 273.15
+
+    assert_grid_close(
+        "diurnal climatology (°C)", got.clim_c, ref, rtol=1e-4, atol=1e-2
+    )
+
+    show_result(got)
+
+
+if __name__ == "__main__":
+    raise SystemExit(
+        run_case(main, "Climatology: GROUP BY lat, lon, hour (ARCO-ERA5)")
+    )