Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
a44d677
Add geospatial SQL benchmark suite (benchmarks/geospatial/)
alxmrs Jun 23, 2026
6bc41aa
NDVI: open Sentinel-2 idiomatically (pystac + open_datatree), drop ma…
alxmrs Jun 23, 2026
e1cbc8b
Make benchmark cases idiomatic xarray: fluent opens, Dataset results,…
alxmrs Jun 23, 2026
a971f2b
Lazy registration, parameterized queries, model-concat: address round…
alxmrs Jun 24, 2026
3bd17a7
Cases 07/08: use Earth Engine via Xee (independent reproj reference; …
alxmrs Jun 24, 2026
df67d98
README: add "Does it work?" section (addresses #46); drop "honest"
alxmrs Jun 24, 2026
8d7ea6a
EE cases: initialize via ADC (no blocked OAuth); fix 07 coord names —…
alxmrs Jun 24, 2026
866d055
Fix run_all.sh: run from any directory, run all cases, use the local …
alxmrs Jun 24, 2026
05b3466
Point each case's uv metadata at the local xarray-sql checkout
alxmrs Jun 24, 2026
7e3dbd1
Print each case's result (xarray repr) after the match is verified
alxmrs Jun 24, 2026
f1a2c03
Docs: write for the library's reader, not the review thread
alxmrs Jun 24, 2026
0e8b15c
docs: cite the coiled/benchmarks #1545 survey the operations come from
alxmrs Jun 24, 2026
be07550
Add opt-in performance profiling to the harness (CSV perf table)
alxmrs Jun 24, 2026
e0a2b03
Profiling via a `for _ in measured(...)` loop instead of re-running m…
alxmrs Jun 24, 2026
c7454b0
Address review: partial-result guard, lazy NDVI, no-reshape regrid
alxmrs Jun 24, 2026
7b199f1
Add cold-vs-cold perf driver (fair SQL-vs-xarray timing)
alxmrs Jun 24, 2026
0bd1e05
docs: publish perf results, analysis, and the SQL-vs-arrays tradeoff
alxmrs Jun 25, 2026
dfd0039
simpler conclusion title.
alxmrs Jun 25, 2026
1cd2c93
Address review: fair cold-vs-cold (.compute), to_pandas headline, doc…
alxmrs Jun 25, 2026
0941e0b
Address review: drop "whole archive" overclaim, inline perf summary
alxmrs Jun 25, 2026
3f9c295
Re-collect cold-vs-cold timings with the .compute() fix
alxmrs Jun 25, 2026
7bfd56f
Address review: lazy zonal-stats reference, split perf summary back out
alxmrs Jun 25, 2026
5ccddae
Address review: make case 08 a symmetric laziness test
alxmrs Jun 25, 2026
1b03bec
Case 05: one partition each (both chunks=100); document the chunk fin…
alxmrs Jun 25, 2026
49356b5
Add case 09: warp (reproject + resample) as UDF → weight-table JOIN
alxmrs Jun 25, 2026
b7c581c
docs: refresh perf table — full 9-case run on one in-region VM with EE
alxmrs Jun 25, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,35 @@ that lets the DB engine translate the underlying Dataset arrays into DataFusion
Ultimately, the initial insight of the `pivot()` function -- that any ndarray can be
translated into a 2D table -- underlies this performant query mechanism.

## Does it work?

Yes. The recurring worry is that the SQL interface is a toy — fine for `SELECT`s,
but not for the operations geoscience actually runs. So we wrote a suite that
takes the staples of geospatial and climate analysis — the ones we assume *need*
an array library — and expresses each one in SQL, then **checks the SQL answer
against an xarray/array reference** to floating-point tolerance:

* **Spectral indices** (NDVI) — column arithmetic over a real Sentinel-2 scene.
* **Climatology, anomalies, zonal means** — `GROUP BY` and self-`JOIN` against
the 0.25° **ARCO-ERA5** archive registered as a lazy table. Each query is
bounded to a small window (a few days over a region) and reads only that
slice — the point is that you can aim a query at a multi-decade archive and
pay only for the data it asks for, not that the query scans the whole record.
* **Forecast skill** — scoring the **Pangu-Weather** and **GraphCast** ML models
against ERA5 (WeatherBench 2) as a `JOIN` on `valid_time = init + lead`; it
reproduces the published result that GraphCast beats Pangu at every lead.
* **Raster × vector zonal stats** — a range `JOIN` of the ERA5 grid against a
table of regions.
* **Reprojection and regridding** — a scalar PROJ UDF (validated against Earth
Engine's own geodesy via [Xee](https://github.com/google/Xee)) and a
sparse-weight-table `JOIN` (regridding real SRTM terrain).

Every case matches its array reference. The headline finding: these operations
are not really "array" operations at all — they are `GROUP BY`, `JOIN`, window
functions, and `CASE` in disguise, and a query engine runs them at scale. See
[`benchmarks/geospatial/`](benchmarks/geospatial/) and the write-up,
[Geospatial operations are relational operations](docs/geospatial.md).

## Why does this work?

Underneath Xarray, Dask, and Pandas, there are NumPy arrays. These are paged in
Expand Down
149 changes: 149 additions & 0 deletions benchmarks/geospatial/01_ndvi.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "xarray-sql",
# "xarray",
# "aiohttp",
# "requests",
# "pystac-client",
# "zarr>=3",
# "numpy",
Comment thread
alxmrs marked this conversation as resolved.
# ]
#
# [tool.uv.sources]
# xarray-sql = { path = "../../", editable = true }
# ///
"""NDVI — "apply_ufunc over a raster" is just column arithmetic.

The Normalized Difference Vegetation Index is the workhorse of optical remote
sensing: ``NDVI = (NIR - Red) / (NIR + Red)``, computed per pixel. The array
paradigm reaches for ``xarray.apply_ufunc`` (the coiled/benchmarks #1545
"vectorized operations" case) to broadcast this over a whole scene.

But a per-pixel formula over two bands is just *column arithmetic over two
columns*::

SELECT x, y, (nir - red) / (nir + red) AS ndvi
FROM scene
ORDER BY y, x

Each pixel is one row; the ufunc is the SELECT expression. Invalid pixels are
already NaN (xarray decodes the band's ``_FillValue`` on open), and NaN
propagates through the arithmetic on both sides — so the masking is free, no
``CASE`` required.

Dataset: a real Sentinel-2 L2A scene in **Zarr** from the ESA EOPF sample
service, discovered with ``pystac-client`` and opened the canonical way with
``xarray`` — ``xr.open_datatree`` yields the reflectance bands (B04=red,
B08=NIR at 10 m) already scaled to reflectance and carrying their ``x``/``y``
coordinates. We read one window so the case stays bounded. Requires network;
skips cleanly if the service is offline.
"""

from __future__ import annotations

import xarray as xr

import xarray_sql as xql

from _harness import (
CaseSkipped,
assert_grid_close,
measured,
run_case,
show_result,
show_sql,
)

# EOPF sample-service STAC catalog; an agricultural AOI near Torino, Italy, in
# early May (peak spring growth). The search is deterministic — it resolves to
# a specific archived Sentinel-2 product.
_STAC = "https://stac.core.eopf.eodc.eu"
_BBOX = [7.2, 44.5, 7.4, 44.7]
_DATETIME = "2025-04-25/2025-05-05"

# A 1024×1024 (~105 km²) window over vegetated valley floor.
_Y0, _X0, _N = 4_000, 6_000, 1_024


def _load_scene() -> tuple[xr.Dataset, str]:
"""Discover a Sentinel-2 L2A product and open its 10 m red/NIR bands.

Idiomatic end to end: ``pystac-client`` finds the product, ``open_datatree``
opens the hierarchical EOPF Zarr, and the ``reflectance/r10m`` node already
carries B04/B08 scaled to reflectance (nodata decoded to NaN) with
``x``/``y`` coordinates — no manual scaling or coordinate reconstruction.
"""
try:
from pystac_client import Client

catalog = Client.open(_STAC)
search = catalog.search(
collections=["sentinel-2-l2a"],
bbox=_BBOX,
datetime=_DATETIME,
max_items=1,
)
item = next(search.items())
tree = xr.open_datatree(
item.assets["product"].href, engine="zarr", chunks={}
)
except StopIteration as exc:
raise CaseSkipped("no Sentinel-2 product found for the query") from exc
except Exception as exc: # noqa: BLE001 — any failure → skip, not crash
raise CaseSkipped(f"EOPF Sentinel-2 unavailable ({exc})") from exc

r10m = tree["measurements/reflectance/r10m"].to_dataset()
scene = (
r10m[["b04", "b08"]]
.rename(b04="red", b08="nir")
.isel(y=slice(_Y0, _Y0 + _N), x=slice(_X0, _X0 + _N))
)
return scene, item.id


def main() -> None:
scene, item_id = _load_scene()
n = scene.sizes["y"] * scene.sizes["x"]
print(f" Sentinel-2 L2A {item_id}")
print(
f" scene window: {dict(scene.sizes)} ({n:,} pixels, B04=red/B08=NIR)"
)

ctx = xql.XarrayContext()
ctx.from_dataset("scene", scene, chunks={"y": 256, "x": 256})

sql = """
SELECT x, y, (nir - red) / (nir + red) AS ndvi
FROM scene
ORDER BY y, x
"""
show_sql(sql)

for _ in measured("SQL NDVI"):
got = ctx.sql(sql).to_dataset(dims=["y", "x"]).ndvi

# Array reference: the same formula in pure xarray. ``.compute()`` reads the
# window and evaluates it here (the scene is lazy), so this measures the same
# read-and-compute the SQL side does — not just graph construction.
for _ in measured("xarray reference"):
ref = ((scene.nir - scene.red) / (scene.nir + scene.red)).compute()

# Compare the xarray way — aligned by coordinate label, so the ORDER BY
# above is enough and neither side needs an explicit sort.
assert_grid_close("NDVI (per-pixel)", got, ref, rtol=1e-6)

show_result(got)

valid = ref.notnull()
print(
f"\n NDVI over {int(valid.sum()):,} valid pixels: "
f"min {float(ref.min()):.3f}, "
f"mean {float(ref.mean()):.3f}, "
f"max {float(ref.max()):.3f}"
)


if __name__ == "__main__":
raise SystemExit(run_case(main, "NDVI: per-pixel column arithmetic"))
137 changes: 137 additions & 0 deletions benchmarks/geospatial/02_climatology.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "xarray-sql",
# "xarray",
# "gcsfs",
# "zarr>=3",
# ]
#
# [tool.uv.sources]
# xarray-sql = { path = "../../", editable = true }
# ///
"""Diurnal climatology — the "rechunk + grouped reduction" that is a GROUP BY.

A *climatology* is the average value for each time-of-cycle, computed
independently at every location: "what is the typical temperature here at
06:00?" In the array paradigm (and in the coiled/benchmarks #1545 write-up)
this is the canonical painful workload — load native Zarr chunks, *rechunk* to
put all of time in one chunk ("pencils"), run a grouped reduction over the
calendar, then rechunk back to "pancakes" for output.

The rechunking exists only to serve the array layout. The *operation* is::

SELECT latitude, longitude, hour_of_day, AVG("2m_temperature")
GROUP BY latitude, longitude, hour_of_day

Group by location and time-of-cycle, average the rest — the same answer as
``da.groupby("time.hour").mean()``. ERA5 is hourly, so grouping by hour of day
gives a clean 24-bin **diurnal cycle**, one sample per day in the window.

We register the full ARCO-ERA5 archive as a lazy table, but the climatology here
is computed over a *bounded window* — a few summer days over a CONUS-ish box. The
``WHERE`` prunes the read, so the query touches only ``2m_temperature`` over that
window and never scans the rest of the archive. The point is not that we reduce
the whole record; it is that you can aim a query at a multi-decade archive and pay
only for the slice it asks for.
"""

from __future__ import annotations

import datetime

import xarray as xr

import xarray_sql as xql

from _harness import (
CaseSkipped,
assert_grid_close,
measured,
run_case,
show_result,
show_sql,
timed,
)

_URL = "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3"
# A few days over a CONUS-ish box (ERA5 latitude descends; lon is 0–360°E).
_START, _END = datetime.datetime(2020, 6, 1), datetime.datetime(2020, 6, 3, 23)
_LAT_N, _LAT_S = 50.0, 25.0
_LON_W, _LON_E = 235.0, 290.0
_PARAMS = {
"start": _START,
"end": _END,
"lat_s": _LAT_S,
"lat_n": _LAT_N,
"lon_w": _LON_W,
"lon_e": _LON_E,
}


def main() -> None:
# Open the full ARCO-ERA5 archive lazily — no data is read here. ERA5 mixes
# surface (time, lat, lon) and atmospheric (… level …) variables, so register
# it as two tables under an ``era5`` schema; the query below touches only the
# surface table's 2m_temperature.
try:
import gcsfs # noqa: F401 — required by the gs:// protocol

ds = xr.open_zarr(_URL, chunks=None, storage_options={"token": "anon"})
except Exception as exc: # noqa: BLE001 — any failure → skip, not crash
raise CaseSkipped(f"ARCO-ERA5 unavailable ({exc})") from exc

ctx = xql.XarrayContext()
with timed("register full ERA5 (lazy)"):
ctx.from_dataset(
"era5",
ds,
chunks={"time": 6},
table_names={
("time", "latitude", "longitude"): "surface",
("time", "level", "latitude", "longitude"): "atmosphere",
},
)

sql = """
SELECT latitude,
longitude,
date_part('hour', time) AS hour,
AVG("2m_temperature") - 273.15 AS clim_c
FROM era5.surface
WHERE time BETWEEN $start AND $end
Comment thread
alxmrs marked this conversation as resolved.
AND latitude BETWEEN $lat_s AND $lat_n
AND longitude BETWEEN $lon_w AND $lon_e
GROUP BY latitude, longitude, date_part('hour', time)
Comment thread
alxmrs marked this conversation as resolved.
ORDER BY latitude DESC, longitude, hour
"""
show_sql(sql)

# A climatology is a gridded product: round-trip the result back to an
# xarray Dataset keyed by (latitude, longitude, hour) — how it is used.
for _ in measured("SQL diurnal climatology (lazy read)"):
got = ctx.sql(sql, param_values=_PARAMS).to_dataset(
dims=["latitude", "longitude", "hour"]
)

# Array reference: the textbook groupby-over-the-cycle reduction, in °C —
# the same lazy window, materialized only on demand.
for _ in measured("xarray reference"):
window = ds["2m_temperature"].sel(
time=slice(_START, _END),
latitude=slice(_LAT_N, _LAT_S),
longitude=slice(_LON_W, _LON_E),
)
ref = window.groupby("time.hour").mean("time") - 273.15
Comment thread
alxmrs marked this conversation as resolved.

assert_grid_close(
"diurnal climatology (°C)", got.clim_c, ref, rtol=1e-4, atol=1e-2
)

show_result(got)


if __name__ == "__main__":
raise SystemExit(
run_case(main, "Climatology: GROUP BY lat, lon, hour (ARCO-ERA5)")
)
Loading
Loading