-
Notifications
You must be signed in to change notification settings - Fork 13
Geospatial SQL benchmark suite: prove core geo/climate ops are relational #177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
alxmrs
wants to merge
26
commits into
main
Choose a base branch
from
geospatial-sql-benchmarks
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
a44d677
Add geospatial SQL benchmark suite (benchmarks/geospatial/)
alxmrs 6bc41aa
NDVI: open Sentinel-2 idiomatically (pystac + open_datatree), drop ma…
alxmrs e1cbc8b
Make benchmark cases idiomatic xarray: fluent opens, Dataset results,…
alxmrs a971f2b
Lazy registration, parameterized queries, model-concat: address round…
alxmrs 3bd17a7
Cases 07/08: use Earth Engine via Xee (independent reproj reference; …
alxmrs df67d98
README: add "Does it work?" section (addresses #46); drop "honest"
alxmrs 8d7ea6a
EE cases: initialize via ADC (no blocked OAuth); fix 07 coord names —…
alxmrs 866d055
Fix run_all.sh: run from any directory, run all cases, use the local …
alxmrs 05b3466
Point each case's uv metadata at the local xarray-sql checkout
alxmrs 7e3dbd1
Print each case's result (xarray repr) after the match is verified
alxmrs f1a2c03
Docs: write for the library's reader, not the review thread
alxmrs 0e8b15c
docs: cite the coiled/benchmarks #1545 survey the operations come from
alxmrs be07550
Add opt-in performance profiling to the harness (CSV perf table)
alxmrs e0a2b03
Profiling via a `for _ in measured(...)` loop instead of re-running m…
alxmrs c7454b0
Address review: partial-result guard, lazy NDVI, no-reshape regrid
alxmrs 7b199f1
Add cold-vs-cold perf driver (fair SQL-vs-xarray timing)
alxmrs 0bd1e05
docs: publish perf results, analysis, and the SQL-vs-arrays tradeoff
alxmrs dfd0039
simpler conclusion title.
alxmrs 1cd2c93
Address review: fair cold-vs-cold (.compute), to_pandas headline, doc…
alxmrs 0941e0b
Address review: drop "whole archive" overclaim, inline perf summary
alxmrs 3f9c295
Re-collect cold-vs-cold timings with the .compute() fix
alxmrs 7bfd56f
Address review: lazy zonal-stats reference, split perf summary back out
alxmrs 5ccddae
Address review: make case 08 a symmetric laziness test
alxmrs 1b03bec
Case 05: one partition each (both chunks=100); document the chunk fin…
alxmrs 49356b5
Add case 09: warp (reproject + resample) as UDF → weight-table JOIN
alxmrs b7c581c
docs: refresh perf table — full 9-case run on one in-region VM with EE
alxmrs File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,149 @@ | ||
| #!/usr/bin/env python3 | ||
| # /// script | ||
| # requires-python = ">=3.11" | ||
| # dependencies = [ | ||
| # "xarray-sql", | ||
| # "xarray", | ||
| # "aiohttp", | ||
| # "requests", | ||
| # "pystac-client", | ||
| # "zarr>=3", | ||
| # "numpy", | ||
| # ] | ||
| # | ||
| # [tool.uv.sources] | ||
| # xarray-sql = { path = "../../", editable = true } | ||
| # /// | ||
| """NDVI — "apply_ufunc over a raster" is just column arithmetic. | ||
|
|
||
| The Normalized Difference Vegetation Index is the workhorse of optical remote | ||
| sensing: ``NDVI = (NIR - Red) / (NIR + Red)``, computed per pixel. The array | ||
| paradigm reaches for ``xarray.apply_ufunc`` (the coiled/benchmarks #1545 | ||
| "vectorized operations" case) to broadcast this over a whole scene. | ||
|
|
||
| But a per-pixel formula over two bands is just *column arithmetic over two | ||
| columns*:: | ||
|
|
||
| SELECT x, y, (nir - red) / (nir + red) AS ndvi | ||
| FROM scene | ||
| ORDER BY y, x | ||
|
|
||
| Each pixel is one row; the ufunc is the SELECT expression. Invalid pixels are | ||
| already NaN (xarray decodes the band's ``_FillValue`` on open), and NaN | ||
| propagates through the arithmetic on both sides — so the masking is free, no | ||
| ``CASE`` required. | ||
|
|
||
| Dataset: a real Sentinel-2 L2A scene in **Zarr** from the ESA EOPF sample | ||
| service, discovered with ``pystac-client`` and opened the canonical way with | ||
| ``xarray`` — ``xr.open_datatree`` yields the reflectance bands (B04=red, | ||
| B08=NIR at 10 m) already scaled to reflectance and carrying their ``x``/``y`` | ||
| coordinates. We read one window so the case stays bounded. Requires network; | ||
| skips cleanly if the service is offline. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import xarray as xr | ||
|
|
||
| import xarray_sql as xql | ||
|
|
||
| from _harness import ( | ||
| CaseSkipped, | ||
| assert_grid_close, | ||
| measured, | ||
| run_case, | ||
| show_result, | ||
| show_sql, | ||
| ) | ||
|
|
||
| # EOPF sample-service STAC catalog; an agricultural AOI near Torino, Italy, in | ||
| # early May (peak spring growth). The search is deterministic — it resolves to | ||
| # a specific archived Sentinel-2 product. | ||
| _STAC = "https://stac.core.eopf.eodc.eu" | ||
| _BBOX = [7.2, 44.5, 7.4, 44.7] | ||
| _DATETIME = "2025-04-25/2025-05-05" | ||
|
|
||
| # A 1024×1024 (~105 km²) window over vegetated valley floor. | ||
| _Y0, _X0, _N = 4_000, 6_000, 1_024 | ||
|
|
||
|
|
||
| def _load_scene() -> tuple[xr.Dataset, str]: | ||
| """Discover a Sentinel-2 L2A product and open its 10 m red/NIR bands. | ||
|
|
||
| Idiomatic end to end: ``pystac-client`` finds the product, ``open_datatree`` | ||
| opens the hierarchical EOPF Zarr, and the ``reflectance/r10m`` node already | ||
| carries B04/B08 scaled to reflectance (nodata decoded to NaN) with | ||
| ``x``/``y`` coordinates — no manual scaling or coordinate reconstruction. | ||
| """ | ||
| try: | ||
| from pystac_client import Client | ||
|
|
||
| catalog = Client.open(_STAC) | ||
| search = catalog.search( | ||
| collections=["sentinel-2-l2a"], | ||
| bbox=_BBOX, | ||
| datetime=_DATETIME, | ||
| max_items=1, | ||
| ) | ||
| item = next(search.items()) | ||
| tree = xr.open_datatree( | ||
| item.assets["product"].href, engine="zarr", chunks={} | ||
| ) | ||
| except StopIteration as exc: | ||
| raise CaseSkipped("no Sentinel-2 product found for the query") from exc | ||
| except Exception as exc: # noqa: BLE001 — any failure → skip, not crash | ||
| raise CaseSkipped(f"EOPF Sentinel-2 unavailable ({exc})") from exc | ||
|
|
||
| r10m = tree["measurements/reflectance/r10m"].to_dataset() | ||
| scene = ( | ||
| r10m[["b04", "b08"]] | ||
| .rename(b04="red", b08="nir") | ||
| .isel(y=slice(_Y0, _Y0 + _N), x=slice(_X0, _X0 + _N)) | ||
| ) | ||
| return scene, item.id | ||
|
|
||
|
|
||
| def main() -> None: | ||
| scene, item_id = _load_scene() | ||
| n = scene.sizes["y"] * scene.sizes["x"] | ||
| print(f" Sentinel-2 L2A {item_id}") | ||
| print( | ||
| f" scene window: {dict(scene.sizes)} ({n:,} pixels, B04=red/B08=NIR)" | ||
| ) | ||
|
|
||
| ctx = xql.XarrayContext() | ||
| ctx.from_dataset("scene", scene, chunks={"y": 256, "x": 256}) | ||
|
|
||
| sql = """ | ||
| SELECT x, y, (nir - red) / (nir + red) AS ndvi | ||
| FROM scene | ||
| ORDER BY y, x | ||
| """ | ||
| show_sql(sql) | ||
|
|
||
| for _ in measured("SQL NDVI"): | ||
| got = ctx.sql(sql).to_dataset(dims=["y", "x"]).ndvi | ||
|
|
||
| # Array reference: the same formula in pure xarray. ``.compute()`` reads the | ||
| # window and evaluates it here (the scene is lazy), so this measures the same | ||
| # read-and-compute the SQL side does — not just graph construction. | ||
| for _ in measured("xarray reference"): | ||
| ref = ((scene.nir - scene.red) / (scene.nir + scene.red)).compute() | ||
|
|
||
| # Compare the xarray way — aligned by coordinate label, so the ORDER BY | ||
| # above is enough and neither side needs an explicit sort. | ||
| assert_grid_close("NDVI (per-pixel)", got, ref, rtol=1e-6) | ||
|
|
||
| show_result(got) | ||
|
|
||
| valid = ref.notnull() | ||
| print( | ||
| f"\n NDVI over {int(valid.sum()):,} valid pixels: " | ||
| f"min {float(ref.min()):.3f}, " | ||
| f"mean {float(ref.mean()):.3f}, " | ||
| f"max {float(ref.max()):.3f}" | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| raise SystemExit(run_case(main, "NDVI: per-pixel column arithmetic")) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,137 @@ | ||
| # /// script | ||
| # requires-python = ">=3.11" | ||
| # dependencies = [ | ||
| # "xarray-sql", | ||
| # "xarray", | ||
| # "gcsfs", | ||
| # "zarr>=3", | ||
| # ] | ||
| # | ||
| # [tool.uv.sources] | ||
| # xarray-sql = { path = "../../", editable = true } | ||
| # /// | ||
| """Diurnal climatology — the "rechunk + grouped reduction" that is a GROUP BY. | ||
|
|
||
| A *climatology* is the average value for each time-of-cycle, computed | ||
| independently at every location: "what is the typical temperature here at | ||
| 06:00?" In the array paradigm (and in the coiled/benchmarks #1545 write-up) | ||
| this is the canonical painful workload — load native Zarr chunks, *rechunk* to | ||
| put all of time in one chunk ("pencils"), run a grouped reduction over the | ||
| calendar, then rechunk back to "pancakes" for output. | ||
|
|
||
| The rechunking exists only to serve the array layout. The *operation* is:: | ||
|
|
||
| SELECT latitude, longitude, hour_of_day, AVG("2m_temperature") | ||
| GROUP BY latitude, longitude, hour_of_day | ||
|
|
||
| Group by location and time-of-cycle, average the rest — the same answer as | ||
| ``da.groupby("time.hour").mean()``. ERA5 is hourly, so grouping by hour of day | ||
| gives a clean 24-bin **diurnal cycle**, one sample per day in the window. | ||
|
|
||
| We register the full ARCO-ERA5 archive as a lazy table, but the climatology here | ||
| is computed over a *bounded window* — a few summer days over a CONUS-ish box. The | ||
| ``WHERE`` prunes the read, so the query touches only ``2m_temperature`` over that | ||
| window and never scans the rest of the archive. The point is not that we reduce | ||
| the whole record; it is that you can aim a query at a multi-decade archive and pay | ||
| only for the slice it asks for. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import datetime | ||
|
|
||
| import xarray as xr | ||
|
|
||
| import xarray_sql as xql | ||
|
|
||
| from _harness import ( | ||
| CaseSkipped, | ||
| assert_grid_close, | ||
| measured, | ||
| run_case, | ||
| show_result, | ||
| show_sql, | ||
| timed, | ||
| ) | ||
|
|
||
| _URL = "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3" | ||
| # A few days over a CONUS-ish box (ERA5 latitude descends; lon is 0–360°E). | ||
| _START, _END = datetime.datetime(2020, 6, 1), datetime.datetime(2020, 6, 3, 23) | ||
| _LAT_N, _LAT_S = 50.0, 25.0 | ||
| _LON_W, _LON_E = 235.0, 290.0 | ||
| _PARAMS = { | ||
| "start": _START, | ||
| "end": _END, | ||
| "lat_s": _LAT_S, | ||
| "lat_n": _LAT_N, | ||
| "lon_w": _LON_W, | ||
| "lon_e": _LON_E, | ||
| } | ||
|
|
||
|
|
||
| def main() -> None: | ||
| # Open the full ARCO-ERA5 archive lazily — no data is read here. ERA5 mixes | ||
| # surface (time, lat, lon) and atmospheric (… level …) variables, so register | ||
| # it as two tables under an ``era5`` schema; the query below touches only the | ||
| # surface table's 2m_temperature. | ||
| try: | ||
| import gcsfs # noqa: F401 — required by the gs:// protocol | ||
|
|
||
| ds = xr.open_zarr(_URL, chunks=None, storage_options={"token": "anon"}) | ||
| except Exception as exc: # noqa: BLE001 — any failure → skip, not crash | ||
| raise CaseSkipped(f"ARCO-ERA5 unavailable ({exc})") from exc | ||
|
|
||
| ctx = xql.XarrayContext() | ||
| with timed("register full ERA5 (lazy)"): | ||
| ctx.from_dataset( | ||
| "era5", | ||
| ds, | ||
| chunks={"time": 6}, | ||
| table_names={ | ||
| ("time", "latitude", "longitude"): "surface", | ||
| ("time", "level", "latitude", "longitude"): "atmosphere", | ||
| }, | ||
| ) | ||
|
|
||
| sql = """ | ||
| SELECT latitude, | ||
| longitude, | ||
| date_part('hour', time) AS hour, | ||
| AVG("2m_temperature") - 273.15 AS clim_c | ||
| FROM era5.surface | ||
| WHERE time BETWEEN $start AND $end | ||
|
alxmrs marked this conversation as resolved.
|
||
| AND latitude BETWEEN $lat_s AND $lat_n | ||
| AND longitude BETWEEN $lon_w AND $lon_e | ||
| GROUP BY latitude, longitude, date_part('hour', time) | ||
|
alxmrs marked this conversation as resolved.
|
||
| ORDER BY latitude DESC, longitude, hour | ||
| """ | ||
| show_sql(sql) | ||
|
|
||
| # A climatology is a gridded product: round-trip the result back to an | ||
| # xarray Dataset keyed by (latitude, longitude, hour) — how it is used. | ||
| for _ in measured("SQL diurnal climatology (lazy read)"): | ||
| got = ctx.sql(sql, param_values=_PARAMS).to_dataset( | ||
| dims=["latitude", "longitude", "hour"] | ||
| ) | ||
|
|
||
| # Array reference: the textbook groupby-over-the-cycle reduction, in °C — | ||
| # the same lazy window, materialized only on demand. | ||
| for _ in measured("xarray reference"): | ||
| window = ds["2m_temperature"].sel( | ||
| time=slice(_START, _END), | ||
| latitude=slice(_LAT_N, _LAT_S), | ||
| longitude=slice(_LON_W, _LON_E), | ||
| ) | ||
| ref = window.groupby("time.hour").mean("time") - 273.15 | ||
|
alxmrs marked this conversation as resolved.
|
||
|
|
||
| assert_grid_close( | ||
| "diurnal climatology (°C)", got.clim_c, ref, rtol=1e-4, atol=1e-2 | ||
| ) | ||
|
|
||
| show_result(got) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| raise SystemExit( | ||
| run_case(main, "Climatology: GROUP BY lat, lon, hour (ARCO-ERA5)") | ||
| ) | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.