feat: add NGWMN getters as an `ogc` sibling; extract a shared OGC engine by thodson-usgs · Pull Request #324 · DOI-USGS/dataretrieval-python

thodson-usgs · 2026-06-12T21:10:06Z

Ports the NGWMN functions from the R dataRetrieval PR (DOI-USGS/dataRetrieval#904) and, per review, refactors the Water Data OGC machinery into a shared engine so NGWMN and Water Data are sibling layers on top of it — NGWMN does not depend on Water Data.

Architecture

dataretrieval/
  ogc/                   generic, API-agnostic OGC engine (no service-specific config)
    engine.py            request build · paginate · parse · finalize · get_ogc_data
    chunking.py          URL-byte multi-value chunker   (moved from waterdata/)
    filters.py           CQL `filter` splitting         (moved from waterdata/)
    progress.py          self-updating status line      (moved from waterdata/_progress.py)
  waterdata/             Water Data layer on the engine
    api.py               typed OGC getters: monitoring-locations, daily, continuous, peaks,
                         time-series/combined metadata, field/channel measurements, samples, …
    stats.py             Statistics API client (get_stats_por / get_stats_date_range)
    ratings.py           rating-curve retrieval via the Water Data STAC catalog
    nearest.py           get_nearest_continuous convenience on top of the getters
    utils.py             service→id map · WATERDATA_DIALECT · get_ogc_data wrapper · engine re-exports
    types.py             service/profile Literals + lookup tables
  ngwmn.py               sibling getters (imports only dataretrieval.ogc):
                         get_sites / get_water_level / get_lithology /
                         get_well_construction / get_providers
  codes/states.py        to_state(): shared name/postal/FIPS state normalizer used by both layers

Layering / import rules (the organization a reviewer can check):

ogc/* is fully API-agnostic — it imports no dataretrieval.waterdata or dataretrieval.ngwmn.
waterdata/* and ngwmn.py are siblings on the engine; ngwmn.py imports only dataretrieval.ogc (+ codes.states), never dataretrieval.waterdata.
api.py / stats.py / ratings.py / nearest.py are thin getter layers; the shared Water-Data state (service→id map, dialect, the get_ogc_data wrapper) lives once in utils.py.

The engine is parameterized, not branched: get_ogc_data(args, service, output_id, *, base_url, extra_id_cols, dialect). An OgcDialect(cql2_services, date_only_services, …) (threaded via a context variable, like the base-url context) carries per-API quirks — Water Data POSTs CQL2 for monitoring-locations and renders daily time args date-only; NGWMN needs neither. Adding a sibling API is a new dialect + base URL, not engine edits.

from dataretrieval import ngwmn

df, md = ngwmn.get_sites(state="Wisconsin")      # name, postal ("WI"), or FIPS ("55")
df, md = ngwmn.get_water_level(
    monitoring_location_id="USGS-272838082142201",
    datetime=["2022-01-01", "2024-01-01"],
)
df, md = ngwmn.get_water_level(                   # NGWMN ids aren't all USGS-prefixed
    monitoring_location_id=["USGS-272838082142201", "MBMG-702934"]
)
df, md = ngwmn.get_providers(state="WI")

The multi-value chunker (recently fixed in #322) is generic and applies to NGWMN unchanged — verified that a forced-small-budget multi-site NGWMN query chunks and unions correctly.

Unified `state` parameter

A single, canonical state parameter spans the modern getters and accepts a full name ("Wisconsin"), a two-letter postal code ("WI"), or a two-digit ANSI/FIPS code ("55"). It is normalized once by codes.states.to_state() (50 states + DC; fails fast on a typo) and resolved at the getter layer into whatever native queryable each endpoint wants — state_name for the OGC getters, US:XX state_code for stats, the per-collection queryable for NGWMN. The native state_code / state_name parameters remain as an escape hatch (e.g. non-US FIPS codes); passing state together with either raises.

Engine fixes (NGWMN's API differs from the main one)

Key the empty-result short-circuit off features rather than the numberReturned that NGWMN omits (otherwise pages with data were silently dropped).
Tolerate observation features with no geometry key (GeoDataFrame.from_features can't index a missing key).

PEP naming

The engine snake_cases any non-snake column in finalize, so the package always returns PEP-8 column names regardless of the upstream API — a no-op today (both APIs are already snake_case) but enforced going forward.

Tests & checks

Live NGWMN tests for all five getters (tests/ngwmn_test.py); unit tests for to_state, the _with_state / getter-layer state resolution, and _to_snake_case; mock.patch sites repointed to ogc.engine; a module fixture activates WATERDATA_DIALECT for the direct _construct_api_requests unit tests. The unit suite passes and mypy --strict + ruff are clean. A pre-commit mypy hook (pinned to the same mypy<2 major as CI, with httpx/anyio in the hook env) now mirrors the CI type-check locally.

Note

CI will show 3 pre-existing failures (test_get_daily_properties/_id, test_get_continuous) — the live-API drift fixed by #323, not introduced here (branch is off main). They go green once #323 merges.

🤖 Generated with Claude Code

Port the NGWMN functions from the R `dataRetrieval` package (DOI-USGS/dataRetrieval#904) and refactor the Water Data OGC machinery into a generic, API-agnostic engine, so NGWMN and Water Data are sibling layers on top of it -- NGWMN does not depend on Water Data. dataretrieval/ogc/ generic OGC engine (no service-specific config) engine.py request build, pagination, parse/finalize, get_ogc_data chunking.py URL-byte multi-value chunker filters.py CQL `filter` splitting progress.py self-updating status line The engine is parameterized by an `OgcDialect` and a base-url context variable rather than branching on service names: Water Data POSTs CQL2 for `monitoring-locations` and renders `daily` time args date-only; NGWMN needs neither. Adding a sibling API is a new dialect + base URL, not engine edits. dataretrieval/ngwmn.py sibling getters that import only dataretrieval.ogc: get_sites, get_water_level, get_lithology, get_well_construction, get_providers dataretrieval/waterdata/ thin Water Data layer on the engine; the Statistics API lives in its own waterdata/stats.py module. Unified `state` parameter across the modern getters, accepting a full name, a two-letter postal code, or a two-digit ANSI/FIPS code; normalized by codes.states.to_state (50 states + DC, fails fast on a typo) and resolved at the getter layer. The native state_code/state_name parameters remain as an escape hatch. Also: export ChunkInterrupted at the package top level; key the empty-result short-circuit off `features` (NGWMN omits `numberReturned`) and tolerate geometry-less features; always return PEP-8 snake_case columns; and add a pre-commit mypy hook mirroring the CI type-check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

thodson-usgs · 2026-06-15T14:35:57Z

@ehinman,
Besides adding the NGWMN module, this PR will create a shared OGC engine used by NGWMN and WaterData. In the architecture outlined above, ngwmn and waterdata would become sibling modules, whereas ratings and stats would still live within waterdata. The logical distinction is a little murky, but I believe that's better than putting everything within waterdata. What organization makes most sense to you?

Every getter that routes through the chunked fan-out can raise a resumable `ChunkInterrupted` when a transient failure outlasts the built-in retries, but nothing said so at the point of use -- a `help(get_daily)` reader would only discover the catch-and-resume affordance via the separate userguide. Add a one-line `Raises: ChunkInterrupted` entry (pointing to :doc:`/userguide/errors` for the full resume example -- the single source of truth) to the 12 chunker-backed getters. `get_cql` (own `_run_sync` path), the stats getters, and the Samples getters don't go through the chunker, so they're intentionally left out. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

`ogc/engine.py` had absorbed ~1940 LOC spanning six unrelated concerns -- constants/config, HTTP error handling, arg validation, request building, response shaping, and the async pagination driver. (DOI-USGS#324 extracted this generic engine out of waterdata/utils.py, so the god-module relocated here rather than disappearing -- this is DOI-USGS#318's original intent, retargeted onto the post-DOI-USGS#324 layout.) Split it into cohesive private siblings under `ogc/`, moving every definition AST-identically (no signature/logic change): _constants.py URLs, OgcDialect, regexes, param sets, context vars, gpd probe _http.py headers, error-body, _raise_for_non_200, retry-after _validate.py arg normalization/validation, id switching _requests.py request building (GET/CQL2, date formatting, pagination URLs) _responses.py wire response -> DataFrame (parse + shape + finalize) engine.py async pagination driver + get_ogc_data, re-exporting the above engine.py (1937 -> 570 LOC) stays the public facade: it re-exports every name so existing `from dataretrieval.ogc.engine import ...` sites in waterdata, ngwmn, and the tests keep working unchanged. The geopandas parse chain lives in _responses.py (its boundary is the domain, not a test patch target); the single gpd monkeypatch seam was relocated to `_responses` -- the only test change. Behavior-preserving: all 42 top-level symbols moved with byte-identical AST bodies; 296 offline tests pass; ruff + mypy --strict clean; no import cycles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

thodson-usgs force-pushed the feat/ngwmn-ogc branch from 35500b3 to 9e4c0ca Compare June 12, 2026 22:12

thodson-usgs changed the title ~~feat(waterdata): add NGWMN OGC getters (sites, water level, lithology, construction, providers)~~ feat: add NGWMN getters as an ogc sibling; extract a shared OGC engine Jun 12, 2026

thodson-usgs force-pushed the feat/ngwmn-ogc branch 2 times, most recently from 1c2ab7a to 3ba001e Compare June 14, 2026 17:51

thodson-usgs force-pushed the feat/ngwmn-ogc branch from 3ba001e to 803a15d Compare June 15, 2026 14:35

thodson-usgs mentioned this pull request Jun 15, 2026

test(waterdata): rerun flaky transient 5xx/429 from the chunked fan-out #325

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add NGWMN getters as an `ogc` sibling; extract a shared OGC engine#324

feat: add NGWMN getters as an `ogc` sibling; extract a shared OGC engine#324
thodson-usgs wants to merge 2 commits into
DOI-USGS:mainfrom
thodson-usgs:feat/ngwmn-ogc

thodson-usgs commented Jun 12, 2026 •

edited

Loading

Uh oh!

thodson-usgs commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thodson-usgs commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture

Unified state parameter

Engine fixes (NGWMN's API differs from the main one)

PEP naming

Tests & checks

Note

Uh oh!

thodson-usgs commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thodson-usgs commented Jun 12, 2026 •

edited

Loading

Unified `state` parameter