Skip to content

feat: add NGWMN getters as an ogc sibling; extract a shared OGC engine#324

Draft
thodson-usgs wants to merge 2 commits into
DOI-USGS:mainfrom
thodson-usgs:feat/ngwmn-ogc
Draft

feat: add NGWMN getters as an ogc sibling; extract a shared OGC engine#324
thodson-usgs wants to merge 2 commits into
DOI-USGS:mainfrom
thodson-usgs:feat/ngwmn-ogc

Conversation

@thodson-usgs

@thodson-usgs thodson-usgs commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Ports the NGWMN functions from the R dataRetrieval PR (DOI-USGS/dataRetrieval#904) and, per review, refactors the Water Data OGC machinery into a shared engine so NGWMN and Water Data are sibling layers on top of it — NGWMN does not depend on Water Data.

Architecture

dataretrieval/
  ogc/                   generic, API-agnostic OGC engine (no service-specific config)
    engine.py            request build · paginate · parse · finalize · get_ogc_data
    chunking.py          URL-byte multi-value chunker   (moved from waterdata/)
    filters.py           CQL `filter` splitting         (moved from waterdata/)
    progress.py          self-updating status line      (moved from waterdata/_progress.py)
  waterdata/             Water Data layer on the engine
    api.py               typed OGC getters: monitoring-locations, daily, continuous, peaks,
                         time-series/combined metadata, field/channel measurements, samples, …
    stats.py             Statistics API client (get_stats_por / get_stats_date_range)
    ratings.py           rating-curve retrieval via the Water Data STAC catalog
    nearest.py           get_nearest_continuous convenience on top of the getters
    utils.py             service→id map · WATERDATA_DIALECT · get_ogc_data wrapper · engine re-exports
    types.py             service/profile Literals + lookup tables
  ngwmn.py               sibling getters (imports only dataretrieval.ogc):
                         get_sites / get_water_level / get_lithology /
                         get_well_construction / get_providers
  codes/states.py        to_state(): shared name/postal/FIPS state normalizer used by both layers

Layering / import rules (the organization a reviewer can check):

  • ogc/* is fully API-agnostic — it imports no dataretrieval.waterdata or dataretrieval.ngwmn.
  • waterdata/* and ngwmn.py are siblings on the engine; ngwmn.py imports only dataretrieval.ogc (+ codes.states), never dataretrieval.waterdata.
  • api.py / stats.py / ratings.py / nearest.py are thin getter layers; the shared Water-Data state (service→id map, dialect, the get_ogc_data wrapper) lives once in utils.py.

The engine is parameterized, not branched: get_ogc_data(args, service, output_id, *, base_url, extra_id_cols, dialect). An OgcDialect(cql2_services, date_only_services, …) (threaded via a context variable, like the base-url context) carries per-API quirks — Water Data POSTs CQL2 for monitoring-locations and renders daily time args date-only; NGWMN needs neither. Adding a sibling API is a new dialect + base URL, not engine edits.

from dataretrieval import ngwmn

df, md = ngwmn.get_sites(state="Wisconsin")      # name, postal ("WI"), or FIPS ("55")
df, md = ngwmn.get_water_level(
    monitoring_location_id="USGS-272838082142201",
    datetime=["2022-01-01", "2024-01-01"],
)
df, md = ngwmn.get_water_level(                   # NGWMN ids aren't all USGS-prefixed
    monitoring_location_id=["USGS-272838082142201", "MBMG-702934"]
)
df, md = ngwmn.get_providers(state="WI")

The multi-value chunker (recently fixed in #322) is generic and applies to NGWMN unchanged — verified that a forced-small-budget multi-site NGWMN query chunks and unions correctly.

Unified state parameter

A single, canonical state parameter spans the modern getters and accepts a full name ("Wisconsin"), a two-letter postal code ("WI"), or a two-digit ANSI/FIPS code ("55"). It is normalized once by codes.states.to_state() (50 states + DC; fails fast on a typo) and resolved at the getter layer into whatever native queryable each endpoint wants — state_name for the OGC getters, US:XX state_code for stats, the per-collection queryable for NGWMN. The native state_code / state_name parameters remain as an escape hatch (e.g. non-US FIPS codes); passing state together with either raises.

Engine fixes (NGWMN's API differs from the main one)

  • Key the empty-result short-circuit off features rather than the numberReturned that NGWMN omits (otherwise pages with data were silently dropped).
  • Tolerate observation features with no geometry key (GeoDataFrame.from_features can't index a missing key).

PEP naming

The engine snake_cases any non-snake column in finalize, so the package always returns PEP-8 column names regardless of the upstream API — a no-op today (both APIs are already snake_case) but enforced going forward.

Tests & checks

Live NGWMN tests for all five getters (tests/ngwmn_test.py); unit tests for to_state, the _with_state / getter-layer state resolution, and _to_snake_case; mock.patch sites repointed to ogc.engine; a module fixture activates WATERDATA_DIALECT for the direct _construct_api_requests unit tests. The unit suite passes and mypy --strict + ruff are clean. A pre-commit mypy hook (pinned to the same mypy<2 major as CI, with httpx/anyio in the hook env) now mirrors the CI type-check locally.

Note

CI will show 3 pre-existing failures (test_get_daily_properties/_id, test_get_continuous) — the live-API drift fixed by #323, not introduced here (branch is off main). They go green once #323 merges.

🤖 Generated with Claude Code

@thodson-usgs thodson-usgs changed the title feat(waterdata): add NGWMN OGC getters (sites, water level, lithology, construction, providers) feat: add NGWMN getters as an ogc sibling; extract a shared OGC engine Jun 12, 2026
@thodson-usgs thodson-usgs force-pushed the feat/ngwmn-ogc branch 2 times, most recently from 1c2ab7a to 3ba001e Compare June 14, 2026 17:51
Port the NGWMN functions from the R `dataRetrieval` package
(DOI-USGS/dataRetrieval#904) and refactor the Water Data OGC machinery into a
generic, API-agnostic engine, so NGWMN and Water Data are sibling layers on
top of it -- NGWMN does not depend on Water Data.

dataretrieval/ogc/  generic OGC engine (no service-specific config)
  engine.py    request build, pagination, parse/finalize, get_ogc_data
  chunking.py  URL-byte multi-value chunker
  filters.py   CQL `filter` splitting
  progress.py  self-updating status line
The engine is parameterized by an `OgcDialect` and a base-url context
variable rather than branching on service names: Water Data POSTs CQL2 for
`monitoring-locations` and renders `daily` time args date-only; NGWMN needs
neither. Adding a sibling API is a new dialect + base URL, not engine edits.

dataretrieval/ngwmn.py  sibling getters that import only dataretrieval.ogc:
  get_sites, get_water_level, get_lithology, get_well_construction,
  get_providers

dataretrieval/waterdata/  thin Water Data layer on the engine; the Statistics
API lives in its own waterdata/stats.py module.

Unified `state` parameter across the modern getters, accepting a full name, a
two-letter postal code, or a two-digit ANSI/FIPS code; normalized by
codes.states.to_state (50 states + DC, fails fast on a typo) and resolved at
the getter layer. The native state_code/state_name parameters remain as an
escape hatch.

Also: export ChunkInterrupted at the package top level; key the empty-result
short-circuit off `features` (NGWMN omits `numberReturned`) and tolerate
geometry-less features; always return PEP-8 snake_case columns; and add a
pre-commit mypy hook mirroring the CI type-check.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@thodson-usgs

Copy link
Copy Markdown
Collaborator Author

@ehinman,
Besides adding the NGWMN module, this PR will create a shared OGC engine used by NGWMN and WaterData. In the architecture outlined above, ngwmn and waterdata would become sibling modules, whereas ratings and stats would still live within waterdata. The logical distinction is a little murky, but I believe that's better than putting everything within waterdata. What organization makes most sense to you?

Every getter that routes through the chunked fan-out can raise a resumable
`ChunkInterrupted` when a transient failure outlasts the built-in retries, but
nothing said so at the point of use -- a `help(get_daily)` reader would only
discover the catch-and-resume affordance via the separate userguide.

Add a one-line `Raises: ChunkInterrupted` entry (pointing to
:doc:`/userguide/errors` for the full resume example -- the single source of
truth) to the 12 chunker-backed getters. `get_cql` (own `_run_sync` path),
the stats getters, and the Samples getters don't go through the chunker, so
they're intentionally left out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request Jun 15, 2026
`ogc/engine.py` had absorbed ~1940 LOC spanning six unrelated concerns --
constants/config, HTTP error handling, arg validation, request building,
response shaping, and the async pagination driver. (DOI-USGS#324 extracted this generic
engine out of waterdata/utils.py, so the god-module relocated here rather than
disappearing -- this is DOI-USGS#318's original intent, retargeted onto the post-DOI-USGS#324
layout.)

Split it into cohesive private siblings under `ogc/`, moving every definition
AST-identically (no signature/logic change):

  _constants.py  URLs, OgcDialect, regexes, param sets, context vars, gpd probe
  _http.py       headers, error-body, _raise_for_non_200, retry-after
  _validate.py   arg normalization/validation, id switching
  _requests.py   request building (GET/CQL2, date formatting, pagination URLs)
  _responses.py  wire response -> DataFrame (parse + shape + finalize)
  engine.py      async pagination driver + get_ogc_data, re-exporting the above

engine.py (1937 -> 570 LOC) stays the public facade: it re-exports every name so
existing `from dataretrieval.ogc.engine import ...` sites in waterdata, ngwmn,
and the tests keep working unchanged. The geopandas parse chain lives in
_responses.py (its boundary is the domain, not a test patch target); the single
gpd monkeypatch seam was relocated to `_responses` -- the only test change.

Behavior-preserving: all 42 top-level symbols moved with byte-identical AST
bodies; 296 offline tests pass; ruff + mypy --strict clean; no import cycles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant