Skip to content

refactor(waterdata): decompose the 2033-LOC utils.py god-module into cohesive modules#318

Draft
thodson-usgs wants to merge 3 commits into
DOI-USGS:mainfrom
thodson-usgs:refactor/split-waterdata-utils
Draft

refactor(waterdata): decompose the 2033-LOC utils.py god-module into cohesive modules#318
thodson-usgs wants to merge 3 commits into
DOI-USGS:mainfrom
thodson-usgs:refactor/split-waterdata-utils

Conversation

@thodson-usgs

@thodson-usgs thodson-usgs commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Rewritten and rebased onto #324. The original PR split waterdata/utils.py; #324 has since made that file thin (it extracted the generic engine into ogc/). This PR retargets #318's original intent — decompose the OGC god-module by cohesive domain — onto the post-#324 layout. Stacked on #324: review/merge after it lands. Until #324 merges, the diff below includes #324's commits; it collapses to just the engine decomposition once #324 is in main.

Why

After #324, the god-module is dataretrieval/ogc/engine.py (~1940 LOC) — it absorbed several unrelated concerns (constants/config, per-call request context, HTTP error handling, arg validation, request building, response shaping) on top of the async pagination driver. Changing how requests are built means navigating the same file as datetime parsing, type coercion, and the async driver.

What

Split it into cohesive private sibling modules under dataretrieval/ogc/, moving every definition AST-identically (no signature/logic change):

module holds
_constants.py URLs, OgcDialect, regexes, param sets, the geopandas probe (pure-data leaf)
_context.py per-call request context — the base-url / dialect / row-cap context vars + their managers
_http.py headers, _error_body, _raise_for_non_200, retry-after
_validate.py pure arg normalization/validation + id switching
_requests.py request building (GET / CQL2, date formatting, pagination URLs, the queryables probe)
_responses.py the whole wire-response → DataFrame domain (parse + shape + finalize)
engine.py async pagination driver + get_ogc_data; re-exports the above

engine.py (1937 → 570 LOC) stays the public façade: it re-exports every name (explicit __all__) so every existing from dataretrieval.ogc.engine import … site in waterdata, ngwmn, and the tests keeps working — no import sites touched. The dependency DAG is acyclic and leaf-first: _constants ← _context ← {_http, _validate, _requests, _responses} ← engine. _validate is a genuine pure leaf — the one live HTTP probe (_check_ogc_requests) lives in _requests with the other request builders.

Behavior-preserving

  • All 42 top-level symbols moved with byte-identical AST bodies (verified mechanically vs the pre-split file) — none lost, changed, or duplicated.
  • One test change: the single gpd monkeypatch seam was relocated from engine to _responses (where the geopandas parse functions now live), so the response domain coheres by domain rather than by a patch target.
  • 296 offline tests pass; ruff + mypy --strict clean; no import cycles.

Scope

ogc/chunking.py (1776 LOC) was evaluated and deliberately left intact — unlike engine.py's unrelated concerns, it is one cohesive algorithm orbiting ChunkedCall (the exception taxonomy, retry policy, and executor share state and a cross-cutting test seam), so splitting would fragment it and add a mandatory façade for little gain.

🤖 Generated with Claude Code

@thodson-usgs thodson-usgs force-pushed the refactor/split-waterdata-utils branch from df8943e to cca750e Compare June 2, 2026 00:50
thodson-usgs and others added 2 commits June 15, 2026 09:34
Port the NGWMN functions from the R `dataRetrieval` package
(DOI-USGS/dataRetrieval#904) and refactor the Water Data OGC machinery into a
generic, API-agnostic engine, so NGWMN and Water Data are sibling layers on
top of it -- NGWMN does not depend on Water Data.

dataretrieval/ogc/  generic OGC engine (no service-specific config)
  engine.py    request build, pagination, parse/finalize, get_ogc_data
  chunking.py  URL-byte multi-value chunker
  filters.py   CQL `filter` splitting
  progress.py  self-updating status line
The engine is parameterized by an `OgcDialect` and a base-url context
variable rather than branching on service names: Water Data POSTs CQL2 for
`monitoring-locations` and renders `daily` time args date-only; NGWMN needs
neither. Adding a sibling API is a new dialect + base URL, not engine edits.

dataretrieval/ngwmn.py  sibling getters that import only dataretrieval.ogc:
  get_sites, get_water_level, get_lithology, get_well_construction,
  get_providers

dataretrieval/waterdata/  thin Water Data layer on the engine; the Statistics
API lives in its own waterdata/stats.py module.

Unified `state` parameter across the modern getters, accepting a full name, a
two-letter postal code, or a two-digit ANSI/FIPS code; normalized by
codes.states.to_state (50 states + DC, fails fast on a typo) and resolved at
the getter layer. The native state_code/state_name parameters remain as an
escape hatch.

Also: export ChunkInterrupted at the package top level; key the empty-result
short-circuit off `features` (NGWMN omits `numberReturned`) and tolerate
geometry-less features; always return PEP-8 snake_case columns; and add a
pre-commit mypy hook mirroring the CI type-check.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Every getter that routes through the chunked fan-out can raise a resumable
`ChunkInterrupted` when a transient failure outlasts the built-in retries, but
nothing said so at the point of use -- a `help(get_daily)` reader would only
discover the catch-and-resume affordance via the separate userguide.

Add a one-line `Raises: ChunkInterrupted` entry (pointing to
:doc:`/userguide/errors` for the full resume example -- the single source of
truth) to the 12 chunker-backed getters. `get_cql` (own `_run_sync` path),
the stats getters, and the Samples getters don't go through the chunker, so
they're intentionally left out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request Jun 15, 2026
`ogc/engine.py` had absorbed ~1940 LOC spanning six unrelated concerns --
constants/config, HTTP error handling, arg validation, request building,
response shaping, and the async pagination driver. (DOI-USGS#324 extracted this generic
engine out of waterdata/utils.py, so the god-module relocated here rather than
disappearing -- this is DOI-USGS#318's original intent, retargeted onto the post-DOI-USGS#324
layout.)

Split it into cohesive private siblings under `ogc/`, moving every definition
AST-identically (no signature/logic change):

  _constants.py  URLs, OgcDialect, regexes, param sets, context vars, gpd probe
  _http.py       headers, error-body, _raise_for_non_200, retry-after
  _validate.py   arg normalization/validation, id switching
  _requests.py   request building (GET/CQL2, date formatting, pagination URLs)
  _responses.py  wire response -> DataFrame (parse + shape + finalize)
  engine.py      async pagination driver + get_ogc_data, re-exporting the above

engine.py (1937 -> 570 LOC) stays the public facade: it re-exports every name so
existing `from dataretrieval.ogc.engine import ...` sites in waterdata, ngwmn,
and the tests keep working unchanged. The geopandas parse chain lives in
_responses.py (its boundary is the domain, not a test patch target); the single
gpd monkeypatch seam was relocated to `_responses` -- the only test change.

Behavior-preserving: all 42 top-level symbols moved with byte-identical AST
bodies; 296 offline tests pass; ruff + mypy --strict clean; no import cycles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@thodson-usgs thodson-usgs force-pushed the refactor/split-waterdata-utils branch from cca750e to 651bb56 Compare June 15, 2026 18:30
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request Jun 15, 2026
`ogc/engine.py` had absorbed ~1940 LOC spanning unrelated concerns --
constants/config, per-call request context, HTTP error handling, arg
validation, request building, and response shaping -- on top of the async
pagination driver. (DOI-USGS#324 extracted this generic engine out of
waterdata/utils.py, so the god-module relocated here rather than disappearing;
this is DOI-USGS#318's original intent retargeted onto the post-DOI-USGS#324 layout.)

Split it into cohesive private siblings under `ogc/`, moving every definition
AST-identically (no signature/logic change):

  _constants.py  URLs, OgcDialect, regexes, param sets, the geopandas probe
  _context.py    per-call request context (base-url / dialect / row-cap vars)
  _http.py       headers, error-body, _raise_for_non_200, retry-after
  _validate.py   pure arg normalization/validation + id switching
  _requests.py   request building (GET/CQL2, dates, pagination URLs, queryables)
  _responses.py  wire response -> DataFrame (parse + shape + finalize)
  engine.py      async pagination driver + get_ogc_data, re-exporting the above

engine.py (1937 -> 570 LOC) stays the public facade: it re-exports every name so
existing `from dataretrieval.ogc.engine import ...` sites in waterdata, ngwmn,
and the tests keep working unchanged. The DAG is acyclic and leaf-first
(_constants <- _context <- _http/_validate/_requests/_responses <- engine);
_validate is a pure leaf -- the lone HTTP probe (_check_ogc_requests) lives in
_requests with the other request builders. The geopandas parse chain lives in
_responses; its single gpd monkeypatch seam was relocated there -- the only test
change.

Behavior-preserving: all 42 top-level symbols moved with byte-identical AST
bodies; 296 offline tests pass; ruff + mypy --strict clean; no import cycles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@thodson-usgs thodson-usgs force-pushed the refactor/split-waterdata-utils branch from 651bb56 to 4d38946 Compare June 15, 2026 22:26
`ogc/engine.py` had absorbed ~1940 LOC spanning unrelated concerns --
constants/config, per-call request context, HTTP error handling, arg
validation, request building, and response shaping -- on top of the async
pagination driver. (DOI-USGS#324 extracted this generic engine out of
waterdata/utils.py, so the god-module relocated here rather than disappearing;
this is DOI-USGS#318's original intent retargeted onto the post-DOI-USGS#324 layout.)

Split it into cohesive private siblings under `ogc/`, moving every definition
AST-identically (no signature/logic change):

  _constants.py  URLs, OgcDialect, regexes, param sets, the geopandas probe
  _context.py    per-call request context (base-url / dialect / row-cap vars)
  _http.py       headers, error-body, _raise_for_non_200, retry-after
  _validate.py   pure arg normalization/validation + id switching
  _requests.py   request building (GET/CQL2, dates, pagination URLs, queryables)
  _responses.py  wire response -> DataFrame (parse + shape + finalize)
  engine.py      async pagination driver + get_ogc_data, re-exporting the above

engine.py (1937 -> 570 LOC) stays the public facade: it re-exports every name so
existing `from dataretrieval.ogc.engine import ...` sites in waterdata, ngwmn,
and the tests keep working unchanged. The DAG is acyclic and leaf-first
(_constants <- _context <- _http/_validate/_requests/_responses <- engine);
_validate is a pure leaf -- the lone HTTP probe (_check_ogc_requests) lives in
_requests with the other request builders. The geopandas parse chain lives in
_responses; its single gpd monkeypatch seam was relocated there -- the only test
change.

Behavior-preserving: all 42 top-level symbols moved with byte-identical AST
bodies; 296 offline tests pass; ruff + mypy --strict clean; no import cycles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@thodson-usgs thodson-usgs force-pushed the refactor/split-waterdata-utils branch from 4d38946 to 3240953 Compare June 15, 2026 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant