refactor(waterdata): decompose the 2033-LOC utils.py god-module into cohesive modules#318
Draft
thodson-usgs wants to merge 3 commits into
Draft
refactor(waterdata): decompose the 2033-LOC utils.py god-module into cohesive modules#318thodson-usgs wants to merge 3 commits into
thodson-usgs wants to merge 3 commits into
Conversation
df8943e to
cca750e
Compare
Port the NGWMN functions from the R `dataRetrieval` package (DOI-USGS/dataRetrieval#904) and refactor the Water Data OGC machinery into a generic, API-agnostic engine, so NGWMN and Water Data are sibling layers on top of it -- NGWMN does not depend on Water Data. dataretrieval/ogc/ generic OGC engine (no service-specific config) engine.py request build, pagination, parse/finalize, get_ogc_data chunking.py URL-byte multi-value chunker filters.py CQL `filter` splitting progress.py self-updating status line The engine is parameterized by an `OgcDialect` and a base-url context variable rather than branching on service names: Water Data POSTs CQL2 for `monitoring-locations` and renders `daily` time args date-only; NGWMN needs neither. Adding a sibling API is a new dialect + base URL, not engine edits. dataretrieval/ngwmn.py sibling getters that import only dataretrieval.ogc: get_sites, get_water_level, get_lithology, get_well_construction, get_providers dataretrieval/waterdata/ thin Water Data layer on the engine; the Statistics API lives in its own waterdata/stats.py module. Unified `state` parameter across the modern getters, accepting a full name, a two-letter postal code, or a two-digit ANSI/FIPS code; normalized by codes.states.to_state (50 states + DC, fails fast on a typo) and resolved at the getter layer. The native state_code/state_name parameters remain as an escape hatch. Also: export ChunkInterrupted at the package top level; key the empty-result short-circuit off `features` (NGWMN omits `numberReturned`) and tolerate geometry-less features; always return PEP-8 snake_case columns; and add a pre-commit mypy hook mirroring the CI type-check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Every getter that routes through the chunked fan-out can raise a resumable `ChunkInterrupted` when a transient failure outlasts the built-in retries, but nothing said so at the point of use -- a `help(get_daily)` reader would only discover the catch-and-resume affordance via the separate userguide. Add a one-line `Raises: ChunkInterrupted` entry (pointing to :doc:`/userguide/errors` for the full resume example -- the single source of truth) to the 12 chunker-backed getters. `get_cql` (own `_run_sync` path), the stats getters, and the Samples getters don't go through the chunker, so they're intentionally left out. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
thodson-usgs
added a commit
to thodson-usgs/dataretrieval-python
that referenced
this pull request
Jun 15, 2026
`ogc/engine.py` had absorbed ~1940 LOC spanning six unrelated concerns -- constants/config, HTTP error handling, arg validation, request building, response shaping, and the async pagination driver. (DOI-USGS#324 extracted this generic engine out of waterdata/utils.py, so the god-module relocated here rather than disappearing -- this is DOI-USGS#318's original intent, retargeted onto the post-DOI-USGS#324 layout.) Split it into cohesive private siblings under `ogc/`, moving every definition AST-identically (no signature/logic change): _constants.py URLs, OgcDialect, regexes, param sets, context vars, gpd probe _http.py headers, error-body, _raise_for_non_200, retry-after _validate.py arg normalization/validation, id switching _requests.py request building (GET/CQL2, date formatting, pagination URLs) _responses.py wire response -> DataFrame (parse + shape + finalize) engine.py async pagination driver + get_ogc_data, re-exporting the above engine.py (1937 -> 570 LOC) stays the public facade: it re-exports every name so existing `from dataretrieval.ogc.engine import ...` sites in waterdata, ngwmn, and the tests keep working unchanged. The geopandas parse chain lives in _responses.py (its boundary is the domain, not a test patch target); the single gpd monkeypatch seam was relocated to `_responses` -- the only test change. Behavior-preserving: all 42 top-level symbols moved with byte-identical AST bodies; 296 offline tests pass; ruff + mypy --strict clean; no import cycles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cca750e to
651bb56
Compare
thodson-usgs
added a commit
to thodson-usgs/dataretrieval-python
that referenced
this pull request
Jun 15, 2026
`ogc/engine.py` had absorbed ~1940 LOC spanning unrelated concerns -- constants/config, per-call request context, HTTP error handling, arg validation, request building, and response shaping -- on top of the async pagination driver. (DOI-USGS#324 extracted this generic engine out of waterdata/utils.py, so the god-module relocated here rather than disappearing; this is DOI-USGS#318's original intent retargeted onto the post-DOI-USGS#324 layout.) Split it into cohesive private siblings under `ogc/`, moving every definition AST-identically (no signature/logic change): _constants.py URLs, OgcDialect, regexes, param sets, the geopandas probe _context.py per-call request context (base-url / dialect / row-cap vars) _http.py headers, error-body, _raise_for_non_200, retry-after _validate.py pure arg normalization/validation + id switching _requests.py request building (GET/CQL2, dates, pagination URLs, queryables) _responses.py wire response -> DataFrame (parse + shape + finalize) engine.py async pagination driver + get_ogc_data, re-exporting the above engine.py (1937 -> 570 LOC) stays the public facade: it re-exports every name so existing `from dataretrieval.ogc.engine import ...` sites in waterdata, ngwmn, and the tests keep working unchanged. The DAG is acyclic and leaf-first (_constants <- _context <- _http/_validate/_requests/_responses <- engine); _validate is a pure leaf -- the lone HTTP probe (_check_ogc_requests) lives in _requests with the other request builders. The geopandas parse chain lives in _responses; its single gpd monkeypatch seam was relocated there -- the only test change. Behavior-preserving: all 42 top-level symbols moved with byte-identical AST bodies; 296 offline tests pass; ruff + mypy --strict clean; no import cycles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
651bb56 to
4d38946
Compare
`ogc/engine.py` had absorbed ~1940 LOC spanning unrelated concerns -- constants/config, per-call request context, HTTP error handling, arg validation, request building, and response shaping -- on top of the async pagination driver. (DOI-USGS#324 extracted this generic engine out of waterdata/utils.py, so the god-module relocated here rather than disappearing; this is DOI-USGS#318's original intent retargeted onto the post-DOI-USGS#324 layout.) Split it into cohesive private siblings under `ogc/`, moving every definition AST-identically (no signature/logic change): _constants.py URLs, OgcDialect, regexes, param sets, the geopandas probe _context.py per-call request context (base-url / dialect / row-cap vars) _http.py headers, error-body, _raise_for_non_200, retry-after _validate.py pure arg normalization/validation + id switching _requests.py request building (GET/CQL2, dates, pagination URLs, queryables) _responses.py wire response -> DataFrame (parse + shape + finalize) engine.py async pagination driver + get_ogc_data, re-exporting the above engine.py (1937 -> 570 LOC) stays the public facade: it re-exports every name so existing `from dataretrieval.ogc.engine import ...` sites in waterdata, ngwmn, and the tests keep working unchanged. The DAG is acyclic and leaf-first (_constants <- _context <- _http/_validate/_requests/_responses <- engine); _validate is a pure leaf -- the lone HTTP probe (_check_ogc_requests) lives in _requests with the other request builders. The geopandas parse chain lives in _responses; its single gpd monkeypatch seam was relocated there -- the only test change. Behavior-preserving: all 42 top-level symbols moved with byte-identical AST bodies; 296 offline tests pass; ruff + mypy --strict clean; no import cycles. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4d38946 to
3240953
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
After #324, the god-module is
dataretrieval/ogc/engine.py(~1940 LOC) — it absorbed several unrelated concerns (constants/config, per-call request context, HTTP error handling, arg validation, request building, response shaping) on top of the async pagination driver. Changing how requests are built means navigating the same file as datetime parsing, type coercion, and the async driver.What
Split it into cohesive private sibling modules under
dataretrieval/ogc/, moving every definition AST-identically (no signature/logic change):_constants.pyOgcDialect, regexes, param sets, the geopandas probe (pure-data leaf)_context.py_http.py_error_body,_raise_for_non_200, retry-after_validate.py_requests.py_responses.pyengine.pyget_ogc_data; re-exports the aboveengine.py(1937 → 570 LOC) stays the public façade: it re-exports every name (explicit__all__) so every existingfrom dataretrieval.ogc.engine import …site inwaterdata,ngwmn, and the tests keeps working — no import sites touched. The dependency DAG is acyclic and leaf-first:_constants ← _context ← {_http, _validate, _requests, _responses} ← engine._validateis a genuine pure leaf — the one live HTTP probe (_check_ogc_requests) lives in_requestswith the other request builders.Behavior-preserving
gpdmonkeypatch seam was relocated fromengineto_responses(where the geopandas parse functions now live), so the response domain coheres by domain rather than by a patch target.ruff+mypy --strictclean; no import cycles.Scope
ogc/chunking.py(1776 LOC) was evaluated and deliberately left intact — unlikeengine.py's unrelated concerns, it is one cohesive algorithm orbitingChunkedCall(the exception taxonomy, retry policy, and executor share state and a cross-cutting test seam), so splitting would fragment it and add a mandatory façade for little gain.🤖 Generated with Claude Code