Skip to content

test(waterdata): rerun flaky transient 5xx/429 from the chunked fan-out#325

Merged
thodson-usgs merged 1 commit into
DOI-USGS:mainfrom
thodson-usgs:fix/flaky-rerun-chunker-5xx
Jun 15, 2026
Merged

test(waterdata): rerun flaky transient 5xx/429 from the chunked fan-out#325
thodson-usgs merged 1 commit into
DOI-USGS:mainfrom
thodson-usgs:fix/flaky-rerun-chunker-5xx

Conversation

@thodson-usgs

Copy link
Copy Markdown
Collaborator

Problem

The live tests/waterdata_test.py suite reruns transient HTTP failures via pytest.mark.flaky(only_rerun=[...]), but the patterns only match the direct-path shapes RateLimited:/RuntimeError: 5xx:. Two transient sources are no longer covered:

  • Chunked fan-out: a transient 5xx/429 in a sub-request is wrapped as a ChunkInterrupted subclass — ServiceInterrupted (5xx) / QuotaExhausted (429) — from the #322 chunker work.
  • Current error taxonomy (#313/#319): a direct 5xx now raises ServiceUnavailable (a TransientError), not RuntimeError.

None of these match only_rerun, so a transient upstream 502 during a multi-value (chunked) get_monitoring_locations call fails CI outright instead of being retried. This surfaced on PR #324 (test (ubuntu-latest, 3.13) red with ServiceInterrupted: ... Cause: ServiceUnavailable: 502: Bad Gateway, while the other five matrix cells passed), but the gap is pre-existing on main — PR #324 only got unlucky enough to hit it.

Fix

Add one only_rerun pattern covering ServiceUnavailable / QuotaExhausted / ServiceInterrupted.

Verified by simulation that it matches those transient exceptions (including a ServiceInterrupted wrapping a 502) but not deterministic failures (HTTPError 404, AssertionError); RateLimited (429) stays covered by the existing pattern.

Notes

🤖 Generated with Claude Code

@thodson-usgs thodson-usgs force-pushed the fix/flaky-rerun-chunker-5xx branch from dd307dc to e2d32d3 Compare June 15, 2026 15:02
@thodson-usgs thodson-usgs marked this pull request as ready for review June 15, 2026 15:40
The suite already retries transient HTTP failures (flaky's `only_rerun`), but
the patterns missed two kinds, so a transient upstream 502 failed CI instead of
retrying:

- a direct 5xx is raised as `ServiceUnavailable`, and
- a chunked request wraps a transient 429/5xx as `QuotaExhausted` /
  `ServiceInterrupted`.

Add patterns for both. Verified they retry these transient errors but not
deterministic ones (e.g. a 404 or an assertion failure).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@thodson-usgs thodson-usgs force-pushed the fix/flaky-rerun-chunker-5xx branch from e2d32d3 to ece83da Compare June 15, 2026 15:43
@thodson-usgs thodson-usgs merged commit dff162c into DOI-USGS:main Jun 15, 2026
9 checks passed
@thodson-usgs thodson-usgs deleted the fix/flaky-rerun-chunker-5xx branch June 15, 2026 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant