Skip to content

fix(airflow-client): retry writes on transient 5xx and make copy operations idempotent#2171

Open
sakethsomaraju wants to merge 5 commits into
mainfrom
add-retries
Open

fix(airflow-client): retry writes on transient 5xx and make copy operations idempotent#2171
sakethsomaraju wants to merge 5 commits into
mainfrom
add-retries

Conversation

@sakethsomaraju

@sakethsomaraju sakethsomaraju commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes two related issues that could leave a preview deployment in a permanently broken, partially-configured state after a transient network failure during airflow-variable copy, connection copy, or pool copy.

Problem 1: No retry on transient write failures
The CLI used retryablehttp for GET requests but not for write operations. A single transient 503 mid-copy caused the entire command to fail with no recovery, leaving the destination with only a partial set of variables/connections/pools.
Problem 2: Copy command not idempotent on retry
On manual retry after a partial failure, the CLI encountered already-copied resources in the destination and attempted to PATCH them. For variables with a leading / in the key, this produced a malformed URL (/api/v1/variables//var-name), causing Airflow to return 404. The command would then fail permanently on every subsequent retry attempt.

Changes
Retry policy (checkRetryPolicy):

GET requests: unchanged (default policy)
Non-GET (writes): retry only on 502, 503, 429, cases where the server provably did not apply the write. Transport errors, 500, and 504 are not retried as these are ambiguous (write may have already been applied)
Write retries use a separate budget: 3 attempts with 5s backoff

Operation-layer idempotency (CreateVariable, CreateConnection, CreatePool):

Each create operation is now an upsert: on 409 Conflict, falls back to the corresponding PATCH update
Makes the entire copy command safe to re-run after any partial failure, existing resources get updated, missing ones get created

🎟 Issue(s)

Closes #2172
Reduces the likelyhood of encountering, but not fully remediates #2165

🧪 Functional Testing

  1. Run astro deployment airflow-variable copy against a source deployment with variables including at least one with a leading / in the key
  2. Confirm all variables copy successfully on a clean first run
  3. Manually interrupt mid-copy (or simulate a 503), then re-run — confirm the command completes successfully without 404 errors
  4. Run astro deployment connection copy and astro deployment pool copy and confirm the same retry and idempotency behavior
  5. Confirm that a 500 or transport error on a write does NOT trigger a retry

📸 Screenshots

Add screenshots to illustrate the validity of these changes.

% ./astro deployment pool copy --source-id cmqen5t798vwa01ouv1aybr0q --target-id cmqen6j2d8z1101l4upvwcpsc --verbosity=debug
Copying Pool default_pool
 astro-cli % ./astro deployment connections copy --source-id cmqen5t798vwa01ouv1aybr0q --target-id cmqen6j2d8z1101l4upvwcpsc --verbosity=debug
WARNING! The password and extra field are not copied over. You will need to manually add these values
Copying Connecton test-conn

To reproduce the 503 responses, I force-deleted the webserver and confirmed that the webserver container came back online within 15 seconds.

In the 1.42.1 version.

 astro deployment airflow-variable copy --source-id cmqen5t798vwa01ouv1aybr0q --target-id cmqen6j2d8z1101l4upvwcpsc --verbosity=debug 

WARNING! Secret values are not copied over. You will need to manually add these values
Error: API error (503): upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refused

After fix:

./astro deployment airflow-variable copy --source-id cmqen5t798vwa01ouv1aybr0q --target-id cmqen6j2d8z1101l4upvwcpsc
WARNING! Secret values are not copied over. You will need to manually add these values 
> --There is some delay here--
Copying Variable test-var
Copying Variable /tes
Copying Variable hhh
Copying Variable jjj

📋 Checklist

  • Rebased from the main (or release if patching) branch (before testing)
  • Ran make test before taking out of draft
  • Ran make lint before taking out of draft
  • Added/updated applicable tests
  • Tested against Astro-API (if necessary).
  • Tested against Houston-API and Astronomer (if necessary).
  • Communicated to/tagged owners of respective clients potentially impacted by these changes.
  • Updated any related documentation

@coveralls-official

coveralls-official Bot commented Jun 15, 2026

Copy link
Copy Markdown

Coverage Report for CI Build 28068617702

Coverage increased (+0.02%) to 45.301%

Details

  • Coverage increased (+0.02%) from the base build.
  • Patch coverage: 4 uncovered changes across 1 file (42 of 46 lines covered, 91.3%).
  • No coverage regressions found.

Uncovered Changes

File Changed Covered %
airflow-client/airflow-client.go 46 42 91.3%

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 55506
Covered Lines: 25145
Line Coverage: 45.3%
Coverage Strength: 8.12 hits per line

💛 - Coveralls

@sakethsomaraju sakethsomaraju changed the title add retries for transient 503, 502 and 429 err codes fix(airflow-client): retry writes on transient 5xx and make copy operations idempotent Jun 15, 2026
@sakethsomaraju sakethsomaraju marked this pull request as ready for review June 15, 2026 05:15
@sakethsomaraju sakethsomaraju requested a review from a team as a code owner June 15, 2026 05:15
Comment thread airflow-client/airflow-client.go Outdated
Comment thread airflow-client/airflow-client.go Outdated
Comment thread airflow-client/airflow-client.go Outdated
Comment thread airflow-client/airflow-client.go

@wendtek wendtek left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

astro deployment airflow-variable/connection/pool copy fails permanently after transient 503 mid-copy

3 participants