Skip to content

Retry SSH connection setup to fix intermittent publickey errors#2282

Merged
berendt merged 2 commits into
mainfrom
fix/ssh-retries-intermittent-publickey
May 19, 2026
Merged

Retry SSH connection setup to fix intermittent publickey errors#2282
berendt merged 2 commits into
mainfrom
fix/ssh-retries-intermittent-publickey

Conversation

@berendt
Copy link
Copy Markdown
Member

@berendt berendt commented May 19, 2026

Since #2134 the SSH ControlPath directory is unique per task and removed in the finally block, so there is no persistent ControlMaster socket reuse across runs anymore. Every run now cold-establishes the SSH connection to every host.

With Ansible's default ANSIBLE_SSH_RETRIES=0 a single transient failure during this cold connection setup makes the affected hosts fail immediately as UNREACHABLE with "Permission denied (publickey)", typically on the very first task that contacts the nodes (e.g. module-load) and for all hosts of the batch at once, while an immediate re-run succeeds. This is a regression of #2134 for the sequential single-playbook case, which previously was masked by persistent ControlMaster socket reuse.

Set ANSIBLE_SSH_RETRIES=3 (unless already provided via the environment) right next to the #2134 logic so transient first-contact glitches no longer abort the whole run.

AI-assisted: Claude Code

Since #2134 the SSH ControlPath directory is unique per task and removed
in the finally block, so there is no persistent ControlMaster socket
reuse across runs anymore. Every run now cold-establishes the SSH
connection to every host.

With Ansible's default ANSIBLE_SSH_RETRIES=0 a single transient failure
during this cold connection setup makes the affected hosts fail
immediately as UNREACHABLE with "Permission denied (publickey)",
typically on the very first task that contacts the nodes (e.g.
module-load) and for all hosts of the batch at once, while an immediate
re-run succeeds. This is a regression of #2134 for the sequential
single-playbook case, which previously was masked by persistent
ControlMaster socket reuse.

Set ANSIBLE_SSH_RETRIES=3 (unless already provided via the environment)
right next to the #2134 logic so transient first-contact glitches no
longer abort the whole run.

AI-assisted: Claude Code

Signed-off-by: Christian Berendt <berendt@osism.tech>
@berendt
Copy link
Copy Markdown
Member Author

berendt commented May 19, 2026

@janhorstmann Can you check this? Maybe this could resolved the unreachable issues.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • Consider making the retry count configurable (e.g. via a setting or constant) instead of hardcoding "3" so that operators can tune behavior without code changes.
  • It may be useful to log when ANSIBLE_SSH_RETRIES is being set implicitly so users can understand why retry behavior changed when troubleshooting SSH issues.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider making the retry count configurable (e.g. via a setting or constant) instead of hardcoding `"3"` so that operators can tune behavior without code changes.
- It may be useful to log when `ANSIBLE_SSH_RETRIES` is being set implicitly so users can understand why retry behavior changed when troubleshooting SSH issues.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

The ANSIBLE_SSH_RETRIES value introduced in d808da1 was hardcoded to
3. Expose it as a run_ansible_in_environment(ssh_retries=3) default
parameter so callers can override the retry count while keeping the
existing behaviour as the default.

AI-assisted: Claude Code

Signed-off-by: Christian Berendt <berendt@osism.tech>
@berendt berendt merged commit 647c96a into main May 19, 2026
3 checks passed
@berendt berendt deleted the fix/ssh-retries-intermittent-publickey branch May 19, 2026 14:50
@github-project-automation github-project-automation Bot moved this from Ready to Done in Human Board May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants