Skip to content

Make applyClusterTLSProfile() non-fatal to prevent crash-loops in HyperShift hosted clusters#3859

Closed
redhat-chai-bot wants to merge 1 commit into
operator-framework:masterfrom
redhat-chai-bot:fix-packageserver-tls-crash-hypershift
Closed

Make applyClusterTLSProfile() non-fatal to prevent crash-loops in HyperShift hosted clusters#3859
redhat-chai-bot wants to merge 1 commit into
operator-framework:masterfrom
redhat-chai-bot:fix-packageserver-tls-crash-hypershift

Conversation

@redhat-chai-bot

Copy link
Copy Markdown

Problem

applyClusterTLSProfile() in packageserver startup (pkg/package-server/server/server.go) fetches the cluster APIServer CR with a 30-second hard timeout. If the API server isn't reachable — which happens during HyperShift hosted cluster creation when the API server may not be ready yet — the error is fatal: the entire container crashes and restarts in a loop.

This is a probabilistic race condition. In OpenShift CI, this caused 5 consecutive payload rejections over 36 hours in the 5.0 CI stream, blocking hypershift-e2e-aws (3 consecutive failures) and hypershift-e2e-aks (2 consecutive failures).

Fix

  • Make the error non-fatal: Log a warning and continue with defaults instead of returning a fatal error. The Package Server Manager (PSM) will inject the correct TLS flags (--tls-min-version, --tls-cipher-suites) on its next reconciliation anyway, so crashing here is unnecessary.
  • Reduce timeout from 30s to 10s: Best-effort lookup that doesn't block startup.

Changes

pkg/package-server/server/server.go | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

Validation

  • All 5 existing TLS-related tests pass
  • Downstream proof-of-concept (openshift/operator-framework-olm#1333) was validated with /payload-job — the primary failing HyperShift e2e job (e2e-aws-ovn) passed after a 3-payload failure streak
  • Tracking: TRT-2761

…erShift hosted clusters

applyClusterTLSProfile() fetches the cluster APIServer CR on startup
with a hard timeout. In HyperShift hosted clusters the API server may
not be ready when packageserver starts, causing the fatal error path
to crash-loop the container indefinitely.

Make the error non-fatal: log a warning and continue with defaults
instead of returning a fatal error. The Package Server Manager (PSM)
will inject the correct TLS flags on the next reconciliation anyway,
so crashing is unnecessary.

Also reduce the lookup timeout from 30s to 10s so we don't stall
startup when the API is genuinely unreachable.

Validated in downstream openshift/operator-framework-olm PR operator-framework#1333,
where the payload job e2e-aws-ovn passed with this fix.

Downstream tracking: https://redhat.atlassian.net/browse/TRT-2761

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 29, 2026 02:23
@openshift-ci openshift-ci Bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 29, 2026
@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown

Hi @redhat-chai-bot. Thanks for your PR.

I'm waiting for a operator-framework member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevinrizza for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents package-server crash-loops during startup by making the “best-effort” OpenShift APIServer TLS profile lookup non-fatal when the API server/config API isn’t reachable yet (notably during HyperShift hosted cluster bring-up), and by shortening the lookup timeout to reduce startup blocking.

Changes:

  • Convert applyClusterTLSProfile() failure during startup from a hard error to a warning, continuing with defaults.
  • Reduce the APIServer TLS profile lookup timeout from 30s to 10s.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 229 to +231
if err := applyClusterTLSProfile(ctx, clientConfig, o.SecureServing); err != nil {
return fmt.Errorf("failed to apply cluster TLS profile to serving options: %w", err)
log.WithError(err).Warn("Failed to apply cluster TLS profile to serving options, continuing with defaults. " +
"PSM will inject the correct TLS flags on next reconciliation.")
@tmshort

tmshort commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Closing in favor of #3860

@tmshort tmshort closed this Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants