Make applyClusterTLSProfile() non-fatal to prevent crash-loops in HyperShift hosted clusters#3859
Conversation
…erShift hosted clusters applyClusterTLSProfile() fetches the cluster APIServer CR on startup with a hard timeout. In HyperShift hosted clusters the API server may not be ready when packageserver starts, causing the fatal error path to crash-loop the container indefinitely. Make the error non-fatal: log a warning and continue with defaults instead of returning a fatal error. The Package Server Manager (PSM) will inject the correct TLS flags on the next reconciliation anyway, so crashing is unnecessary. Also reduce the lookup timeout from 30s to 10s so we don't stall startup when the API is genuinely unreachable. Validated in downstream openshift/operator-framework-olm PR operator-framework#1333, where the payload job e2e-aws-ovn passed with this fix. Downstream tracking: https://redhat.atlassian.net/browse/TRT-2761 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hi @redhat-chai-bot. Thanks for your PR. I'm waiting for a operator-framework member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR prevents package-server crash-loops during startup by making the “best-effort” OpenShift APIServer TLS profile lookup non-fatal when the API server/config API isn’t reachable yet (notably during HyperShift hosted cluster bring-up), and by shortening the lookup timeout to reduce startup blocking.
Changes:
- Convert
applyClusterTLSProfile()failure during startup from a hard error to a warning, continuing with defaults. - Reduce the APIServer TLS profile lookup timeout from 30s to 10s.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if err := applyClusterTLSProfile(ctx, clientConfig, o.SecureServing); err != nil { | ||
| return fmt.Errorf("failed to apply cluster TLS profile to serving options: %w", err) | ||
| log.WithError(err).Warn("Failed to apply cluster TLS profile to serving options, continuing with defaults. " + | ||
| "PSM will inject the correct TLS flags on next reconciliation.") |
|
Closing in favor of #3860 |
Problem
applyClusterTLSProfile()in packageserver startup (pkg/package-server/server/server.go) fetches the clusterAPIServerCR with a 30-second hard timeout. If the API server isn't reachable — which happens during HyperShift hosted cluster creation when the API server may not be ready yet — the error is fatal: the entire container crashes and restarts in a loop.This is a probabilistic race condition. In OpenShift CI, this caused 5 consecutive payload rejections over 36 hours in the 5.0 CI stream, blocking
hypershift-e2e-aws(3 consecutive failures) andhypershift-e2e-aks(2 consecutive failures).Fix
--tls-min-version,--tls-cipher-suites) on its next reconciliation anyway, so crashing here is unnecessary.Changes
Validation
/payload-job— the primary failing HyperShift e2e job (e2e-aws-ovn) passed after a 3-payload failure streak