TRT-2761: Make applyClusterTLSProfile() non-fatal to prevent packageserver crash-loops in HyperShift#1333
Conversation
…erver crash-loops in HyperShift PR openshift#1330 introduced applyClusterTLSProfile() in the packageserver startup path that makes API calls with a hard timeout. If any call fails (e.g. because the API server is not yet ready during HyperShift hosted cluster bootstrap), the function returns a fatal error that crashes the container. This causes an infinite crash-loop that blocks cluster creation. Make the failure non-fatal by logging a warning instead of returning an error. This is safe because: - The Package Server Manager (PSM) will inject the correct TLS flags (--tls-min-version, --tls-cipher-suites) on its next reconciliation - Default TLS settings are safe and compliant - The packageserver must be able to start even when the API server is temporarily unavailable Also reduce the best-effort lookup timeout from 30s to 10s to avoid blocking startup unnecessarily. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@redhat-chai-bot: This pull request references TRT-2761 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (1)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughIn ChangesTLS Profile Startup Handling
Estimated code review effort🎯 2 (Simple) | ⏱️ ~5 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error)
✅ Passed checks (14 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: redhat-chai-bot The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/payload-job periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aws-ovn |
|
@redhat-chai-bot: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/09b2fb40-733d-11f1-9de9-5a87bc1152af-0 |
Update the vendored copy to match the staging changes that make applyClusterTLSProfile() non-fatal and reduce the lookup timeout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@redhat-chai-bot: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@tmshort: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Problem
PR #1330 introduced
applyClusterTLSProfile()in packageserver startup which fetches the cluster APIServer CR with a 30-second timeout. If the API server is unreachable, the error is fatal — the packageserver container crashes and enters a restart loop.In HyperShift hosted clusters, the API server may not be ready when the packageserver starts during initial cluster creation. This race condition has caused 5 consecutive payload rejections over 36 hours in the 5.0 CI stream, blocking
hypershift-e2e-aws-ovn(streak of 3) andhypershift-e2e-aks(streak of 2).Root Cause
In
staging/operator-lifecycle-manager/pkg/package-server/server/server.go, theConfig()function calls:The
return fmt.Errorf(...)causes a fatal startup failure when the API server is temporarily unavailable.Fix
Two changes:
Make failure non-fatal: Replace the fatal
return fmt.Errorf(...)withlog.Warningf(...). The packageserver continues with default TLS settings, which are safe and compliant. The Package Server Manager (PSM) will inject the correct TLS flags (--tls-min-version,--tls-cipher-suites) on its next reconciliation cycle.Reduce timeout from 30s to 10s: Since this is now best-effort, there's no reason to block startup for 30 seconds. 10 seconds is sufficient for a healthy API server and avoids unnecessary delay.
Testing
All existing TLS profile tests pass:
TestApplyClusterTLSProfileWithClients_NonOpenShift✅TestApplyClusterTLSProfileWithClients_IntermediateProfile✅TestApplyClusterTLSProfileWithClients_ModernProfile✅TestApplyClusterTLSProfileWithClients_FlagsTakePrecedence✅TestApplyClusterTLSProfileWithClients_MissingAPIServerCR✅Related
Summary by CodeRabbit