Skip to content

TRT-2761: Make applyClusterTLSProfile() non-fatal to prevent packageserver crash-loops in HyperShift#1333

Closed
redhat-chai-bot wants to merge 2 commits into
openshift:mainfrom
redhat-chai-bot:fix-packageserver-tls-crash-hypershift
Closed

TRT-2761: Make applyClusterTLSProfile() non-fatal to prevent packageserver crash-loops in HyperShift#1333
redhat-chai-bot wants to merge 2 commits into
openshift:mainfrom
redhat-chai-bot:fix-packageserver-tls-crash-hypershift

Conversation

@redhat-chai-bot

@redhat-chai-bot redhat-chai-bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Problem

PR #1330 introduced applyClusterTLSProfile() in packageserver startup which fetches the cluster APIServer CR with a 30-second timeout. If the API server is unreachable, the error is fatal — the packageserver container crashes and enters a restart loop.

In HyperShift hosted clusters, the API server may not be ready when the packageserver starts during initial cluster creation. This race condition has caused 5 consecutive payload rejections over 36 hours in the 5.0 CI stream, blocking hypershift-e2e-aws-ovn (streak of 3) and hypershift-e2e-aks (streak of 2).

Root Cause

In staging/operator-lifecycle-manager/pkg/package-server/server/server.go, the Config() function calls:

if o.SecureServing.MinTLSVersion == "" {
    if err := applyClusterTLSProfile(ctx, clientConfig, o.SecureServing); err != nil {
        return fmt.Errorf("failed to apply cluster TLS profile to serving options: %w", err)
    }
}

The return fmt.Errorf(...) causes a fatal startup failure when the API server is temporarily unavailable.

Fix

Two changes:

  1. Make failure non-fatal: Replace the fatal return fmt.Errorf(...) with log.Warningf(...). The packageserver continues with default TLS settings, which are safe and compliant. The Package Server Manager (PSM) will inject the correct TLS flags (--tls-min-version, --tls-cipher-suites) on its next reconciliation cycle.

  2. Reduce timeout from 30s to 10s: Since this is now best-effort, there's no reason to block startup for 30 seconds. 10 seconds is sufficient for a healthy API server and avoids unnecessary delay.

Testing

All existing TLS profile tests pass:

  • TestApplyClusterTLSProfileWithClients_NonOpenShift
  • TestApplyClusterTLSProfileWithClients_IntermediateProfile
  • TestApplyClusterTLSProfileWithClients_ModernProfile
  • TestApplyClusterTLSProfileWithClients_FlagsTakePrecedence
  • TestApplyClusterTLSProfileWithClients_MissingAPIServerCR

Related

Summary by CodeRabbit

  • Bug Fixes
    • Improved startup resilience when TLS settings are not yet available, allowing the Package Server to continue running instead of failing immediately.
    • Added a warning when cluster TLS profile application cannot be completed right away.
    • Reduced the wait time for TLS profile lookup, so fallback handling happens sooner.

…erver crash-loops in HyperShift

PR openshift#1330 introduced applyClusterTLSProfile() in the packageserver startup
path that makes API calls with a hard timeout. If any call fails (e.g.
because the API server is not yet ready during HyperShift hosted cluster
bootstrap), the function returns a fatal error that crashes the container.
This causes an infinite crash-loop that blocks cluster creation.

Make the failure non-fatal by logging a warning instead of returning an
error. This is safe because:
- The Package Server Manager (PSM) will inject the correct TLS flags
  (--tls-min-version, --tls-cipher-suites) on its next reconciliation
- Default TLS settings are safe and compliant
- The packageserver must be able to start even when the API server is
  temporarily unavailable

Also reduce the best-effort lookup timeout from 30s to 10s to avoid
blocking startup unnecessarily.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 28, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 28, 2026

Copy link
Copy Markdown

@redhat-chai-bot: This pull request references TRT-2761 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Problem

PR #1330 introduced applyClusterTLSProfile() in packageserver startup which fetches the cluster APIServer CR with a 30-second timeout. If the API server is unreachable, the error is fatal — the packageserver container crashes and enters a restart loop.

In HyperShift hosted clusters, the API server may not be ready when the packageserver starts during initial cluster creation. This race condition has caused 5 consecutive payload rejections over 36 hours in the 5.0 CI stream, blocking hypershift-e2e-aws-ovn (streak of 3) and hypershift-e2e-aks (streak of 2).

Root Cause

In staging/operator-lifecycle-manager/pkg/package-server/server/server.go, the Config() function calls:

if o.SecureServing.MinTLSVersion == "" {
   if err := applyClusterTLSProfile(ctx, clientConfig, o.SecureServing); err != nil {
       return fmt.Errorf("failed to apply cluster TLS profile to serving options: %w", err)
   }
}

The return fmt.Errorf(...) causes a fatal startup failure when the API server is temporarily unavailable.

Fix

Two changes:

  1. Make failure non-fatal: Replace the fatal return fmt.Errorf(...) with log.Warningf(...). The packageserver continues with default TLS settings, which are safe and compliant. The Package Server Manager (PSM) will inject the correct TLS flags (--tls-min-version, --tls-cipher-suites) on its next reconciliation cycle.

  2. Reduce timeout from 30s to 10s: Since this is now best-effort, there's no reason to block startup for 30 seconds. 10 seconds is sufficient for a healthy API server and avoids unnecessary delay.

Testing

All existing TLS profile tests pass:

  • TestApplyClusterTLSProfileWithClients_NonOpenShift
  • TestApplyClusterTLSProfileWithClients_IntermediateProfile
  • TestApplyClusterTLSProfileWithClients_ModernProfile
  • TestApplyClusterTLSProfileWithClients_FlagsTakePrecedence
  • TestApplyClusterTLSProfileWithClients_MissingAPIServerCR

Related

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (1)
  • vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/package-server/server/server.go is excluded by !**/vendor/**, !vendor/**

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 49c8848e-97ca-45b8-98c9-56a1c5d8f12e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

In PackageServerOptions.Run, failure to apply the cluster TLS security profile is now non-fatal when SecureServing.MinTLSVersion is unset: a warning is logged and startup continues. Additionally, the lookupTimeout in applyClusterTLSProfile is reduced from 30s to 10s.

Changes

TLS Profile Startup Handling

Layer / File(s) Summary
Non-fatal TLS profile error and reduced lookup timeout
staging/operator-lifecycle-manager/pkg/package-server/server/server.go
Run now logs a warning and continues instead of returning an error when applyClusterTLSProfile fails with MinTLSVersion unset. lookupTimeout in applyClusterTLSProfile is reduced from 30s to 10s.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error The new warning logs %v from applyClusterTLSProfile(), whose network/API errors can include internal API hostnames or URLs. Log a generic message without the raw error, or redact host/URL details before emitting the warning.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: making applyClusterTLSProfile() non-fatal to avoid packageserver crash-loops.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PASS: The PR only changes server.go; no Ginkgo test titles were added or modified. The package-server tests inspected use static, descriptive names.
Test Structure And Quality ✅ Passed No Ginkgo tests were added here; the package uses small isolated unit tests with explicit failure messages, so the checklist issues don’t apply.
Microshift Test Compatibility ✅ Passed HEAD diff adds no new Ginkgo It/Describe/Context/When tests; the change is server.go TLS startup logic only, so MicroShift compatibility isn't implicated.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The PR only changes server.go startup/TLS logic; no Ginkgo e2e tests or SNO-sensitive test assumptions were added.
Topology-Aware Scheduling Compatibility ✅ Passed The PR only changes packageserver TLS-profile startup/timeout logic; no affinity, nodeSelector, spread, PDB, or topology-aware scheduling code was added.
Ote Binary Stdout Contract ✅ Passed New startup warning uses logrus, which defaults to stderr; no fmt.Print/klog stdout writes or stdout-directed logger config were added in process-level startup.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests were added or changed; this PR only updates package-server startup logic in server.go, so the IPv6/disconnected-network test check is not applicable.
No-Weak-Crypto ✅ Passed Changed code only logs and shortens a TLS-profile lookup timeout; scan found no md5/sha1/des/rc4/3DES/blowfish/ECB or secret comparisons.
Container-Privileges ✅ Passed Changed workload manifests keep restricted securityContext (allowPrivilegeEscalation false, drop ALL, runAsNonRoot true); no privileged/host* settings were found.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested review from bentito and pedjak June 28, 2026 22:02
@openshift-ci

openshift-ci Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: redhat-chai-bot
Once this PR has been reviewed and has the lgtm label, please assign oceanc80 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@redhat-chai-bot

Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aws-ovn
/payload-job periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aks

@openshift-ci

openshift-ci Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

@redhat-chai-bot: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aws-ovn
  • periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aks

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/09b2fb40-733d-11f1-9de9-5a87bc1152af-0

Update the vendored copy to match the staging changes that make
applyClusterTLSProfile() non-fatal and reduce the lookup timeout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

@redhat-chai-bot: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tmshort

tmshort commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

/close

Closing in favor of #1335, since #1334 (revert) was done.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 29, 2026
@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot closed this Jun 29, 2026
@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

@tmshort: Closed this PR.

Details

In response to this:

/close

Closing in favor of #1335, since #1334 (revert) was done.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants