TRT-2761: Make applyClusterTLSProfile() non-fatal to prevent packageserver crash-loops in HyperShift by redhat-chai-bot · Pull Request #1333 · openshift/operator-framework-olm

redhat-chai-bot · 2026-06-28T22:01:53Z

Problem

PR #1330 introduced applyClusterTLSProfile() in packageserver startup which fetches the cluster APIServer CR with a 30-second timeout. If the API server is unreachable, the error is fatal — the packageserver container crashes and enters a restart loop.

In HyperShift hosted clusters, the API server may not be ready when the packageserver starts during initial cluster creation. This race condition has caused 5 consecutive payload rejections over 36 hours in the 5.0 CI stream, blocking hypershift-e2e-aws-ovn (streak of 3) and hypershift-e2e-aks (streak of 2).

Root Cause

In staging/operator-lifecycle-manager/pkg/package-server/server/server.go, the Config() function calls:

if o.SecureServing.MinTLSVersion == "" {
    if err := applyClusterTLSProfile(ctx, clientConfig, o.SecureServing); err != nil {
        return fmt.Errorf("failed to apply cluster TLS profile to serving options: %w", err)
    }
}

The return fmt.Errorf(...) causes a fatal startup failure when the API server is temporarily unavailable.

Fix

Two changes:

Make failure non-fatal: Replace the fatal return fmt.Errorf(...) with log.Warningf(...). The packageserver continues with default TLS settings, which are safe and compliant. The Package Server Manager (PSM) will inject the correct TLS flags (--tls-min-version, --tls-cipher-suites) on its next reconciliation cycle.
Reduce timeout from 30s to 10s: Since this is now best-effort, there's no reason to block startup for 30 seconds. 10 seconds is sufficient for a healthy API server and avoids unnecessary delay.

Testing

All existing TLS profile tests pass:

TestApplyClusterTLSProfileWithClients_NonOpenShift ✅
TestApplyClusterTLSProfileWithClients_IntermediateProfile ✅
TestApplyClusterTLSProfileWithClients_ModernProfile ✅
TestApplyClusterTLSProfileWithClients_FlagsTakePrecedence ✅
TestApplyClusterTLSProfileWithClients_MissingAPIServerCR ✅

Summary by CodeRabbit

Bug Fixes
- Improved startup resilience when TLS settings are not yet available, allowing the Package Server to continue running instead of failing immediately.
- Added a warning when cluster TLS profile application cannot be completed right away.
- Reduced the wait time for TLS profile lookup, so fallback handling happens sooner.

…erver crash-loops in HyperShift PR openshift#1330 introduced applyClusterTLSProfile() in the packageserver startup path that makes API calls with a hard timeout. If any call fails (e.g. because the API server is not yet ready during HyperShift hosted cluster bootstrap), the function returns a fatal error that crashes the container. This causes an infinite crash-loop that blocks cluster creation. Make the failure non-fatal by logging a warning instead of returning an error. This is safe because: - The Package Server Manager (PSM) will inject the correct TLS flags (--tls-min-version, --tls-cipher-suites) on its next reconciliation - Default TLS settings are safe and compliant - The packageserver must be able to start even when the API server is temporarily unavailable Also reduce the best-effort lookup timeout from 30s to 10s to avoid blocking startup unnecessarily. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci-robot · 2026-06-28T22:01:57Z

@redhat-chai-bot: This pull request references TRT-2761 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Problem

PR #1330 introduced applyClusterTLSProfile() in packageserver startup which fetches the cluster APIServer CR with a 30-second timeout. If the API server is unreachable, the error is fatal — the packageserver container crashes and enters a restart loop.

In HyperShift hosted clusters, the API server may not be ready when the packageserver starts during initial cluster creation. This race condition has caused 5 consecutive payload rejections over 36 hours in the 5.0 CI stream, blocking hypershift-e2e-aws-ovn (streak of 3) and hypershift-e2e-aks (streak of 2).

Root Cause

In staging/operator-lifecycle-manager/pkg/package-server/server/server.go, the Config() function calls:
if o.SecureServing.MinTLSVersion == "" {
   if err := applyClusterTLSProfile(ctx, clientConfig, o.SecureServing); err != nil {
       return fmt.Errorf("failed to apply cluster TLS profile to serving options: %w", err)
   }
}
The return fmt.Errorf(...) causes a fatal startup failure when the API server is temporarily unavailable.

Fix

Two changes:

Make failure non-fatal: Replace the fatal return fmt.Errorf(...) with log.Warningf(...). The packageserver continues with default TLS settings, which are safe and compliant. The Package Server Manager (PSM) will inject the correct TLS flags (--tls-min-version, --tls-cipher-suites) on its next reconciliation cycle.

Reduce timeout from 30s to 10s: Since this is now best-effort, there's no reason to block startup for 30 seconds. 10 seconds is sufficient for a healthy API server and avoids unnecessary delay.

Testing

All existing TLS profile tests pass:

TestApplyClusterTLSProfileWithClients_NonOpenShift ✅

TestApplyClusterTLSProfileWithClients_IntermediateProfile ✅

TestApplyClusterTLSProfileWithClients_ModernProfile ✅

TestApplyClusterTLSProfileWithClients_FlagsTakePrecedence ✅

TestApplyClusterTLSProfileWithClients_MissingAPIServerCR ✅

Related

Tracking issue: TRT-2761

Offending PR: NO-ISSUE: Synchronize From Upstream Repositories #1330

Complementary PR: OCPBUGS-88737: Update PSM/package server to support TLS profiles #1328 (adds PSM-side TLS injection, does not fix the startup crash)

AI analysis: Payload analysis report

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-28T22:02:10Z

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (1)

vendor/github.com/operator-framework/operator-lifecycle-manager/pkg/package-server/server/server.go is excluded by !**/vendor/**, !vendor/**

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 49c8848e-97ca-45b8-98c9-56a1c5d8f12e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

In PackageServerOptions.Run, failure to apply the cluster TLS security profile is now non-fatal when SecureServing.MinTLSVersion is unset: a warning is logged and startup continues. Additionally, the lookupTimeout in applyClusterTLSProfile is reduced from 30s to 10s.

Changes

TLS Profile Startup Handling

Layer / File(s)	Summary
Non-fatal TLS profile error and reduced lookup timeout `staging/operator-lifecycle-manager/pkg/package-server/server/server.go`	`Run` now logs a warning and continues instead of returning an error when `applyClusterTLSProfile` fails with `MinTLSVersion` unset. `lookupTimeout` in `applyClusterTLSProfile` is reduced from 30s to 10s.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~5 minutes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error)

Check name	Status	Explanation	Resolution
No-Sensitive-Data-In-Logs	❌ Error	The new warning logs `%v` from `applyClusterTLSProfile()`, whose network/API errors can include internal API hostnames or URLs.	Log a generic message without the raw error, or redact host/URL details before emitting the warning.

✅ Passed checks (14 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: making applyClusterTLSProfile() non-fatal to avoid packageserver crash-loops.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PASS: The PR only changes server.go; no Ginkgo test titles were added or modified. The package-server tests inspected use static, descriptive names.
Test Structure And Quality	✅ Passed	No Ginkgo tests were added here; the package uses small isolated unit tests with explicit failure messages, so the checklist issues don’t apply.
Microshift Test Compatibility	✅ Passed	HEAD diff adds no new Ginkgo It/Describe/Context/When tests; the change is server.go TLS startup logic only, so MicroShift compatibility isn't implicated.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	The PR only changes server.go startup/TLS logic; no Ginkgo e2e tests or SNO-sensitive test assumptions were added.
Topology-Aware Scheduling Compatibility	✅ Passed	The PR only changes packageserver TLS-profile startup/timeout logic; no affinity, nodeSelector, spread, PDB, or topology-aware scheduling code was added.
Ote Binary Stdout Contract	✅ Passed	New startup warning uses logrus, which defaults to stderr; no fmt.Print/klog stdout writes or stdout-directed logger config were added in process-level startup.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No Ginkgo e2e tests were added or changed; this PR only updates package-server startup logic in server.go, so the IPv6/disconnected-network test check is not applicable.
No-Weak-Crypto	✅ Passed	Changed code only logs and shortens a TLS-profile lookup timeout; scan found no md5/sha1/des/rc4/3DES/blowfish/ECB or secret comparisons.
Container-Privileges	✅ Passed	Changed workload manifests keep restricted securityContext (allowPrivilegeEscalation false, drop ALL, runAsNonRoot true); no privileged/host* settings were found.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

openshift-ci · 2026-06-28T22:02:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: redhat-chai-bot
Once this PR has been reviewed and has the lgtm label, please assign oceanc80 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

DOWNSTREAM_OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

redhat-chai-bot · 2026-06-28T22:02:21Z

/payload-job periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aws-ovn
/payload-job periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aks

openshift-ci · 2026-06-28T22:02:23Z

@redhat-chai-bot: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aws-ovn
periodic-ci-openshift-hypershift-release-5.0-periodics-e2e-aks

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/09b2fb40-733d-11f1-9de9-5a87bc1152af-0

Update the vendored copy to match the staging changes that make applyClusterTLSProfile() non-fatal and reduce the lookup timeout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci · 2026-06-29T02:09:12Z

@redhat-chai-bot: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

tmshort · 2026-06-29T15:00:40Z

/close

Closing in favor of #1335, since #1334 (revert) was done.

openshift-ci · 2026-06-29T15:01:29Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2026-06-29T15:04:54Z

@tmshort: Closed this PR.

Details

In response to this:

/close

Closing in favor of #1335, since #1334 (revert) was done.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 28, 2026

openshift-ci Bot requested review from bentito and pedjak June 28, 2026 22:02

TRT-2761: Sync vendor copy of packageserver server.go

d9bd919

Update the vendored copy to match the staging changes that make applyClusterTLSProfile() non-fatal and reduce the lookup timeout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

redhat-chai-bot mentioned this pull request Jun 28, 2026

TRT-2761: Revert #1330 "NO-ISSUE: Synchronize From Upstream Repositories" #1334

Merged

redhat-chai-bot mentioned this pull request Jun 29, 2026

Make applyClusterTLSProfile() non-fatal to prevent crash-loops in HyperShift hosted clusters operator-framework/operator-lifecycle-manager#3859

Closed

tmshort mentioned this pull request Jun 29, 2026

TRT-2761: Undo revert and fix applyClusterTLSProfile() #1335

Open

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 29, 2026

openshift-ci Bot closed this Jun 29, 2026

Uh oh!

Conversation

redhat-chai-bot commented Jun 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fix

Testing

Related

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Jun 28, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fix

Testing

Related

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Pre-merge checks failed

❌ Failed checks (1 error)

Uh oh!

openshift-ci Bot commented Jun 28, 2026

Uh oh!

redhat-chai-bot commented Jun 28, 2026

Uh oh!

openshift-ci Bot commented Jun 28, 2026

Uh oh!

openshift-ci Bot commented Jun 29, 2026

Uh oh!

tmshort commented Jun 29, 2026

Uh oh!

openshift-ci Bot commented Jun 29, 2026

Uh oh!

openshift-ci Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

redhat-chai-bot commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Jun 28, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading