Skip to content

OCPBUGS-77056: Make external cert validation asynchronous#745

Open
bentito wants to merge 6 commits into
openshift:masterfrom
bentito:OCPBUGS-77056-async-sar
Open

OCPBUGS-77056: Make external cert validation asynchronous#745
bentito wants to merge 6 commits into
openshift:masterfrom
bentito:OCPBUGS-77056-async-sar

Conversation

@bentito

@bentito bentito commented Mar 3, 2026

Copy link
Copy Markdown
Contributor

This PR implements asynchronous and cached SubjectAccessReview checks for spec.tls.externalCertificate during router startup.

Previously, each external certificate triggered synchronous API validations that blocked the main thread. When scaling to thousands of external cert routes, the router hit O(N * API_latency) delays and would hit CrashLoopBackOff from failing liveness probes.

This fix maintains full RBAC security checks but executes them asynchronously in the background. A global sync.Map acts as an asyncSARCache to drastically speed up repeated checks for the same underlying secret.

Fixes: OCPBUGS-77056

Summary by CodeRabbit

  • New Features

    • Cache-backed, semaphore-throttled validation for external certificates with configurable concurrent SAR checks.
    • Configurable parallelism for route status writer lease workers.
  • Improvements

    • Hybrid secret informer strategy (faster where allowed; more selective when listing is forbidden).
    • Higher Kubernetes client request throughput for router startup.
    • Architecture-aware debug image builds and more robust route condition/status comparisons.
  • Bug Fixes

    • Preserve “ignored” admission condition details and record completion of external-certificate access checks.
  • Tests

    • Updated and added unit tests for secret/SAR caching behavior and status logic.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 3, 2026
@openshift-ci

openshift-ci Bot commented Mar 3, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Mar 3, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@bentito: This pull request references Jira Issue OCPBUGS-77056, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This PR implements asynchronous and cached SubjectAccessReview checks for spec.tls.externalCertificate during router startup.

Previously, each external certificate triggered synchronous API validations that blocked the main thread. When scaling to thousands of external cert routes, the router hit O(N * API_latency) delays and would hit CrashLoopBackOff from failing liveness probes.

This fix maintains full RBAC security checks but executes them asynchronously in the background. A global sync.Map acts as an asyncSARCache to drastically speed up repeated checks for the same underlying secret.

Fixes: OCPBUGS-77056

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2026
@openshift-ci openshift-ci Bot requested a review from lihongan March 3, 2026 23:38
@bentito bentito changed the title WIP: OCPBUGS-77056: Make external cert validation asynchronous OCPBUGS-77056: Make external cert validation asynchronous Mar 3, 2026
@bentito bentito force-pushed the OCPBUGS-77056-async-sar branch from 325f313 to 214b603 Compare March 3, 2026 23:41
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2026
@bentito bentito marked this pull request as ready for review March 3, 2026 23:41
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 3, 2026
@bentito

bentito commented Mar 3, 2026

Copy link
Copy Markdown
Contributor Author

/hold

@openshift-ci openshift-ci Bot requested review from Miciah and grzpiotrowski March 3, 2026 23:42
@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 3, 2026
@coderabbitai

coderabbitai Bot commented Mar 9, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Semaphore-throttled, cached SAR validation for external TLS; new RBAC-aware SharedSecretManager with hybrid informer strategy; SAR-completed route status events and condition equality fixes; WriterLease gains worker concurrency; synchronous test refactors and small build/client wiring tweaks.

Changes

External Certificate Validation, Secret Management, and Route Status

Layer / File(s) Summary
SAR validation with caching and throttling
pkg/router/routeapihelpers/validation.go, pkg/router/routeapihelpers/validation_test.go
Reimplement ValidateTLSExternalCertificate as synchronous, semaphore-throttled, short-TTL-cached SubjectAccessReview checks (get, watch, list); simplify ExtendedValidateRoute to use tls.X509KeyPair directly; remove internal PEM validation helper; add MaxConcurrentSARChecks, cache invalidation, and test helpers; adjust test expectations.
SharedSecretManager RBAC-aware informer strategy
pkg/router/controller/shared_secret_manager.go, pkg/router/controller/shared_secret_manager_test.go
Probe Secrets(list) per namespace to determine RBAC restrictions; create namespace-wide or per-secret informers with matching selectors; support idempotent route registration; normalize and dispatch secret events to matching routes; deduplicate informers; fall back to direct API reads when cache unavailable; add unit tests for hybrid and multi-route scenarios.
RouteSecretManager cache invalidation and status events
pkg/router/controller/route_secret_manager.go, pkg/router/controller/route_secret_manager_test.go
Record ExternalCertificateSARCompleted status after successful downstream handling; synchronously revalidate and repopulate on certain watch.Modified events; invalidate async SAR cache on secret Add/Update/Delete with proper tombstone extraction; refactor tests to use mutex-protected statusRecorder and direct handler invocation instead of informer/channel waiting.
Status equality and ignored condition reasons
pkg/router/controller/contention.go, pkg/router/controller/status.go, pkg/router/controller/status_test.go
Add ExternalCertificateSARCompleted to ignored ingress condition reasons; fix conditionsEqual to require Type/Status match before applying ignore-reason short-circuit; preserve existing ignored Reason/Message in recordIngressCondition when incoming Reason is empty; remove lock from HandleRoute admission path; change conflict handling to return writerlease.None; add test case for ignored reason preservation.
WriterLease multi-worker support
pkg/router/writerlease/writerlease.go, pkg/router/writerlease/writerlease_test.go
Add workers parameter (minimum 1) to New and NewWithBackoff constructors; launch l.workers goroutines in Run coordinated by sync.WaitGroup with graceful shutdown; replace follower-mode sleep with AddAfter requeue; update all tests to pass workers argument.
Wiring and build configuration
go.mod, hack/Makefile.debug, pkg/cmd/infra/router/clientcmd.go, pkg/cmd/infra/router/template.go, pkg/router/router_test.go, pkg/router/template/configmanager/haproxy/testing/haproxy.go
Add go.mod replace for bentito/library-go fork; export GOARCH make variable and pass to image builder; increase Kubernetes client QPS to 50 and Burst to 100; wire SharedSecretManager in template.go and pass MaxConcurrentSARChecks to WriterLease; update test harness; adjust fake HAProxy socket path and error handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error Lines 332 and 337 in pkg/router/controller/route_secret_manager.go log "obj" and "tombstone.Obj" parameters which contain entire Kubernetes Secret objects including sensitive Data field with creden... Log only safe metadata like dynamic type name (e.g., fmt.Sprintf("%T", obj)) instead of the entire Secret object; follow the proposed fix in review comments lines 327-339.
Docstring Coverage ⚠️ Warning Docstring coverage is 48.28% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'OCPBUGS-77056: Make external cert validation asynchronous' accurately reflects the primary objective of the changeset—implementing asynchronous SubjectAccessReview checks for external certificate validation to improve router startup performance.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in modified test files use static, descriptive strings. Table-driven tests use static struct field names from hardcoded scenarios, with no dynamic content like timestamps, UUIDs, or...
Test Structure And Quality ✅ Passed The custom check targets Ginkgo tests (It/Describe/BeforeEach/AfterEach patterns), but this PR contains only standard Go unit tests using testing.T. The tests follow good practices: single responsi...
Microshift Test Compatibility ✅ Passed PR adds only standard Go unit tests (testing package), not Ginkgo e2e tests. Custom check for Ginkgo e2e test MicroShift compatibility is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. All test modifications use standard Go testing (testing.T), not Ginkgo framework. The check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed No topology-aware scheduling constraints introduced. PR contains only Go source code changes (controllers, validation, lease management) with no pod specs, affinity rules, replicas calculations, no...
Ote Binary Stdout Contract ✅ Passed This PR contains only library code (no main() functions) and introduces no new process-level code (init(), TestMain(), BeforeSuite(), AfterSuite(), or top-level var initializers) that writes to std...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR does not add any Ginkgo e2e tests. All changes are to production code and standard Go unit tests, which are outside the scope of this IPv6/disconnected network compatibility check.
No-Weak-Crypto ✅ Passed PR introduces no weak cryptography. All cryptographic operations use secure standard library (crypto/tls, crypto/x509, crypto/rsa, crypto/ecdsa). Test data with weak certs (MD5/SHA1) is used approp...
Container-Privileges ✅ Passed PR introduces no new privileged container settings. deploy/router.yaml's existing hostNetwork: true is justified (ingress router binds ports 80/443); container runs as non-root (USER 1001); no priv...
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/router/controller/route_secret_manager.go`:
- Around line 288-298: The current Add handler unconditionally calls
RecordRouteRejection when a secret is replayed; change it to mirror the
UpdateFunc/SAR-completion logic: check the route's admission (via
populateRouteTLSFromSecret result or route.Status.Conditions) and if the route
is already admitted emit a normal update event instead of calling
RecordRouteRejection, otherwise record the rejection; extract the string
"ExternalCertificateSecretLoaded" into a shared constant (e.g.,
ExtCrtStatusReasonSecretLoaded) and add that constant to the
ignoreIngressConditionReason set in contention.go so the admission-ignoring
logic treats this replay reason as benign. Ensure you update references to
RecordRouteRejection, populateRouteTLSFromSecret, UpdateFunc,
ExternalCertificateSecretLoaded, and ignoreIngressConditionReason accordingly.

In `@pkg/router/routeapihelpers/validation.go`:
- Around line 503-556: The async SAR flow stores only a single pending error
(pendingErr) in asyncSARCache for a cacheKey, so subsequent calls (in validate
flow) that hit the cache drop their onComplete callback and never get
re-enqueued; change the cache value to include a list of waiting callbacks (or a
small struct with errs + []onComplete callbacks) keyed by asyncSARCache so that
when you detect a pending entry you append the current onComplete for
route.Namespace/secretName, and in the goroutine after
asyncSARCache.Store(cacheKey, errs) iterate and invoke all stored callbacks (not
just the first) and then replace the cache entry with errs only; update code
paths that read the cache (the cache-hit branch and the goroutine completion) to
handle this new struct and ensure onComplete is invoked for every waiting route.
- Around line 485-490: The global asyncSARCache currently stores completed
validation results forever; change it to avoid permanent sticky entries by (a)
replacing the raw sync.Map value with a small struct that includes the cached
field.ErrorList plus an expiration timestamp or a source type flag, (b) only
writing non-transient successful validation results (or results with a short
TTL) into asyncSARCache, and (c) adding explicit invalidation helpers (e.g.,
InvalidateAsyncSARCache(namespace, secretName) and
InvalidateAllAsyncSARCacheForSecret(namespace, secretName)) and call them from
the secret add/update/delete/recreate paths in the secret manager
(route_secret_manager.go) so secret changes force revalidation; also update
ClearAsyncSARCacheForTest to clear the new structure. Ensure references to
asyncSARCache and ClearAsyncSARCacheForTest (and the code paths mentioned around
the 503-556 region) are updated to use the new value type and invalidation APIs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e07f375c-945a-4e58-b5ec-3347eaff109c

📥 Commits

Reviewing files that changed from the base of the PR and between d8ed355 and 093ad83.

📒 Files selected for processing (5)
  • pkg/router/controller/contention.go
  • pkg/router/controller/route_secret_manager.go
  • pkg/router/controller/route_secret_manager_test.go
  • pkg/router/routeapihelpers/validation.go
  • pkg/router/routeapihelpers/validation_test.go

Comment thread pkg/router/controller/route_secret_manager.go Outdated
Comment thread pkg/router/routeapihelpers/validation.go Outdated
Comment thread pkg/router/routeapihelpers/validation.go Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/router/routeapihelpers/validation.go (1)

485-497: ⚠️ Potential issue | 🟠 Major

Completed SAR entries are still permanent.

Only secret add/update/delete clears this cache. A finished allow/deny result for namespace/secretName survives route deletion, later RBAC grants/revocations, and transient SAR/API failures, so a new route that reuses the same secret name can inherit a stale decision without performing a fresh check. Please put completed entries behind a TTL/generation and avoid caching transient internal errors indefinitely.

Also applies to: 593-605

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/router/routeapihelpers/validation.go` around lines 485 - 497, The cache
currently stores completed asyncSARResult entries forever; update asyncSARResult
(and uses of asyncSARCache) to include a timestamp or generation counter and
enforce a TTL/generation check before returning cached results so that completed
allow/deny decisions expire after a short interval or when generation changes.
Ensure only stable allow/deny outcomes are cached long-term; if
asyncSARResult.errs contains transient/internal errors, set a much shorter TTL
(or do not mark done) so they trigger fresh SARs. Update the cache read path to
validate timestamp/generation and the write path to record it, and keep
InvalidateAsyncSARCache(namespace, secretName) as-is to support manual
invalidation.
pkg/router/controller/route_secret_manager.go (1)

262-304: ⚠️ Potential issue | 🟠 Major

Don't clear the fresh SAR result on the initial SecretLoaded replay.

RegisterRoute() is only reached after validate() has already completed successfully, so this non-deleted AddFunc path is replaying an already-validated secret. Clearing asyncSARCache here throws away that fresh result, defeats the per-secret cache for shared secrets, and can bounce the route back into a second "authorization check pending" / ExternalCertificateValidationFailed cycle on the SecretLoaded requeue. Keep invalidation on actual secret changes (recreated/updated/deleted), but not on the initial load replay.

Suggested shape
 		AddFunc: func(obj interface{}) {
 			secret := obj.(*kapi.Secret)
 			log.V(4).Info("Secret added for route", "namespace", namespace, "secret", secret.Name, "route", routeName)
-			routeapihelpers.InvalidateAsyncSARCache(namespace, secret.Name)
 
 			// Secret re-creation scenario
 			// Check if the route key exists in the deletedSecrets map, indicating that the secret was previously deleted for this route.
 			// If it exists, it means the secret is being recreated. Remove the key from the map and proceed with handling the route.
@@
 			key := generateKey(namespace, routeName)
 			if _, deleted := p.deletedSecrets.LoadAndDelete(key); deleted {
+				routeapihelpers.InvalidateAsyncSARCache(namespace, secret.Name)
 				log.V(4).Info("Secret recreated for route", "namespace", namespace, "secret", secret.Name, "route", routeName)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/router/controller/route_secret_manager.go` around lines 262 - 304, The
AddFunc for secret events currently calls
routeapihelpers.InvalidateAsyncSARCache unconditionally, which discards fresh
per-secret async SAR results during the initial cache-sync replay; to fix,
remove the unconditional call to routeapihelpers.InvalidateAsyncSARCache from
the top of the AddFunc in RegisterRoute's secret handler and only invoke
routeapihelpers.InvalidateAsyncSARCache when the secret is actually
changed/recreated/deleted (e.g., inside the p.deletedSecrets.LoadAndDelete
"recreated" branch and in the Update/Delete handlers), using
generateKey(namespace, routeName) and p.deletedSecrets to determine real
recreation so you don't invalidate the SAR cache on the initial load.
🧹 Nitpick comments (1)
pkg/router/controller/route_secret_manager_test.go (1)

1355-1372: Put a timeout around these async waits.

These bare channel receives will hang the whole suite on any missed informer callback. Now that the tests intentionally depend on async completion, a tiny helper with select + time.After will turn those deadlocks into normal test failures.

Suggested helper
+func waitForRecorderEvent(t *testing.T, ch <-chan struct{}) {
+	t.Helper()
+	select {
+	case <-ch:
+	case <-time.After(2 * time.Second):
+		t.Fatal("timeout waiting for recorder event")
+	}
+}

Then replace each bare receive with:

-<-recorder.doneCh
+waitForRecorderEvent(t, recorder.doneCh)

Also applies to: 1434-1444, 1496-1516

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/router/controller/route_secret_manager_test.go` around lines 1355 - 1372,
Replace bare channel receives on recorder.doneCh with a timed wait helper to
avoid test hangs: add a small helper (e.g., waitForDone(ctxTimeout, ch) or
waitWithTimeout(t, ch, duration)) that uses select with the channel and
time.After to fail the test on timeout, then update every occurrence of
"<-recorder.doneCh" in route_secret_manager_test.go (including the occurrences
around the Secret Update flow that use kubeClient.CoreV1().Secrets(...).Update
and the initial Add wait) to call that helper instead so missed informer
callbacks produce a deterministic test failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/router/controller/route_secret_manager.go`:
- Around line 262-304: The AddFunc for secret events currently calls
routeapihelpers.InvalidateAsyncSARCache unconditionally, which discards fresh
per-secret async SAR results during the initial cache-sync replay; to fix,
remove the unconditional call to routeapihelpers.InvalidateAsyncSARCache from
the top of the AddFunc in RegisterRoute's secret handler and only invoke
routeapihelpers.InvalidateAsyncSARCache when the secret is actually
changed/recreated/deleted (e.g., inside the p.deletedSecrets.LoadAndDelete
"recreated" branch and in the Update/Delete handlers), using
generateKey(namespace, routeName) and p.deletedSecrets to determine real
recreation so you don't invalidate the SAR cache on the initial load.

In `@pkg/router/routeapihelpers/validation.go`:
- Around line 485-497: The cache currently stores completed asyncSARResult
entries forever; update asyncSARResult (and uses of asyncSARCache) to include a
timestamp or generation counter and enforce a TTL/generation check before
returning cached results so that completed allow/deny decisions expire after a
short interval or when generation changes. Ensure only stable allow/deny
outcomes are cached long-term; if asyncSARResult.errs contains
transient/internal errors, set a much shorter TTL (or do not mark done) so they
trigger fresh SARs. Update the cache read path to validate timestamp/generation
and the write path to record it, and keep InvalidateAsyncSARCache(namespace,
secretName) as-is to support manual invalidation.

---

Nitpick comments:
In `@pkg/router/controller/route_secret_manager_test.go`:
- Around line 1355-1372: Replace bare channel receives on recorder.doneCh with a
timed wait helper to avoid test hangs: add a small helper (e.g.,
waitForDone(ctxTimeout, ch) or waitWithTimeout(t, ch, duration)) that uses
select with the channel and time.After to fail the test on timeout, then update
every occurrence of "<-recorder.doneCh" in route_secret_manager_test.go
(including the occurrences around the Secret Update flow that use
kubeClient.CoreV1().Secrets(...).Update and the initial Add wait) to call that
helper instead so missed informer callbacks produce a deterministic test
failure.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c800f527-b5e1-4386-b6cf-e20eb4726a31

📥 Commits

Reviewing files that changed from the base of the PR and between 093ad83 and 8c6f16a.

📒 Files selected for processing (4)
  • pkg/router/controller/contention.go
  • pkg/router/controller/route_secret_manager.go
  • pkg/router/controller/route_secret_manager_test.go
  • pkg/router/routeapihelpers/validation.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/router/controller/contention.go

@lihongan

Copy link
Copy Markdown

/retest-required

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 561-568: The semaphore (asyncSARSemaphore) is being acquired on
the caller path before launching the goroutine, causing the caller to block when
the limit is reached; move the acquire into the goroutine so the caller can
spawn all per-secret goroutines without blocking, then release the token inside
that goroutine (use defer to ensure release). Concretely: start the goroutine
immediately, perform asyncSARSemaphore <- struct{}{} at the top of the goroutine
body, and keep the existing defer func() { <-asyncSARSemaphore }() to guarantee
release; ensure any early returns in the goroutine still release the token.
- Around line 532-535: The current guard that returns nil when sarc or
secretsGetter is nil hides wiring bugs and skips RBAC/secret validation;
instead, change the conditional in validation.go so that if sarc == nil ||
secretsGetter == nil you return an internal error (e.g., fmt.Errorf or the
package's internal/server error type) describing "missing validation clients"
rather than nil, or alternatively gate this behind a test-only seam/flag so
production never silently succeeds; update callers to propagate/handle the
returned error as needed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 31462f9b-48bb-466f-8d90-5c399b5fab8b

📥 Commits

Reviewing files that changed from the base of the PR and between 8c6f16a and 871f184.

📒 Files selected for processing (1)
  • pkg/router/routeapihelpers/validation.go

Comment thread pkg/router/routeapihelpers/validation.go Outdated
Comment thread pkg/router/routeapihelpers/validation.go Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 526-529: The code currently returns a synthetic pending error on
cache miss without registering a callback if onComplete is nil; change the logic
in the validation path (the block that manipulates cached.callbacks and returns
cached.errs) so that onComplete must never be nil: either (A) make onComplete a
required parameter at the API boundary, or (B) when onComplete == nil execute
the async validation synchronously (drive the same validation routine on the
current goroutine, wait for completion, populate cached.errs, and then return
the real errors) instead of just returning the pending placeholder; if you
choose (B) ensure you reuse the same validation function that enqueues callbacks
(so cached state is updated) and that cached.callbacks is appended only when
onComplete is non-nil or after synchronous completion you return the finalized
errors.
- Around line 572-599: Wrap each SAR and secret GET with a per-request timeout
context (use context.WithTimeout and defer cancel) instead of context.TODO();
for the SAR checks stop using authorizationutil.Authorize and call the SAR
client Create() directly (e.g., sarClient.Create) with that timeout context so
you can inspect both (response, err) separately, treat API/transport errors as
retryable/internal (append field.InternalError or a distinct cached error) and
only append field.Forbidden when the SAR response explicitly denies, and call
secretsGetter.Secrets(route.Namespace).Get with the same bounded context and map
Get errors to NotFound vs InternalError accordingly. Ensure you still reference
routerServiceAccount, secretName, fldPath and preserve existing
field.Forbidden/field.InternalError/field.NotFound usages when classifying the
outcome.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bbebc97e-5cff-40e0-80d7-bff7bf16a68c

📥 Commits

Reviewing files that changed from the base of the PR and between 871f184 and c2f6381.

📒 Files selected for processing (1)
  • pkg/router/routeapihelpers/validation.go

Comment thread pkg/router/routeapihelpers/validation.go Outdated
Comment thread pkg/router/routeapihelpers/validation.go Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/router/routeapihelpers/validation.go (1)

637-639: Consider recovering from callback panics to ensure all callbacks are invoked.

If a callback panics, the remaining callbacks in the slice won't be invoked. This is unlikely but could leave some routes stuck in a pending state if one route's callback handler has a bug.

🛡️ Proposed defensive callback invocation
 		for _, cb := range callbacks {
-			cb(route.Namespace, secretName)
+			func() {
+				defer func() {
+					if r := recover(); r != nil {
+						// Log panic but continue invoking remaining callbacks
+					}
+				}()
+				cb(route.Namespace, secretName)
+			}()
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/router/routeapihelpers/validation.go` around lines 637 - 639, Wrap each
callback invocation in a deferred recover to prevent a single panic from
aborting the loop: inside the loop over callbacks (callbacks, cb) call each cb
via a small closure that uses defer + recover to catch any panic, log or report
the panic along with the callback context (e.g., route.Namespace and secretName)
and continue to the next cb; ensure the recovered panic does not re-panic so all
callbacks are invoked.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 637-639: Wrap each callback invocation in a deferred recover to
prevent a single panic from aborting the loop: inside the loop over callbacks
(callbacks, cb) call each cb via a small closure that uses defer + recover to
catch any panic, log or report the panic along with the callback context (e.g.,
route.Namespace and secretName) and continue to the next cb; ensure the
recovered panic does not re-panic so all callbacks are invoked.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c9c78ea5-dba5-483e-8d75-13dc6f9905d7

📥 Commits

Reviewing files that changed from the base of the PR and between c2f6381 and dc6b383.

📒 Files selected for processing (1)
  • pkg/router/routeapihelpers/validation.go

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2026
@bentito bentito force-pushed the OCPBUGS-77056-async-sar branch from 716e042 to 0a99ee0 Compare March 17, 2026 20:15
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2026
@bentito bentito force-pushed the OCPBUGS-77056-async-sar branch from 0a99ee0 to 4e1eae4 Compare March 23, 2026 13:51
@bentito

bentito commented Mar 27, 2026

Copy link
Copy Markdown
Contributor Author

Depends-On: openshift/library-go#2132

@bentito

bentito commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@bentito

bentito commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

/retest

@bentito bentito changed the base branch from master to release-4.19 June 1, 2026 22:26
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@bentito: This pull request references Jira Issue OCPBUGS-77056, which is invalid:

  • expected the bug to target either version "4.19." or "openshift-4.19.", but it targets "4.22.0" instead
  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected Jira Issue OCPBUGS-77056 to depend on a bug targeting a version in 4.20.0, 4.20.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

This PR implements asynchronous and cached SubjectAccessReview checks for spec.tls.externalCertificate during router startup.

Previously, each external certificate triggered synchronous API validations that blocked the main thread. When scaling to thousands of external cert routes, the router hit O(N * API_latency) delays and would hit CrashLoopBackOff from failing liveness probes.

This fix maintains full RBAC security checks but executes them asynchronously in the background. A global sync.Map acts as an asyncSARCache to drastically speed up repeated checks for the same underlying secret.

Fixes: OCPBUGS-77056

Summary by CodeRabbit

  • New Features

  • Caching and throttling for external-certificate validation

  • Configurable parallelism for route status updates

  • Improvements

  • More reliable, faster secret handling with hybrid informer strategy

  • Increased Kubernetes client request throughput

  • Architecture-aware debug image builds

  • Improved route status comparison to avoid spurious updates

  • Bug Fixes

  • Record completed access-check status for external certificates

  • Tests

  • Revised and added unit tests for secret manager, SAR caching, and status logic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bentito bentito changed the base branch from release-4.19 to master June 1, 2026 22:28
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@bentito: This pull request references Jira Issue OCPBUGS-77056, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

This PR implements asynchronous and cached SubjectAccessReview checks for spec.tls.externalCertificate during router startup.

Previously, each external certificate triggered synchronous API validations that blocked the main thread. When scaling to thousands of external cert routes, the router hit O(N * API_latency) delays and would hit CrashLoopBackOff from failing liveness probes.

This fix maintains full RBAC security checks but executes them asynchronously in the background. A global sync.Map acts as an asyncSARCache to drastically speed up repeated checks for the same underlying secret.

Fixes: OCPBUGS-77056

Summary by CodeRabbit

  • New Features

  • Caching and throttling for external-certificate validation

  • Configurable parallelism for route status updates

  • Improvements

  • More reliable, faster secret handling with hybrid informer strategy

  • Increased Kubernetes client request throughput

  • Architecture-aware debug image builds

  • Improved route status comparison to avoid spurious updates

  • Bug Fixes

  • Record completed access-check status for external certificates

  • Tests

  • Revised and added unit tests for secret manager, SAR caching, and status logic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bentito

bentito commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

/retest

This prevents the router from dropping its leader lease under high concurrency when updating route statuses, which was causing a 60-second stall in scale tests.
@bentito

bentito commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

/retest

Comment thread hack/Makefile.debug
new-openshift-router-image:
GO111MODULE=on CGO_ENABLED=0 GOFLAGS=-mod=vendor go build -o openshift-router -gcflags=all="-N -l" ./cmd/openshift-router
$(IMAGEBUILDER) build -t $(LOCAL_IMAGE) -f hack/Dockerfile.debug .
GO111MODULE=on CGO_ENABLED=0 GOOS=$(GOOS) GOARCH=$(GOARCH) GOFLAGS=-mod=vendor go build -o openshift-router -gcflags=all="-N -l" ./cmd/openshift-router

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as on the GOARCH definition

@bentito

bentito commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 16, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@bentito: This pull request references Jira Issue OCPBUGS-77056, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested a review from melvinjoseph86 June 16, 2026 15:09
@bentito

bentito commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai can you clean up your old comments and re-review this PR comprehensively?

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# List all review comments made by coderabbitai on PR `#745`
gh api repos/openshift/router/pulls/745/comments \
  --jq '.[] | select(.user.login == "coderabbitai[bot]") | {id: .id, body: (.body | .[0:80])}' 2>/dev/null

echo "---REVIEW COMMENTS ABOVE---"

# Also check issue comments (top-level PR comments)
gh api repos/openshift/router/issues/745/comments \
  --jq '.[] | select(.user.login == "coderabbitai[bot]") | {id: .id, body: (.body | .[0:80])}' 2>/dev/null

Length of output: 3352


🏁 Script executed:

#!/bin/bash
# Get all review threads for PR `#745` and resolve the ones from coderabbitai
THREADS=$(gh api graphql -f query='
{
  repository(owner: "openshift", name: "router") {
    pullRequest(number: 745) {
      reviewThreads(first: 100) {
        nodes {
          id
          isResolved
          comments(first: 1) {
            nodes {
              author { login }
              body
            }
          }
        }
      }
    }
  }
}' --jq '.data.repository.pullRequest.reviewThreads.nodes[] | select(.isResolved == false) | select(.comments.nodes[0].author.login == "coderabbitai[bot]") | .id')

echo "Unresolved coderabbitai threads:"
echo "$THREADS"
echo "---"

# Resolve each thread
for thread_id in $THREADS; do
  echo "Resolving thread: $thread_id"
  gh api graphql -f query='
  mutation($threadId: ID!) {
    resolveReviewThread(input: {threadId: $threadId}) {
      thread { id isResolved }
    }
  }' -f threadId="$thread_id" --jq '.data.resolveReviewThread.thread' 2>/dev/null || echo "Failed to resolve $thread_id"
done

Length of output: 1128



@bentito All previous review threads are already resolved (or were previously resolved). I'll now kick off a comprehensive re-review of this PR with the latest changes.

✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/router/template/configmanager/haproxy/testing/haproxy.go (1)

110-127: ⚠️ Potential issue | 🟠 Major

Propagate fake HAProxy listener failures explicitly instead of panicking/silently exiting.

Line 113 panics inside a goroutine on bind failure, which crashes the test process abruptly and prevents proper error handling. Line 126 returns on any Accept error without surfacing the cause, making test failures flaky and hard to diagnose. The Start() method returns void, so callers cannot detect failures at all.

Convert Start() to return error, replace the panic with explicit error propagation through an error channel, and handle Accept failures by differentiating between shutdown and unexpected errors. Use context.Background() with net.ListenConfig for consistency with Go best practices.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/template/configmanager/haproxy/testing/haproxy.go` around lines
110 - 127, The fakeHAProxy Start() method needs to be refactored to return an
error instead of void, and the goroutine it spawns should propagate errors
explicitly rather than panicking or silently exiting. Replace the panic call on
the net.Listen failure with sending the error through an error channel that
Start() can read from before returning. Update the listener.Accept() error
handling to differentiate between expected errors during shutdown (when
p.shutdown is true) and unexpected errors, propagating unexpected ones through
the error channel or logging mechanism. Additionally, modernize the net.Listen
call to use net.ListenConfig with context.Background() for consistency with
current Go best practices. Ensure the error channel is created, read from in
Start() to detect initialization failures, and properly closed to prevent
goroutine leaks.

Sources: Coding guidelines, Linters/SAST tools

pkg/router/routeapihelpers/validation_test.go (1)

2512-2523: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep this case as a rejection unless the API contract intentionally changed.

The test still says a key field containing certificate/CA-chain material “should be rejected”, but expectedErrors: 0 now accepts that malformed key field and masks the validation regression. Either restore the rejection expectation and fix the validator, or rename/comment this case with an explicit contract-change rationale.

🧪 Proposed test expectation if rejection remains the contract
-			expectedErrors: 0,
+			expectedErrors: 1,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/routeapihelpers/validation_test.go` around lines 2512 - 2523, The
test case in the validation_test.go file has an inconsistency where the test
comment states that a key field containing Public Key/certificate attributes
"should be rejected", but expectedErrors is set to 0, which means the test is
accepting the malformed input rather than rejecting it. Either update the
expectedErrors value to a positive number (likely 1) to match the stated
rejection behavior and ensure the validator in the routeapihelpers package
actually rejects keys containing CA chain material, or update the test comment
and add explicit documentation explaining that the API contract has
intentionally changed to allow this case. Choose the appropriate fix based on
the actual intended API contract.
🧹 Nitpick comments (3)
pkg/router/writerlease/writerlease_test.go (1)

1-148: 💤 Low value

Consider adding multi-worker concurrency tests.

All tests pass workers=1, which maintains deterministic single-worker behavior. Since production usage passes MaxConcurrentSARChecks=50 workers, consider adding a test that exercises concurrent worker behavior—for example, verifying that multiple queued items are processed in parallel and that shutdown cleanly waits for all workers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/writerlease/writerlease_test.go` around lines 1 - 148, Add new
test functions to verify concurrent worker behavior with multiple workers, since
all existing tests use a single worker (third parameter set to 1 in the New
constructor calls). Create tests that instantiate a writerlease with multiple
workers (e.g., 5 or 10 for testing purposes) instead of 1, and verify that
multiple queued work items are processed in parallel and that shutdown
gracefully waits for all workers to complete their tasks. This will ensure the
code behaves correctly under production conditions where MaxConcurrentSARChecks
is 50 workers.
pkg/router/controller/status_test.go (1)

1502-1538: ⚡ Quick win

Assert message preservation too.

This case verifies the ignored Reason is preserved, but it leaves Message empty on both sides. Add a non-empty message to catch regressions that keep Reason while clearing the user-facing detail.

🧪 Proposed test tightening
 					Conditions: []routev1.RouteIngressCondition{{
 						Type:   routev1.RouteAdmitted,
 						Status: corev1.ConditionTrue,
 						Reason: ExtCrtStatusReasonSARCompleted,
+						Message: "SAR check and secret load completed for secret \"tls-secret\"",
 					}}},
 				}},
 			},
@@
 					Conditions: []routev1.RouteIngressCondition{{
 						Type:   routev1.RouteAdmitted,
 						Status: corev1.ConditionTrue,
 						Reason: ExtCrtStatusReasonSARCompleted,
+						Message: "SAR check and secret load completed for secret \"tls-secret\"",
 					}}},
 				}},
 			},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/status_test.go` around lines 1502 - 1538, In the test
case "do not overwrite existing ignored reason with empty reason", add a
non-empty Message field to both the existing condition (in the
route.Status.Ingress[0].Conditions[0]) and the expected condition (in
expectedRoute.Status.Ingress[0].Conditions[0]) with the same value. This ensures
the test verifies that the Message field is also preserved alongside the Reason
field, preventing regressions where the Reason is kept but the Message is
inadvertently cleared.
pkg/router/controller/route_secret_manager_test.go (1)

112-123: ⚡ Quick win

Add a regression case where the downstream plugin rejects the route.

The production fix depends on RecordRouteUpdate(...SARCompleted...) not firing when plugin.HandleRoute returns an error, but this table always uses a successful fakePlugin. Add a scenario with p.err != nil and assert expectedUpdates == nil.

Also applies to: 1114-1118, 1129-1134

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/route_secret_manager_test.go` around lines 112 - 123,
Add a new test scenario to the scenarios table in the route secret manager test
that validates the regression case where the downstream plugin rejects the
route. Create a scenario where the fakePlugin is configured with a non-nil error
value (p.err != nil) to simulate HandleRoute returning an error, and set
expectedUpdates to nil to assert that RecordRouteUpdate is not invoked when the
plugin rejects the route. This ensures the production fix where
RecordRouteUpdate does not fire on plugin errors is properly covered by
regression testing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go.mod`:
- Line 121: Add a comment in go.mod above the replace directive for
github.com/openshift/library-go (the line replacing it with
github.com/bentito/library-go) to clarify the intent and provenance of this
fork. The comment should explicitly document whether this fork is temporary
(pending upstream acceptance of changes) or permanent (intentional architectural
split), and explain the justification such as the HAProxy API improvements (DCM
add/del functionality) that necessitate this divergence from upstream. This
visibility helps future maintainers understand the fork's lifecycle and when it
can potentially be reconciled with upstream openshift/library-go.

In `@pkg/router/controller/route_secret_manager.go`:
- Around line 327-339: In the DeleteFunc handler for the route secret manager,
the error logs are logging entire objects which could expose sensitive data from
Secret Data fields. Replace the logging of whole objects in both error cases:
for the first log.Error call, instead of logging the full obj parameter, log
only the dynamic type or safe metadata; for the second log.Error call, instead
of logging the full tombstone.Obj, log only safe metadata such as the type or
name, not the entire object that may contain sensitive Data fields.

In `@pkg/router/controller/shared_secret_manager_test.go`:
- Around line 42-55: The test creates a SharedSecretManager that uses informers
internally but never verifies cache synchronization or tests the actual
GetSecret behavior that depends on HasSynced(). Either modify the
SharedSecretManager to accept an injectable informer factory so tests can
provide a mock source instead of the real REST client, or extend the test to
call WaitForCacheSync after setting up the manager and then verify informer
functionality by calling GetSecret to ensure the cache lookups work correctly.

In `@pkg/router/controller/shared_secret_manager.go`:
- Around line 107-109: The variable infState is assigned from the map lookup
m.informers[infKey] on the same line as the exists check, but this value is
never used before being potentially overwritten. Remove the infState assignment
from the initial map lookup statement, and only assign infState after confirming
the key exists (after the if !exists block completes), or restructure the code
to avoid the ineffectual assignment by directly using the map lookup result only
when needed.
- Around line 131-141: The AddEventHandler call is ignoring the error return
value, which in k8s.io/client-go v0.35.0 now returns both a
ResourceEventHandlerRegistration and an error. Capture the error return value
from the AddEventHandler method call and check if it is non-nil. If the error
occurs, the route that was previously registered (referenced in the comment as
being registered at lines 101-105) must be rolled back to maintain consistency,
since the handler registration failure means the route cannot process events
properly.

---

Outside diff comments:
In `@pkg/router/routeapihelpers/validation_test.go`:
- Around line 2512-2523: The test case in the validation_test.go file has an
inconsistency where the test comment states that a key field containing Public
Key/certificate attributes "should be rejected", but expectedErrors is set to 0,
which means the test is accepting the malformed input rather than rejecting it.
Either update the expectedErrors value to a positive number (likely 1) to match
the stated rejection behavior and ensure the validator in the routeapihelpers
package actually rejects keys containing CA chain material, or update the test
comment and add explicit documentation explaining that the API contract has
intentionally changed to allow this case. Choose the appropriate fix based on
the actual intended API contract.

In `@pkg/router/template/configmanager/haproxy/testing/haproxy.go`:
- Around line 110-127: The fakeHAProxy Start() method needs to be refactored to
return an error instead of void, and the goroutine it spawns should propagate
errors explicitly rather than panicking or silently exiting. Replace the panic
call on the net.Listen failure with sending the error through an error channel
that Start() can read from before returning. Update the listener.Accept() error
handling to differentiate between expected errors during shutdown (when
p.shutdown is true) and unexpected errors, propagating unexpected ones through
the error channel or logging mechanism. Additionally, modernize the net.Listen
call to use net.ListenConfig with context.Background() for consistency with
current Go best practices. Ensure the error channel is created, read from in
Start() to detect initialization failures, and properly closed to prevent
goroutine leaks.

---

Nitpick comments:
In `@pkg/router/controller/route_secret_manager_test.go`:
- Around line 112-123: Add a new test scenario to the scenarios table in the
route secret manager test that validates the regression case where the
downstream plugin rejects the route. Create a scenario where the fakePlugin is
configured with a non-nil error value (p.err != nil) to simulate HandleRoute
returning an error, and set expectedUpdates to nil to assert that
RecordRouteUpdate is not invoked when the plugin rejects the route. This ensures
the production fix where RecordRouteUpdate does not fire on plugin errors is
properly covered by regression testing.

In `@pkg/router/controller/status_test.go`:
- Around line 1502-1538: In the test case "do not overwrite existing ignored
reason with empty reason", add a non-empty Message field to both the existing
condition (in the route.Status.Ingress[0].Conditions[0]) and the expected
condition (in expectedRoute.Status.Ingress[0].Conditions[0]) with the same
value. This ensures the test verifies that the Message field is also preserved
alongside the Reason field, preventing regressions where the Reason is kept but
the Message is inadvertently cleared.

In `@pkg/router/writerlease/writerlease_test.go`:
- Around line 1-148: Add new test functions to verify concurrent worker behavior
with multiple workers, since all existing tests use a single worker (third
parameter set to 1 in the New constructor calls). Create tests that instantiate
a writerlease with multiple workers (e.g., 5 or 10 for testing purposes) instead
of 1, and verify that multiple queued work items are processed in parallel and
that shutdown gracefully waits for all workers to complete their tasks. This
will ensure the code behaves correctly under production conditions where
MaxConcurrentSARChecks is 50 workers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eebcedd7-463a-43a4-832b-5884254e1cc0

📥 Commits

Reviewing files that changed from the base of the PR and between 6761134 and 0212ba6.

⛔ Files ignored due to path filters (6)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/openshift/library-go/pkg/authorization/authorizationutil/subject.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/library-go/pkg/authorization/authorizationutil/util.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/library-go/pkg/route/secretmanager/manager.go is excluded by !**/vendor/**, !vendor/**
  • vendor/github.com/openshift/library-go/pkg/secret/secret_monitor.go is excluded by !**/vendor/**, !vendor/**
  • vendor/modules.txt is excluded by !**/vendor/**, !vendor/**
📒 Files selected for processing (17)
  • go.mod
  • hack/Makefile.debug
  • pkg/cmd/infra/router/clientcmd.go
  • pkg/cmd/infra/router/template.go
  • pkg/router/controller/contention.go
  • pkg/router/controller/route_secret_manager.go
  • pkg/router/controller/route_secret_manager_test.go
  • pkg/router/controller/shared_secret_manager.go
  • pkg/router/controller/shared_secret_manager_test.go
  • pkg/router/controller/status.go
  • pkg/router/controller/status_test.go
  • pkg/router/routeapihelpers/validation.go
  • pkg/router/routeapihelpers/validation_test.go
  • pkg/router/router_test.go
  • pkg/router/template/configmanager/haproxy/testing/haproxy.go
  • pkg/router/writerlease/writerlease.go
  • pkg/router/writerlease/writerlease_test.go

Comment thread go.mod Outdated
Comment on lines 327 to +339
DeleteFunc: func(obj interface{}) {
secret := obj.(*kapi.Secret)
secret, ok := obj.(*kapi.Secret)
if !ok {
tombstone, ok := obj.(cache.DeletedFinalStateUnknown)
if !ok {
log.Error(nil, "Couldn't get object from tombstone", "obj", obj)
return
}
secret, ok = tombstone.Obj.(*kapi.Secret)
if !ok {
log.Error(nil, "Tombstone contained object that is not a secret", "obj", tombstone.Obj)
return
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid logging whole objects from the Secret delete handler.

These fallback logs can serialize unexpected Secret-like/tombstone objects, including Data fields. Log only the dynamic type or safe metadata.

🛡️ Proposed fix
-					log.Error(nil, "Couldn't get object from tombstone", "obj", obj)
+					log.Error(nil, "Couldn't get object from tombstone", "type", fmt.Sprintf("%T", obj))
 					return
 				}
 				secret, ok = tombstone.Obj.(*kapi.Secret)
 				if !ok {
-					log.Error(nil, "Tombstone contained object that is not a secret", "obj", tombstone.Obj)
+					log.Error(nil, "Tombstone contained object that is not a secret", "type", fmt.Sprintf("%T", tombstone.Obj))
 					return
 				}

As per coding guidelines, flag logging that may expose passwords, tokens, API keys, PII, session IDs, internal hostnames, or customer data.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/route_secret_manager.go` around lines 327 - 339, In the
DeleteFunc handler for the route secret manager, the error logs are logging
entire objects which could expose sensitive data from Secret Data fields.
Replace the logging of whole objects in both error cases: for the first
log.Error call, instead of logging the full obj parameter, log only the dynamic
type or safe metadata; for the second log.Error call, instead of logging the
full tombstone.Obj, log only safe metadata such as the type or name, not the
entire object that may contain sensitive Data fields.

Source: Coding guidelines

Comment on lines +42 to +55
client := fake.NewSimpleClientset()
if !s.allowList {
client.PrependReactor("list", "secrets", func(action testing2.Action) (handled bool, ret runtime.Object, err error) {
return true, nil, apierrors.NewForbidden(schema.GroupResource{Resource: "secrets"}, "", fmt.Errorf("restricted"))
})
}

mgr := NewSharedSecretManager(client, nil)
ctx := context.Background()
routeName := "my-route"
secretName := "my-secret"
handler := cache.ResourceEventHandlerFuncs{}

err := mgr.RegisterRoute(ctx, s.namespace, routeName, secretName, handler)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "client-go version declared by this repo:"
rg -n '^\s*k8s\.io/client-go\s+' go.mod

echo "test sites using fake clients with RegisterRoute:"
rg -n --type go -C3 'fake\.NewSimpleClientset\(\)|RegisterRoute\(' pkg/router/controller/shared_secret_manager_test.go

echo "manager informer construction path:"
rg -n --type go -C5 'NewListWatchFromClient|RESTClient\(\)|go inf\.Run' pkg/router/controller/shared_secret_manager.go

Repository: openshift/router

Length of output: 2084


🏁 Script executed:

# Read the full test function around line 42-55
sed -n '38,65p' pkg/router/controller/shared_secret_manager_test.go

# Read the second test function around line 95-108
sed -n '92,125p' pkg/router/controller/shared_secret_manager_test.go

# Check if there's any WaitForCacheSync or similar synchronization
rg -n 'WaitForCacheSync|WaitFor|Eventually|sync\.' pkg/router/controller/shared_secret_manager_test.go

Repository: openshift/router

Length of output: 1761


🏁 Script executed:

# Get the complete first test function
sed -n '33,70p' pkg/router/controller/shared_secret_manager_test.go

# Get the complete second test function  
sed -n '92,140p' pkg/router/controller/shared_secret_manager_test.go

# Also check what RegisterRoute actually does with the informer
sed -n '100,160p' pkg/router/controller/shared_secret_manager.go

Repository: openshift/router

Length of output: 3998


🏁 Script executed:

# Check if there are any known issues or failing tests mentioned
git log --oneline --all -S 'shared_secret_manager_test' | head -20

# Check the actual behavior of fake clientset's RESTClient
# First, let's see what the fake clientset provides
rg -n 'CoreV1.*RESTClient' --type go pkg/

# Check if tests are actually passing or if there are CI issues
git log --oneline pkg/router/controller/shared_secret_manager_test.go | head -10

# Inspect what the test assertions are actually checking
sed -n '55,90p' pkg/router/controller/shared_secret_manager_test.go

Repository: openshift/router

Length of output: 146


🏁 Script executed:

# Read the complete first test function to see all assertions
sed -n '1,100p' pkg/router/controller/shared_secret_manager_test.go | tail -70

# Check if the tests actually wait or just check state immediately
sed -n '55,90p' pkg/router/controller/shared_secret_manager_test.go

# Look at how the informer is used - are events actually expected?
rg -n 'notify\|AddFunc\|UpdateFunc\|DeleteFunc' pkg/router/controller/shared_secret_manager.go -A3

Repository: openshift/router

Length of output: 3254


🏁 Script executed:

# Check if there's any test infrastructure for fake informers or mocking
fd -t f 'informer|mock|fake' pkg/router/controller/ --type f

# Look for any comments or TODOs about this issue
rg -n 'TODO|FIXME|BUG|HACK' pkg/router/controller/shared_secret_manager_test.go

# Check if tests actually run without issues - look for skip markers
rg -n 'Skip\|t\.Skip' pkg/router/controller/shared_secret_manager_test.go

# Understand what happens when fake clientset REST client is called with NewListWatchFromClient
# Check imports to see where these come from
head -30 pkg/router/controller/shared_secret_manager_test.go

Repository: openshift/router

Length of output: 782


🏁 Script executed:

# Let's check if there's an actual problem reported or if tests pass
# Look at recent commits to see if tests are stable
wc -l pkg/router/controller/shared_secret_manager_test.go

# Check the complete test file to understand full context
cat pkg/router/controller/shared_secret_manager_test.go | head -150

# Check if the manager actually has tests that verify informer behavior (cache syncing)
rg -n 'cache\|sync\|watch\|event' pkg/router/controller/shared_secret_manager_test.go

Repository: openshift/router

Length of output: 3853


🏁 Script executed:

# Check if tests have any runtime assertions that depend on informer behavior
rg -n 'RangeReactor\|WatchReactor\|PrependWatchReactor' pkg/router/controller/shared_secret_manager_test.go

# Check what happens when the informer tries to list from the fake client
# by looking at what the notify function does
rg -n 'func.*notify' pkg/router/controller/shared_secret_manager.go -A 10

# Check if there are any actual runtime dependencies on the informer working
rg -n 'GetStore\|GetIndexer\|ByIndex' pkg/router/controller/shared_secret_manager.go

Repository: openshift/router

Length of output: 412


🏁 Script executed:

# Check if there's documentation or examples of using fake clients with informers
# Look for how the fake client is used elsewhere in the codebase
rg -n 'fake.NewSimpleClientset' pkg/ -A 5 | head -50

# Check if the manager itself has any other usages that might expose issues
rg -n 'NewSharedSecretManager' --type go pkg/

# Look at the complete implementation to understand if REST client is actually needed
sed -n '1,50p' pkg/router/controller/shared_secret_manager.go

Repository: openshift/router

Length of output: 6234


🏁 Script executed:

# Check if the fake clientset provides a working REST client
# Look for any comments or documentation about this
rg -n 'RESTClient\|CoreV1' pkg/router/controller/shared_secret_manager.go -B2 -A2

# Check if there are actual watch/list operations happening during tests
# that would expose the problem
rg -n 'Watch\|List' pkg/router/controller/shared_secret_manager_test.go

# Look at whether the tests are integration-style or unit-style
# Check if they verify actual data flow from the informer
rg -n 'secret\|cache\|informer' pkg/router/controller/shared_secret_manager_test.go | grep -v 'SharedSecretManager'

Repository: openshift/router

Length of output: 42


🏁 Script executed:

# Let's check if the tests actually pass or if they're known to be flaky
# Look for CI configuration or test results
fd -t f '.github|.gitlab|Makefile' -x cat {} \; 2>/dev/null | grep -i 'test\|shared_secret' | head -20

# Check the most basic question: Does the fake client's RESTClient actually fail?
# by looking at what happens if you call it
rg -n 'error\|panic' pkg/router/controller/shared_secret_manager.go | head -20

# Check if RegisterRoute or the informer start could actually fail
sed -n '100,165p' pkg/router/controller/shared_secret_manager.go

Repository: openshift/router

Length of output: 2641


🏁 Script executed:

# Check if the fake clientset actually provides a usable RESTClient
# by checking the client-go source or documentation
python3 <<'PY'
# Let's check what the actual behavior is when you try to use
# fake clientset's RESTClient with NewListWatchFromClient
# by checking if there are any test utilities for this

import subprocess
import json

# Check if there are any other tests that use informers with fake clients
result = subprocess.run(
    ['rg', '-n', 'NewListWatchFromClient|SharedIndexInformer', '--type', 'go', 'pkg/'],
    capture_output=True,
    text=True
)

print("Usage of NewListWatchFromClient and SharedIndexInformer in tests:")
print(result.stdout)
print("---")
print("Stderr:", result.stderr)
PY

Repository: openshift/router

Length of output: 1408


🏁 Script executed:

# Look at the actual test execution - does it work or fail?
# Check if there are any known issues or warnings
rg -n 'goroutine\|panic\|race\|flaky' pkg/router/controller/shared_secret_manager_test.go

# The key question: Do tests actually verify informer functionality?
# or just the manager's state management?
grep -n 'notify\|eventHandler\|AddFunc\|UpdateFunc' pkg/router/controller/shared_secret_manager_test.go

Repository: openshift/router

Length of output: 42


🏁 Script executed:

# The key insight: tests only check state, not informer behavior
# Let's verify if tests actually pass in CI
# Look for test execution or known failures

# Check if there's a go.mod and try to understand the test setup better
head -40 go.mod

# Most importantly: do the tests actually call anything that would require
# the informer to work? Let's see if there's any actual watch/list operation
rg -n 'handler\|notify' pkg/router/controller/shared_secret_manager_test.go

# Check if RegisterRoute is supposed to do anything async that tests would miss
sed -n '65,95p' pkg/router/controller/shared_secret_manager.go

Repository: openshift/router

Length of output: 2821


🏁 Script executed:

# Final verification: Let's understand what the tests actually check
# and whether the informer needs to work for these specific tests

# Test 1: What assertions does it make?
sed -n '42,90p' pkg/router/controller/shared_secret_manager_test.go | grep -E 'if |Errorf|Fatalf'

# Test 2: What assertions does it make?
sed -n '95,142p' pkg/router/controller/shared_secret_manager_test.go | grep -E 'if |Errorf|Fatal'

# Check if there are any integration tests that would catch informer issues
fd -t f '*integration*test*' pkg/router/controller/

# Check if the production code is ever tested with real informers
rg -n 'WaitForCacheSync|HasSynced' pkg/router/controller/ --type go

Repository: openshift/router

Length of output: 1847


🏁 Script executed:

# Key finding: Production code uses HasSynced(), but tests don't wait for it
# Let's see where HasSynced is called in production
sed -n '260,270p' pkg/router/controller/shared_secret_manager.go

# Check if the tests would ever reach code that calls HasSynced
rg -n 'GetSecret\|GetSecrets\|informer\.informer\.' pkg/router/controller/shared_secret_manager.go

# Most importantly: verify if fake clientset REST client actually fails
# by checking if there are any actual runtime failures reported
git log --oneline --all --grep='shared_secret' 2>/dev/null | head -10

# Check if there's any test output or CI results
rg -n 'error\|FAIL\|PASS' pkg/router/controller/shared_secret_manager_test.go

Repository: openshift/router

Length of output: 361


🏁 Script executed:

# Check what GetSecret actually does - this is the code path that needs informers
rg -n 'func.*GetSecret' pkg/router/controller/shared_secret_manager.go -A 20 | head -40

# Verify: the tests never call GetSecret, right?
rg -n 'GetSecret' pkg/router/controller/shared_secret_manager_test.go

# Check if there are any production usages of this manager that would fail
rg -n 'GetSecret' pkg/ --type go | grep -v test | grep -v factory

# Most important: Does the fake clientset's RESTClient actually work or not?
# Check the client-go version and known issues
cat go.mod | grep client-go

Repository: openshift/router

Length of output: 1256


Tests don't verify informer synchronization; consider using a test-only informer factory or mocking the list/watch source.

These tests create informers with cache.NewListWatchFromClient but never verify cache synchronization or call GetSecret (which depends on inf.informer.HasSynced()). Tests only check that informers are created and registered in the map. Either inject an informer factory to allow tests to use a mock/fake source instead of the REST client, or add cache sync assertions if informer behavior is in scope.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/shared_secret_manager_test.go` around lines 42 - 55,
The test creates a SharedSecretManager that uses informers internally but never
verifies cache synchronization or tests the actual GetSecret behavior that
depends on HasSynced(). Either modify the SharedSecretManager to accept an
injectable informer factory so tests can provide a mock source instead of the
real REST client, or extend the test to call WaitForCacheSync after setting up
the manager and then verify informer functionality by calling GetSecret to
ensure the cache lookups work correctly.

Comment thread pkg/router/controller/shared_secret_manager.go
Comment on lines +131 to +141
inf.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
m.notify(namespace, obj, "Add", nil)
},
UpdateFunc: func(oldObj, newObj interface{}) {
m.notify(namespace, newObj, "Update", oldObj)
},
DeleteFunc: func(obj interface{}) {
m.notify(namespace, obj, "Delete", nil)
},
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd pkg/router/controller && head -200 shared_secret_manager.go | tail -100

Repository: openshift/router

Length of output: 2544


🏁 Script executed:

cat -n pkg/router/controller/shared_secret_manager.go | sed -n '100,160p'

Repository: openshift/router

Length of output: 2055


🏁 Script executed:

rg "AddEventHandler" --type go -B 5 -A 10 pkg/router/controller/shared_secret_manager.go

Repository: openshift/router

Length of output: 451


🏁 Script executed:

head -30 pkg/router/controller/shared_secret_manager.go

Repository: openshift/router

Length of output: 985


🌐 Web query:

Kubernetes client-go cache.SharedIndexInformer AddEventHandler signature error return

💡 Result:

In the Go Kubernetes client-go library, the AddEventHandler method on SharedIndexInformer (and the SharedInformer interface) returns both a ResourceEventHandlerRegistration and an error [1][2][3]. The signature is defined as: AddEventHandler(handler ResourceEventHandler) (ResourceEventHandlerRegistration, error) [1][3] Historical Context and Error Handling: Prior to client-go v1.26.0, AddEventHandler did not return an error [4]. The change to include an error return type was introduced to allow for better error reporting and cleanup in controller implementations [5]. If you are working with an older codebase that expects the previous signature (where it returned nothing), you will need to update your code to handle the returned error [5][6]. Key Points: - Return Value: The method returns a ResourceEventHandlerRegistration (which can be used to remove the handler later) and an error [1][2]. - Compiler Errors: If your code fails to compile because it expects only one return value, it is likely because it is using an outdated signature pattern or is not accounting for the error object returned by the modern API [5]. - Best Practice: Always check the returned error [5][6]. If an error occurs, it indicates that the handler could not be added to the informer [1][2]. Note: This applies specifically to the official Go k8s.io/client-go library. Other language clients, such as client-java or fabric8, maintain different method signatures (often returning void or a fluent builder object) [7][8][9][10].

Citations:


🏁 Script executed:

cat go.mod | grep client-go

Repository: openshift/router

Length of output: 153


🏁 Script executed:

# Let's check the actual current state of the file more carefully around line 131
cat -n pkg/router/controller/shared_secret_manager.go | sed -n '125,160p'

Repository: openshift/router

Length of output: 1200


🏁 Script executed:

# Try to check if there are any build/lint logs or if this would compile
cd pkg/router/controller && git log --oneline shared_secret_manager.go | head -10

Repository: openshift/router

Length of output: 138


🏁 Script executed:

# Check if there's a blank assignment on line 131 that might be escaped in the output
od -c pkg/router/controller/shared_secret_manager.go | sed -n '5500,5700p'

Repository: openshift/router

Length of output: 101


🏁 Script executed:

# Let's also verify the exact signature by checking client-go source
cat go.mod | grep -A 2 "k8s.io/client-go"

Repository: openshift/router

Length of output: 160


🌐 Web query:

k8s.io/client-go v0.35.0 AddEventHandler return signature documentation

💡 Result:

In k8s.io/client-go v0.35.0, the AddEventHandler method in the SharedInformer interface has the following signature [1]: func (s *sharedIndexInformer) AddEventHandler(handler ResourceEventHandler) (ResourceEventHandlerRegistration, error) This method returns two values: 1. ResourceEventHandlerRegistration: A handle for the registered handler, which can be used to remove the handler or check its sync status [1]. 2. error: An error value that will be non-nil if the handler cannot be successfully added to the informer [1]. This return signature was introduced in earlier versions of client-go (beginning with v1.26.0) to allow for error handling during handler registration [2]. Prior versions of the interface did not return these values [3][4][5].

Citations:


Handle AddEventHandler failure before registering the informer.

In k8s.io/client-go v0.35.0, AddEventHandler returns (ResourceEventHandlerRegistration, error). The code ignores this error return at line 131, violating the Go security guideline to never ignore error returns. Since the route is registered at lines 101–105 before AddEventHandler is called, a handler registration failure leaves the route registered without any event processing capability. Capture the error and roll back the route registration.

🛠️ Proposed error handling
-		inf.AddEventHandler(cache.ResourceEventHandlerFuncs{
+		if _, err := inf.AddEventHandler(cache.ResourceEventHandlerFuncs{
 			AddFunc: func(obj interface{}) {
 				m.notify(namespace, obj, "Add", nil)
 			},
 			UpdateFunc: func(oldObj, newObj interface{}) {
 				m.notify(namespace, newObj, "Update", oldObj)
 			},
 			DeleteFunc: func(obj interface{}) {
 				m.notify(namespace, obj, "Delete", nil)
 			},
-		})
+		}); err != nil {
+			delete(m.registeredRoutes, key)
+			m.lock.Unlock()
+			return fmt.Errorf("failed to add secret informer handler for key %s: %w", infKey, err)
+		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
inf.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
m.notify(namespace, obj, "Add", nil)
},
UpdateFunc: func(oldObj, newObj interface{}) {
m.notify(namespace, newObj, "Update", oldObj)
},
DeleteFunc: func(obj interface{}) {
m.notify(namespace, obj, "Delete", nil)
},
})
if _, err := inf.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
m.notify(namespace, obj, "Add", nil)
},
UpdateFunc: func(oldObj, newObj interface{}) {
m.notify(namespace, newObj, "Update", oldObj)
},
DeleteFunc: func(obj interface{}) {
m.notify(namespace, obj, "Delete", nil)
},
}); err != nil {
delete(m.registeredRoutes, key)
m.lock.Unlock()
return fmt.Errorf("failed to add secret informer handler for key %s: %w", infKey, err)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/shared_secret_manager.go` around lines 131 - 141, The
AddEventHandler call is ignoring the error return value, which in
k8s.io/client-go v0.35.0 now returns both a ResourceEventHandlerRegistration and
an error. Capture the error return value from the AddEventHandler method call
and check if it is non-nil. If the error occurs, the route that was previously
registered (referenced in the comment as being registered at lines 101-105) must
be rolled back to maintain consistency, since the handler registration failure
means the route cannot process events properly.

Source: Coding guidelines

@jcmoraisjr

Copy link
Copy Markdown
Member

/assign

@jcmoraisjr jcmoraisjr left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job. A few suggestions and some nitpicking comments, just ignore the ones that does not apply.

Comment on lines +85 to +89
// Release lock during API probe to avoid blocking other registrations.
m.lock.Unlock()
// We use a metadata-only List check to see if we have namespace-wide permissions.
_, err := m.kubeClient.CoreV1().Secrets(namespace).List(ctx, metav1.ListOptions{Limit: 1})
m.lock.Lock()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this Unlock() is the only reason a single Unlock() is not as a defer in the start of the method. From the comments this is not to block other registrations. Is there a fair decrease of the time implementing this way? Unlocking here is giving a chance for other concurrent calls to grab the lock and create the same registeredRoute as this goroutine is already creating. Also, it seems that this List would only be called once per namespace since it's under the !known branch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The List check only happens once per namespace. However, in high-scale environments (similar to the load testing done by the Performance and Scale team), we might see hundreds of routes created simultaneously in a newly provisioned namespace. Because m.lock is a global lock across the entire SharedSecretManager, if we hold it during the kubeClient.CoreV1().Secrets().List() network call, we would block all other route registrations across all namespaces while waiting for the API server to respond.

By dropping the lock during the network call, we allow routes in other (already known) namespaces to continue registering without being blocked by this API latency. When we re-acquire the lock, we re-evaluate if restricted, known = m.restrictedNamespaces[namespace]; !known to handle the case where another concurrent goroutine completed the API call for this namespace while we were unlocked. Felt this complexity was worth it to prevent a single slow API call from stalling the entire secret management pipeline.

Comment on lines +225 to +226
// Only check routes in the same namespace
if !strings.HasPrefix(k, namespace+"/") {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: using types.NamespacedName as the key gives the advantage of not need to concat string when building the key, and a direct comparison here with the Namespace field of the struct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, changing in upcoming commit

Comment thread pkg/router/controller/shared_secret_manager.go
Comment on lines 109 to -111
func (a *StatusAdmitter) HandleRoute(eventType watch.EventType, route *routev1.Route) error {
a.lock.Lock()
defer a.lock.Unlock()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the Lock() being removed here but not removed from other methods. On a quick look, this method in fact seems thread safe, but wondering if all the other methods should follow the same approach and the lock field being removed altogether. If it is in fact safe here, it should be safe on the others.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we should remove the lock field from StatusAdmitter, remove its initialization in NewStatusAdmitter, and remove the a.lock.Lock() and defer a.lock.Unlock() calls from all RecordRoute* methods. Doing in upcoming commit

Comment on lines -255 to -262
// Validate that final key contains only private key, and cert contains only public keys
if err := validatePEMContent(keyBytes, "PRIVATE KEY"); err != nil {
result = append(result, field.Invalid(tlsFieldPath.Child("key"), "redacted key data", err.Error()))
}
if err := validatePEMContent(certBytes, "CERTIFICATE"); err != nil {
result = append(result, field.Invalid(tlsFieldPath.Child("certificate"), "redacted certificate data", err.Error()))
}
// Validate if the keypair is valid (eg.: the leaf certificate should be the first on certBytes)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this code be removed? If so, maybe the last line could be preserved since it is the comment of the code below it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, removing and keeping comment

Comment on lines +501 to +503
// sarCache stores successful SAR validation results keyed by "namespace/secretName".
// Only successful validations are cached; failures always trigger fresh checks.
var sarCache sync.Map

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It sounds like this code is claiming for a struct to organize the fields related with the ValidateTLSExternalCertificate() func. This should also allow the removal of the ClearAsyncSARCacheForTest().

Just thinking out loud. The test related method aside, all the rest seems ok in the global area as this is a small package.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be slightly cleaner in a struct, but as you mentioned, given the small footprint here, we opted to keep it at the package level for simplicity and to minimize structural churn on this critical path

Comment on lines +560 to +563
// Acquire semaphore slot — blocks if all slots are in use.
// This throttles concurrent SAR API calls to MaxConcurrentSARChecks.
sarSemaphore <- struct{}{}
defer func() { <-sarSemaphore }()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never had that idea before, nice trick! Wondering if you know and considered Acquire() from "golang.org/x/sync/semaphore". Just curious, no action needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, the channel approach here is pretty standard, dependency-free pattern that covers our exact use case without needing the extra features from x/sync

Comment on lines +48 to +49
func startFakeHAProxyServer(prefix string) (*fakeHAProxy, error) {
f, err := os.CreateTemp(os.TempDir(), prefix)
f, err := os.CreateTemp("/tmp", prefix)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.TempDir() is more idiomatic, did you find some issue using it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in coming commit to use the default temp directory via os.CreateTemp

Comment on lines +574 to +575
timeoutCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the timeout for the API calls, I thought about 10s or so, wondering if the 30s timeout consider some scenario or it is the result of your performance tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the 30s timeout was specifically chosen based on our results during the P&S scale testing. Under high load and concurrency, we observed that API calls could occasionally queue up and take longer than 10s to complete. Setting it to 30s provided enough buffer to prevent spurious timeouts during these peak load events without compromising the overall responsiveness of the router.

Comment on lines 2511 to 2524
{
// Private key containing the CA chain should be rejected
name: "A key field containing Public Key/certificate attributes should be rejected",
route: &routev1.Route{
Spec: routev1.RouteSpec{
TLS: &routev1.TLSConfig{
Termination: routev1.TLSTerminationEdge,
Certificate: leafCertExtractedFromP12,
Key: leafKeyExtractedFromP12 + caChainExtractedFromP12,
},
},
},
expectedErrors: 1,
expectedErrors: 0,
},

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit strange, since the test says it's checking a rejection but reporting zero failures. This should be related with the removed validations, so wondering if either the validation is added again or instead this test refactored or renamed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated (in coming commit) the test case name and inline comment to reflect that the condition is now expected to be "allowed" since the strict validation was removed.

@openshift-ci

openshift-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jcmoraisjr. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci

openshift-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@bentito: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-serial 2611961 link true /test e2e-aws-serial

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gcs278

gcs278 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

/assign @Miciah

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants