OCPBUGS-77056: Make external cert validation asynchronous#745
Conversation
|
Skipping CI for Draft Pull Request. |
|
@bentito: This pull request references Jira Issue OCPBUGS-77056, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
325f313 to
214b603
Compare
|
/hold |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughSemaphore-throttled, cached SAR validation for external TLS; new RBAC-aware SharedSecretManager with hybrid informer strategy; SAR-completed route status events and condition equality fixes; WriterLease gains worker concurrency; synchronous test refactors and small build/client wiring tweaks. ChangesExternal Certificate Validation, Secret Management, and Route Status
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (13 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/router/controller/route_secret_manager.go`:
- Around line 288-298: The current Add handler unconditionally calls
RecordRouteRejection when a secret is replayed; change it to mirror the
UpdateFunc/SAR-completion logic: check the route's admission (via
populateRouteTLSFromSecret result or route.Status.Conditions) and if the route
is already admitted emit a normal update event instead of calling
RecordRouteRejection, otherwise record the rejection; extract the string
"ExternalCertificateSecretLoaded" into a shared constant (e.g.,
ExtCrtStatusReasonSecretLoaded) and add that constant to the
ignoreIngressConditionReason set in contention.go so the admission-ignoring
logic treats this replay reason as benign. Ensure you update references to
RecordRouteRejection, populateRouteTLSFromSecret, UpdateFunc,
ExternalCertificateSecretLoaded, and ignoreIngressConditionReason accordingly.
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 503-556: The async SAR flow stores only a single pending error
(pendingErr) in asyncSARCache for a cacheKey, so subsequent calls (in validate
flow) that hit the cache drop their onComplete callback and never get
re-enqueued; change the cache value to include a list of waiting callbacks (or a
small struct with errs + []onComplete callbacks) keyed by asyncSARCache so that
when you detect a pending entry you append the current onComplete for
route.Namespace/secretName, and in the goroutine after
asyncSARCache.Store(cacheKey, errs) iterate and invoke all stored callbacks (not
just the first) and then replace the cache entry with errs only; update code
paths that read the cache (the cache-hit branch and the goroutine completion) to
handle this new struct and ensure onComplete is invoked for every waiting route.
- Around line 485-490: The global asyncSARCache currently stores completed
validation results forever; change it to avoid permanent sticky entries by (a)
replacing the raw sync.Map value with a small struct that includes the cached
field.ErrorList plus an expiration timestamp or a source type flag, (b) only
writing non-transient successful validation results (or results with a short
TTL) into asyncSARCache, and (c) adding explicit invalidation helpers (e.g.,
InvalidateAsyncSARCache(namespace, secretName) and
InvalidateAllAsyncSARCacheForSecret(namespace, secretName)) and call them from
the secret add/update/delete/recreate paths in the secret manager
(route_secret_manager.go) so secret changes force revalidation; also update
ClearAsyncSARCacheForTest to clear the new structure. Ensure references to
asyncSARCache and ClearAsyncSARCacheForTest (and the code paths mentioned around
the 503-556 region) are updated to use the new value type and invalidation APIs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e07f375c-945a-4e58-b5ec-3347eaff109c
📒 Files selected for processing (5)
pkg/router/controller/contention.gopkg/router/controller/route_secret_manager.gopkg/router/controller/route_secret_manager_test.gopkg/router/routeapihelpers/validation.gopkg/router/routeapihelpers/validation_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/router/routeapihelpers/validation.go (1)
485-497:⚠️ Potential issue | 🟠 MajorCompleted SAR entries are still permanent.
Only secret add/update/delete clears this cache. A finished allow/deny result for
namespace/secretNamesurvives route deletion, later RBAC grants/revocations, and transient SAR/API failures, so a new route that reuses the same secret name can inherit a stale decision without performing a fresh check. Please put completed entries behind a TTL/generation and avoid caching transient internal errors indefinitely.Also applies to: 593-605
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/router/routeapihelpers/validation.go` around lines 485 - 497, The cache currently stores completed asyncSARResult entries forever; update asyncSARResult (and uses of asyncSARCache) to include a timestamp or generation counter and enforce a TTL/generation check before returning cached results so that completed allow/deny decisions expire after a short interval or when generation changes. Ensure only stable allow/deny outcomes are cached long-term; if asyncSARResult.errs contains transient/internal errors, set a much shorter TTL (or do not mark done) so they trigger fresh SARs. Update the cache read path to validate timestamp/generation and the write path to record it, and keep InvalidateAsyncSARCache(namespace, secretName) as-is to support manual invalidation.pkg/router/controller/route_secret_manager.go (1)
262-304:⚠️ Potential issue | 🟠 MajorDon't clear the fresh SAR result on the initial
SecretLoadedreplay.
RegisterRoute()is only reached aftervalidate()has already completed successfully, so this non-deletedAddFuncpath is replaying an already-validated secret. ClearingasyncSARCachehere throws away that fresh result, defeats the per-secret cache for shared secrets, and can bounce the route back into a second"authorization check pending"/ExternalCertificateValidationFailedcycle on theSecretLoadedrequeue. Keep invalidation on actual secret changes (recreated/updated/deleted), but not on the initial load replay.Suggested shape
AddFunc: func(obj interface{}) { secret := obj.(*kapi.Secret) log.V(4).Info("Secret added for route", "namespace", namespace, "secret", secret.Name, "route", routeName) - routeapihelpers.InvalidateAsyncSARCache(namespace, secret.Name) // Secret re-creation scenario // Check if the route key exists in the deletedSecrets map, indicating that the secret was previously deleted for this route. // If it exists, it means the secret is being recreated. Remove the key from the map and proceed with handling the route. @@ key := generateKey(namespace, routeName) if _, deleted := p.deletedSecrets.LoadAndDelete(key); deleted { + routeapihelpers.InvalidateAsyncSARCache(namespace, secret.Name) log.V(4).Info("Secret recreated for route", "namespace", namespace, "secret", secret.Name, "route", routeName)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/router/controller/route_secret_manager.go` around lines 262 - 304, The AddFunc for secret events currently calls routeapihelpers.InvalidateAsyncSARCache unconditionally, which discards fresh per-secret async SAR results during the initial cache-sync replay; to fix, remove the unconditional call to routeapihelpers.InvalidateAsyncSARCache from the top of the AddFunc in RegisterRoute's secret handler and only invoke routeapihelpers.InvalidateAsyncSARCache when the secret is actually changed/recreated/deleted (e.g., inside the p.deletedSecrets.LoadAndDelete "recreated" branch and in the Update/Delete handlers), using generateKey(namespace, routeName) and p.deletedSecrets to determine real recreation so you don't invalidate the SAR cache on the initial load.
🧹 Nitpick comments (1)
pkg/router/controller/route_secret_manager_test.go (1)
1355-1372: Put a timeout around these async waits.These bare channel receives will hang the whole suite on any missed informer callback. Now that the tests intentionally depend on async completion, a tiny helper with
select+time.Afterwill turn those deadlocks into normal test failures.Suggested helper
+func waitForRecorderEvent(t *testing.T, ch <-chan struct{}) { + t.Helper() + select { + case <-ch: + case <-time.After(2 * time.Second): + t.Fatal("timeout waiting for recorder event") + } +}Then replace each bare receive with:
-<-recorder.doneCh +waitForRecorderEvent(t, recorder.doneCh)Also applies to: 1434-1444, 1496-1516
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/router/controller/route_secret_manager_test.go` around lines 1355 - 1372, Replace bare channel receives on recorder.doneCh with a timed wait helper to avoid test hangs: add a small helper (e.g., waitForDone(ctxTimeout, ch) or waitWithTimeout(t, ch, duration)) that uses select with the channel and time.After to fail the test on timeout, then update every occurrence of "<-recorder.doneCh" in route_secret_manager_test.go (including the occurrences around the Secret Update flow that use kubeClient.CoreV1().Secrets(...).Update and the initial Add wait) to call that helper instead so missed informer callbacks produce a deterministic test failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/router/controller/route_secret_manager.go`:
- Around line 262-304: The AddFunc for secret events currently calls
routeapihelpers.InvalidateAsyncSARCache unconditionally, which discards fresh
per-secret async SAR results during the initial cache-sync replay; to fix,
remove the unconditional call to routeapihelpers.InvalidateAsyncSARCache from
the top of the AddFunc in RegisterRoute's secret handler and only invoke
routeapihelpers.InvalidateAsyncSARCache when the secret is actually
changed/recreated/deleted (e.g., inside the p.deletedSecrets.LoadAndDelete
"recreated" branch and in the Update/Delete handlers), using
generateKey(namespace, routeName) and p.deletedSecrets to determine real
recreation so you don't invalidate the SAR cache on the initial load.
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 485-497: The cache currently stores completed asyncSARResult
entries forever; update asyncSARResult (and uses of asyncSARCache) to include a
timestamp or generation counter and enforce a TTL/generation check before
returning cached results so that completed allow/deny decisions expire after a
short interval or when generation changes. Ensure only stable allow/deny
outcomes are cached long-term; if asyncSARResult.errs contains
transient/internal errors, set a much shorter TTL (or do not mark done) so they
trigger fresh SARs. Update the cache read path to validate timestamp/generation
and the write path to record it, and keep InvalidateAsyncSARCache(namespace,
secretName) as-is to support manual invalidation.
---
Nitpick comments:
In `@pkg/router/controller/route_secret_manager_test.go`:
- Around line 1355-1372: Replace bare channel receives on recorder.doneCh with a
timed wait helper to avoid test hangs: add a small helper (e.g.,
waitForDone(ctxTimeout, ch) or waitWithTimeout(t, ch, duration)) that uses
select with the channel and time.After to fail the test on timeout, then update
every occurrence of "<-recorder.doneCh" in route_secret_manager_test.go
(including the occurrences around the Secret Update flow that use
kubeClient.CoreV1().Secrets(...).Update and the initial Add wait) to call that
helper instead so missed informer callbacks produce a deterministic test
failure.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c800f527-b5e1-4386-b6cf-e20eb4726a31
📒 Files selected for processing (4)
pkg/router/controller/contention.gopkg/router/controller/route_secret_manager.gopkg/router/controller/route_secret_manager_test.gopkg/router/routeapihelpers/validation.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/router/controller/contention.go
|
/retest-required |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 561-568: The semaphore (asyncSARSemaphore) is being acquired on
the caller path before launching the goroutine, causing the caller to block when
the limit is reached; move the acquire into the goroutine so the caller can
spawn all per-secret goroutines without blocking, then release the token inside
that goroutine (use defer to ensure release). Concretely: start the goroutine
immediately, perform asyncSARSemaphore <- struct{}{} at the top of the goroutine
body, and keep the existing defer func() { <-asyncSARSemaphore }() to guarantee
release; ensure any early returns in the goroutine still release the token.
- Around line 532-535: The current guard that returns nil when sarc or
secretsGetter is nil hides wiring bugs and skips RBAC/secret validation;
instead, change the conditional in validation.go so that if sarc == nil ||
secretsGetter == nil you return an internal error (e.g., fmt.Errorf or the
package's internal/server error type) describing "missing validation clients"
rather than nil, or alternatively gate this behind a test-only seam/flag so
production never silently succeeds; update callers to propagate/handle the
returned error as needed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 31462f9b-48bb-466f-8d90-5c399b5fab8b
📒 Files selected for processing (1)
pkg/router/routeapihelpers/validation.go
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 526-529: The code currently returns a synthetic pending error on
cache miss without registering a callback if onComplete is nil; change the logic
in the validation path (the block that manipulates cached.callbacks and returns
cached.errs) so that onComplete must never be nil: either (A) make onComplete a
required parameter at the API boundary, or (B) when onComplete == nil execute
the async validation synchronously (drive the same validation routine on the
current goroutine, wait for completion, populate cached.errs, and then return
the real errors) instead of just returning the pending placeholder; if you
choose (B) ensure you reuse the same validation function that enqueues callbacks
(so cached state is updated) and that cached.callbacks is appended only when
onComplete is non-nil or after synchronous completion you return the finalized
errors.
- Around line 572-599: Wrap each SAR and secret GET with a per-request timeout
context (use context.WithTimeout and defer cancel) instead of context.TODO();
for the SAR checks stop using authorizationutil.Authorize and call the SAR
client Create() directly (e.g., sarClient.Create) with that timeout context so
you can inspect both (response, err) separately, treat API/transport errors as
retryable/internal (append field.InternalError or a distinct cached error) and
only append field.Forbidden when the SAR response explicitly denies, and call
secretsGetter.Secrets(route.Namespace).Get with the same bounded context and map
Get errors to NotFound vs InternalError accordingly. Ensure you still reference
routerServiceAccount, secretName, fldPath and preserve existing
field.Forbidden/field.InternalError/field.NotFound usages when classifying the
outcome.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bbebc97e-5cff-40e0-80d7-bff7bf16a68c
📒 Files selected for processing (1)
pkg/router/routeapihelpers/validation.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/router/routeapihelpers/validation.go (1)
637-639: Consider recovering from callback panics to ensure all callbacks are invoked.If a callback panics, the remaining callbacks in the slice won't be invoked. This is unlikely but could leave some routes stuck in a pending state if one route's callback handler has a bug.
🛡️ Proposed defensive callback invocation
for _, cb := range callbacks { - cb(route.Namespace, secretName) + func() { + defer func() { + if r := recover(); r != nil { + // Log panic but continue invoking remaining callbacks + } + }() + cb(route.Namespace, secretName) + }() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/router/routeapihelpers/validation.go` around lines 637 - 639, Wrap each callback invocation in a deferred recover to prevent a single panic from aborting the loop: inside the loop over callbacks (callbacks, cb) call each cb via a small closure that uses defer + recover to catch any panic, log or report the panic along with the callback context (e.g., route.Namespace and secretName) and continue to the next cb; ensure the recovered panic does not re-panic so all callbacks are invoked.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/router/routeapihelpers/validation.go`:
- Around line 637-639: Wrap each callback invocation in a deferred recover to
prevent a single panic from aborting the loop: inside the loop over callbacks
(callbacks, cb) call each cb via a small closure that uses defer + recover to
catch any panic, log or report the panic along with the callback context (e.g.,
route.Namespace and secretName) and continue to the next cb; ensure the
recovered panic does not re-panic so all callbacks are invoked.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c9c78ea5-dba5-483e-8d75-13dc6f9905d7
📒 Files selected for processing (1)
pkg/router/routeapihelpers/validation.go
716e042 to
0a99ee0
Compare
0a99ee0 to
4e1eae4
Compare
|
Depends-On: openshift/library-go#2132 |
|
/retest |
1 similar comment
|
/retest |
|
@bentito: This pull request references Jira Issue OCPBUGS-77056, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@bentito: This pull request references Jira Issue OCPBUGS-77056, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
This prevents the router from dropping its leader lease under high concurrency when updating route statuses, which was causing a 60-second stall in scale tests.
|
/retest |
| new-openshift-router-image: | ||
| GO111MODULE=on CGO_ENABLED=0 GOFLAGS=-mod=vendor go build -o openshift-router -gcflags=all="-N -l" ./cmd/openshift-router | ||
| $(IMAGEBUILDER) build -t $(LOCAL_IMAGE) -f hack/Dockerfile.debug . | ||
| GO111MODULE=on CGO_ENABLED=0 GOOS=$(GOOS) GOARCH=$(GOARCH) GOFLAGS=-mod=vendor go build -o openshift-router -gcflags=all="-N -l" ./cmd/openshift-router |
There was a problem hiding this comment.
same comment as on the GOARCH definition
|
/jira refresh |
|
@bentito: This pull request references Jira Issue OCPBUGS-77056, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@coderabbitai can you clean up your old comments and re-review this PR comprehensively? |
🧩 Analysis chain🏁 Script executed: #!/bin/bash
# List all review comments made by coderabbitai on PR `#745`
gh api repos/openshift/router/pulls/745/comments \
--jq '.[] | select(.user.login == "coderabbitai[bot]") | {id: .id, body: (.body | .[0:80])}' 2>/dev/null
echo "---REVIEW COMMENTS ABOVE---"
# Also check issue comments (top-level PR comments)
gh api repos/openshift/router/issues/745/comments \
--jq '.[] | select(.user.login == "coderabbitai[bot]") | {id: .id, body: (.body | .[0:80])}' 2>/dev/nullLength of output: 3352 🏁 Script executed: #!/bin/bash
# Get all review threads for PR `#745` and resolve the ones from coderabbitai
THREADS=$(gh api graphql -f query='
{
repository(owner: "openshift", name: "router") {
pullRequest(number: 745) {
reviewThreads(first: 100) {
nodes {
id
isResolved
comments(first: 1) {
nodes {
author { login }
body
}
}
}
}
}
}
}' --jq '.data.repository.pullRequest.reviewThreads.nodes[] | select(.isResolved == false) | select(.comments.nodes[0].author.login == "coderabbitai[bot]") | .id')
echo "Unresolved coderabbitai threads:"
echo "$THREADS"
echo "---"
# Resolve each thread
for thread_id in $THREADS; do
echo "Resolving thread: $thread_id"
gh api graphql -f query='
mutation($threadId: ID!) {
resolveReviewThread(input: {threadId: $threadId}) {
thread { id isResolved }
}
}' -f threadId="$thread_id" --jq '.data.resolveReviewThread.thread' 2>/dev/null || echo "Failed to resolve $thread_id"
doneLength of output: 1128
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/router/template/configmanager/haproxy/testing/haproxy.go (1)
110-127:⚠️ Potential issue | 🟠 MajorPropagate fake HAProxy listener failures explicitly instead of panicking/silently exiting.
Line 113 panics inside a goroutine on bind failure, which crashes the test process abruptly and prevents proper error handling. Line 126 returns on any
Accepterror without surfacing the cause, making test failures flaky and hard to diagnose. TheStart()method returns void, so callers cannot detect failures at all.Convert
Start()to returnerror, replace the panic with explicit error propagation through an error channel, and handleAcceptfailures by differentiating between shutdown and unexpected errors. Usecontext.Background()withnet.ListenConfigfor consistency with Go best practices.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/router/template/configmanager/haproxy/testing/haproxy.go` around lines 110 - 127, The fakeHAProxy Start() method needs to be refactored to return an error instead of void, and the goroutine it spawns should propagate errors explicitly rather than panicking or silently exiting. Replace the panic call on the net.Listen failure with sending the error through an error channel that Start() can read from before returning. Update the listener.Accept() error handling to differentiate between expected errors during shutdown (when p.shutdown is true) and unexpected errors, propagating unexpected ones through the error channel or logging mechanism. Additionally, modernize the net.Listen call to use net.ListenConfig with context.Background() for consistency with current Go best practices. Ensure the error channel is created, read from in Start() to detect initialization failures, and properly closed to prevent goroutine leaks.Sources: Coding guidelines, Linters/SAST tools
pkg/router/routeapihelpers/validation_test.go (1)
2512-2523:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftKeep this case as a rejection unless the API contract intentionally changed.
The test still says a key field containing certificate/CA-chain material “should be rejected”, but
expectedErrors: 0now accepts that malformed key field and masks the validation regression. Either restore the rejection expectation and fix the validator, or rename/comment this case with an explicit contract-change rationale.🧪 Proposed test expectation if rejection remains the contract
- expectedErrors: 0, + expectedErrors: 1,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/router/routeapihelpers/validation_test.go` around lines 2512 - 2523, The test case in the validation_test.go file has an inconsistency where the test comment states that a key field containing Public Key/certificate attributes "should be rejected", but expectedErrors is set to 0, which means the test is accepting the malformed input rather than rejecting it. Either update the expectedErrors value to a positive number (likely 1) to match the stated rejection behavior and ensure the validator in the routeapihelpers package actually rejects keys containing CA chain material, or update the test comment and add explicit documentation explaining that the API contract has intentionally changed to allow this case. Choose the appropriate fix based on the actual intended API contract.
🧹 Nitpick comments (3)
pkg/router/writerlease/writerlease_test.go (1)
1-148: 💤 Low valueConsider adding multi-worker concurrency tests.
All tests pass
workers=1, which maintains deterministic single-worker behavior. Since production usage passesMaxConcurrentSARChecks=50workers, consider adding a test that exercises concurrent worker behavior—for example, verifying that multiple queued items are processed in parallel and that shutdown cleanly waits for all workers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/router/writerlease/writerlease_test.go` around lines 1 - 148, Add new test functions to verify concurrent worker behavior with multiple workers, since all existing tests use a single worker (third parameter set to 1 in the New constructor calls). Create tests that instantiate a writerlease with multiple workers (e.g., 5 or 10 for testing purposes) instead of 1, and verify that multiple queued work items are processed in parallel and that shutdown gracefully waits for all workers to complete their tasks. This will ensure the code behaves correctly under production conditions where MaxConcurrentSARChecks is 50 workers.pkg/router/controller/status_test.go (1)
1502-1538: ⚡ Quick winAssert message preservation too.
This case verifies the ignored
Reasonis preserved, but it leavesMessageempty on both sides. Add a non-empty message to catch regressions that keepReasonwhile clearing the user-facing detail.🧪 Proposed test tightening
Conditions: []routev1.RouteIngressCondition{{ Type: routev1.RouteAdmitted, Status: corev1.ConditionTrue, Reason: ExtCrtStatusReasonSARCompleted, + Message: "SAR check and secret load completed for secret \"tls-secret\"", }}}, }}, }, @@ Conditions: []routev1.RouteIngressCondition{{ Type: routev1.RouteAdmitted, Status: corev1.ConditionTrue, Reason: ExtCrtStatusReasonSARCompleted, + Message: "SAR check and secret load completed for secret \"tls-secret\"", }}}, }}, },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/router/controller/status_test.go` around lines 1502 - 1538, In the test case "do not overwrite existing ignored reason with empty reason", add a non-empty Message field to both the existing condition (in the route.Status.Ingress[0].Conditions[0]) and the expected condition (in expectedRoute.Status.Ingress[0].Conditions[0]) with the same value. This ensures the test verifies that the Message field is also preserved alongside the Reason field, preventing regressions where the Reason is kept but the Message is inadvertently cleared.pkg/router/controller/route_secret_manager_test.go (1)
112-123: ⚡ Quick winAdd a regression case where the downstream plugin rejects the route.
The production fix depends on
RecordRouteUpdate(...SARCompleted...)not firing whenplugin.HandleRoutereturns an error, but this table always uses a successfulfakePlugin. Add a scenario withp.err != niland assertexpectedUpdates == nil.Also applies to: 1114-1118, 1129-1134
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/router/controller/route_secret_manager_test.go` around lines 112 - 123, Add a new test scenario to the scenarios table in the route secret manager test that validates the regression case where the downstream plugin rejects the route. Create a scenario where the fakePlugin is configured with a non-nil error value (p.err != nil) to simulate HandleRoute returning an error, and set expectedUpdates to nil to assert that RecordRouteUpdate is not invoked when the plugin rejects the route. This ensures the production fix where RecordRouteUpdate does not fire on plugin errors is properly covered by regression testing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@go.mod`:
- Line 121: Add a comment in go.mod above the replace directive for
github.com/openshift/library-go (the line replacing it with
github.com/bentito/library-go) to clarify the intent and provenance of this
fork. The comment should explicitly document whether this fork is temporary
(pending upstream acceptance of changes) or permanent (intentional architectural
split), and explain the justification such as the HAProxy API improvements (DCM
add/del functionality) that necessitate this divergence from upstream. This
visibility helps future maintainers understand the fork's lifecycle and when it
can potentially be reconciled with upstream openshift/library-go.
In `@pkg/router/controller/route_secret_manager.go`:
- Around line 327-339: In the DeleteFunc handler for the route secret manager,
the error logs are logging entire objects which could expose sensitive data from
Secret Data fields. Replace the logging of whole objects in both error cases:
for the first log.Error call, instead of logging the full obj parameter, log
only the dynamic type or safe metadata; for the second log.Error call, instead
of logging the full tombstone.Obj, log only safe metadata such as the type or
name, not the entire object that may contain sensitive Data fields.
In `@pkg/router/controller/shared_secret_manager_test.go`:
- Around line 42-55: The test creates a SharedSecretManager that uses informers
internally but never verifies cache synchronization or tests the actual
GetSecret behavior that depends on HasSynced(). Either modify the
SharedSecretManager to accept an injectable informer factory so tests can
provide a mock source instead of the real REST client, or extend the test to
call WaitForCacheSync after setting up the manager and then verify informer
functionality by calling GetSecret to ensure the cache lookups work correctly.
In `@pkg/router/controller/shared_secret_manager.go`:
- Around line 107-109: The variable infState is assigned from the map lookup
m.informers[infKey] on the same line as the exists check, but this value is
never used before being potentially overwritten. Remove the infState assignment
from the initial map lookup statement, and only assign infState after confirming
the key exists (after the if !exists block completes), or restructure the code
to avoid the ineffectual assignment by directly using the map lookup result only
when needed.
- Around line 131-141: The AddEventHandler call is ignoring the error return
value, which in k8s.io/client-go v0.35.0 now returns both a
ResourceEventHandlerRegistration and an error. Capture the error return value
from the AddEventHandler method call and check if it is non-nil. If the error
occurs, the route that was previously registered (referenced in the comment as
being registered at lines 101-105) must be rolled back to maintain consistency,
since the handler registration failure means the route cannot process events
properly.
---
Outside diff comments:
In `@pkg/router/routeapihelpers/validation_test.go`:
- Around line 2512-2523: The test case in the validation_test.go file has an
inconsistency where the test comment states that a key field containing Public
Key/certificate attributes "should be rejected", but expectedErrors is set to 0,
which means the test is accepting the malformed input rather than rejecting it.
Either update the expectedErrors value to a positive number (likely 1) to match
the stated rejection behavior and ensure the validator in the routeapihelpers
package actually rejects keys containing CA chain material, or update the test
comment and add explicit documentation explaining that the API contract has
intentionally changed to allow this case. Choose the appropriate fix based on
the actual intended API contract.
In `@pkg/router/template/configmanager/haproxy/testing/haproxy.go`:
- Around line 110-127: The fakeHAProxy Start() method needs to be refactored to
return an error instead of void, and the goroutine it spawns should propagate
errors explicitly rather than panicking or silently exiting. Replace the panic
call on the net.Listen failure with sending the error through an error channel
that Start() can read from before returning. Update the listener.Accept() error
handling to differentiate between expected errors during shutdown (when
p.shutdown is true) and unexpected errors, propagating unexpected ones through
the error channel or logging mechanism. Additionally, modernize the net.Listen
call to use net.ListenConfig with context.Background() for consistency with
current Go best practices. Ensure the error channel is created, read from in
Start() to detect initialization failures, and properly closed to prevent
goroutine leaks.
---
Nitpick comments:
In `@pkg/router/controller/route_secret_manager_test.go`:
- Around line 112-123: Add a new test scenario to the scenarios table in the
route secret manager test that validates the regression case where the
downstream plugin rejects the route. Create a scenario where the fakePlugin is
configured with a non-nil error value (p.err != nil) to simulate HandleRoute
returning an error, and set expectedUpdates to nil to assert that
RecordRouteUpdate is not invoked when the plugin rejects the route. This ensures
the production fix where RecordRouteUpdate does not fire on plugin errors is
properly covered by regression testing.
In `@pkg/router/controller/status_test.go`:
- Around line 1502-1538: In the test case "do not overwrite existing ignored
reason with empty reason", add a non-empty Message field to both the existing
condition (in the route.Status.Ingress[0].Conditions[0]) and the expected
condition (in expectedRoute.Status.Ingress[0].Conditions[0]) with the same
value. This ensures the test verifies that the Message field is also preserved
alongside the Reason field, preventing regressions where the Reason is kept but
the Message is inadvertently cleared.
In `@pkg/router/writerlease/writerlease_test.go`:
- Around line 1-148: Add new test functions to verify concurrent worker behavior
with multiple workers, since all existing tests use a single worker (third
parameter set to 1 in the New constructor calls). Create tests that instantiate
a writerlease with multiple workers (e.g., 5 or 10 for testing purposes) instead
of 1, and verify that multiple queued work items are processed in parallel and
that shutdown gracefully waits for all workers to complete their tasks. This
will ensure the code behaves correctly under production conditions where
MaxConcurrentSARChecks is 50 workers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: eebcedd7-463a-43a4-832b-5884254e1cc0
⛔ Files ignored due to path filters (6)
go.sumis excluded by!**/*.sumvendor/github.com/openshift/library-go/pkg/authorization/authorizationutil/subject.gois excluded by!**/vendor/**,!vendor/**vendor/github.com/openshift/library-go/pkg/authorization/authorizationutil/util.gois excluded by!**/vendor/**,!vendor/**vendor/github.com/openshift/library-go/pkg/route/secretmanager/manager.gois excluded by!**/vendor/**,!vendor/**vendor/github.com/openshift/library-go/pkg/secret/secret_monitor.gois excluded by!**/vendor/**,!vendor/**vendor/modules.txtis excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (17)
go.modhack/Makefile.debugpkg/cmd/infra/router/clientcmd.gopkg/cmd/infra/router/template.gopkg/router/controller/contention.gopkg/router/controller/route_secret_manager.gopkg/router/controller/route_secret_manager_test.gopkg/router/controller/shared_secret_manager.gopkg/router/controller/shared_secret_manager_test.gopkg/router/controller/status.gopkg/router/controller/status_test.gopkg/router/routeapihelpers/validation.gopkg/router/routeapihelpers/validation_test.gopkg/router/router_test.gopkg/router/template/configmanager/haproxy/testing/haproxy.gopkg/router/writerlease/writerlease.gopkg/router/writerlease/writerlease_test.go
| DeleteFunc: func(obj interface{}) { | ||
| secret := obj.(*kapi.Secret) | ||
| secret, ok := obj.(*kapi.Secret) | ||
| if !ok { | ||
| tombstone, ok := obj.(cache.DeletedFinalStateUnknown) | ||
| if !ok { | ||
| log.Error(nil, "Couldn't get object from tombstone", "obj", obj) | ||
| return | ||
| } | ||
| secret, ok = tombstone.Obj.(*kapi.Secret) | ||
| if !ok { | ||
| log.Error(nil, "Tombstone contained object that is not a secret", "obj", tombstone.Obj) | ||
| return | ||
| } |
There was a problem hiding this comment.
Avoid logging whole objects from the Secret delete handler.
These fallback logs can serialize unexpected Secret-like/tombstone objects, including Data fields. Log only the dynamic type or safe metadata.
🛡️ Proposed fix
- log.Error(nil, "Couldn't get object from tombstone", "obj", obj)
+ log.Error(nil, "Couldn't get object from tombstone", "type", fmt.Sprintf("%T", obj))
return
}
secret, ok = tombstone.Obj.(*kapi.Secret)
if !ok {
- log.Error(nil, "Tombstone contained object that is not a secret", "obj", tombstone.Obj)
+ log.Error(nil, "Tombstone contained object that is not a secret", "type", fmt.Sprintf("%T", tombstone.Obj))
return
}As per coding guidelines, flag logging that may expose passwords, tokens, API keys, PII, session IDs, internal hostnames, or customer data.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/router/controller/route_secret_manager.go` around lines 327 - 339, In the
DeleteFunc handler for the route secret manager, the error logs are logging
entire objects which could expose sensitive data from Secret Data fields.
Replace the logging of whole objects in both error cases: for the first
log.Error call, instead of logging the full obj parameter, log only the dynamic
type or safe metadata; for the second log.Error call, instead of logging the
full tombstone.Obj, log only safe metadata such as the type or name, not the
entire object that may contain sensitive Data fields.
Source: Coding guidelines
| client := fake.NewSimpleClientset() | ||
| if !s.allowList { | ||
| client.PrependReactor("list", "secrets", func(action testing2.Action) (handled bool, ret runtime.Object, err error) { | ||
| return true, nil, apierrors.NewForbidden(schema.GroupResource{Resource: "secrets"}, "", fmt.Errorf("restricted")) | ||
| }) | ||
| } | ||
|
|
||
| mgr := NewSharedSecretManager(client, nil) | ||
| ctx := context.Background() | ||
| routeName := "my-route" | ||
| secretName := "my-secret" | ||
| handler := cache.ResourceEventHandlerFuncs{} | ||
|
|
||
| err := mgr.RegisterRoute(ctx, s.namespace, routeName, secretName, handler) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "client-go version declared by this repo:"
rg -n '^\s*k8s\.io/client-go\s+' go.mod
echo "test sites using fake clients with RegisterRoute:"
rg -n --type go -C3 'fake\.NewSimpleClientset\(\)|RegisterRoute\(' pkg/router/controller/shared_secret_manager_test.go
echo "manager informer construction path:"
rg -n --type go -C5 'NewListWatchFromClient|RESTClient\(\)|go inf\.Run' pkg/router/controller/shared_secret_manager.goRepository: openshift/router
Length of output: 2084
🏁 Script executed:
# Read the full test function around line 42-55
sed -n '38,65p' pkg/router/controller/shared_secret_manager_test.go
# Read the second test function around line 95-108
sed -n '92,125p' pkg/router/controller/shared_secret_manager_test.go
# Check if there's any WaitForCacheSync or similar synchronization
rg -n 'WaitForCacheSync|WaitFor|Eventually|sync\.' pkg/router/controller/shared_secret_manager_test.goRepository: openshift/router
Length of output: 1761
🏁 Script executed:
# Get the complete first test function
sed -n '33,70p' pkg/router/controller/shared_secret_manager_test.go
# Get the complete second test function
sed -n '92,140p' pkg/router/controller/shared_secret_manager_test.go
# Also check what RegisterRoute actually does with the informer
sed -n '100,160p' pkg/router/controller/shared_secret_manager.goRepository: openshift/router
Length of output: 3998
🏁 Script executed:
# Check if there are any known issues or failing tests mentioned
git log --oneline --all -S 'shared_secret_manager_test' | head -20
# Check the actual behavior of fake clientset's RESTClient
# First, let's see what the fake clientset provides
rg -n 'CoreV1.*RESTClient' --type go pkg/
# Check if tests are actually passing or if there are CI issues
git log --oneline pkg/router/controller/shared_secret_manager_test.go | head -10
# Inspect what the test assertions are actually checking
sed -n '55,90p' pkg/router/controller/shared_secret_manager_test.goRepository: openshift/router
Length of output: 146
🏁 Script executed:
# Read the complete first test function to see all assertions
sed -n '1,100p' pkg/router/controller/shared_secret_manager_test.go | tail -70
# Check if the tests actually wait or just check state immediately
sed -n '55,90p' pkg/router/controller/shared_secret_manager_test.go
# Look at how the informer is used - are events actually expected?
rg -n 'notify\|AddFunc\|UpdateFunc\|DeleteFunc' pkg/router/controller/shared_secret_manager.go -A3Repository: openshift/router
Length of output: 3254
🏁 Script executed:
# Check if there's any test infrastructure for fake informers or mocking
fd -t f 'informer|mock|fake' pkg/router/controller/ --type f
# Look for any comments or TODOs about this issue
rg -n 'TODO|FIXME|BUG|HACK' pkg/router/controller/shared_secret_manager_test.go
# Check if tests actually run without issues - look for skip markers
rg -n 'Skip\|t\.Skip' pkg/router/controller/shared_secret_manager_test.go
# Understand what happens when fake clientset REST client is called with NewListWatchFromClient
# Check imports to see where these come from
head -30 pkg/router/controller/shared_secret_manager_test.goRepository: openshift/router
Length of output: 782
🏁 Script executed:
# Let's check if there's an actual problem reported or if tests pass
# Look at recent commits to see if tests are stable
wc -l pkg/router/controller/shared_secret_manager_test.go
# Check the complete test file to understand full context
cat pkg/router/controller/shared_secret_manager_test.go | head -150
# Check if the manager actually has tests that verify informer behavior (cache syncing)
rg -n 'cache\|sync\|watch\|event' pkg/router/controller/shared_secret_manager_test.goRepository: openshift/router
Length of output: 3853
🏁 Script executed:
# Check if tests have any runtime assertions that depend on informer behavior
rg -n 'RangeReactor\|WatchReactor\|PrependWatchReactor' pkg/router/controller/shared_secret_manager_test.go
# Check what happens when the informer tries to list from the fake client
# by looking at what the notify function does
rg -n 'func.*notify' pkg/router/controller/shared_secret_manager.go -A 10
# Check if there are any actual runtime dependencies on the informer working
rg -n 'GetStore\|GetIndexer\|ByIndex' pkg/router/controller/shared_secret_manager.goRepository: openshift/router
Length of output: 412
🏁 Script executed:
# Check if there's documentation or examples of using fake clients with informers
# Look for how the fake client is used elsewhere in the codebase
rg -n 'fake.NewSimpleClientset' pkg/ -A 5 | head -50
# Check if the manager itself has any other usages that might expose issues
rg -n 'NewSharedSecretManager' --type go pkg/
# Look at the complete implementation to understand if REST client is actually needed
sed -n '1,50p' pkg/router/controller/shared_secret_manager.goRepository: openshift/router
Length of output: 6234
🏁 Script executed:
# Check if the fake clientset provides a working REST client
# Look for any comments or documentation about this
rg -n 'RESTClient\|CoreV1' pkg/router/controller/shared_secret_manager.go -B2 -A2
# Check if there are actual watch/list operations happening during tests
# that would expose the problem
rg -n 'Watch\|List' pkg/router/controller/shared_secret_manager_test.go
# Look at whether the tests are integration-style or unit-style
# Check if they verify actual data flow from the informer
rg -n 'secret\|cache\|informer' pkg/router/controller/shared_secret_manager_test.go | grep -v 'SharedSecretManager'Repository: openshift/router
Length of output: 42
🏁 Script executed:
# Let's check if the tests actually pass or if they're known to be flaky
# Look for CI configuration or test results
fd -t f '.github|.gitlab|Makefile' -x cat {} \; 2>/dev/null | grep -i 'test\|shared_secret' | head -20
# Check the most basic question: Does the fake client's RESTClient actually fail?
# by looking at what happens if you call it
rg -n 'error\|panic' pkg/router/controller/shared_secret_manager.go | head -20
# Check if RegisterRoute or the informer start could actually fail
sed -n '100,165p' pkg/router/controller/shared_secret_manager.goRepository: openshift/router
Length of output: 2641
🏁 Script executed:
# Check if the fake clientset actually provides a usable RESTClient
# by checking the client-go source or documentation
python3 <<'PY'
# Let's check what the actual behavior is when you try to use
# fake clientset's RESTClient with NewListWatchFromClient
# by checking if there are any test utilities for this
import subprocess
import json
# Check if there are any other tests that use informers with fake clients
result = subprocess.run(
['rg', '-n', 'NewListWatchFromClient|SharedIndexInformer', '--type', 'go', 'pkg/'],
capture_output=True,
text=True
)
print("Usage of NewListWatchFromClient and SharedIndexInformer in tests:")
print(result.stdout)
print("---")
print("Stderr:", result.stderr)
PYRepository: openshift/router
Length of output: 1408
🏁 Script executed:
# Look at the actual test execution - does it work or fail?
# Check if there are any known issues or warnings
rg -n 'goroutine\|panic\|race\|flaky' pkg/router/controller/shared_secret_manager_test.go
# The key question: Do tests actually verify informer functionality?
# or just the manager's state management?
grep -n 'notify\|eventHandler\|AddFunc\|UpdateFunc' pkg/router/controller/shared_secret_manager_test.goRepository: openshift/router
Length of output: 42
🏁 Script executed:
# The key insight: tests only check state, not informer behavior
# Let's verify if tests actually pass in CI
# Look for test execution or known failures
# Check if there's a go.mod and try to understand the test setup better
head -40 go.mod
# Most importantly: do the tests actually call anything that would require
# the informer to work? Let's see if there's any actual watch/list operation
rg -n 'handler\|notify' pkg/router/controller/shared_secret_manager_test.go
# Check if RegisterRoute is supposed to do anything async that tests would miss
sed -n '65,95p' pkg/router/controller/shared_secret_manager.goRepository: openshift/router
Length of output: 2821
🏁 Script executed:
# Final verification: Let's understand what the tests actually check
# and whether the informer needs to work for these specific tests
# Test 1: What assertions does it make?
sed -n '42,90p' pkg/router/controller/shared_secret_manager_test.go | grep -E 'if |Errorf|Fatalf'
# Test 2: What assertions does it make?
sed -n '95,142p' pkg/router/controller/shared_secret_manager_test.go | grep -E 'if |Errorf|Fatal'
# Check if there are any integration tests that would catch informer issues
fd -t f '*integration*test*' pkg/router/controller/
# Check if the production code is ever tested with real informers
rg -n 'WaitForCacheSync|HasSynced' pkg/router/controller/ --type goRepository: openshift/router
Length of output: 1847
🏁 Script executed:
# Key finding: Production code uses HasSynced(), but tests don't wait for it
# Let's see where HasSynced is called in production
sed -n '260,270p' pkg/router/controller/shared_secret_manager.go
# Check if the tests would ever reach code that calls HasSynced
rg -n 'GetSecret\|GetSecrets\|informer\.informer\.' pkg/router/controller/shared_secret_manager.go
# Most importantly: verify if fake clientset REST client actually fails
# by checking if there are any actual runtime failures reported
git log --oneline --all --grep='shared_secret' 2>/dev/null | head -10
# Check if there's any test output or CI results
rg -n 'error\|FAIL\|PASS' pkg/router/controller/shared_secret_manager_test.goRepository: openshift/router
Length of output: 361
🏁 Script executed:
# Check what GetSecret actually does - this is the code path that needs informers
rg -n 'func.*GetSecret' pkg/router/controller/shared_secret_manager.go -A 20 | head -40
# Verify: the tests never call GetSecret, right?
rg -n 'GetSecret' pkg/router/controller/shared_secret_manager_test.go
# Check if there are any production usages of this manager that would fail
rg -n 'GetSecret' pkg/ --type go | grep -v test | grep -v factory
# Most important: Does the fake clientset's RESTClient actually work or not?
# Check the client-go version and known issues
cat go.mod | grep client-goRepository: openshift/router
Length of output: 1256
Tests don't verify informer synchronization; consider using a test-only informer factory or mocking the list/watch source.
These tests create informers with cache.NewListWatchFromClient but never verify cache synchronization or call GetSecret (which depends on inf.informer.HasSynced()). Tests only check that informers are created and registered in the map. Either inject an informer factory to allow tests to use a mock/fake source instead of the REST client, or add cache sync assertions if informer behavior is in scope.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/router/controller/shared_secret_manager_test.go` around lines 42 - 55,
The test creates a SharedSecretManager that uses informers internally but never
verifies cache synchronization or tests the actual GetSecret behavior that
depends on HasSynced(). Either modify the SharedSecretManager to accept an
injectable informer factory so tests can provide a mock source instead of the
real REST client, or extend the test to call WaitForCacheSync after setting up
the manager and then verify informer functionality by calling GetSecret to
ensure the cache lookups work correctly.
| inf.AddEventHandler(cache.ResourceEventHandlerFuncs{ | ||
| AddFunc: func(obj interface{}) { | ||
| m.notify(namespace, obj, "Add", nil) | ||
| }, | ||
| UpdateFunc: func(oldObj, newObj interface{}) { | ||
| m.notify(namespace, newObj, "Update", oldObj) | ||
| }, | ||
| DeleteFunc: func(obj interface{}) { | ||
| m.notify(namespace, obj, "Delete", nil) | ||
| }, | ||
| }) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd pkg/router/controller && head -200 shared_secret_manager.go | tail -100Repository: openshift/router
Length of output: 2544
🏁 Script executed:
cat -n pkg/router/controller/shared_secret_manager.go | sed -n '100,160p'Repository: openshift/router
Length of output: 2055
🏁 Script executed:
rg "AddEventHandler" --type go -B 5 -A 10 pkg/router/controller/shared_secret_manager.goRepository: openshift/router
Length of output: 451
🏁 Script executed:
head -30 pkg/router/controller/shared_secret_manager.goRepository: openshift/router
Length of output: 985
🌐 Web query:
Kubernetes client-go cache.SharedIndexInformer AddEventHandler signature error return
💡 Result:
In the Go Kubernetes client-go library, the AddEventHandler method on SharedIndexInformer (and the SharedInformer interface) returns both a ResourceEventHandlerRegistration and an error [1][2][3]. The signature is defined as: AddEventHandler(handler ResourceEventHandler) (ResourceEventHandlerRegistration, error) [1][3] Historical Context and Error Handling: Prior to client-go v1.26.0, AddEventHandler did not return an error [4]. The change to include an error return type was introduced to allow for better error reporting and cleanup in controller implementations [5]. If you are working with an older codebase that expects the previous signature (where it returned nothing), you will need to update your code to handle the returned error [5][6]. Key Points: - Return Value: The method returns a ResourceEventHandlerRegistration (which can be used to remove the handler later) and an error [1][2]. - Compiler Errors: If your code fails to compile because it expects only one return value, it is likely because it is using an outdated signature pattern or is not accounting for the error object returned by the modern API [5]. - Best Practice: Always check the returned error [5][6]. If an error occurs, it indicates that the handler could not be added to the informer [1][2]. Note: This applies specifically to the official Go k8s.io/client-go library. Other language clients, such as client-java or fabric8, maintain different method signatures (often returning void or a fluent builder object) [7][8][9][10].
Citations:
- 1: https://github.com/kubernetes/client-go/blob/master/tools/cache/shared_informer.go
- 2: https://github.com/kubernetes/client-go/blob/v0.35.3/tools/cache/shared_informer.go
- 3: https://github.com/kubernetes/client-go/blob/3dc7fd5f4c1d8afaf5924c461eae2ab27db0045a/tools/cache/shared_informer.go
- 4: https://github.com/kubernetes/client-go/blob/ccd5becdffb7fd8006e31341baaaacd14db2dcb7/tools/cache/shared_informer.go
- 5: Job: Handle error returned from AddEventHandler function kubernetes/kubernetes#119917
- 6: https://github.com/traefik/traefik/blob/master/pkg/provider/kubernetes/ingress/client.go
- 7: https://javadoc.io/static/io.kubernetes/client-java/16.0.1/io/kubernetes/client/informer/SharedInformer.html
- 8: https://javadoc.io/static/io.kubernetes/client-java/11.0.4/io/kubernetes/client/informer/impl/DefaultSharedIndexInformer.html
- 9: https://javadoc.io/static/io.fabric8/kubernetes-client/5.10.1/io/fabric8/kubernetes/client/informers/impl/DefaultSharedIndexInformer.html
- 10: https://www.javadoc.io/static/io.fabric8/kubernetes-client-api/6.1.1/io/fabric8/kubernetes/client/informers/SharedIndexInformer.html
🏁 Script executed:
cat go.mod | grep client-goRepository: openshift/router
Length of output: 153
🏁 Script executed:
# Let's check the actual current state of the file more carefully around line 131
cat -n pkg/router/controller/shared_secret_manager.go | sed -n '125,160p'Repository: openshift/router
Length of output: 1200
🏁 Script executed:
# Try to check if there are any build/lint logs or if this would compile
cd pkg/router/controller && git log --oneline shared_secret_manager.go | head -10Repository: openshift/router
Length of output: 138
🏁 Script executed:
# Check if there's a blank assignment on line 131 that might be escaped in the output
od -c pkg/router/controller/shared_secret_manager.go | sed -n '5500,5700p'Repository: openshift/router
Length of output: 101
🏁 Script executed:
# Let's also verify the exact signature by checking client-go source
cat go.mod | grep -A 2 "k8s.io/client-go"Repository: openshift/router
Length of output: 160
🌐 Web query:
k8s.io/client-go v0.35.0 AddEventHandler return signature documentation
💡 Result:
In k8s.io/client-go v0.35.0, the AddEventHandler method in the SharedInformer interface has the following signature [1]: func (s *sharedIndexInformer) AddEventHandler(handler ResourceEventHandler) (ResourceEventHandlerRegistration, error) This method returns two values: 1. ResourceEventHandlerRegistration: A handle for the registered handler, which can be used to remove the handler or check its sync status [1]. 2. error: An error value that will be non-nil if the handler cannot be successfully added to the informer [1]. This return signature was introduced in earlier versions of client-go (beginning with v1.26.0) to allow for error handling during handler registration [2]. Prior versions of the interface did not return these values [3][4][5].
Citations:
- 1: https://github.com/kubernetes/client-go/blob/v0.35.0/tools/cache/shared_informer.go
- 2: Job: Handle error returned from AddEventHandler function kubernetes/kubernetes#119917
- 3: https://github.com/kubernetes/client-go/blob/87935b98dd4abdf6477f29baf934d6dd0aba82d4/tools/cache/shared_informer.go
- 4: https://github.com/kubernetes/client-go/blob/0c67a1f23ff9b6db00723652c1c45cdff65b5591/tools/cache/shared_informer.go
- 5: https://github.com/kubernetes/client-go/blob/b425fb657986de75a11b1844886c70b486728935/tools/cache/shared_informer.go
Handle AddEventHandler failure before registering the informer.
In k8s.io/client-go v0.35.0, AddEventHandler returns (ResourceEventHandlerRegistration, error). The code ignores this error return at line 131, violating the Go security guideline to never ignore error returns. Since the route is registered at lines 101–105 before AddEventHandler is called, a handler registration failure leaves the route registered without any event processing capability. Capture the error and roll back the route registration.
🛠️ Proposed error handling
- inf.AddEventHandler(cache.ResourceEventHandlerFuncs{
+ if _, err := inf.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
m.notify(namespace, obj, "Add", nil)
},
UpdateFunc: func(oldObj, newObj interface{}) {
m.notify(namespace, newObj, "Update", oldObj)
},
DeleteFunc: func(obj interface{}) {
m.notify(namespace, obj, "Delete", nil)
},
- })
+ }); err != nil {
+ delete(m.registeredRoutes, key)
+ m.lock.Unlock()
+ return fmt.Errorf("failed to add secret informer handler for key %s: %w", infKey, err)
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| inf.AddEventHandler(cache.ResourceEventHandlerFuncs{ | |
| AddFunc: func(obj interface{}) { | |
| m.notify(namespace, obj, "Add", nil) | |
| }, | |
| UpdateFunc: func(oldObj, newObj interface{}) { | |
| m.notify(namespace, newObj, "Update", oldObj) | |
| }, | |
| DeleteFunc: func(obj interface{}) { | |
| m.notify(namespace, obj, "Delete", nil) | |
| }, | |
| }) | |
| if _, err := inf.AddEventHandler(cache.ResourceEventHandlerFuncs{ | |
| AddFunc: func(obj interface{}) { | |
| m.notify(namespace, obj, "Add", nil) | |
| }, | |
| UpdateFunc: func(oldObj, newObj interface{}) { | |
| m.notify(namespace, newObj, "Update", oldObj) | |
| }, | |
| DeleteFunc: func(obj interface{}) { | |
| m.notify(namespace, obj, "Delete", nil) | |
| }, | |
| }); err != nil { | |
| delete(m.registeredRoutes, key) | |
| m.lock.Unlock() | |
| return fmt.Errorf("failed to add secret informer handler for key %s: %w", infKey, err) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/router/controller/shared_secret_manager.go` around lines 131 - 141, The
AddEventHandler call is ignoring the error return value, which in
k8s.io/client-go v0.35.0 now returns both a ResourceEventHandlerRegistration and
an error. Capture the error return value from the AddEventHandler method call
and check if it is non-nil. If the error occurs, the route that was previously
registered (referenced in the comment as being registered at lines 101-105) must
be rolled back to maintain consistency, since the handler registration failure
means the route cannot process events properly.
Source: Coding guidelines
|
/assign |
jcmoraisjr
left a comment
There was a problem hiding this comment.
Good job. A few suggestions and some nitpicking comments, just ignore the ones that does not apply.
| // Release lock during API probe to avoid blocking other registrations. | ||
| m.lock.Unlock() | ||
| // We use a metadata-only List check to see if we have namespace-wide permissions. | ||
| _, err := m.kubeClient.CoreV1().Secrets(namespace).List(ctx, metav1.ListOptions{Limit: 1}) | ||
| m.lock.Lock() |
There was a problem hiding this comment.
It seems this Unlock() is the only reason a single Unlock() is not as a defer in the start of the method. From the comments this is not to block other registrations. Is there a fair decrease of the time implementing this way? Unlocking here is giving a chance for other concurrent calls to grab the lock and create the same registeredRoute as this goroutine is already creating. Also, it seems that this List would only be called once per namespace since it's under the !known branch.
There was a problem hiding this comment.
The List check only happens once per namespace. However, in high-scale environments (similar to the load testing done by the Performance and Scale team), we might see hundreds of routes created simultaneously in a newly provisioned namespace. Because m.lock is a global lock across the entire SharedSecretManager, if we hold it during the kubeClient.CoreV1().Secrets().List() network call, we would block all other route registrations across all namespaces while waiting for the API server to respond.
By dropping the lock during the network call, we allow routes in other (already known) namespaces to continue registering without being blocked by this API latency. When we re-acquire the lock, we re-evaluate if restricted, known = m.restrictedNamespaces[namespace]; !known to handle the case where another concurrent goroutine completed the API call for this namespace while we were unlocked. Felt this complexity was worth it to prevent a single slow API call from stalling the entire secret management pipeline.
| // Only check routes in the same namespace | ||
| if !strings.HasPrefix(k, namespace+"/") { |
There was a problem hiding this comment.
nit: using types.NamespacedName as the key gives the advantage of not need to concat string when building the key, and a direct comparison here with the Namespace field of the struct.
There was a problem hiding this comment.
agreed, changing in upcoming commit
| func (a *StatusAdmitter) HandleRoute(eventType watch.EventType, route *routev1.Route) error { | ||
| a.lock.Lock() | ||
| defer a.lock.Unlock() |
There was a problem hiding this comment.
I can see the Lock() being removed here but not removed from other methods. On a quick look, this method in fact seems thread safe, but wondering if all the other methods should follow the same approach and the lock field being removed altogether. If it is in fact safe here, it should be safe on the others.
There was a problem hiding this comment.
yeah, we should remove the lock field from StatusAdmitter, remove its initialization in NewStatusAdmitter, and remove the a.lock.Lock() and defer a.lock.Unlock() calls from all RecordRoute* methods. Doing in upcoming commit
| // Validate that final key contains only private key, and cert contains only public keys | ||
| if err := validatePEMContent(keyBytes, "PRIVATE KEY"); err != nil { | ||
| result = append(result, field.Invalid(tlsFieldPath.Child("key"), "redacted key data", err.Error())) | ||
| } | ||
| if err := validatePEMContent(certBytes, "CERTIFICATE"); err != nil { | ||
| result = append(result, field.Invalid(tlsFieldPath.Child("certificate"), "redacted certificate data", err.Error())) | ||
| } | ||
| // Validate if the keypair is valid (eg.: the leaf certificate should be the first on certBytes) |
There was a problem hiding this comment.
Should this code be removed? If so, maybe the last line could be preserved since it is the comment of the code below it.
There was a problem hiding this comment.
yep, removing and keeping comment
| // sarCache stores successful SAR validation results keyed by "namespace/secretName". | ||
| // Only successful validations are cached; failures always trigger fresh checks. | ||
| var sarCache sync.Map |
There was a problem hiding this comment.
nit: It sounds like this code is claiming for a struct to organize the fields related with the ValidateTLSExternalCertificate() func. This should also allow the removal of the ClearAsyncSARCacheForTest().
Just thinking out loud. The test related method aside, all the rest seems ok in the global area as this is a small package.
There was a problem hiding this comment.
I agree it would be slightly cleaner in a struct, but as you mentioned, given the small footprint here, we opted to keep it at the package level for simplicity and to minimize structural churn on this critical path
| // Acquire semaphore slot — blocks if all slots are in use. | ||
| // This throttles concurrent SAR API calls to MaxConcurrentSARChecks. | ||
| sarSemaphore <- struct{}{} | ||
| defer func() { <-sarSemaphore }() |
There was a problem hiding this comment.
Never had that idea before, nice trick! Wondering if you know and considered Acquire() from "golang.org/x/sync/semaphore". Just curious, no action needed.
There was a problem hiding this comment.
eh, the channel approach here is pretty standard, dependency-free pattern that covers our exact use case without needing the extra features from x/sync
| func startFakeHAProxyServer(prefix string) (*fakeHAProxy, error) { | ||
| f, err := os.CreateTemp(os.TempDir(), prefix) | ||
| f, err := os.CreateTemp("/tmp", prefix) |
There was a problem hiding this comment.
os.TempDir() is more idiomatic, did you find some issue using it?
There was a problem hiding this comment.
Updated in coming commit to use the default temp directory via os.CreateTemp
| timeoutCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second) | ||
| defer cancel() |
There was a problem hiding this comment.
This is the timeout for the API calls, I thought about 10s or so, wondering if the 30s timeout consider some scenario or it is the result of your performance tests.
There was a problem hiding this comment.
Yes, the 30s timeout was specifically chosen based on our results during the P&S scale testing. Under high load and concurrency, we observed that API calls could occasionally queue up and take longer than 10s to complete. Setting it to 30s provided enough buffer to prevent spurious timeouts during these peak load events without compromising the overall responsiveness of the router.
| { | ||
| // Private key containing the CA chain should be rejected | ||
| name: "A key field containing Public Key/certificate attributes should be rejected", | ||
| route: &routev1.Route{ | ||
| Spec: routev1.RouteSpec{ | ||
| TLS: &routev1.TLSConfig{ | ||
| Termination: routev1.TLSTerminationEdge, | ||
| Certificate: leafCertExtractedFromP12, | ||
| Key: leafKeyExtractedFromP12 + caChainExtractedFromP12, | ||
| }, | ||
| }, | ||
| }, | ||
| expectedErrors: 1, | ||
| expectedErrors: 0, | ||
| }, |
There was a problem hiding this comment.
This looks a bit strange, since the test says it's checking a rejection but reporting zero failures. This should be related with the removed validations, so wondering if either the validation is added again or instead this test refactored or renamed.
There was a problem hiding this comment.
Updated (in coming commit) the test case name and inline comment to reflect that the condition is now expected to be "allowed" since the strict validation was removed.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@bentito: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/assign @Miciah |
This PR implements asynchronous and cached
SubjectAccessReviewchecks forspec.tls.externalCertificateduring router startup.Previously, each external certificate triggered synchronous API validations that blocked the main thread. When scaling to thousands of external cert routes, the router hit
O(N * API_latency)delays and would hitCrashLoopBackOfffrom failing liveness probes.This fix maintains full RBAC security checks but executes them asynchronously in the background. A global
sync.Mapacts as anasyncSARCacheto drastically speed up repeated checks for the same underlying secret.Fixes: OCPBUGS-77056
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Tests