Skip to content

OCPBUGS-87204: Prevent SSRF via FQDN-typed EndpointSlices#788

Open
bentito wants to merge 3 commits into
openshift:masterfrom
bentito:OCPBUGS-86729
Open

OCPBUGS-87204: Prevent SSRF via FQDN-typed EndpointSlices#788
bentito wants to merge 3 commits into
openshift:masterfrom
bentito:OCPBUGS-86729

Conversation

@bentito

@bentito bentito commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

The OpenShift Router previously did not validate backend destinations resolved from FQDN-typed EndpointSlices. This allowed an attacker with EndpointSlice write access to create a Service backed by an FQDN pointing to a hostname that resolves to restricted IPs (like the cloud metadata service at 169.254.169.254) and bypass SSRF protections.

This commit updates the ExtendedValidator to inspect all EndpointSlice addresses. If an address is an FQDN, it performs a DNS lookup and blocks the endpoint if it resolves to a restricted IP address.

Includes unit tests to verify malicious FQDNs are successfully blocked.

Joint effort of the author and gemini-pro-3.1-preview

Summary by CodeRabbit

  • New Features

    • Endpoint addresses are now validated and filtered to prevent invalid IPs from being processed
    • FQDN endpoint addresses are automatically resolved to IP addresses for better compatibility
    • Improved endpoint validation with individual error logging for rejected addresses
  • Tests

    • Added comprehensive test coverage for endpoint validation and FQDN resolution scenarios

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 4, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@bentito: This pull request references Jira Issue OCPBUGS-86729, which is invalid:

  • expected the vulnerability to target either version "5.0." or "openshift-5.0.", but it targets "4.21.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The OpenShift Router previously did not validate backend destinations resolved from FQDN-typed EndpointSlices. This allowed an attacker with EndpointSlice write access to create a Service backed by an FQDN pointing to a hostname that resolves to restricted IPs (like the cloud metadata service at 169.254.169.254) and bypass SSRF protections.

This commit updates the ExtendedValidator to inspect all EndpointSlice addresses. If an address is an FQDN, it performs a DNS lookup and blocks the endpoint if it resolves to a restricted IP address.

Includes unit tests to verify malicious FQDNs are successfully blocked.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

The PR adds endpoint address validation and DNS resolution capabilities to the router controller. ExtendedValidator now filters endpoint IPs against restricted categories before forwarding events. RouterController optionally resolves FQDN endpoints to IPs using a new EndpointResolver. Factory wiring initializes both components unconditionally.

Changes

Endpoint Validation and FQDN Resolution

Layer / File(s) Summary
Endpoint address validation
pkg/router/controller/extended_validator.go, pkg/router/controller/extended_validator_test.go
ExtendedValidator.HandleEndpoints filters each subset's Addresses and NotReadyAddresses using strict IP parsing and semantic checks that reject unspecified, loopback, link-local, and cloud metadata IPs. Invalid addresses are logged and omitted. Tests cover restricted IP detection, individual address validation, and filtered endpoint forwarding.
DNS endpoint resolution
pkg/router/controller/resolver.go, pkg/router/controller/resolver_test.go
EndpointResolver interface and DNSEndpointResolver implementation resolve hostnames to IPs using net.Resolver.LookupIPAddr with a 10-second deadline, first 5-second timeout, and single retry (5 seconds). Results are sorted with IPv6 addresses before IPv4. Tests verify IP sorting behavior.
Router controller FQDN expansion
pkg/router/controller/router_controller.go, pkg/router/controller/router_controller_test.go
RouterController gains optional Resolver field. During HandleEndpointSlice, when resolver is set, resolveEndpointAddresses expands non-IP EndpointAddress values (FQDNs) by invoking Resolver.ResolveEndpointAddress, applying expansion to both Addresses and NotReadyAddresses subsets, and re-sorting. Resolution failures are logged and addresses omitted. Tests exercise resolver success, failure, and no-resolver passthrough cases across mixed address types.
Factory and template wiring
pkg/router/controller/factory/factory.go, pkg/cmd/infra/router/template.go, pkg/router/controller/factory/factory_endpointslices_test.go, pkg/router/template/plugin_test.go
Factory unconditionally initializes RouterController.Resolver to NewDNSEndpointResolver(). Template changes wrap plugin with NewExtendedValidator(..., o.ExtendedValidation) unconditionally (formerly conditional). Test fixtures and call sites are updated for consistency.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

approved, lgtm, verified

Suggested reviewers

  • frobware

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error Error logs expose internal FQDNs via resolver.go fmt.Errorf calls (lines 41, 50, 55) which include hostname. These errors logged in router_controller.go line 287 expose customer/internal hostnames. In resolver.go, sanitize error messages to exclude the hostname: replace fmt.Errorf("resolving %q: ...", hostname, ...) with generic fmt.Errorf("resolving endpoint address: ...") to prevent FQDN exposure in logs.
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main security objective: preventing SSRF attacks via FQDN-typed EndpointSlices. It accurately reflects the core changes across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Custom check targets Ginkgo tests but PR uses only standard Go tests. All test case names are static, descriptive strings with no dynamic values.
Test Structure And Quality ✅ Passed Custom check is for Ginkgo test code; PR contains only standard Go tests with *testing.T. No Ginkgo tests (Describe/It blocks) found in PR or repository.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All new tests are standard Go unit tests (using testing.T, not Ginkgo patterns), located in pkg/router/controller/ alongside other unit tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All new test files use standard Go testing package (func TestXxx(t *testing.T)), not Ginkgo framework. The check only applies to Ginkgo e2e tests.
Topology-Aware Scheduling Compatibility ✅ Passed PR contains only Go code changes for SSRF prevention; deployment manifest (router.yaml) lacks topology-incompatible constraints (no affinity, spread, CP nodeSelectors, or PDBs).
Ote Binary Stdout Contract ✅ Passed No OTE Binary Stdout Contract violations found. All logging uses klogr/stderr, no fmt.Print calls exist, and no process-level code writes to stdout.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests added in this PR. All new tests are standard Go unit tests (testing.T), not Ginkgo-style tests (It/Describe/Context). Check not applicable.
No-Weak-Crypto ✅ Passed No weak cryptography (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons found in PR files.
Container-Privileges ✅ Passed PR contains only Go source code changes; no container/K8s manifests or Dockerfiles are modified, so the privileged-container check is not applicable.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from frobware and jcmoraisjr June 4, 2026 19:51

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
pkg/router/controller/extended_validator_test.go (2)

39-89: ⚡ Quick win

Add IPv6 loopback test case for complete coverage.

The test only covers IPv4 addresses. Add a test case for IPv6 loopback (::1) to verify the validation handles IPv6 correctly.

Proposed addition
{
	name: "restricted IPv6 loopback",
	endpoints: &kapi.Endpoints{
		Subsets: []kapi.EndpointSubset{
			{
				Addresses: []kapi.EndpointAddress{{IP: "::1"}},
			},
		},
	},
	expectBlocked: true,
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/extended_validator_test.go` around lines 39 - 89, Add
an IPv6 loopback case to the TestExtendedValidator_HandleEndpoints tests slice
so the validator is exercised for IPv6; specifically add a test entry named
"restricted IPv6 loopback" in the tests array (inside
TestExtendedValidator_HandleEndpoints) with endpoints containing a
kapi.EndpointAddress whose IP is "::1" and expectBlocked: true to mirror the
existing IPv4 loopback/blocked cases.

39-89: ⚡ Quick win

Add test case for NotReadyAddresses to match implementation coverage.

The HandleEndpoints implementation validates both Addresses and NotReadyAddresses (lines 45-58 of extended_validator.go), but the tests only cover Addresses.

Proposed addition
{
	name: "restricted IP in NotReadyAddresses",
	endpoints: &kapi.Endpoints{
		Subsets: []kapi.EndpointSubset{
			{
				NotReadyAddresses: []kapi.EndpointAddress{{IP: "169.254.169.254"}},
			},
		},
	},
	expectBlocked: true,
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/extended_validator_test.go` around lines 39 - 89, The
test suite TestExtendedValidator_HandleEndpoints covers Addresses but not
NotReadyAddresses used by HandleEndpoints; add a test case that mirrors the
existing restricted-IP cases but puts the IP (e.g. "169.254.169.254") into
EndpointSubset.NotReadyAddresses to assert expectBlocked == true so the tests
exercise validation of NotReadyAddresses in HandleEndpoints.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/router/controller/extended_validator.go`:
- Around line 69-84: Replace the long stream-of-consciousness block around the
net.LookupIP(address) failure handling in extended_validator.go with a concise
comment that documents the TOCTOU/DNS-rebinding limitation (attacker can return
a benign IP now and 169.254.169.254 later when HAProxy resolves) and either mark
it as a known limitation with a TODO to investigate mitigation (e.g., pin/cached
resolution in HAProxy) or reference a config option to enforce resolution
caching; keep the actual behavior (return fmt.Errorf("failed to resolve FQDN %q:
%v", address, err)) unchanged but ensure the comment is one short sentence
calling out the risk and next steps.
- Around line 70-84: The code currently calls net.LookupIP(address) without a
context; replace that call with a context-aware resolver using
(*net.Resolver).LookupIPAddr(ctx, address) so you can control
timeouts/cancellation and inject a test resolver. Update the function that
performs this validation to accept or use a context.Context (or a resolver
parameter) and call resolver.LookupIPAddr(ctx, address), handle the returned
[]net.IPAddr, and preserve the existing error handling (returning
fmt.Errorf("failed to resolve FQDN %q: %v", address, err)). Ensure tests can
inject a custom *net.Resolver (or mock) and set a timeout on the context when
performing production lookups.
- Around line 96-105: The checkRestrictedIP function currently only blocks a
single metadata IP and loopback; update it to block the full link-local range
and Azure wireserver by using net.IP's IsLinkLocalUnicast() to reject
169.254.0.0/16 and fe80::/10 and explicitly check for 168.63.129.16 as well as
existing loopback checks; locate the function checkRestrictedIP and replace the
single-address comparison with an IsLinkLocalUnicast() check, add a direct
comparison for net.ParseIP("168.63.129.16"), and ensure all error messages
include the offending IP via ip.String().

---

Nitpick comments:
In `@pkg/router/controller/extended_validator_test.go`:
- Around line 39-89: Add an IPv6 loopback case to the
TestExtendedValidator_HandleEndpoints tests slice so the validator is exercised
for IPv6; specifically add a test entry named "restricted IPv6 loopback" in the
tests array (inside TestExtendedValidator_HandleEndpoints) with endpoints
containing a kapi.EndpointAddress whose IP is "::1" and expectBlocked: true to
mirror the existing IPv4 loopback/blocked cases.
- Around line 39-89: The test suite TestExtendedValidator_HandleEndpoints covers
Addresses but not NotReadyAddresses used by HandleEndpoints; add a test case
that mirrors the existing restricted-IP cases but puts the IP (e.g.
"169.254.169.254") into EndpointSubset.NotReadyAddresses to assert expectBlocked
== true so the tests exercise validation of NotReadyAddresses in
HandleEndpoints.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b3de770d-618d-44fe-a4d4-c7ac39f1c6cb

📥 Commits

Reviewing files that changed from the base of the PR and between 6761134 and 30b2aff.

📒 Files selected for processing (2)
  • pkg/router/controller/extended_validator.go
  • pkg/router/controller/extended_validator_test.go

Comment thread pkg/router/controller/extended_validator.go Outdated
Comment thread pkg/router/controller/extended_validator.go Outdated
Comment thread pkg/router/controller/extended_validator.go
if ip.IsLoopback() {
return fmt.Errorf("IP address %s is a restricted loopback IP", ip.String())
}
if ip.IsLinkLocalUnicast() {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify for anyone reviewing this diff against the original PR description: we replaced the explicit literal check for 169.254.169.254 with ip.IsLinkLocalUnicast()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to add code comments here, and also here:

var (
azureMetadata = net.ParseIP("168.63.129.16")
awsIPv6IMDS = net.ParseIP("fd00:ec2::254")
)

Something like this:

	// The AWS IMDS endpoint 169.254.169.254 is covered by the ip.IsLinkLocalUnicast()
	// check in `checkRestrictedIP`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@bentito bentito changed the title OCPBUGS-86729: Prevent SSRF via FQDN-typed EndpointSlices OCPBUGS-87204: Prevent SSRF via FQDN-typed EndpointSlices Jun 5, 2026
@bentito

bentito commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/jira refresh

defer cancel()
addrs, err := resolver.LookupIPAddr(lookupCtx, address)
if err != nil {
// TOCTOU Limitation: An attacker can bypass this check by returning a benign IP now and

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid but a very edge case IMO. The attacker would need to either:

  • Control the DNS of the cluster, which brings a different problem
  • Be able to rename the "invalid" domain to a valid domain on a good TTL window (doable)

I think long term we can consider something like external-check command /usr/local/bin/check_ip_blacklist.sh where HAProxy can call a binary running on the same container of HAProxy passing the hostname or IP and leaving this binary to decide if it is safe or not.

log.Error(err, "skipping endpoints due to invalid configuration", "endpoints", fmt.Sprintf("%s/%s", endpoints.Namespace, endpoints.Name))
// We don't have a recordEndpointRejection mechanism, so we log and drop the event.
// This prevents the router from proxying to the malicious backend.
return fmt.Errorf("invalid endpoint configuration: %v", err)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious and can't remember, what happens when we return an error? Does this block any further configuration of the router, or does this block just the specific route? (I think it is just the route, but I cannot remember).

Also, would it make sense to aggregate all of the errors on a []err and return at once?

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 5, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@bentito: This pull request references Jira Issue OCPBUGS-87204, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The OpenShift Router previously did not validate backend destinations resolved from FQDN-typed EndpointSlices. This allowed an attacker with EndpointSlice write access to create a Service backed by an FQDN pointing to a hostname that resolves to restricted IPs (like the cloud metadata service at 169.254.169.254) and bypass SSRF protections.

This commit updates the ExtendedValidator to inspect all EndpointSlice addresses. If an address is an FQDN, it performs a DNS lookup and blocks the endpoint if it resolves to a restricted IP address.

Includes unit tests to verify malicious FQDNs are successfully blocked.

Joint effort of the author and gemini-pro-3.1-preview

Summary by CodeRabbit

  • New Features

  • Added an option to enable endpoint address validation and a CLI flag to control it.

  • EndpointSlice handling can now filter addresses by type when validation is enabled.

  • Bug Fixes

  • Endpoint validation now checks all addresses in subsets and rejects unspecified, loopback, link-local, and cloud metadata IPs; invalid addresses cause endpoints to be dropped.

  • Tests

  • Added unit tests covering address validation, filtering, and forwarding behavior.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/cmd/infra/router/router.go (1)

110-110: ⚡ Quick win

Clarify the help text for FQDN handling.

The current help text states "FQDN-typed EndpointSlices are rejected," which implies all FQDN-typed EndpointSlices are blocked unconditionally. Based on the PR description, the actual behavior is that FQDNs are resolved via DNS lookup and rejected only if they resolve to restricted IP addresses (unspecified, loopback, link-local, or Azure metadata). Please update the text to reflect this nuance.

📝 Suggested help text revision
-	flag.BoolVar(&o.EndpointAddressValidation, "endpoint-address-validation", isTrue(env("ENDPOINT_ADDRESS_VALIDATION", "true")), "If set, endpoint addresses are validated against restricted IP ranges and FQDN-typed EndpointSlices are rejected. Defaults to true.")
+	flag.BoolVar(&o.EndpointAddressValidation, "endpoint-address-validation", isTrue(env("ENDPOINT_ADDRESS_VALIDATION", "true")), "If set, endpoint addresses are validated against restricted IP ranges (loopback, link-local, unspecified, and Azure metadata IPs). FQDN-typed EndpointSlice addresses are resolved and rejected if they resolve to restricted IPs. Defaults to true.")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cmd/infra/router/router.go` at line 110, Update the help text for the
flag registration where flag.BoolVar is called with &o.EndpointAddressValidation
to clarify FQDN handling: change the message to state that FQDN-typed
EndpointSlices are resolved via DNS and only rejected if they resolve to
restricted IP addresses (unspecified, loopback, link-local, or Azure metadata)
rather than being unconditionally rejected; ensure the new string still mentions
validation against restricted IP ranges and that the default remains true.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/cmd/infra/router/router.go`:
- Line 110: Update the help text for the flag registration where flag.BoolVar is
called with &o.EndpointAddressValidation to clarify FQDN handling: change the
message to state that FQDN-typed EndpointSlices are resolved via DNS and only
rejected if they resolve to restricted IP addresses (unspecified, loopback,
link-local, or Azure metadata) rather than being unconditionally rejected;
ensure the new string still mentions validation against restricted IP ranges and
that the default remains true.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b0de28c0-7411-416c-b5e0-00c08775a70f

📥 Commits

Reviewing files that changed from the base of the PR and between 1f474b4 and 5d3b8c9.

📒 Files selected for processing (9)
  • pkg/cmd/infra/router/router.go
  • pkg/cmd/infra/router/template.go
  • pkg/router/controller/extended_validator.go
  • pkg/router/controller/extended_validator_test.go
  • pkg/router/controller/factory/factory.go
  • pkg/router/controller/factory/factory_endpointslices_test.go
  • pkg/router/controller/router_controller.go
  • pkg/router/controller/router_controller_test.go
  • pkg/router/template/plugin_test.go

@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@bentito: This pull request references Jira Issue OCPBUGS-87204, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

The OpenShift Router previously did not validate backend destinations resolved from FQDN-typed EndpointSlices. This allowed an attacker with EndpointSlice write access to create a Service backed by an FQDN pointing to a hostname that resolves to restricted IPs (like the cloud metadata service at 169.254.169.254) and bypass SSRF protections.

This commit updates the ExtendedValidator to inspect all EndpointSlice addresses. If an address is an FQDN, it performs a DNS lookup and blocks the endpoint if it resolves to a restricted IP address.

Includes unit tests to verify malicious FQDNs are successfully blocked.

Joint effort of the author and gemini-pro-3.1-preview

Summary by CodeRabbit

  • New Features

  • Added an option and CLI flag to enable endpoint address validation.

  • EndpointSlice handling can filter addresses by type when validation is enabled.

  • Bug Fixes

  • Validation now checks all endpoint addresses in subsets and rejects unspecified, loopback, link-local, and known cloud metadata IPs; invalid addresses cause endpoints to be dropped and the event rejected.

  • Tests

  • Added unit tests covering address validation, endpoint forwarding, and address-type filtering.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/router/controller/router_controller_test.go (1)

171-177: ⚡ Quick win

Count-only assertion is too weak for mixed-slice filtering.

This can miss regressions where the wrong address survives filtering. Assert exact retained addresses, not just total count.

Suggested strengthening
-       totalAddrs := 0
-       for _, subset := range event.endpoints.Subsets {
-           totalAddrs += len(subset.Addresses)
-       }
-       if totalAddrs != tc.expectedAddrCount {
-           t.Errorf("expected %d addresses, got %d", tc.expectedAddrCount, totalAddrs)
-       }
+       got := sets.NewString()
+       for _, subset := range event.endpoints.Subsets {
+           for _, addr := range subset.Addresses {
+               got.Insert(addr.IP)
+           }
+       }
+       if got.Len() != tc.expectedAddrCount {
+           t.Errorf("expected %d addresses, got %d", tc.expectedAddrCount, got.Len())
+       }
+       // Add per-test expected address set and compare exactly:
+       // if diff := cmp.Diff(tc.expectedAddresses, got.UnsortedList()); diff != "" { ... }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/router_controller_test.go` around lines 171 - 177,
Replace the weak count-only assertion for endpoint addresses with an
exact-address comparison: iterate event.endpoints.Subsets and collect all
subset.Addresses into a single slice (e.g., gotAddrs), then compare gotAddrs
against the test-case's expected addresses (add a tc.expectedAddrs field if
needed) using reflect.DeepEqual or cmp.Diff and call t.Errorf with the diff when
they mismatch instead of the current count check that uses tc.expectedAddrCount.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/router/controller/router_controller_test.go`:
- Around line 171-177: Replace the weak count-only assertion for endpoint
addresses with an exact-address comparison: iterate event.endpoints.Subsets and
collect all subset.Addresses into a single slice (e.g., gotAddrs), then compare
gotAddrs against the test-case's expected addresses (add a tc.expectedAddrs
field if needed) using reflect.DeepEqual or cmp.Diff and call t.Errorf with the
diff when they mismatch instead of the current count check that uses
tc.expectedAddrCount.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8a635ad1-bb51-4764-8cc0-b0b1b4efcaa8

📥 Commits

Reviewing files that changed from the base of the PR and between 5d3b8c9 and f9adb1c.

📒 Files selected for processing (9)
  • pkg/cmd/infra/router/router.go
  • pkg/cmd/infra/router/template.go
  • pkg/router/controller/extended_validator.go
  • pkg/router/controller/extended_validator_test.go
  • pkg/router/controller/factory/factory.go
  • pkg/router/controller/factory/factory_endpointslices_test.go
  • pkg/router/controller/router_controller.go
  • pkg/router/controller/router_controller_test.go
  • pkg/router/template/plugin_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • pkg/router/controller/router_controller.go
  • pkg/router/template/plugin_test.go
  • pkg/cmd/infra/router/template.go
  • pkg/router/controller/factory/factory.go
  • pkg/router/controller/extended_validator.go
  • pkg/router/controller/extended_validator_test.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/router/controller/router_controller.go (1)

267-275: ⚖️ Poor tradeoff

Consider aggregating resolution errors instead of silently skipping failed resolutions.

When multiple FQDN addresses fail to resolve within a single endpoint subset, the current implementation logs each failure individually but provides no aggregate signal. For observability and debugging, consider counting failed resolutions per endpoint and emitting a structured log or metric at the subset level.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/router/controller/router_controller.go` around lines 267 - 275, The
resolveEndpointAddresses loop currently calls resolveAddressList for each
Endpoints.Subsets' Addresses and NotReadyAddresses and only logs individual DNS
failures; change resolveEndpointAddresses to aggregate resolution failures per
subset by having resolveAddressList return both the resolved list and a count
(or error slice) of failed resolutions, then sum failures for Addresses and
NotReadyAddresses for each endpoints.Subsets entry and emit a single structured
log/metric (including nsName, subset index/identifier, totalFailures) instead of
only per-address logs; update calls in resolveEndpointAddresses and adjust
endpointsubset.SortAddresses usage to use the resolved lists.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/router/controller/router_controller.go`:
- Line 285: Replace the use of context.TODO() when calling
c.Resolver.ResolveEndpointAddress with a cancellable context that has a finite
deadline (e.g., create a child context from the controller/request context via
context.WithTimeout(ctx, 15*time.Second)), pass that context into
ResolveEndpointAddress, and ensure you defer the cancel() so cancellation
propagates during shutdown; update the call site around resolved, err :=
c.Resolver.ResolveEndpointAddress(...) to use the new timed context instead of
context.TODO().

---

Nitpick comments:
In `@pkg/router/controller/router_controller.go`:
- Around line 267-275: The resolveEndpointAddresses loop currently calls
resolveAddressList for each Endpoints.Subsets' Addresses and NotReadyAddresses
and only logs individual DNS failures; change resolveEndpointAddresses to
aggregate resolution failures per subset by having resolveAddressList return
both the resolved list and a count (or error slice) of failed resolutions, then
sum failures for Addresses and NotReadyAddresses for each endpoints.Subsets
entry and emit a single structured log/metric (including nsName, subset
index/identifier, totalFailures) instead of only per-address logs; update calls
in resolveEndpointAddresses and adjust endpointsubset.SortAddresses usage to use
the resolved lists.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 57e3234f-44a1-403f-a813-3003e53113db

📥 Commits

Reviewing files that changed from the base of the PR and between f9adb1c and bb61fda.

📒 Files selected for processing (10)
  • pkg/cmd/infra/router/template.go
  • pkg/router/controller/extended_validator.go
  • pkg/router/controller/extended_validator_test.go
  • pkg/router/controller/factory/factory.go
  • pkg/router/controller/factory/factory_endpointslices_test.go
  • pkg/router/controller/resolver.go
  • pkg/router/controller/resolver_test.go
  • pkg/router/controller/router_controller.go
  • pkg/router/controller/router_controller_test.go
  • pkg/router/template/plugin_test.go
💤 Files with no reviewable changes (1)
  • pkg/router/controller/factory/factory_endpointslices_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/router/controller/factory/factory.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/router/template/plugin_test.go
  • pkg/router/controller/extended_validator_test.go

Comment thread pkg/router/controller/router_controller.go Outdated
}
}

func (r *DNSEndpointResolver) ResolveEndpointAddress(ctx context.Context, hostname string) ([]net.IP, error) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flagging for myself: add a better comment on what these timeouts are, make the timeout constants with a proper explanation

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


func NewDNSEndpointResolver() *DNSEndpointResolver {
return &DNSEndpointResolver{
resolver: net.DefaultResolver,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to define timeouts and retries? Need to verify

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

// sortIPsIPv6First sorts IPs with IPv6 addresses before IPv4.
func sortIPsIPv6First(ips []net.IP) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ask Claude to make this reflect exactly the behavior of HAProxy

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

{
name: "IPv6 before IPv4",
ips: []net.IP{net.ParseIP("10.0.0.1"), net.ParseIP("2001:db8::1")},
expected: []string{"2001:db8::1", "10.0.0.1"},

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when we render IPv6? Will HAProxy break? Is this the right semantic? We probably need a better test here, or even some e2e test or some validation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the internal plugin is able to convert IPv6 to bracket address, this is fine

c.HandleEndpoints(eventType, endpoints)
}

// resolveEndpointAddresses resolves FQDN addresses in endpoint

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a flaw. needs to be verified. We must focus on any endpointslice of type FQDN, I need to check inside the conversion again if the endpointslice of type FQDN is converted to the IP field, but without validating it is an IP

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some checking confirmed this is correct. ConvertEndpointSlice will blindly convert EndpointSlices to EndpointSubset, without checking the Endpoint type and moving the Addresses to an IP.

So per the code here we are:

  • resolving the address list, when it is not an IP type (resolveEndpointAddress comes first)
  • Then call handleEndpoints that will filter out invalid addresses

@rikatz

rikatz commented Jun 10, 2026

Copy link
Copy Markdown
Member

@coderabbitai rise and shine. I want you to give me reasons not to merge this PR. Find broken behavior. Find concerning issues. If you don't find any concerning issue, we are on a happy path here, otherwise let me know what should be fixed

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== extended_validator.go ==="
cat pkg/router/controller/extended_validator.go

echo ""
echo "=== resolver.go ==="
cat pkg/router/controller/resolver.go

echo ""
echo "=== router_controller.go (FQDN-related changes) ==="
cat pkg/router/controller/router_controller.go

echo ""
echo "=== template.go (Run method) ==="
cat pkg/cmd/infra/router/template.go

echo ""
echo "=== factory.go ==="
cat pkg/router/controller/factory/factory.go

echo ""
echo "=== extended_validator_test.go ==="
cat pkg/router/controller/extended_validator_test.go

echo ""
echo "=== resolver_test.go ==="
cat pkg/router/controller/resolver_test.go

echo ""
echo "=== router_controller_test.go (new test) ==="
cat pkg/router/controller/router_controller_test.go

Length of output: 95222



@rikatz: Here are the concerning issues I found. Some are security-related, others are operational/correctness problems.


🔴 Serious Concerns

1. DNS Rebinding / TOCTOU — The Core Security Guarantee Is Weak

This is the most important issue. The router resolves FQDN→IP at event time and validates that snapshot. However, HAProxy performs its own independent DNS re-resolution on a timer (via its resolvers block). An attacker who controls a DNS server can:

  1. Initially resolve evil.example.com93.184.216.34 (passes the router's check)
  2. After the router stores the endpoint, drop the TTL and rebind evil.example.com169.254.169.254
  3. HAProxy re-resolves on its own schedule and starts sending traffic to the metadata service

The PR validates the IP once, but the protection evaporates on the next HAProxy DNS cycle. This undermines the entire security premise.


2. Synchronous DNS Blocks the Informer Event Loop

In factory.go, HandleEndpointSlice is called directly from the informer event handler, not in a goroutine:

f.registerSharedInformerEventHandlers(&discoveryv1.EndpointSlice{}, func(eventType watch.EventType, obj interface{}) {
    // ...
    rc.HandleEndpointSlice(eventType, *objMeta, f.aggregateEndpointSlice(eps.Namespace, serviceName))
})

And in resolver.go, a single FQDN resolution can take up to 10s + 5s + 10s = 25s before giving up. The cap is endpointResolutionTimeout = 30s. An EndpointSlice with multiple FQDNs exhausts that cap fully.

Result: All EndpointSlice events (for every service in the cluster) are stalled for up to 30 seconds per event while DNS is in flight. This is a denial-of-service vector against the router's event loop — and the attacker's required capability (EndpointSlice write access) is exactly the threat model of this PR.


3. No Kill Switch for the Resolver

In factory.go, the resolver is always assigned:

rc := &routercontroller.RouterController{
    Plugin:     plugin,
    WatchNodes: watchNodes,
    Resolver:   routercontroller.NewDNSEndpointResolver(),   // unconditional
    ...
}

There is no flag, environment variable, or configuration option to disable FQDN resolution. If this causes issues in production (e.g. DNS failures at startup, slow DNS, or compatibility problems), operators have no recovery path short of patching the binary.


4. Endpoint IP Validation Is Silently Always-On

The extendedRouteValidation bool parameter only gates route validation in HandleRoute. The HandleEndpoints path always filters IPs regardless of the flag value — even NewExtendedValidator(..., false) will still drop loopback/link-local endpoints. The constructor comment says "Endpoint address validation is always enabled" but this is not reflected in the --extended-validation flag's help text and was not like this before.

This will silently break any deployment that:

  • Routes to 127.x.x.x (loopback, e.g. sidecar proxies with host networking)
  • Routes to 169.254.x.x (some overlay/CNI implementations)
  • Routes to 0.0.0.0 (catch-all placeholders)

There is no flag to restore the old behavior.


5. Stale Backends When All FQDNs Fail to Resolve

In router_controller.go, the event type is decided by len(items) == 0:

if len(items) == 0 {
    eventType = watch.Deleted
} else {
    eventType = watch.Modified
}

If an EndpointSlice has items (the slice exists) but all FQDN addresses fail DNS resolution, the resulting kapi.Endpoints has empty address lists and the event type is watch.Modified. HAProxy receives a Modified event with empty backends instead of a Deleted event — depending on the template logic, stale backends may persist rather than being cleaned up.


🟡 Moderate Concerns

6. isRetryableDNSError Is Too Narrow

Only DNS timeout errors are retried. SERVFAIL (server failure), which HAProxy does retry, is not retried here. The comment claims HAProxy parity but the behavior diverges on transient server-side failures.

func isRetryableDNSError(err error) bool {
    var dnsErr *net.DNSError
    if !errors.As(err, &dnsErr) {
        return false
    }
    return dnsErr.IsTimeout && !dnsErr.IsNotFound  // SERVFAIL not covered
}

7. AWS IPv6 IMDS Hardcoded as a Single Address

fd00:ec2::254 is hardcoded as an exact match. AWS could extend its metadata service across the fd00:ec2::/56 ULA prefix. Blocking a single address is fragile compared to blocking the prefix.

awsIPv6IMDS = net.ParseIP("fd00:ec2::254")
// ...
if ip.Equal(awsIPv6IMDS) {  // only this exact IP, not the /56

8. Test Case Documents a Gap Without Failing

In router_controller_test.go, this test case:

{
    name: "FQDN resolving to restricted IP is resolved but will be filtered by ExtendedValidator",
    ...
    expectedAddrs: []string{"127.0.0.1"},   // loopback passes RouterController!
},

…explicitly verifies that RouterController passes restricted IPs through and relies on ExtendedValidator being in the chain. If someone wires up a RouterController without ExtendedValidator (e.g. in testing, a custom deployment, or after a future refactor), restricted IPs reach the backend. The defense-in-depth integration test TestFQDNToRestrictedIP_DefenseInDepth covers the combined case but the test of the individual component should arguably fail when a restricted IP escapes.


Summary

# Issue Severity
1 DNS rebinding bypasses the FQDN→IP validation (TOCTOU) 🔴 Critical
2 Synchronous DNS blocks informer loop (DoS by attacker with EndpointSlice write) 🔴 Critical
3 No operational kill switch for the resolver 🔴 High
4 Endpoint IP filtering silently always-on, no flag to disable 🔴 High
5 All-FQDN-fail results in Modified not Deleted; stale backends 🟡 Medium
6 isRetryableDNSError doesn't retry SERVFAIL like HAProxy does 🟡 Medium
7 AWS IPv6 IMDS blocked by exact IP only, not prefix 🟡 Medium
8 RouterController alone passes restricted IPs; relies on validator being in chain 🟡 Low

Issues 1 and 2 are the most pressing — issue 1 undermines the security guarantee of the entire PR, and issue 2 turns the attacker's required capability (EndpointSlice write access) directly into a DoS vector.

Comment on lines +83 to +86
var (
azureMetadata = net.ParseIP("168.63.129.16")
awsIPv6IMDS = net.ParseIP("fd00:ec2::254")
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth using a set instead? Of course, you would need to convert the net.IP value back to a string value.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the advantage? I mean, it is a 2 line check, I think we can make it not a set for now and make a set in case we get more IPs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. No need to use a set if you prefer the current approach.

if o.ExtendedValidation {
plugin = controller.NewExtendedValidator(plugin, recorder)
}
plugin = controller.NewExtendedValidator(plugin, recorder, o.ExtendedValidation)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, did you consider making the endpoints validation a separate plugin, or implementing it in createRouterEndpoints?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, mostly claude took this approach and I though was fine

Comment on lines +205 to +206
expectedAddrCount int
expectedNRCount int

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to check the count and not the actual addresses?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason, added inner subset assertion/equality

}
func (r *fakeTestRecorder) RecordRouteUnservableInFutureVersionsClear(route *routev1.Route) {}

func TestCheckRestrictedIP(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func TestCheckRestrictedIP(t *testing.T) {
func Test_checkRestrictedIP(t *testing.T) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
}

func TestValidateEndpointAddress(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func TestValidateEndpointAddress(t *testing.T) {
func Test_validateEndpointAddress(t *testing.T) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

name: "restricted loopback IP",
address: "127.0.0.1",
expectError: true,
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test case for another loopback address.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Comment on lines +146 to +149
{
name: "valid public IPv4",
address: "10.0.0.1",
expectError: false,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling 10.0.0.1 a "public" address is confusing. Maybe it makes sense in the context of cloud networking and distinguishing between infrastructure and non-infrastructure? Still, I'd suggest saying "private network IPv4 address" or using some different term entirely to avoid confusion.

Also, it might make sense to add a test case with a public IPv4 address.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that "public" was changed, but I don't see a new test case with an actual public address. It was just a suggestion, but maybe it was overlooked.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I've read the first paragraph, not the second. Fixed here

Comment thread pkg/router/controller/resolver.go Outdated
if !errors.As(err, &dnsErr) {
return false
}
return dnsErr.IsTimeout && !dnsErr.IsNotFound

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition looks suspicious. Can dnsErr.IsTimeout and dnsErr.IsNotFound both be true? Should we check dnsErr.IsTimeout || dnsErr.IsTemporary instead?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, I have fixed here as suggested

Comment on lines +247 to +250
nsName := fmt.Sprintf("%s/%s", objMeta.Namespace, objMeta.Name)
if c.Resolver != nil {
c.resolveEndpointAddresses(endpoints, nsName)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my question about adding validation to the existing plugin, did you consider adding the name resolution to createRouterEndpoints? I figured that we would have it both resolve and validate addresses. I see now that you separated those concerns between the router controller and the extended validation, and maybe it is better design to have those concerns separated, but I'm curious about your design choices.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original approach was from @bentito and I agreed that the current design is good enough. It is part of extendedValidation but yet always executed (it is not behind the route validation flag)

Comment on lines +63 to +74
{
name: "IPv4 slices pass through unchanged",
items: []discoveryv1.EndpointSlice{{
ObjectMeta: metav1.ObjectMeta{Name: "slice-1", Namespace: "ns"},
AddressType: ipv4Type,
Endpoints: []discoveryv1.Endpoint{{
Addresses: []string{"10.0.0.1"},
}},
}},
expectedEventType: watch.Modified,
expectedAddrs: []string{"10.0.0.1"},
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test case omits resolver, does the test actually cover what it is intended to cover? It seems that the controller doesn't actually attempt name resolution if it doesn't have a resolver:

if c.Resolver != nil {
c.resolveEndpointAddresses(endpoints, nsName)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, this is something that while I was doing self review got me but I was "meh", but yes, we should never allow null resolver. I will add it to my fix list

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a default resolver always even if it is null, and removed this check

}},
}},
expectedEventType: watch.Modified,
expectedAddrs: []string{"10.0.0.1", "2001:db8::1"},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the intention was that IPv6 addresses would come first?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, fixed in a next commit

Comment on lines +272 to +276
endpoints.Subsets[i].Addresses = c.resolveAddressList(ctx, endpoints.Subsets[i].Addresses, nsName)
endpoints.Subsets[i].NotReadyAddresses = c.resolveAddressList(ctx, endpoints.Subsets[i].NotReadyAddresses, nsName)

endpointsubset.SortAddresses(endpoints.Subsets[i].Addresses, endpointsubset.DefaultEndpointAddressOrderByFuncs())
endpointsubset.SortAddresses(endpoints.Subsets[i].NotReadyAddresses, endpointsubset.DefaultEndpointAddressOrderByFuncs())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolveAddressList sorts IPv6 addresses first, but then endpointsubset.SortAddresses puts IPv4 addresses first — right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related question: Is the correct behavior to sort IPv6 addresses before IPv4 addresses, or is the correct behavior to drop IPv4 addresses if the the name resolves to both IPv6 addresses and IPv4 addresses?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my understanding, HAProxy prioritizes IPv6 and then IPv4 but does not drop it, so I have followed the same approach

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently and per unit tests I am wrong, let me check and fix here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my understanding, HAProxy prioritizes IPv6 and then IPv4 but does not drop it, so I have followed the same approach

I mean, if the controller still adds server lines for IPv4 addresses in the HAProxy configuration file, then HAProxy will still use them, right? Even if the IPv6 lines come first, HAProxy will use the IPv4 addresses when using roundrobin balancing, or with leastconn or random if the number of concurrent connections is greater than the number of IPv6 addresses. In contrast — given the description of the httpclient.resolvers.prefer option:

httpclient.resolvers.prefer <ipv4|ipv6>
  This option allows to chose which family of IP you want when resolving,
  which is convenient when IPv6 is not available on your network. Default
  option is "ipv6".

— this means that HAProxy only uses the IPv6 addresses if you give it a server line that has a DNS name that resolves to a mix of IPv4 and IPv6 addresses, right?

name string
resolver EndpointResolver
items []discoveryv1.EndpointSlice
expectedEventType watch.EventType

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have expectedEventType as a test input, or should the test runner unconditionally assert that it is watch.Modified?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I have removed this assertion and added an assertion for always being watch.Modified

Comment on lines +258 to +262
// TestFQDNToRestrictedIP_DefenseInDepth verifies that when a FQDN
// resolves to a restricted IP, the address is resolved by the
// RouterController but then filtered by the ExtendedValidator before
// reaching the inner plugin.
func TestFQDNToRestrictedIP_DefenseInDepth(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test named "DefenseInDepth"? Wouldn't that imply that we had two checks for restricted addresses rather than the one check that we actually have?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the validation is deep on the code, so the test here is more like a broader unit test that verifies the whole workflow. It is not calling the validation functions, but rather the whole ExtendedValidation from route controller. I can change for a better name if you prefer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the validation is deep on the code, so the test here is more like a broader unit test that verifies the whole workflow. It is not calling the validation functions, but rather the whole ExtendedValidation from route controller.

The phrase "defense in depth" refers to having redundant security layers — for example, using SELinux labels plus Linux namespaces to contain processes, or using NetworkPolicy plus TLS to restrict access to network endpoints — so if one layer fails, the next layer provides protection.

I can change for a better name if you prefer

It isn't very creative, but maybe TestRouterController_EndpointsValidation? The name isn't terribly important though; I find the current name a little confusing, but it isn't worth spending much time on if you are happy with it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, sounds good, I am already making changes and changing this name is fine

@Miciah

Miciah commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

It looks like there has been some drift between the commit message and the current changes. The commit message should be updated:

This commit updates the Endpoints validation and addition, first
trying to resolve any non-IP endpoint address to a proper address,
and then validating against a forbidden list of IPs on Endpoints,
like multicast, link local, localhost and cloud metadata server

  • The router controller now resolves addresses, so this commit has two main parts: (1) updates to the extended validation and (2) updates to the router controller logic.
  • Is "addition" a typo for "admission", or do you mean it in the sense of handling watch.Added (and also watch.Modified) events? A more general term like "handling" might be less confusing.
  • What do you mean by "proper address"? Since the resolution and validation are now separate concerns, maybe it is more accurate just to say the controller resolves a hostname to "an IP address" (and the validation makes sure that it is a "proper"/valid/allowed address).
  • I'm not sure what the phrase "on Endpoints" is intended to convey. Can it be omitted?
  • You are missing a period at the end of the sentence (or at least I can only assume it is the end of the sentence!).

@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from miciah. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

if ip == nil {
t.Fatalf("failed to parse IP %q", tc.ip)
}
require.NotNil(t, ip, "failed to parse IP")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require.NotNil(t, ip, "failed to parse IP")
require.NotNil(t, ip, "failed to parse IP address %s", tc.ip)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use also NotNilf

Comment on lines +129 to +130
if tc.expectError && err == nil {
t.Errorf("expected error for IP %s, got nil", tc.ip)
require.Error(t, err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&& err == nil is redundant.

Suggested change
if tc.expectError && err == nil {
t.Errorf("expected error for IP %s, got nil", tc.ip)
require.Error(t, err)
if tc.expectError {
require.Error(t, err, "expected error for IP %s, got nil", tc.ip)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think my brain was melted when I wrote this on friday...

Comment on lines +132 to +133
if !tc.expectError && err != nil {
t.Errorf("expected no error for IP %s, got %v", tc.ip, err)
require.NoError(t, err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if !tc.expectError && err != nil {
t.Errorf("expected no error for IP %s, got %v", tc.ip, err)
require.NoError(t, err)
if !tc.expectError {
require.NoError(t, err)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +194 to +195
if tc.expectError && err == nil {
t.Errorf("expected error for address %q, got nil", tc.address)
require.Error(t, err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if tc.expectError && err == nil {
t.Errorf("expected error for address %q, got nil", tc.address)
require.Error(t, err)
if tc.expectError {
require.Error(t, err, "expected error for address %q, got nil", tc.address)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +197 to +198
if !tc.expectError && err != nil {
t.Errorf("expected no error for address %q, got %v", tc.address, err)
require.NoError(t, err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if !tc.expectError && err != nil {
t.Errorf("expected no error for address %q, got %v", tc.address, err)
require.NoError(t, err)
if !tc.expectError {
require.NoError(t, err)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines -275 to -276
endpointsubset.SortAddresses(endpoints.Subsets[i].Addresses, endpointsubset.DefaultEndpointAddressOrderByFuncs())
endpointsubset.SortAddresses(endpoints.Subsets[i].NotReadyAddresses, endpointsubset.DefaultEndpointAddressOrderByFuncs())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that removing SortAddresses entirely is the right solution. As https://github.com/openshift/enhancements/blob/master/enhancements/ingress/power-of-two-random-choices.md#alternatives-not-implemented notes, deterministic ordering has advantages. In particular, the source balancing algorithm depends on server ordering consistency to preserve session affinity across configuration updates and among router pods belonging to the same deployment.

How difficult would it be both to sort using the default sort funcs and to sort IPv6 addresses first? I think you could straightforwardly convert the logic in sortIPsIPv6First into an EndpointAddressLessFunc value and pass that as an additional func to SortAddresses. Then DNSEndpointResolver itself wouldn't need any sorting logic.

bentito and others added 3 commits June 15, 2026 10:27
The OpenShift Router previously did not validate backend destinations
resolved from FQDN-typed EndpointSlices. This allowed an attacker with
EndpointSlice write access to create a Service backed by an FQDN
pointing to a hostname that resolves to restricted IPs (like the cloud
metadata service at 169.254.169.254) and bypass SSRF protections.

This commit updates the Endpoints validation and addition, first
trying to resolve any non-IP endpoint address to a proper address,
and then validating against a forbidden list of IPs on Endpoints,
like multicast, link local, localhost and cloud metadata server
@rikatz

rikatz commented Jun 16, 2026

Copy link
Copy Markdown
Member

/test e2e-aws-fips

2 similar comments
@rikatz

rikatz commented Jun 16, 2026

Copy link
Copy Markdown
Member

/test e2e-aws-fips

@rikatz

rikatz commented Jun 19, 2026

Copy link
Copy Markdown
Member

/test e2e-aws-fips

@openshift-ci

openshift-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

@bentito: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants