OCPBUGS-87204: Prevent SSRF via FQDN-typed EndpointSlices#788
Conversation
|
@bentito: This pull request references Jira Issue OCPBUGS-86729, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThe PR adds endpoint address validation and DNS resolution capabilities to the router controller. ExtendedValidator now filters endpoint IPs against restricted categories before forwarding events. RouterController optionally resolves FQDN endpoints to IPs using a new EndpointResolver. Factory wiring initializes both components unconditionally. ChangesEndpoint Validation and FQDN Resolution
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (13 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
pkg/router/controller/extended_validator_test.go (2)
39-89: ⚡ Quick winAdd IPv6 loopback test case for complete coverage.
The test only covers IPv4 addresses. Add a test case for IPv6 loopback (
::1) to verify the validation handles IPv6 correctly.Proposed addition
{ name: "restricted IPv6 loopback", endpoints: &kapi.Endpoints{ Subsets: []kapi.EndpointSubset{ { Addresses: []kapi.EndpointAddress{{IP: "::1"}}, }, }, }, expectBlocked: true, },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/router/controller/extended_validator_test.go` around lines 39 - 89, Add an IPv6 loopback case to the TestExtendedValidator_HandleEndpoints tests slice so the validator is exercised for IPv6; specifically add a test entry named "restricted IPv6 loopback" in the tests array (inside TestExtendedValidator_HandleEndpoints) with endpoints containing a kapi.EndpointAddress whose IP is "::1" and expectBlocked: true to mirror the existing IPv4 loopback/blocked cases.
39-89: ⚡ Quick winAdd test case for
NotReadyAddressesto match implementation coverage.The
HandleEndpointsimplementation validates bothAddressesandNotReadyAddresses(lines 45-58 of extended_validator.go), but the tests only coverAddresses.Proposed addition
{ name: "restricted IP in NotReadyAddresses", endpoints: &kapi.Endpoints{ Subsets: []kapi.EndpointSubset{ { NotReadyAddresses: []kapi.EndpointAddress{{IP: "169.254.169.254"}}, }, }, }, expectBlocked: true, },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/router/controller/extended_validator_test.go` around lines 39 - 89, The test suite TestExtendedValidator_HandleEndpoints covers Addresses but not NotReadyAddresses used by HandleEndpoints; add a test case that mirrors the existing restricted-IP cases but puts the IP (e.g. "169.254.169.254") into EndpointSubset.NotReadyAddresses to assert expectBlocked == true so the tests exercise validation of NotReadyAddresses in HandleEndpoints.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/router/controller/extended_validator.go`:
- Around line 69-84: Replace the long stream-of-consciousness block around the
net.LookupIP(address) failure handling in extended_validator.go with a concise
comment that documents the TOCTOU/DNS-rebinding limitation (attacker can return
a benign IP now and 169.254.169.254 later when HAProxy resolves) and either mark
it as a known limitation with a TODO to investigate mitigation (e.g., pin/cached
resolution in HAProxy) or reference a config option to enforce resolution
caching; keep the actual behavior (return fmt.Errorf("failed to resolve FQDN %q:
%v", address, err)) unchanged but ensure the comment is one short sentence
calling out the risk and next steps.
- Around line 70-84: The code currently calls net.LookupIP(address) without a
context; replace that call with a context-aware resolver using
(*net.Resolver).LookupIPAddr(ctx, address) so you can control
timeouts/cancellation and inject a test resolver. Update the function that
performs this validation to accept or use a context.Context (or a resolver
parameter) and call resolver.LookupIPAddr(ctx, address), handle the returned
[]net.IPAddr, and preserve the existing error handling (returning
fmt.Errorf("failed to resolve FQDN %q: %v", address, err)). Ensure tests can
inject a custom *net.Resolver (or mock) and set a timeout on the context when
performing production lookups.
- Around line 96-105: The checkRestrictedIP function currently only blocks a
single metadata IP and loopback; update it to block the full link-local range
and Azure wireserver by using net.IP's IsLinkLocalUnicast() to reject
169.254.0.0/16 and fe80::/10 and explicitly check for 168.63.129.16 as well as
existing loopback checks; locate the function checkRestrictedIP and replace the
single-address comparison with an IsLinkLocalUnicast() check, add a direct
comparison for net.ParseIP("168.63.129.16"), and ensure all error messages
include the offending IP via ip.String().
---
Nitpick comments:
In `@pkg/router/controller/extended_validator_test.go`:
- Around line 39-89: Add an IPv6 loopback case to the
TestExtendedValidator_HandleEndpoints tests slice so the validator is exercised
for IPv6; specifically add a test entry named "restricted IPv6 loopback" in the
tests array (inside TestExtendedValidator_HandleEndpoints) with endpoints
containing a kapi.EndpointAddress whose IP is "::1" and expectBlocked: true to
mirror the existing IPv4 loopback/blocked cases.
- Around line 39-89: The test suite TestExtendedValidator_HandleEndpoints covers
Addresses but not NotReadyAddresses used by HandleEndpoints; add a test case
that mirrors the existing restricted-IP cases but puts the IP (e.g.
"169.254.169.254") into EndpointSubset.NotReadyAddresses to assert expectBlocked
== true so the tests exercise validation of NotReadyAddresses in
HandleEndpoints.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b3de770d-618d-44fe-a4d4-c7ac39f1c6cb
📒 Files selected for processing (2)
pkg/router/controller/extended_validator.gopkg/router/controller/extended_validator_test.go
| if ip.IsLoopback() { | ||
| return fmt.Errorf("IP address %s is a restricted loopback IP", ip.String()) | ||
| } | ||
| if ip.IsLinkLocalUnicast() { |
There was a problem hiding this comment.
To clarify for anyone reviewing this diff against the original PR description: we replaced the explicit literal check for 169.254.169.254 with ip.IsLinkLocalUnicast()
There was a problem hiding this comment.
It would be helpful to add code comments here, and also here:
router/pkg/router/controller/extended_validator.go
Lines 83 to 86 in fa5382c
Something like this:
// The AWS IMDS endpoint 169.254.169.254 is covered by the ip.IsLinkLocalUnicast()
// check in `checkRestrictedIP`.|
/jira refresh |
| defer cancel() | ||
| addrs, err := resolver.LookupIPAddr(lookupCtx, address) | ||
| if err != nil { | ||
| // TOCTOU Limitation: An attacker can bypass this check by returning a benign IP now and |
There was a problem hiding this comment.
This is valid but a very edge case IMO. The attacker would need to either:
- Control the DNS of the cluster, which brings a different problem
- Be able to rename the "invalid" domain to a valid domain on a good TTL window (doable)
I think long term we can consider something like external-check command /usr/local/bin/check_ip_blacklist.sh where HAProxy can call a binary running on the same container of HAProxy passing the hostname or IP and leaving this binary to decide if it is safe or not.
| log.Error(err, "skipping endpoints due to invalid configuration", "endpoints", fmt.Sprintf("%s/%s", endpoints.Namespace, endpoints.Name)) | ||
| // We don't have a recordEndpointRejection mechanism, so we log and drop the event. | ||
| // This prevents the router from proxying to the malicious backend. | ||
| return fmt.Errorf("invalid endpoint configuration: %v", err) |
There was a problem hiding this comment.
I am curious and can't remember, what happens when we return an error? Does this block any further configuration of the router, or does this block just the specific route? (I think it is just the route, but I cannot remember).
Also, would it make sense to aggregate all of the errors on a []err and return at once?
|
@bentito: This pull request references Jira Issue OCPBUGS-87204, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/cmd/infra/router/router.go (1)
110-110: ⚡ Quick winClarify the help text for FQDN handling.
The current help text states "FQDN-typed EndpointSlices are rejected," which implies all FQDN-typed EndpointSlices are blocked unconditionally. Based on the PR description, the actual behavior is that FQDNs are resolved via DNS lookup and rejected only if they resolve to restricted IP addresses (unspecified, loopback, link-local, or Azure metadata). Please update the text to reflect this nuance.
📝 Suggested help text revision
- flag.BoolVar(&o.EndpointAddressValidation, "endpoint-address-validation", isTrue(env("ENDPOINT_ADDRESS_VALIDATION", "true")), "If set, endpoint addresses are validated against restricted IP ranges and FQDN-typed EndpointSlices are rejected. Defaults to true.") + flag.BoolVar(&o.EndpointAddressValidation, "endpoint-address-validation", isTrue(env("ENDPOINT_ADDRESS_VALIDATION", "true")), "If set, endpoint addresses are validated against restricted IP ranges (loopback, link-local, unspecified, and Azure metadata IPs). FQDN-typed EndpointSlice addresses are resolved and rejected if they resolve to restricted IPs. Defaults to true.")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cmd/infra/router/router.go` at line 110, Update the help text for the flag registration where flag.BoolVar is called with &o.EndpointAddressValidation to clarify FQDN handling: change the message to state that FQDN-typed EndpointSlices are resolved via DNS and only rejected if they resolve to restricted IP addresses (unspecified, loopback, link-local, or Azure metadata) rather than being unconditionally rejected; ensure the new string still mentions validation against restricted IP ranges and that the default remains true.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/cmd/infra/router/router.go`:
- Line 110: Update the help text for the flag registration where flag.BoolVar is
called with &o.EndpointAddressValidation to clarify FQDN handling: change the
message to state that FQDN-typed EndpointSlices are resolved via DNS and only
rejected if they resolve to restricted IP addresses (unspecified, loopback,
link-local, or Azure metadata) rather than being unconditionally rejected;
ensure the new string still mentions validation against restricted IP ranges and
that the default remains true.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b0de28c0-7411-416c-b5e0-00c08775a70f
📒 Files selected for processing (9)
pkg/cmd/infra/router/router.gopkg/cmd/infra/router/template.gopkg/router/controller/extended_validator.gopkg/router/controller/extended_validator_test.gopkg/router/controller/factory/factory.gopkg/router/controller/factory/factory_endpointslices_test.gopkg/router/controller/router_controller.gopkg/router/controller/router_controller_test.gopkg/router/template/plugin_test.go
|
@bentito: This pull request references Jira Issue OCPBUGS-87204, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/router/controller/router_controller_test.go (1)
171-177: ⚡ Quick winCount-only assertion is too weak for mixed-slice filtering.
This can miss regressions where the wrong address survives filtering. Assert exact retained addresses, not just total count.
Suggested strengthening
- totalAddrs := 0 - for _, subset := range event.endpoints.Subsets { - totalAddrs += len(subset.Addresses) - } - if totalAddrs != tc.expectedAddrCount { - t.Errorf("expected %d addresses, got %d", tc.expectedAddrCount, totalAddrs) - } + got := sets.NewString() + for _, subset := range event.endpoints.Subsets { + for _, addr := range subset.Addresses { + got.Insert(addr.IP) + } + } + if got.Len() != tc.expectedAddrCount { + t.Errorf("expected %d addresses, got %d", tc.expectedAddrCount, got.Len()) + } + // Add per-test expected address set and compare exactly: + // if diff := cmp.Diff(tc.expectedAddresses, got.UnsortedList()); diff != "" { ... }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/router/controller/router_controller_test.go` around lines 171 - 177, Replace the weak count-only assertion for endpoint addresses with an exact-address comparison: iterate event.endpoints.Subsets and collect all subset.Addresses into a single slice (e.g., gotAddrs), then compare gotAddrs against the test-case's expected addresses (add a tc.expectedAddrs field if needed) using reflect.DeepEqual or cmp.Diff and call t.Errorf with the diff when they mismatch instead of the current count check that uses tc.expectedAddrCount.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/router/controller/router_controller_test.go`:
- Around line 171-177: Replace the weak count-only assertion for endpoint
addresses with an exact-address comparison: iterate event.endpoints.Subsets and
collect all subset.Addresses into a single slice (e.g., gotAddrs), then compare
gotAddrs against the test-case's expected addresses (add a tc.expectedAddrs
field if needed) using reflect.DeepEqual or cmp.Diff and call t.Errorf with the
diff when they mismatch instead of the current count check that uses
tc.expectedAddrCount.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8a635ad1-bb51-4764-8cc0-b0b1b4efcaa8
📒 Files selected for processing (9)
pkg/cmd/infra/router/router.gopkg/cmd/infra/router/template.gopkg/router/controller/extended_validator.gopkg/router/controller/extended_validator_test.gopkg/router/controller/factory/factory.gopkg/router/controller/factory/factory_endpointslices_test.gopkg/router/controller/router_controller.gopkg/router/controller/router_controller_test.gopkg/router/template/plugin_test.go
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/router/controller/router_controller.go
- pkg/router/template/plugin_test.go
- pkg/cmd/infra/router/template.go
- pkg/router/controller/factory/factory.go
- pkg/router/controller/extended_validator.go
- pkg/router/controller/extended_validator_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/router/controller/router_controller.go (1)
267-275: ⚖️ Poor tradeoffConsider aggregating resolution errors instead of silently skipping failed resolutions.
When multiple FQDN addresses fail to resolve within a single endpoint subset, the current implementation logs each failure individually but provides no aggregate signal. For observability and debugging, consider counting failed resolutions per endpoint and emitting a structured log or metric at the subset level.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/router/controller/router_controller.go` around lines 267 - 275, The resolveEndpointAddresses loop currently calls resolveAddressList for each Endpoints.Subsets' Addresses and NotReadyAddresses and only logs individual DNS failures; change resolveEndpointAddresses to aggregate resolution failures per subset by having resolveAddressList return both the resolved list and a count (or error slice) of failed resolutions, then sum failures for Addresses and NotReadyAddresses for each endpoints.Subsets entry and emit a single structured log/metric (including nsName, subset index/identifier, totalFailures) instead of only per-address logs; update calls in resolveEndpointAddresses and adjust endpointsubset.SortAddresses usage to use the resolved lists.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/router/controller/router_controller.go`:
- Line 285: Replace the use of context.TODO() when calling
c.Resolver.ResolveEndpointAddress with a cancellable context that has a finite
deadline (e.g., create a child context from the controller/request context via
context.WithTimeout(ctx, 15*time.Second)), pass that context into
ResolveEndpointAddress, and ensure you defer the cancel() so cancellation
propagates during shutdown; update the call site around resolved, err :=
c.Resolver.ResolveEndpointAddress(...) to use the new timed context instead of
context.TODO().
---
Nitpick comments:
In `@pkg/router/controller/router_controller.go`:
- Around line 267-275: The resolveEndpointAddresses loop currently calls
resolveAddressList for each Endpoints.Subsets' Addresses and NotReadyAddresses
and only logs individual DNS failures; change resolveEndpointAddresses to
aggregate resolution failures per subset by having resolveAddressList return
both the resolved list and a count (or error slice) of failed resolutions, then
sum failures for Addresses and NotReadyAddresses for each endpoints.Subsets
entry and emit a single structured log/metric (including nsName, subset
index/identifier, totalFailures) instead of only per-address logs; update calls
in resolveEndpointAddresses and adjust endpointsubset.SortAddresses usage to use
the resolved lists.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 57e3234f-44a1-403f-a813-3003e53113db
📒 Files selected for processing (10)
pkg/cmd/infra/router/template.gopkg/router/controller/extended_validator.gopkg/router/controller/extended_validator_test.gopkg/router/controller/factory/factory.gopkg/router/controller/factory/factory_endpointslices_test.gopkg/router/controller/resolver.gopkg/router/controller/resolver_test.gopkg/router/controller/router_controller.gopkg/router/controller/router_controller_test.gopkg/router/template/plugin_test.go
💤 Files with no reviewable changes (1)
- pkg/router/controller/factory/factory_endpointslices_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/router/controller/factory/factory.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/router/template/plugin_test.go
- pkg/router/controller/extended_validator_test.go
| } | ||
| } | ||
|
|
||
| func (r *DNSEndpointResolver) ResolveEndpointAddress(ctx context.Context, hostname string) ([]net.IP, error) { |
There was a problem hiding this comment.
flagging for myself: add a better comment on what these timeouts are, make the timeout constants with a proper explanation
|
|
||
| func NewDNSEndpointResolver() *DNSEndpointResolver { | ||
| return &DNSEndpointResolver{ | ||
| resolver: net.DefaultResolver, |
There was a problem hiding this comment.
Is there a better way to define timeouts and retries? Need to verify
| } | ||
|
|
||
| // sortIPsIPv6First sorts IPs with IPv6 addresses before IPv4. | ||
| func sortIPsIPv6First(ips []net.IP) { |
There was a problem hiding this comment.
ask Claude to make this reflect exactly the behavior of HAProxy
| { | ||
| name: "IPv6 before IPv4", | ||
| ips: []net.IP{net.ParseIP("10.0.0.1"), net.ParseIP("2001:db8::1")}, | ||
| expected: []string{"2001:db8::1", "10.0.0.1"}, |
There was a problem hiding this comment.
what happens when we render IPv6? Will HAProxy break? Is this the right semantic? We probably need a better test here, or even some e2e test or some validation.
There was a problem hiding this comment.
the internal plugin is able to convert IPv6 to bracket address, this is fine
| c.HandleEndpoints(eventType, endpoints) | ||
| } | ||
|
|
||
| // resolveEndpointAddresses resolves FQDN addresses in endpoint |
There was a problem hiding this comment.
this seems like a flaw. needs to be verified. We must focus on any endpointslice of type FQDN, I need to check inside the conversion again if the endpointslice of type FQDN is converted to the IP field, but without validating it is an IP
There was a problem hiding this comment.
some checking confirmed this is correct. ConvertEndpointSlice will blindly convert EndpointSlices to EndpointSubset, without checking the Endpoint type and moving the Addresses to an IP.
So per the code here we are:
- resolving the address list, when it is not an IP type (resolveEndpointAddress comes first)
- Then call handleEndpoints that will filter out invalid addresses
|
@coderabbitai rise and shine. I want you to give me reasons not to merge this PR. Find broken behavior. Find concerning issues. If you don't find any concerning issue, we are on a happy path here, otherwise let me know what should be fixed |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
echo "=== extended_validator.go ==="
cat pkg/router/controller/extended_validator.go
echo ""
echo "=== resolver.go ==="
cat pkg/router/controller/resolver.go
echo ""
echo "=== router_controller.go (FQDN-related changes) ==="
cat pkg/router/controller/router_controller.go
echo ""
echo "=== template.go (Run method) ==="
cat pkg/cmd/infra/router/template.go
echo ""
echo "=== factory.go ==="
cat pkg/router/controller/factory/factory.go
echo ""
echo "=== extended_validator_test.go ==="
cat pkg/router/controller/extended_validator_test.go
echo ""
echo "=== resolver_test.go ==="
cat pkg/router/controller/resolver_test.go
echo ""
echo "=== router_controller_test.go (new test) ==="
cat pkg/router/controller/router_controller_test.goLength of output: 95222
🔴 Serious Concerns1. DNS Rebinding / TOCTOU — The Core Security Guarantee Is WeakThis is the most important issue. The router resolves FQDN→IP at event time and validates that snapshot. However, HAProxy performs its own independent DNS re-resolution on a timer (via its
The PR validates the IP once, but the protection evaporates on the next HAProxy DNS cycle. This undermines the entire security premise. 2. Synchronous DNS Blocks the Informer Event LoopIn f.registerSharedInformerEventHandlers(&discoveryv1.EndpointSlice{}, func(eventType watch.EventType, obj interface{}) {
// ...
rc.HandleEndpointSlice(eventType, *objMeta, f.aggregateEndpointSlice(eps.Namespace, serviceName))
})And in Result: All EndpointSlice events (for every service in the cluster) are stalled for up to 30 seconds per event while DNS is in flight. This is a denial-of-service vector against the router's event loop — and the attacker's required capability (EndpointSlice write access) is exactly the threat model of this PR. 3. No Kill Switch for the ResolverIn rc := &routercontroller.RouterController{
Plugin: plugin,
WatchNodes: watchNodes,
Resolver: routercontroller.NewDNSEndpointResolver(), // unconditional
...
}There is no flag, environment variable, or configuration option to disable FQDN resolution. If this causes issues in production (e.g. DNS failures at startup, slow DNS, or compatibility problems), operators have no recovery path short of patching the binary. 4. Endpoint IP Validation Is Silently Always-OnThe This will silently break any deployment that:
There is no flag to restore the old behavior. 5. Stale Backends When All FQDNs Fail to ResolveIn if len(items) == 0 {
eventType = watch.Deleted
} else {
eventType = watch.Modified
}If an EndpointSlice has 🟡 Moderate Concerns6.
|
| # | Issue | Severity |
|---|---|---|
| 1 | DNS rebinding bypasses the FQDN→IP validation (TOCTOU) | 🔴 Critical |
| 2 | Synchronous DNS blocks informer loop (DoS by attacker with EndpointSlice write) | 🔴 Critical |
| 3 | No operational kill switch for the resolver | 🔴 High |
| 4 | Endpoint IP filtering silently always-on, no flag to disable | 🔴 High |
| 5 | All-FQDN-fail results in Modified not Deleted; stale backends |
🟡 Medium |
| 6 | isRetryableDNSError doesn't retry SERVFAIL like HAProxy does |
🟡 Medium |
| 7 | AWS IPv6 IMDS blocked by exact IP only, not prefix | 🟡 Medium |
| 8 | RouterController alone passes restricted IPs; relies on validator being in chain |
🟡 Low |
Issues 1 and 2 are the most pressing — issue 1 undermines the security guarantee of the entire PR, and issue 2 turns the attacker's required capability (EndpointSlice write access) directly into a DoS vector.
| var ( | ||
| azureMetadata = net.ParseIP("168.63.129.16") | ||
| awsIPv6IMDS = net.ParseIP("fd00:ec2::254") | ||
| ) |
There was a problem hiding this comment.
Is it worth using a set instead? Of course, you would need to convert the net.IP value back to a string value.
There was a problem hiding this comment.
what is the advantage? I mean, it is a 2 line check, I think we can make it not a set for now and make a set in case we get more IPs
There was a problem hiding this comment.
Fair enough. No need to use a set if you prefer the current approach.
| if o.ExtendedValidation { | ||
| plugin = controller.NewExtendedValidator(plugin, recorder) | ||
| } | ||
| plugin = controller.NewExtendedValidator(plugin, recorder, o.ExtendedValidation) |
There was a problem hiding this comment.
Out of curiosity, did you consider making the endpoints validation a separate plugin, or implementing it in createRouterEndpoints?
There was a problem hiding this comment.
nope, mostly claude took this approach and I though was fine
| expectedAddrCount int | ||
| expectedNRCount int |
There was a problem hiding this comment.
Any reason to check the count and not the actual addresses?
There was a problem hiding this comment.
no reason, added inner subset assertion/equality
| } | ||
| func (r *fakeTestRecorder) RecordRouteUnservableInFutureVersionsClear(route *routev1.Route) {} | ||
|
|
||
| func TestCheckRestrictedIP(t *testing.T) { |
There was a problem hiding this comment.
| func TestCheckRestrictedIP(t *testing.T) { | |
| func Test_checkRestrictedIP(t *testing.T) { |
| } | ||
| } | ||
|
|
||
| func TestValidateEndpointAddress(t *testing.T) { |
There was a problem hiding this comment.
| func TestValidateEndpointAddress(t *testing.T) { | |
| func Test_validateEndpointAddress(t *testing.T) { |
| name: "restricted loopback IP", | ||
| address: "127.0.0.1", | ||
| expectError: true, | ||
| }, |
There was a problem hiding this comment.
Please add a test case for another loopback address.
| { | ||
| name: "valid public IPv4", | ||
| address: "10.0.0.1", | ||
| expectError: false, |
There was a problem hiding this comment.
Calling 10.0.0.1 a "public" address is confusing. Maybe it makes sense in the context of cloud networking and distinguishing between infrastructure and non-infrastructure? Still, I'd suggest saying "private network IPv4 address" or using some different term entirely to avoid confusion.
Also, it might make sense to add a test case with a public IPv4 address.
There was a problem hiding this comment.
I see that "public" was changed, but I don't see a new test case with an actual public address. It was just a suggestion, but maybe it was overlooked.
There was a problem hiding this comment.
yeah I've read the first paragraph, not the second. Fixed here
| if !errors.As(err, &dnsErr) { | ||
| return false | ||
| } | ||
| return dnsErr.IsTimeout && !dnsErr.IsNotFound |
There was a problem hiding this comment.
This condition looks suspicious. Can dnsErr.IsTimeout and dnsErr.IsNotFound both be true? Should we check dnsErr.IsTimeout || dnsErr.IsTemporary instead?
There was a problem hiding this comment.
indeed, I have fixed here as suggested
| nsName := fmt.Sprintf("%s/%s", objMeta.Namespace, objMeta.Name) | ||
| if c.Resolver != nil { | ||
| c.resolveEndpointAddresses(endpoints, nsName) | ||
| } |
There was a problem hiding this comment.
Related to my question about adding validation to the existing plugin, did you consider adding the name resolution to createRouterEndpoints? I figured that we would have it both resolve and validate addresses. I see now that you separated those concerns between the router controller and the extended validation, and maybe it is better design to have those concerns separated, but I'm curious about your design choices.
There was a problem hiding this comment.
original approach was from @bentito and I agreed that the current design is good enough. It is part of extendedValidation but yet always executed (it is not behind the route validation flag)
| { | ||
| name: "IPv4 slices pass through unchanged", | ||
| items: []discoveryv1.EndpointSlice{{ | ||
| ObjectMeta: metav1.ObjectMeta{Name: "slice-1", Namespace: "ns"}, | ||
| AddressType: ipv4Type, | ||
| Endpoints: []discoveryv1.Endpoint{{ | ||
| Addresses: []string{"10.0.0.1"}, | ||
| }}, | ||
| }}, | ||
| expectedEventType: watch.Modified, | ||
| expectedAddrs: []string{"10.0.0.1"}, | ||
| }, |
There was a problem hiding this comment.
If the test case omits resolver, does the test actually cover what it is intended to cover? It seems that the controller doesn't actually attempt name resolution if it doesn't have a resolver:
router/pkg/router/controller/router_controller.go
Lines 248 to 249 in fa5382c
There was a problem hiding this comment.
so, this is something that while I was doing self review got me but I was "meh", but yes, we should never allow null resolver. I will add it to my fix list
There was a problem hiding this comment.
added a default resolver always even if it is null, and removed this check
| }}, | ||
| }}, | ||
| expectedEventType: watch.Modified, | ||
| expectedAddrs: []string{"10.0.0.1", "2001:db8::1"}, |
There was a problem hiding this comment.
I thought the intention was that IPv6 addresses would come first?
| endpoints.Subsets[i].Addresses = c.resolveAddressList(ctx, endpoints.Subsets[i].Addresses, nsName) | ||
| endpoints.Subsets[i].NotReadyAddresses = c.resolveAddressList(ctx, endpoints.Subsets[i].NotReadyAddresses, nsName) | ||
|
|
||
| endpointsubset.SortAddresses(endpoints.Subsets[i].Addresses, endpointsubset.DefaultEndpointAddressOrderByFuncs()) | ||
| endpointsubset.SortAddresses(endpoints.Subsets[i].NotReadyAddresses, endpointsubset.DefaultEndpointAddressOrderByFuncs()) |
There was a problem hiding this comment.
resolveAddressList sorts IPv6 addresses first, but then endpointsubset.SortAddresses puts IPv4 addresses first — right?
There was a problem hiding this comment.
Related question: Is the correct behavior to sort IPv6 addresses before IPv4 addresses, or is the correct behavior to drop IPv4 addresses if the the name resolves to both IPv6 addresses and IPv4 addresses?
There was a problem hiding this comment.
from my understanding, HAProxy prioritizes IPv6 and then IPv4 but does not drop it, so I have followed the same approach
There was a problem hiding this comment.
apparently and per unit tests I am wrong, let me check and fix here
There was a problem hiding this comment.
from my understanding, HAProxy prioritizes IPv6 and then IPv4 but does not drop it, so I have followed the same approach
I mean, if the controller still adds server lines for IPv4 addresses in the HAProxy configuration file, then HAProxy will still use them, right? Even if the IPv6 lines come first, HAProxy will use the IPv4 addresses when using roundrobin balancing, or with leastconn or random if the number of concurrent connections is greater than the number of IPv6 addresses. In contrast — given the description of the httpclient.resolvers.prefer option:
httpclient.resolvers.prefer <ipv4|ipv6>
This option allows to chose which family of IP you want when resolving,
which is convenient when IPv6 is not available on your network. Default
option is "ipv6".
— this means that HAProxy only uses the IPv6 addresses if you give it a server line that has a DNS name that resolves to a mix of IPv4 and IPv6 addresses, right?
| name string | ||
| resolver EndpointResolver | ||
| items []discoveryv1.EndpointSlice | ||
| expectedEventType watch.EventType |
There was a problem hiding this comment.
Does it make sense to have expectedEventType as a test input, or should the test runner unconditionally assert that it is watch.Modified?
There was a problem hiding this comment.
yes, I have removed this assertion and added an assertion for always being watch.Modified
| // TestFQDNToRestrictedIP_DefenseInDepth verifies that when a FQDN | ||
| // resolves to a restricted IP, the address is resolved by the | ||
| // RouterController but then filtered by the ExtendedValidator before | ||
| // reaching the inner plugin. | ||
| func TestFQDNToRestrictedIP_DefenseInDepth(t *testing.T) { |
There was a problem hiding this comment.
Why is this test named "DefenseInDepth"? Wouldn't that imply that we had two checks for restricted addresses rather than the one check that we actually have?
There was a problem hiding this comment.
because the validation is deep on the code, so the test here is more like a broader unit test that verifies the whole workflow. It is not calling the validation functions, but rather the whole ExtendedValidation from route controller. I can change for a better name if you prefer
There was a problem hiding this comment.
because the validation is deep on the code, so the test here is more like a broader unit test that verifies the whole workflow. It is not calling the validation functions, but rather the whole ExtendedValidation from route controller.
The phrase "defense in depth" refers to having redundant security layers — for example, using SELinux labels plus Linux namespaces to contain processes, or using NetworkPolicy plus TLS to restrict access to network endpoints — so if one layer fails, the next layer provides protection.
I can change for a better name if you prefer
It isn't very creative, but maybe TestRouterController_EndpointsValidation? The name isn't terribly important though; I find the current name a little confusing, but it isn't worth spending much time on if you are happy with it.
There was a problem hiding this comment.
no, sounds good, I am already making changes and changing this name is fine
|
It looks like there has been some drift between the commit message and the current changes. The commit message should be updated:
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| if ip == nil { | ||
| t.Fatalf("failed to parse IP %q", tc.ip) | ||
| } | ||
| require.NotNil(t, ip, "failed to parse IP") |
There was a problem hiding this comment.
| require.NotNil(t, ip, "failed to parse IP") | |
| require.NotNil(t, ip, "failed to parse IP address %s", tc.ip) |
| if tc.expectError && err == nil { | ||
| t.Errorf("expected error for IP %s, got nil", tc.ip) | ||
| require.Error(t, err) |
There was a problem hiding this comment.
&& err == nil is redundant.
| if tc.expectError && err == nil { | |
| t.Errorf("expected error for IP %s, got nil", tc.ip) | |
| require.Error(t, err) | |
| if tc.expectError { | |
| require.Error(t, err, "expected error for IP %s, got nil", tc.ip) |
There was a problem hiding this comment.
yeah, I think my brain was melted when I wrote this on friday...
| if !tc.expectError && err != nil { | ||
| t.Errorf("expected no error for IP %s, got %v", tc.ip, err) | ||
| require.NoError(t, err) |
There was a problem hiding this comment.
| if !tc.expectError && err != nil { | |
| t.Errorf("expected no error for IP %s, got %v", tc.ip, err) | |
| require.NoError(t, err) | |
| if !tc.expectError { | |
| require.NoError(t, err) |
| if tc.expectError && err == nil { | ||
| t.Errorf("expected error for address %q, got nil", tc.address) | ||
| require.Error(t, err) |
There was a problem hiding this comment.
| if tc.expectError && err == nil { | |
| t.Errorf("expected error for address %q, got nil", tc.address) | |
| require.Error(t, err) | |
| if tc.expectError { | |
| require.Error(t, err, "expected error for address %q, got nil", tc.address) |
| if !tc.expectError && err != nil { | ||
| t.Errorf("expected no error for address %q, got %v", tc.address, err) | ||
| require.NoError(t, err) |
There was a problem hiding this comment.
| if !tc.expectError && err != nil { | |
| t.Errorf("expected no error for address %q, got %v", tc.address, err) | |
| require.NoError(t, err) | |
| if !tc.expectError { | |
| require.NoError(t, err) |
| endpointsubset.SortAddresses(endpoints.Subsets[i].Addresses, endpointsubset.DefaultEndpointAddressOrderByFuncs()) | ||
| endpointsubset.SortAddresses(endpoints.Subsets[i].NotReadyAddresses, endpointsubset.DefaultEndpointAddressOrderByFuncs()) |
There was a problem hiding this comment.
I'm not sure that removing SortAddresses entirely is the right solution. As https://github.com/openshift/enhancements/blob/master/enhancements/ingress/power-of-two-random-choices.md#alternatives-not-implemented notes, deterministic ordering has advantages. In particular, the source balancing algorithm depends on server ordering consistency to preserve session affinity across configuration updates and among router pods belonging to the same deployment.
How difficult would it be both to sort using the default sort funcs and to sort IPv6 addresses first? I think you could straightforwardly convert the logic in sortIPsIPv6First into an EndpointAddressLessFunc value and pass that as an additional func to SortAddresses. Then DNSEndpointResolver itself wouldn't need any sorting logic.
The OpenShift Router previously did not validate backend destinations resolved from FQDN-typed EndpointSlices. This allowed an attacker with EndpointSlice write access to create a Service backed by an FQDN pointing to a hostname that resolves to restricted IPs (like the cloud metadata service at 169.254.169.254) and bypass SSRF protections. This commit updates the Endpoints validation and addition, first trying to resolve any non-IP endpoint address to a proper address, and then validating against a forbidden list of IPs on Endpoints, like multicast, link local, localhost and cloud metadata server
|
/test e2e-aws-fips |
2 similar comments
|
/test e2e-aws-fips |
|
/test e2e-aws-fips |
|
@bentito: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The OpenShift Router previously did not validate backend destinations resolved from FQDN-typed EndpointSlices. This allowed an attacker with EndpointSlice write access to create a Service backed by an FQDN pointing to a hostname that resolves to restricted IPs (like the cloud metadata service at 169.254.169.254) and bypass SSRF protections.
This commit updates the ExtendedValidator to inspect all EndpointSlice addresses. If an address is an FQDN, it performs a DNS lookup and blocks the endpoint if it resolves to a restricted IP address.
Includes unit tests to verify malicious FQDNs are successfully blocked.
Joint effort of the author and gemini-pro-3.1-preview
Summary by CodeRabbit
New Features
Tests