Skip to content

[Fix] Add timeout to Ray HTTP proxy client for k8s proxy mode#4680

Merged
andrewsykim merged 3 commits intoray-project:masterfrom
JiangJiaWei1103:add-timeout-to-ray-http-proxy-client
Apr 10, 2026
Merged

[Fix] Add timeout to Ray HTTP proxy client for k8s proxy mode#4680
andrewsykim merged 3 commits intoray-project:masterfrom
JiangJiaWei1103:add-timeout-to-ray-http-proxy-client

Conversation

@JiangJiaWei1103
Copy link
Copy Markdown
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented Apr 7, 2026

Why are these changes needed?

Previously, no timeout was configured for KubeRay operator requests made through the Kubernetes proxy subresource (useKubernetesProxy == true). As a result, TCP connections could hang indefinitely, blocking reconciliation silently until the Kubernetes API server's default timeout (60s) was reached.

This PR introduces an HTTP client timeout for Kubernetes proxy mode, aligning its behavior with non-proxy mode, which enforces a 2-second timeout.

Related issue number

Closes #4679.

Related to #4660. For those who run the operator in K8s proxy mode, the e2e test will fail since the ray.io/serve label is never updated.

Test Results

The following shows the local e2e test result of TestOldHeadPodFailDuringUpgrade:

Before After
Screenshot 2026-04-07 at 4 31 53 PM Screenshot 2026-04-07 at 4 26 56 PM
Failed since label will never be updated Succeed 20 times in a row

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit e643c93. Configure here.

Comment thread ray-operator/controllers/ray/utils/util.go Outdated
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Comment thread ray-operator/controllers/ray/utils/util.go Outdated
@JiangJiaWei1103
Copy link
Copy Markdown
Contributor Author

cc @machichima @Future-Outlier to take a look if you have time, thx.

DefaultLivenessProbeFailureThreshold = 120

// Timeout for Ray HTTP proxy client
RayHTTPProxyClientTimeoutSeconds = 2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While in proxy mode, we need to pass through:
Kuberay Operator -> k8s API Server -> kubelet -> Pod

and in non-proxy mode, only:
Kuberay Operator -> Pod

Maybe we can consider setting a higher value for proxy mode as it need to pass through more components? Maybe 5 or 10 second?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing up this concern.

We make it 10 sec to take additional latency (2 more hops) into account.

Copy link
Copy Markdown
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor the function to this?

func GetRayHttpProxyClientFunc(mgr manager.Manager, useKubernetesProxy bool) func(hostIp, podNamespace, podName string, port int) RayHttpProxyClientInterface {
	return func(hostIp, podNamespace, podName string, port int) RayHttpProxyClientInterface {
		httpClient := &http.Client{
			Timeout: RayHTTPProxyClientTimeoutSeconds * time.Second,
		}
		httpProxyURL := fmt.Sprintf("http://%s:%d/", hostIp, port)

		if useKubernetesProxy {
			// Use the manager's transport for TLS and API server authentication.
			httpClient.Transport = mgr.GetHTTPClient().Transport
			httpProxyURL = fmt.Sprintf("%s/api/v1/namespaces/%s/pods/%s:%d/proxy/", mgr.GetConfig().Host, podNamespace, podName, port)
		}

		return &RayHttpProxyClient{
			client:       httpClient,
			httpProxyURL: httpProxyURL,
		}
	}
}

Copy link
Copy Markdown
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern: this changes the effective timeout in k8s proxy mode from ~60s (API server default) to 2s. Could this break backward compatibility for users who rely on the longer timeout?

cc @andrewsykim @rueian

@andrewsykim
Copy link
Copy Markdown
Member

I feel like 60s is overly generous, if it is blocking reconcilers then we need to time out faster. I do worry that 2s can lead to time outs since there is inherently more latency due to additional network hops. Maybe we can start with 5s timeout instead?

@rueian
Copy link
Copy Markdown
Collaborator

rueian commented Apr 10, 2026

I think 5s is also a bit risky. Let's do 10s?

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
@JiangJiaWei1103
Copy link
Copy Markdown
Contributor Author

Can we refactor the function to this?

func GetRayHttpProxyClientFunc(mgr manager.Manager, useKubernetesProxy bool) func(hostIp, podNamespace, podName string, port int) RayHttpProxyClientInterface {
	return func(hostIp, podNamespace, podName string, port int) RayHttpProxyClientInterface {
		httpClient := &http.Client{
			Timeout: RayHTTPProxyClientTimeoutSeconds * time.Second,
		}
		httpProxyURL := fmt.Sprintf("http://%s:%d/", hostIp, port)

		if useKubernetesProxy {
			// Use the manager's transport for TLS and API server authentication.
			httpClient.Transport = mgr.GetHTTPClient().Transport
			httpProxyURL = fmt.Sprintf("%s/api/v1/namespaces/%s/pods/%s:%d/proxy/", mgr.GetConfig().Host, podNamespace, podName, port)
		}

		return &RayHttpProxyClient{
			client:       httpClient,
			httpProxyURL: httpProxyURL,
		}
	}
}

Good suggestion! Fixed at 9f6164f and GetRayDashboardClientFunc follows the same pattern for maintainability.

@JiangJiaWei1103
Copy link
Copy Markdown
Contributor Author

Thanks @andrewsykim and @rueian,

The Kubernetes proxy mode now uses RayHTTPClientProxyTimeoutSeconds (10 sec).

Copy link
Copy Markdown
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you

Copy link
Copy Markdown
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a side note, I think it would be helpful to have a KubeRay metric for these requests so we can track p95 / p99 lantecy in the future.

@andrewsykim andrewsykim merged commit 65364a1 into ray-project:master Apr 10, 2026
31 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in My Kuberay & Ray Apr 10, 2026
@JiangJiaWei1103
Copy link
Copy Markdown
Contributor Author

As a side note, I think it would be helpful to have a KubeRay metric for these requests so we can track p95 / p99 lantecy in the future.

Thanks Andrew, I will open an issue to track it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Hung TCP connection blocks reconciliation indefinitely through proxy subresource

5 participants