[Fix] Add timeout to Ray HTTP proxy client for k8s proxy mode by JiangJiaWei1103 · Pull Request #4680 · ray-project/kuberay

JiangJiaWei1103 · 2026-04-07T08:34:40Z

Why are these changes needed?

Previously, no timeout was configured for KubeRay operator requests made through the Kubernetes proxy subresource (useKubernetesProxy == true). As a result, TCP connections could hang indefinitely, blocking reconciliation silently until the Kubernetes API server's default timeout (60s) was reached.

This PR introduces an HTTP client timeout for Kubernetes proxy mode, aligning its behavior with non-proxy mode, which enforces a 2-second timeout.

Related issue number

Closes #4679.

Related to #4660. For those who run the operator in K8s proxy mode, the e2e test will fail since the ray.io/serve label is never updated.

Test Results

The following shows the local e2e test result of TestOldHeadPodFailDuringUpgrade:

Before	After

Failed since label will never be updated	Succeed 20 times in a row

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit e643c93. Configure here.}

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>

JiangJiaWei1103 · 2026-04-07T08:49:36Z

cc @machichima @Future-Outlier to take a look if you have time, thx.

machichima · 2026-04-09T14:02:55Z

 	DefaultLivenessProbeFailureThreshold   = 120

+	// Timeout for Ray HTTP proxy client
+	RayHTTPProxyClientTimeoutSeconds = 2


While in proxy mode, we need to pass through:
Kuberay Operator -> k8s API Server -> kubelet -> Pod

and in non-proxy mode, only:
Kuberay Operator -> Pod

Maybe we can consider setting a higher value for proxy mode as it need to pass through more components? Maybe 5 or 10 second?

Thanks for bringing up this concern.

We make it 10 sec to take additional latency (2 more hops) into account.

Future-Outlier

Can we refactor the function to this?

func GetRayHttpProxyClientFunc(mgr manager.Manager, useKubernetesProxy bool) func(hostIp, podNamespace, podName string, port int) RayHttpProxyClientInterface {
	return func(hostIp, podNamespace, podName string, port int) RayHttpProxyClientInterface {
		httpClient := &http.Client{
			Timeout: RayHTTPProxyClientTimeoutSeconds * time.Second,
		}
		httpProxyURL := fmt.Sprintf("http://%s:%d/", hostIp, port)

		if useKubernetesProxy {
			// Use the manager's transport for TLS and API server authentication.
			httpClient.Transport = mgr.GetHTTPClient().Transport
			httpProxyURL = fmt.Sprintf("%s/api/v1/namespaces/%s/pods/%s:%d/proxy/", mgr.GetConfig().Host, podNamespace, podName, port)
		}

		return &RayHttpProxyClient{
			client:       httpClient,
			httpProxyURL: httpProxyURL,
		}
	}
}

Future-Outlier

One concern: this changes the effective timeout in k8s proxy mode from ~60s (API server default) to 2s. Could this break backward compatibility for users who rely on the longer timeout?

cc @andrewsykim @rueian

andrewsykim · 2026-04-10T03:15:56Z

I feel like 60s is overly generous, if it is blocking reconcilers then we need to time out faster. I do worry that 2s can lead to time outs since there is inherently more latency due to additional network hops. Maybe we can start with 5s timeout instead?

rueian · 2026-04-10T04:04:48Z

I think 5s is also a bit risky. Let's do 10s?

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>

JiangJiaWei1103 · 2026-04-10T05:06:05Z

Can we refactor the function to this?

func GetRayHttpProxyClientFunc(mgr manager.Manager, useKubernetesProxy bool) func(hostIp, podNamespace, podName string, port int) RayHttpProxyClientInterface {
	return func(hostIp, podNamespace, podName string, port int) RayHttpProxyClientInterface {
		httpClient := &http.Client{
			Timeout: RayHTTPProxyClientTimeoutSeconds * time.Second,
		}
		httpProxyURL := fmt.Sprintf("http://%s:%d/", hostIp, port)

		if useKubernetesProxy {
			// Use the manager's transport for TLS and API server authentication.
			httpClient.Transport = mgr.GetHTTPClient().Transport
			httpProxyURL = fmt.Sprintf("%s/api/v1/namespaces/%s/pods/%s:%d/proxy/", mgr.GetConfig().Host, podNamespace, podName, port)
		}

		return &RayHttpProxyClient{
			client:       httpClient,
			httpProxyURL: httpProxyURL,
		}
	}
}

Good suggestion! Fixed at 9f6164f and GetRayDashboardClientFunc follows the same pattern for maintainability.

JiangJiaWei1103 · 2026-04-10T05:07:56Z

Thanks @andrewsykim and @rueian,

The Kubernetes proxy mode now uses RayHTTPClientProxyTimeoutSeconds (10 sec).

Future-Outlier

LGTM, thank you

andrewsykim

As a side note, I think it would be helpful to have a KubeRay metric for these requests so we can track p95 / p99 lantecy in the future.

JiangJiaWei1103 · 2026-04-11T01:11:14Z

As a side note, I think it would be helpful to have a KubeRay metric for these requests so we can track p95 / p99 lantecy in the future.

Thanks Andrew, I will open an issue to track it.

fix: Add timeout to Ray HTTP proxy client for k8s proxy mode

e643c93

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>

JiangJiaWei1103 requested review from MortalHappiness, andrewsykim, kevin85421 and rueian as code owners April 7, 2026 08:34

cursor bot reviewed Apr 7, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/utils/util.go Outdated

fix: Avoid mutating shared client

274aee2

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>

JiangJiaWei1103 moved this to In review in My Kuberay & Ray Apr 7, 2026

JiangJiaWei1103 added this to My Kuberay & Ray Apr 7, 2026

JiangJiaWei1103 commented Apr 7, 2026

View reviewed changes

Comment thread ray-operator/controllers/ray/utils/util.go Outdated

machichima reviewed Apr 9, 2026

View reviewed changes

Future-Outlier reviewed Apr 10, 2026

View reviewed changes

refactor: Increase timeout and clean up client init

9f6164f

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>

Future-Outlier approved these changes Apr 10, 2026

View reviewed changes

andrewsykim approved these changes Apr 10, 2026

View reviewed changes

andrewsykim merged commit 65364a1 into ray-project:master Apr 10, 2026
31 checks passed

github-project-automation bot moved this from In review to Done in My Kuberay & Ray Apr 10, 2026

JiangJiaWei1103 mentioned this pull request Apr 11, 2026

[Feature] [observability] Add latency metrics (p95, p99) for Ray HTTP clients #4697

Open

2 tasks

Conversation

JiangJiaWei1103 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Test Results

Checks

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JiangJiaWei1103 commented Apr 7, 2026

Uh oh!

machichima Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

JiangJiaWei1103 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

andrewsykim commented Apr 10, 2026

Uh oh!

rueian commented Apr 10, 2026

Uh oh!

JiangJiaWei1103 commented Apr 10, 2026

Uh oh!

JiangJiaWei1103 commented Apr 10, 2026

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

andrewsykim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JiangJiaWei1103 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JiangJiaWei1103 commented Apr 7, 2026 •

edited

Loading