[Fix] Add timeout to Ray HTTP proxy client for k8s proxy mode#4680
Conversation
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit e643c93. Configure here.
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
|
cc @machichima @Future-Outlier to take a look if you have time, thx. |
| DefaultLivenessProbeFailureThreshold = 120 | ||
|
|
||
| // Timeout for Ray HTTP proxy client | ||
| RayHTTPProxyClientTimeoutSeconds = 2 |
There was a problem hiding this comment.
While in proxy mode, we need to pass through:
Kuberay Operator -> k8s API Server -> kubelet -> Pod
and in non-proxy mode, only:
Kuberay Operator -> Pod
Maybe we can consider setting a higher value for proxy mode as it need to pass through more components? Maybe 5 or 10 second?
There was a problem hiding this comment.
Thanks for bringing up this concern.
We make it 10 sec to take additional latency (2 more hops) into account.
Future-Outlier
left a comment
There was a problem hiding this comment.
Can we refactor the function to this?
func GetRayHttpProxyClientFunc(mgr manager.Manager, useKubernetesProxy bool) func(hostIp, podNamespace, podName string, port int) RayHttpProxyClientInterface {
return func(hostIp, podNamespace, podName string, port int) RayHttpProxyClientInterface {
httpClient := &http.Client{
Timeout: RayHTTPProxyClientTimeoutSeconds * time.Second,
}
httpProxyURL := fmt.Sprintf("http://%s:%d/", hostIp, port)
if useKubernetesProxy {
// Use the manager's transport for TLS and API server authentication.
httpClient.Transport = mgr.GetHTTPClient().Transport
httpProxyURL = fmt.Sprintf("%s/api/v1/namespaces/%s/pods/%s:%d/proxy/", mgr.GetConfig().Host, podNamespace, podName, port)
}
return &RayHttpProxyClient{
client: httpClient,
httpProxyURL: httpProxyURL,
}
}
}
Future-Outlier
left a comment
There was a problem hiding this comment.
One concern: this changes the effective timeout in k8s proxy mode from ~60s (API server default) to 2s. Could this break backward compatibility for users who rely on the longer timeout?
|
I feel like 60s is overly generous, if it is blocking reconcilers then we need to time out faster. I do worry that 2s can lead to time outs since there is inherently more latency due to additional network hops. Maybe we can start with 5s timeout instead? |
|
I think 5s is also a bit risky. Let's do 10s? |
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Good suggestion! Fixed at 9f6164f and |
|
Thanks @andrewsykim and @rueian, The Kubernetes proxy mode now uses |
andrewsykim
left a comment
There was a problem hiding this comment.
As a side note, I think it would be helpful to have a KubeRay metric for these requests so we can track p95 / p99 lantecy in the future.
Thanks Andrew, I will open an issue to track it. |

Why are these changes needed?
Previously, no timeout was configured for KubeRay operator requests made through the Kubernetes proxy subresource (
useKubernetesProxy == true). As a result, TCP connections could hang indefinitely, blocking reconciliation silently until the Kubernetes API server's default timeout (60s) was reached.This PR introduces an HTTP client timeout for Kubernetes proxy mode, aligning its behavior with non-proxy mode, which enforces a 2-second timeout.
Related issue number
Closes #4679.
Related to #4660. For those who run the operator in K8s proxy mode, the e2e test will fail since the
ray.io/servelabel is never updated.Test Results
The following shows the local e2e test result of
TestOldHeadPodFailDuringUpgrade:Checks