fix: decouple pull and mount from kubelet RPC deadlines#36
fix: decouple pull and mount from kubelet RPC deadlines#36tjandy98 wants to merge 2 commits intomodelpack:mainfrom
Conversation
Signed-off-by: tjandy98 <3953059+tjandy98@users.noreply.github.com>
Signed-off-by: tjandy98 <3953059+tjandy98@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request decouples model pulling and mounting operations from the parent context's deadline to prevent interruptions caused by kubelet timeouts. Specifically, it introduces a 30-second timeout for mounting in node_static_inline.go and decouples the pull operation in worker.go. Feedback was provided regarding a potential context leak in worker.go due to a missing call to the cancellation function.
| // when kubelet times out and retries | ||
| var cancel context.CancelFunc | ||
| ctx, cancel = context.WithCancel(ctx) | ||
| ctx, cancel = context.WithCancel(context.WithoutCancel(ctx)) |
There was a problem hiding this comment.
The cancel function returned by context.WithCancel should be called to release resources as soon as the operation is complete. Since this context is decoupled from the parent and has no deadline, failing to call cancel will result in a context leak until the garbage collector cleans it up. Adding a defer cancel() ensures that resources are released regardless of how the function exits.
| ctx, cancel = context.WithCancel(context.WithoutCancel(ctx)) | |
| ctx, cancel = context.WithCancel(context.WithoutCancel(ctx)) | |
| defer cancel() |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
📊 Code Coverage Report
📦 Per-package breakdown |
imeoer
left a comment
There was a problem hiding this comment.
Thanks for this PR, but it introduces an issue: if pulling model is no longer subject to kubelet’s 2min mount volume timeout, the model volume directory may be mounted into the pod prematurely, before the model weight files are fully downloaded.
| // when kubelet times out and retries | ||
| var cancel context.CancelFunc | ||
| ctx, cancel = context.WithCancel(ctx) | ||
| ctx, cancel = context.WithCancel(context.WithoutCancel(ctx)) |
There was a problem hiding this comment.
Thanks for the PR! Should we only use WithoutCancel for the s.worker.PullModel of nodePublishVolumeStaticInlineVolume?
Thanks for the feedback. From my understanding mount only runs if |
Okay, so will the pod likely see an empty directory initially and then see the model contents later? If so, we need a clear mechanism for the application process in the pod to know when the model directory is ready, but this is no different from using dynamic API mounting, the business logic must explicitly adapt: |
The pod status will not be marked as ready as long as all the layers are not pulled completely yet |
Do you mean the inference engine process will initially fail and then eventually reach the ready state via the container’s restart backoff retry mechanism? That seems viable, could we control this behavior via a parameter, e.g., |
The CSI driver's NodePublishVolume was canceling in-flight model pulls whenever kubelet's RPC deadline expired, making it impossible to mount any model large enough to exceed that budget.