Skip to content

fix: decouple pull and mount from kubelet RPC deadlines#36

Open
tjandy98 wants to merge 2 commits intomodelpack:mainfrom
tjandy98:context-fix
Open

fix: decouple pull and mount from kubelet RPC deadlines#36
tjandy98 wants to merge 2 commits intomodelpack:mainfrom
tjandy98:context-fix

Conversation

@tjandy98
Copy link
Copy Markdown

@tjandy98 tjandy98 commented May 7, 2026

The CSI driver's NodePublishVolume was canceling in-flight model pulls whenever kubelet's RPC deadline expired, making it impossible to mount any model large enough to exceed that budget.

tjandy98 added 2 commits May 7, 2026 18:27
Signed-off-by: tjandy98 <3953059+tjandy98@users.noreply.github.com>
Signed-off-by: tjandy98 <3953059+tjandy98@users.noreply.github.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request decouples model pulling and mounting operations from the parent context's deadline to prevent interruptions caused by kubelet timeouts. Specifically, it introduces a 30-second timeout for mounting in node_static_inline.go and decouples the pull operation in worker.go. Feedback was provided regarding a potential context leak in worker.go due to a missing call to the cancellation function.

Comment thread pkg/service/worker.go
// when kubelet times out and retries
var cancel context.CancelFunc
ctx, cancel = context.WithCancel(ctx)
ctx, cancel = context.WithCancel(context.WithoutCancel(ctx))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cancel function returned by context.WithCancel should be called to release resources as soon as the operation is complete. Since this context is decoupled from the parent and has no deadline, failing to call cancel will result in a context leak until the garbage collector cleans it up. Adding a defer cancel() ensures that resources are released regardless of how the function exits.

Suggested change
ctx, cancel = context.WithCancel(context.WithoutCancel(ctx))
ctx, cancel = context.WithCancel(context.WithoutCancel(ctx))
defer cancel()

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 25.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pkg/service/node_static_inline.go 0.00% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

📊 Code Coverage Report

Metric Coverage Threshold Status
Overall 71.4% 70%
Changed lines 50% 90%
📦 Per-package breakdown
github.com/modelpack/model-csi-driver/pkg/client/grpc.go:104:                    80.0%
github.com/modelpack/model-csi-driver/pkg/client/grpc.go:116:                    80.0%
github.com/modelpack/model-csi-driver/pkg/client/grpc.go:31:                     85.7%
github.com/modelpack/model-csi-driver/pkg/client/grpc.go:60:                     75.0%
github.com/modelpack/model-csi-driver/pkg/client/grpc.go:69:                     80.0%
github.com/modelpack/model-csi-driver/pkg/client/grpc.go:81:                     80.0%
github.com/modelpack/model-csi-driver/pkg/client/grpc.go:92:                     80.0%
github.com/modelpack/model-csi-driver/pkg/client/http.go:105:                    100.0%
github.com/modelpack/model-csi-driver/pkg/client/http.go:23:                     80.0%
github.com/modelpack/model-csi-driver/pkg/client/http.go:49:                     74.3%
github.com/modelpack/model-csi-driver/pkg/client/request.go:12:                  100.0%
github.com/modelpack/model-csi-driver/pkg/client/request.go:34:                  75.0%
github.com/modelpack/model-csi-driver/pkg/client/request.go:50:                  66.7%
github.com/modelpack/model-csi-driver/pkg/client/request.go:65:                  75.0%
github.com/modelpack/model-csi-driver/pkg/config/auth/docker.go:112:             87.5%
github.com/modelpack/model-csi-driver/pkg/config/auth/docker.go:26:              100.0%
github.com/modelpack/model-csi-driver/pkg/config/auth/docker.go:32:              100.0%
github.com/modelpack/model-csi-driver/pkg/config/auth/docker.go:55:              100.0%
github.com/modelpack/model-csi-driver/pkg/config/auth/docker.go:66:              92.9%
github.com/modelpack/model-csi-driver/pkg/config/auth/docker.go:88:              92.3%
github.com/modelpack/model-csi-driver/pkg/config/auth/keychain.go:20:            100.0%
github.com/modelpack/model-csi-driver/pkg/config/auth/keychain.go:31:            100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:102:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:107:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:112:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:117:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:122:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:127:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:132:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:137:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:142:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:147:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:151:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:155:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:159:                  61.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:17:                   87.5%
github.com/modelpack/model-csi-driver/pkg/config/config.go:236:                  83.3%
github.com/modelpack/model-csi-driver/pkg/config/config.go:249:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:257:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:261:                  100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:70:                   100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:74:                   100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:78:                   100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:82:                   100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:86:                   100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:90:                   100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:94:                   100.0%
github.com/modelpack/model-csi-driver/pkg/config/config.go:98:                   100.0%
github.com/modelpack/model-csi-driver/pkg/config/watcher.go:13:                  65.2%
github.com/modelpack/model-csi-driver/pkg/logger/logger.go:19:                   100.0%
github.com/modelpack/model-csi-driver/pkg/logger/logger.go:29:                   100.0%
github.com/modelpack/model-csi-driver/pkg/logger/logger.go:41:                   100.0%
github.com/modelpack/model-csi-driver/pkg/metrics/mount_collector.go:21:         100.0%
github.com/modelpack/model-csi-driver/pkg/metrics/mount_collector.go:34:         100.0%
github.com/modelpack/model-csi-driver/pkg/metrics/mount_collector.go:38:         100.0%
github.com/modelpack/model-csi-driver/pkg/metrics/mount_collector.go:42:         83.3%
github.com/modelpack/model-csi-driver/pkg/metrics/registry.go:117:               100.0%
github.com/modelpack/model-csi-driver/pkg/metrics/registry.go:126:               100.0%
github.com/modelpack/model-csi-driver/pkg/metrics/registry.go:135:               100.0%
github.com/modelpack/model-csi-driver/pkg/metrics/registry.go:146:               100.0%
github.com/modelpack/model-csi-driver/pkg/metrics/registry.go:24:                100.0%
github.com/modelpack/model-csi-driver/pkg/metrics/serve.go:26:                   100.0%
github.com/modelpack/model-csi-driver/pkg/metrics/serve.go:37:                   90.9%
github.com/modelpack/model-csi-driver/pkg/metrics/serve.go:59:                   83.3%
github.com/modelpack/model-csi-driver/pkg/mounter/builder.go:39:                 100.0%
github.com/modelpack/model-csi-driver/pkg/mounter/builder.go:50:                 100.0%
github.com/modelpack/model-csi-driver/pkg/mounter/builder.go:54:                 100.0%
github.com/modelpack/model-csi-driver/pkg/mounter/builder.go:59:                 100.0%
github.com/modelpack/model-csi-driver/pkg/mounter/builder.go:64:                 100.0%
github.com/modelpack/model-csi-driver/pkg/mounter/builder.go:69:                 100.0%
github.com/modelpack/model-csi-driver/pkg/mounter/builder.go:74:                 100.0%
github.com/modelpack/model-csi-driver/pkg/mounter/builder.go:82:                 100.0%
github.com/modelpack/model-csi-driver/pkg/mounter/builder.go:88:                 80.0%
github.com/modelpack/model-csi-driver/pkg/mounter/mounter.go:15:                 100.0%
github.com/modelpack/model-csi-driver/pkg/mounter/mounter.go:26:                 66.7%
github.com/modelpack/model-csi-driver/pkg/mounter/mounter.go:37:                 90.9%
github.com/modelpack/model-csi-driver/pkg/mounter/mounter.go:57:                 71.4%
github.com/modelpack/model-csi-driver/pkg/mounter/mounter.go:81:                 83.3%
github.com/modelpack/model-csi-driver/pkg/provider/provider.go:15:               100.0%
github.com/modelpack/model-csi-driver/pkg/service/artifact.go:11:                100.0%
github.com/modelpack/model-csi-driver/pkg/service/cache.go:119:                  71.4%
github.com/modelpack/model-csi-driver/pkg/service/cache.go:135:                  85.7%
github.com/modelpack/model-csi-driver/pkg/service/cache.go:28:                   75.0%
github.com/modelpack/model-csi-driver/pkg/service/cache.go:37:                   73.9%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:101:             100.0%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:115:             100.0%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:122:             76.2%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:157:             100.0%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:164:             50.0%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:189:             100.0%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:196:             100.0%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:203:             100.0%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:210:             100.0%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:217:             100.0%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:22:              84.0%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:249:             100.0%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:256:             100.0%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:263:             100.0%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:270:             100.0%
github.com/modelpack/model-csi-driver/pkg/service/controller.go:62:              91.3%
github.com/modelpack/model-csi-driver/pkg/service/controller_local.go:143:       79.3%
github.com/modelpack/model-csi-driver/pkg/service/controller_local.go:184:       28.1%
github.com/modelpack/model-csi-driver/pkg/service/controller_local.go:25:        57.3%
github.com/modelpack/model-csi-driver/pkg/service/controller_remote.go:134:      0.0%
github.com/modelpack/model-csi-driver/pkg/service/controller_remote.go:199:      0.0%
github.com/modelpack/model-csi-driver/pkg/service/controller_remote.go:34:       100.0%
github.com/modelpack/model-csi-driver/pkg/service/controller_remote.go:46:       0.0%
github.com/modelpack/model-csi-driver/pkg/service/controller_remote.go:60:       0.0%
github.com/modelpack/model-csi-driver/pkg/service/dynamic_server.go:107:         66.7%
github.com/modelpack/model-csi-driver/pkg/service/dynamic_server.go:151:         71.4%
github.com/modelpack/model-csi-driver/pkg/service/dynamic_server.go:190:         88.9%
github.com/modelpack/model-csi-driver/pkg/service/dynamic_server.go:46:          100.0%
github.com/modelpack/model-csi-driver/pkg/service/dynamic_server.go:54:          82.4%
github.com/modelpack/model-csi-driver/pkg/service/dynamic_server.go:84:          83.3%
github.com/modelpack/model-csi-driver/pkg/service/dynamic_server_handler.go:124: 83.3%
github.com/modelpack/model-csi-driver/pkg/service/dynamic_server_handler.go:156: 90.9%
github.com/modelpack/model-csi-driver/pkg/service/dynamic_server_handler.go:185: 85.7%
github.com/modelpack/model-csi-driver/pkg/service/dynamic_server_handler.go:203: 100.0%
github.com/modelpack/model-csi-driver/pkg/service/dynamic_server_handler.go:27:  83.3%
github.com/modelpack/model-csi-driver/pkg/service/dynamic_server_handler.go:38:  100.0%
github.com/modelpack/model-csi-driver/pkg/service/dynamic_server_handler.go:56:  80.0%
github.com/modelpack/model-csi-driver/pkg/service/identity.go:21:                100.0%
github.com/modelpack/model-csi-driver/pkg/service/identity.go:41:                100.0%
github.com/modelpack/model-csi-driver/pkg/service/identity.go:9:                 100.0%
github.com/modelpack/model-csi-driver/pkg/service/kube.go:14:                    0.0%
github.com/modelpack/model-csi-driver/pkg/service/kube.go:25:                    0.0%
github.com/modelpack/model-csi-driver/pkg/service/kube.go:34:                    0.0%
github.com/modelpack/model-csi-driver/pkg/service/model.go:114:                  89.5%
github.com/modelpack/model-csi-driver/pkg/service/model.go:157:                  91.7%
github.com/modelpack/model-csi-driver/pkg/service/model.go:177:                  100.0%
github.com/modelpack/model-csi-driver/pkg/service/model.go:27:                   80.0%
github.com/modelpack/model-csi-driver/pkg/service/model.go:48:                   100.0%
github.com/modelpack/model-csi-driver/pkg/service/model.go:60:                   100.0%
github.com/modelpack/model-csi-driver/pkg/service/model.go:68:                   100.0%
github.com/modelpack/model-csi-driver/pkg/service/model.go:96:                   100.0%
github.com/modelpack/model-csi-driver/pkg/service/node.go:123:                   94.4%
github.com/modelpack/model-csi-driver/pkg/service/node.go:154:                   66.7%
github.com/modelpack/model-csi-driver/pkg/service/node.go:202:                   94.4%
github.com/modelpack/model-csi-driver/pkg/service/node.go:233:                   100.0%
github.com/modelpack/model-csi-driver/pkg/service/node.go:241:                   100.0%
github.com/modelpack/model-csi-driver/pkg/service/node.go:249:                   100.0%
github.com/modelpack/model-csi-driver/pkg/service/node.go:269:                   100.0%
github.com/modelpack/model-csi-driver/pkg/service/node.go:26:                    100.0%
github.com/modelpack/model-csi-driver/pkg/service/node.go:34:                    100.0%
github.com/modelpack/model-csi-driver/pkg/service/node.go:42:                    100.0%
github.com/modelpack/model-csi-driver/pkg/service/node.go:46:                    100.0%
github.com/modelpack/model-csi-driver/pkg/service/node.go:50:                    48.8%
github.com/modelpack/model-csi-driver/pkg/service/node_dynamic.go:18:            73.3%
github.com/modelpack/model-csi-driver/pkg/service/node_dynamic.go:52:            68.4%
github.com/modelpack/model-csi-driver/pkg/service/node_static.go:16:             81.8%
github.com/modelpack/model-csi-driver/pkg/service/node_static.go:42:             69.2%
github.com/modelpack/model-csi-driver/pkg/service/node_static_inline.go:18:      0.0%
github.com/modelpack/model-csi-driver/pkg/service/node_static_inline.go:56:      57.1%
github.com/modelpack/model-csi-driver/pkg/service/puller.go:42:                  26.9%
github.com/modelpack/model-csi-driver/pkg/service/quota.go:21:                   94.1%
github.com/modelpack/model-csi-driver/pkg/service/quota.go:49:                   100.0%
github.com/modelpack/model-csi-driver/pkg/service/quota.go:55:                   100.0%
github.com/modelpack/model-csi-driver/pkg/service/quota.go:67:                   84.2%
github.com/modelpack/model-csi-driver/pkg/service/service.go:41:                 100.0%
github.com/modelpack/model-csi-driver/pkg/service/service.go:45:                 0.0%
github.com/modelpack/model-csi-driver/pkg/service/worker.go:112:                 100.0%
github.com/modelpack/model-csi-driver/pkg/service/worker.go:121:                 87.5%
github.com/modelpack/model-csi-driver/pkg/service/worker.go:146:                 71.7%
github.com/modelpack/model-csi-driver/pkg/service/worker.go:237:                 77.1%
github.com/modelpack/model-csi-driver/pkg/service/worker.go:28:                  100.0%
github.com/modelpack/model-csi-driver/pkg/service/worker.go:34:                  100.0%
github.com/modelpack/model-csi-driver/pkg/service/worker.go:46:                  100.0%
github.com/modelpack/model-csi-driver/pkg/service/worker.go:62:                  100.0%
github.com/modelpack/model-csi-driver/pkg/service/worker.go:73:                  77.3%
github.com/modelpack/model-csi-driver/pkg/status/hook.go:100:                    100.0%
github.com/modelpack/model-csi-driver/pkg/status/hook.go:107:                    86.7%
github.com/modelpack/model-csi-driver/pkg/status/hook.go:139:                    100.0%
github.com/modelpack/model-csi-driver/pkg/status/hook.go:174:                    88.9%
github.com/modelpack/model-csi-driver/pkg/status/hook.go:195:                    100.0%
github.com/modelpack/model-csi-driver/pkg/status/hook.go:28:                     100.0%
github.com/modelpack/model-csi-driver/pkg/status/hook.go:34:                     100.0%
github.com/modelpack/model-csi-driver/pkg/status/hook.go:46:                     100.0%
github.com/modelpack/model-csi-driver/pkg/status/hook.go:53:                     100.0%
github.com/modelpack/model-csi-driver/pkg/status/hook.go:69:                     100.0%
github.com/modelpack/model-csi-driver/pkg/status/hook.go:76:                     80.0%
github.com/modelpack/model-csi-driver/pkg/status/hook.go:87:                     80.0%
github.com/modelpack/model-csi-driver/pkg/status/status.go:110:                  87.5%
github.com/modelpack/model-csi-driver/pkg/status/status.go:125:                  83.3%
github.com/modelpack/model-csi-driver/pkg/status/status.go:136:                  100.0%
github.com/modelpack/model-csi-driver/pkg/status/status.go:51:                   75.0%
github.com/modelpack/model-csi-driver/pkg/status/status.go:68:                   100.0%
github.com/modelpack/model-csi-driver/pkg/status/status.go:74:                   66.7%
github.com/modelpack/model-csi-driver/pkg/status/status.go:92:                   100.0%
github.com/modelpack/model-csi-driver/pkg/tracing/tracing.go:22:                 85.7%
github.com/modelpack/model-csi-driver/pkg/tracing/tracing.go:36:                 55.6%
github.com/modelpack/model-csi-driver/pkg/tracing/tracing.go:72:                 100.0%
github.com/modelpack/model-csi-driver/pkg/tracing/tracing.go:79:                 81.8%
github.com/modelpack/model-csi-driver/pkg/utils/utils.go:16:                     100.0%
github.com/modelpack/model-csi-driver/pkg/utils/utils.go:34:                     75.0%
github.com/modelpack/model-csi-driver/pkg/utils/utils.go:54:                     81.8%

total:											(statements)				71.4%

Copy link
Copy Markdown
Collaborator

@imeoer imeoer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, but it introduces an issue: if pulling model is no longer subject to kubelet’s 2min mount volume timeout, the model volume directory may be mounted into the pod prematurely, before the model weight files are fully downloaded.

Comment thread pkg/service/worker.go
// when kubelet times out and retries
var cancel context.CancelFunc
ctx, cancel = context.WithCancel(ctx)
ctx, cancel = context.WithCancel(context.WithoutCancel(ctx))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Should we only use WithoutCancel for the s.worker.PullModel of nodePublishVolumeStaticInlineVolume?

@tjandy98
Copy link
Copy Markdown
Author

tjandy98 commented May 7, 2026

Thanks for this PR, but it introduces an issue: if pulling model is no longer subject to kubelet’s 2min mount volume timeout, the model volume directory may be mounted into the pod prematurely, before the model weight files are fully downloaded.

Thanks for the feedback. From my understanding mount only runs if PullModel succeeds, which means all the files are on disk

@imeoer
Copy link
Copy Markdown
Collaborator

imeoer commented May 8, 2026

Thanks for the feedback. From my understanding mount only runs if PullModel succeeds, which means all the files are on disk

Okay, so will the pod likely see an empty directory initially and then see the model contents later?

If so, we need a clear mechanism for the application process in the pod to know when the model directory is ready, but this is no different from using dynamic API mounting, the business logic must explicitly adapt:

s.echo.POST("/api/v1/volumes/:volume_name/mounts", handler.CreateVolume)

@tjandy98
Copy link
Copy Markdown
Author

tjandy98 commented May 8, 2026

Thanks for the feedback. From my understanding mount only runs if PullModel succeeds, which means all the files are on disk

Okay, so will the pod likely see an empty directory initially and then see the model contents later?

If so, we need a clear mechanism for the application process in the pod to know when the model directory is ready, but this is no different from using dynamic API mounting, the business logic must explicitly adapt:

s.echo.POST("/api/v1/volumes/:volume_name/mounts", handler.CreateVolume)

The pod status will not be marked as ready as long as all the layers are not pulled completely yet

@imeoer
Copy link
Copy Markdown
Collaborator

imeoer commented May 8, 2026

The pod status will not be marked as ready as long as all the layers are not pulled completely yet

Do you mean the inference engine process will initially fail and then eventually reach the ready state via the container’s restart backoff retry mechanism?

That seems viable, could we control this behavior via a parameter, e.g., model.csi.modelpack.org/async-mount?

  volumes:
  - name: model-volume
    csi:
      driver: model.csi.modelpack.org
      volumeAttributes:
        model.csi.modelpack.org/reference: "registry.example.com/models/qwen3-0.6b:latest"
        model.csi.modelpack.org/async-mount: "true"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants