Skip to content

fix: bind file resource refCnt to collection lifecycle to prevent panic#48893

Merged
sre-ci-robot merged 3 commits intomilvus-io:masterfrom
aoiasd:file_resource_panic
Apr 22, 2026
Merged

fix: bind file resource refCnt to collection lifecycle to prevent panic#48893
sre-ci-robot merged 3 commits intomilvus-io:masterfrom
aoiasd:file_resource_panic

Conversation

@aoiasd
Copy link
Copy Markdown
Contributor

@aoiasd aoiasd commented Apr 9, 2026

relate: #48612

Summary

  • Increment fileResourceRefCnt during validateSchema instead of in the async ack callback's AddCollection, closing the TOCTOU race where RemoveFileResource could delete a resource between validation and AddCollection
  • On failure before Broadcast, refCnt is decremented immediately; on restart, refCnt for pending broadcast tasks is recovered from etcd before rootcoord becomes Healthy
  • Remove refCnt++ from addCollectionMeta since it's now done at validation time (reload path unchanged)

@sre-ci-robot sre-ci-robot requested review from chyezh and yhmo April 9, 2026 11:50
@sre-ci-robot sre-ci-robot added the size/L Denotes a PR that changes 100-499 lines. label Apr 9, 2026
@mergify mergify Bot added dco-passed DCO check passed. kind/bug Issues or changes related a bug labels Apr 9, 2026
@sre-ci-robot
Copy link
Copy Markdown
Contributor

[ci-v2-notice]
Notice: New ci-v2 system is enabled for this PR.

To rerun ci-v2 checks, comment with:

  • /ci-rerun-code-check // for ci-v2/code-check
  • /ci-rerun-build // for ci-v2/build
  • /ci-rerun-build-all // for ci-v2/build-all (multi-arch builds)
  • /ci-rerun-buildenv // for ci-v2/build-env (build milvus-env builder images)
  • /ci-rerun-ut-integration // for ci-v2/ut-integration, will rerun ci-v2/build
  • /ci-rerun-ut-go // for ci-v2/ut-go, will rerun ci-v2/build
  • /ci-rerun-ut-cpp // for ci-v2/ut-cpp
  • /ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp, will rerun ci-v2/build
  • /ci-rerun-e2e-default // for ci-v2/e2e-default
  • /ci-rerun-e2e-amd // for ci-v2/e2e-amd (e2e pool dispatcher)
  • /ci-rerun-build-ut-cov // for ci-v2/build-ut-cov (build + unit tests in one pipeline)
  • /ci-rerun-gosdk // for ci-v2/go-sdk (Go SDK E2E tests, ARM)

If you have any questions or requests, please contact @zhikunyao.

@sre-ci-robot
Copy link
Copy Markdown
Contributor

✅ CI Loop Results 5b6391b

Stage Result Duration Tests
✅ Build SUCCESS 8.0min -

Total: 18min | Pipeline | Artifacts

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 21.59091% with 69 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.95%. Comparing base (353ce56) to head (daf5198).
⚠️ Report is 53 commits behind head on master.

Files with missing lines Patch % Lines
internal/rootcoord/meta_table.go 0.00% 39 Missing ⚠️
...amingcoord/server/broadcaster/broadcast_manager.go 31.57% 12 Missing and 1 partial ⚠️
internal/rootcoord/create_collection_task.go 46.15% 5 Missing and 2 partials ⚠️
internal/rootcoord/root_coord.go 50.00% 3 Missing and 1 partial ⚠️
...ernal/rootcoord/ddl_callbacks_create_collection.go 25.00% 3 Missing ⚠️
...ingcoord/server/broadcaster/broadcast/singleton.go 40.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #48893      +/-   ##
==========================================
- Coverage   77.97%   77.95%   -0.03%     
==========================================
  Files        2168     2168              
  Lines      356851   356934      +83     
==========================================
- Hits       278271   278239      -32     
- Misses      70010    70108      +98     
- Partials     8570     8587      +17     
Components Coverage Δ
Client 79.25% <ø> (ø)
Core 84.45% <ø> (ø)
Go 76.23% <21.59%> (-0.04%) ⬇️
Files with missing lines Coverage Δ
...ernal/rootcoord/ddl_callbacks_create_collection.go 85.62% <25.00%> (-1.49%) ⬇️
...ingcoord/server/broadcaster/broadcast/singleton.go 62.50% <40.00%> (-3.22%) ⬇️
internal/rootcoord/root_coord.go 76.78% <50.00%> (-0.09%) ⬇️
internal/rootcoord/create_collection_task.go 79.46% <46.15%> (-0.85%) ⬇️
...amingcoord/server/broadcaster/broadcast_manager.go 74.36% <31.57%> (-4.32%) ⬇️
internal/rootcoord/meta_table.go 73.55% <0.00%> (-1.33%) ⬇️

... and 30 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sre-ci-robot
Copy link
Copy Markdown
Contributor

❌ CI Loop Results 5b6391b

Stage Result Duration Tests

Total: 3min | Pipeline | Artifacts

@sre-ci-robot sre-ci-robot added the low-code-coverage add test-label from zhikun, diff coverage > 80% label Apr 9, 2026
@aoiasd
Copy link
Copy Markdown
Contributor Author

aoiasd commented Apr 13, 2026

/ci-rerun-e2e-default

@aoiasd
Copy link
Copy Markdown
Contributor Author

aoiasd commented Apr 13, 2026

/ci-rerun-ut-go

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 15, 2026

@aoiasd Please associate the related issue to the body of your Pull Request. (eg. "issue: #")

@sre-ci-robot sre-ci-robot added the low-code-coverage add test-label from zhikun, diff coverage > 80% label Apr 15, 2026
@aoiasd aoiasd force-pushed the file_resource_panic branch from 42cae53 to 3f876cc Compare April 15, 2026 12:43
@sre-ci-robot sre-ci-robot added low-code-coverage add test-label from zhikun, diff coverage > 80% and removed low-code-coverage add test-label from zhikun, diff coverage > 80% labels Apr 15, 2026
Increment fileResourceRefCnt during validateSchema (when file resource
IDs are resolved), rather than in the async ack callback's AddCollection.
This closes the TOCTOU race window where RemoveFileResource could delete
a resource between validation and AddCollection, causing streaming node
to panic when creating the tokenizer.

On failure before Broadcast, refCnt is decremented immediately. On
restart, refCnt for pending broadcast tasks is recovered from etcd
before rootcoord becomes Healthy.

issue: milvus-io#48612

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
@aoiasd aoiasd force-pushed the file_resource_panic branch from 3f876cc to d80700d Compare April 16, 2026 02:37
…fCnt

1. Move RecoverFileResourceRefCnt from restore() to Init(), before
   RegisterDDLCallbacks, so ack callbacks won't race with recovery.

2. Remove releaseFileResources on Broadcast error: the task is already
   in the scheduler and will retry until success.

3. Add underflow guard (>0 check) for refCnt decrement in both
   DecFileResourceRefCnt and DropCollection paths.

4. Add warning log in RecoverFileResourceRefCnt when pending task
   references missing file resources.

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
@aoiasd aoiasd force-pushed the file_resource_panic branch from 59e5418 to 950bd82 Compare April 16, 2026 07:53
@sre-ci-robot sre-ci-robot added the low-code-coverage add test-label from zhikun, diff coverage > 80% label Apr 16, 2026
aoiasd added a commit to aoiasd/milvus that referenced this pull request Apr 16, 2026
…nt panic (milvus-io#48893)

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
@mergify mergify Bot added the ci-passed label Apr 16, 2026
@sre-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aoiasd, zhengbuqian

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aoiasd added a commit to aoiasd/milvus that referenced this pull request Apr 22, 2026
…nt panic (milvus-io#48893)

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
@sparknack
Copy link
Copy Markdown
Contributor

/lgtm

@sre-ci-robot sre-ci-robot merged commit 3a4e37a into milvus-io:master Apr 22, 2026
17 of 21 checks passed
aoiasd added a commit to aoiasd/milvus that referenced this pull request Apr 23, 2026
…nt panic (milvus-io#48893)

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
sre-ci-robot pushed a commit that referenced this pull request Apr 23, 2026
…nt panic (#48894)

## Summary
- Increment `fileResourceRefCnt` during `validateSchema` instead of in
the async ack callback's `AddCollection`, closing the TOCTOU race where
`RemoveFileResource` could delete a resource between validation and
`AddCollection`
- On failure before `Broadcast`, refCnt is decremented immediately; on
restart, refCnt for pending broadcast tasks is recovered from etcd
before rootcoord becomes Healthy
- Remove refCnt++ from `addCollectionMeta` since it's now done at
validation time (reload path unchanged)

## Test plan
- [ ] Existing file resource E2E tests pass (the race that caused
#48612)
- [ ] CreateCollection with file resource → verify refCnt incremented
- [ ] RemoveFileResource blocked during in-flight CreateCollection
- [ ] Restart during CreateCollection → verify refCnt recovered from
pending broadcast tasks

issue: #48612
pr: #48893

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved ci-passed dco-passed DCO check passed. kind/bug Issues or changes related a bug lgtm low-code-coverage add test-label from zhikun, diff coverage > 80% size/L Denotes a PR that changes 100-499 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants