fix: bind file resource refCnt to collection lifecycle to prevent panic#48893
fix: bind file resource refCnt to collection lifecycle to prevent panic#48893sre-ci-robot merged 3 commits intomilvus-io:masterfrom
Conversation
|
[ci-v2-notice] To rerun ci-v2 checks, comment with:
If you have any questions or requests, please contact @zhikunyao. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #48893 +/- ##
==========================================
- Coverage 77.97% 77.95% -0.03%
==========================================
Files 2168 2168
Lines 356851 356934 +83
==========================================
- Hits 278271 278239 -32
- Misses 70010 70108 +98
- Partials 8570 8587 +17
🚀 New features to boost your workflow:
|
|
/ci-rerun-e2e-default |
|
/ci-rerun-ut-go |
5b6391b to
cbb6b89
Compare
cbb6b89 to
32ea3cf
Compare
|
@aoiasd Please associate the related issue to the body of your Pull Request. (eg. "issue: #") |
42cae53 to
3f876cc
Compare
Increment fileResourceRefCnt during validateSchema (when file resource IDs are resolved), rather than in the async ack callback's AddCollection. This closes the TOCTOU race window where RemoveFileResource could delete a resource between validation and AddCollection, causing streaming node to panic when creating the tokenizer. On failure before Broadcast, refCnt is decremented immediately. On restart, refCnt for pending broadcast tasks is recovered from etcd before rootcoord becomes Healthy. issue: milvus-io#48612 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
3f876cc to
d80700d
Compare
…fCnt 1. Move RecoverFileResourceRefCnt from restore() to Init(), before RegisterDDLCallbacks, so ack callbacks won't race with recovery. 2. Remove releaseFileResources on Broadcast error: the task is already in the scheduler and will retry until success. 3. Add underflow guard (>0 check) for refCnt decrement in both DecFileResourceRefCnt and DropCollection paths. 4. Add warning log in RecoverFileResourceRefCnt when pending task references missing file resources. Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
59e5418 to
950bd82
Compare
…nt panic (milvus-io#48893) Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aoiasd, zhengbuqian The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…nt panic (milvus-io#48893) Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
|
/lgtm |
…nt panic (milvus-io#48893) Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
…nt panic (#48894) ## Summary - Increment `fileResourceRefCnt` during `validateSchema` instead of in the async ack callback's `AddCollection`, closing the TOCTOU race where `RemoveFileResource` could delete a resource between validation and `AddCollection` - On failure before `Broadcast`, refCnt is decremented immediately; on restart, refCnt for pending broadcast tasks is recovered from etcd before rootcoord becomes Healthy - Remove refCnt++ from `addCollectionMeta` since it's now done at validation time (reload path unchanged) ## Test plan - [ ] Existing file resource E2E tests pass (the race that caused #48612) - [ ] CreateCollection with file resource → verify refCnt incremented - [ ] RemoveFileResource blocked during in-flight CreateCollection - [ ] Restart during CreateCollection → verify refCnt recovered from pending broadcast tasks issue: #48612 pr: #48893 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: aoiasd <zhicheng.yue@zilliz.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
relate: #48612
Summary
fileResourceRefCntduringvalidateSchemainstead of in the async ack callback'sAddCollection, closing the TOCTOU race whereRemoveFileResourcecould delete a resource between validation andAddCollectionBroadcast, refCnt is decremented immediately; on restart, refCnt for pending broadcast tasks is recovered from etcd before rootcoord becomes HealthyaddCollectionMetasince it's now done at validation time (reload path unchanged)