fix: prevent unsuspend worker death and workqueue leak on cache miss#56
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @goyalpalak18! It looks like this is your first PR to volcano-sh/volcano-global 🎉 |
Summary of ChangesHello @goyalpalak18, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical stability issue within the dispatcher controller's Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a critical stability issue in the unSuspendResourceBindingTaskWorker. The fix to prevent worker death and workqueue leaks on a cache miss is correct, and the logic of using continue and ensuring Done() is called is sound. The addition of comprehensive unit tests to cover these specific failure scenarios is excellent and greatly improves confidence in the fix.
While reviewing the worker logic, I identified a couple of pre-existing potential issues in the unSuspendResourceBindingTaskWorker function that are outside the scope of this PR's changes but are worth considering for a follow-up to further improve stability:
- Unbounded Goroutine Spawning: The call
go dc.unSuspendResourceBinding(...)launches a new goroutine for every successfully processed item. Since there are multiple workers pulling from the queue, this can lead to an unbounded number of concurrent goroutines making API patch requests, which could overwhelm the Kubernetes API server. It might be safer to make this call synchronously, relying on theworkerNumto control concurrency. - Potential Nil Pointer Dereference: The
unSuspendResourceBindingfunction receivesrbi.ResourceBindingbut doesn't check if it'snilbefore dereferencing it (e.g.,rb.Namespace). If it's possible for aResourceBindingInfoin the cache to have anilResourceBindingfield, this could cause a panic.
These are not regressions from this PR but opportunities for future hardening. Overall, this is a great contribution that fixes a significant bug.
Signed-off-by: goyalpalak18 <goyalpalak1806@gmail.com>
c9e10c2 to
316a621
Compare
|
Hi @hwdef @lowang-bh, could you please let me know whether you think this fix's reasoning makes sense or if you would prefer I approach this in another way? |
Summary
While reviewing the dispatcher controller logic, I uncovered a critical stability issue in the
unSuspendResourceBindingTaskWorker.Currently, if a
ResourceBindingis missing from the cache (e.g., due to a concurrent delete or a transient API failure), the worker loop incorrectlybreaks instead ofcontinueing. This kills the worker goroutine and permanently leaks the workqueue item, causing jobs to get stuck inPendingforever.The Problem
There were two compounding failures in
pkg/dispatcher/cache/resource_binding_processor.go:rbi, ok := ...returnsfalse(cache miss), the code hits abreak. This exits theforloop entirely, terminating the worker thread. If this happens enough times, the controller eventually runs out of workers.Done()method was only being called in the success path. When thebreakhappened, the item remained in the workqueue's internal "processing" set. This means even if the event was re-queued later, the workqueue would ignore it because it thinks it's still being processed.The Fix
I have modified
unSuspendResourceBindingTaskWorkerto:Done(): Ensure the item is properly released from the workqueue on a cache miss.continue: Skip the current item and proceed to the next one, keeping the worker alive.nilcheck fordc.resourceBindingInfos[key.Namespace]to prevent potential nil pointer dereferences if the namespace map is uninitialized.Tests
I added a new test file
pkg/dispatcher/cache/resource_binding_processor_test.gowith the following cases:TestWorkerContinuesAfterCacheMiss: Verifies the worker processes subsequent items after a cache miss (proving the loop didn't break).TestWorkerReleasesWorkqueueItemOnCacheMiss: VerifiesDone()is called so the key can be re-processed later.TestWorkerHandlesNilNamespaceMap: Verifies the controller handles a nil namespace map without panicking.