Skip to content

fix: prevent unsuspend worker death and workqueue leak on cache miss#56

Open
goyalpalak18 wants to merge 1 commit intovolcano-sh:mainfrom
goyalpalak18:fix/unsuspend-worker-break-and-workqueue-leak
Open

fix: prevent unsuspend worker death and workqueue leak on cache miss#56
goyalpalak18 wants to merge 1 commit intovolcano-sh:mainfrom
goyalpalak18:fix/unsuspend-worker-break-and-workqueue-leak

Conversation

@goyalpalak18
Copy link
Copy Markdown

Summary

While reviewing the dispatcher controller logic, I uncovered a critical stability issue in the unSuspendResourceBindingTaskWorker.

Currently, if a ResourceBinding is missing from the cache (e.g., due to a concurrent delete or a transient API failure), the worker loop incorrectly breaks instead of continueing. This kills the worker goroutine and permanently leaks the workqueue item, causing jobs to get stuck in Pending forever.

The Problem

There were two compounding failures in pkg/dispatcher/cache/resource_binding_processor.go:

  1. Worker Death: When rbi, ok := ... returns false (cache miss), the code hits a break. This exits the for loop entirely, terminating the worker thread. If this happens enough times, the controller eventually runs out of workers.
  2. Workqueue Leak: The Done() method was only being called in the success path. When the break happened, the item remained in the workqueue's internal "processing" set. This means even if the event was re-queued later, the workqueue would ignore it because it thinks it's still being processed.

The Fix

I have modified unSuspendResourceBindingTaskWorker to:

  1. Call Done(): Ensure the item is properly released from the workqueue on a cache miss.
  2. Use continue: Skip the current item and proceed to the next one, keeping the worker alive.
  3. Add Safety: Added a defensive nil check for dc.resourceBindingInfos[key.Namespace] to prevent potential nil pointer dereferences if the namespace map is uninitialized.

Tests

I added a new test file pkg/dispatcher/cache/resource_binding_processor_test.go with the following cases:

  • TestWorkerContinuesAfterCacheMiss: Verifies the worker processes subsequent items after a cache miss (proving the loop didn't break).
  • TestWorkerReleasesWorkqueueItemOnCacheMiss: Verifies Done() is called so the key can be re-processed later.
  • TestWorkerHandlesNilNamespaceMap: Verifies the controller handles a nil namespace map without panicking.

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevin-wangzefeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

Welcome @goyalpalak18! It looks like this is your first PR to volcano-sh/volcano-global 🎉

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @goyalpalak18, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical stability issue within the dispatcher controller's unSuspendResourceBindingTaskWorker. Previously, a missing ResourceBinding in the cache would cause the worker to prematurely exit and leak the workqueue item, leading to stuck jobs. The changes ensure that workers gracefully handle cache misses by continuing to process other items and properly releasing the affected workqueue item, significantly improving the controller's reliability and preventing resource exhaustion.

Highlights

  • Worker Stability: Corrected unSuspendResourceBindingTaskWorker logic to prevent worker goroutine termination on cache misses, ensuring continuous operation.
  • Workqueue Item Release: Implemented proper workqueue item release (Done()) on cache misses to avoid item leaks, preventing jobs from getting stuck in a pending state.
  • Nil Namespace Map Handling: Added a defensive nil check for namespace maps to enhance robustness and prevent potential panics when a namespace entry is uninitialized.
  • New Unit Tests: Introduced new unit tests to validate the worker's behavior, specifically covering scenarios like continuing after a cache miss, proper workqueue item release, and graceful handling of nil namespace maps.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pkg/dispatcher/cache/resource_binding_processor.go
    • Modified unSuspendResourceBindingTaskWorker to replace break with continue on cache miss, allowing the worker to process subsequent items.
    • Added dc.unSuspendRBTaskQueue.Done(obj) call within the cache miss handling block to ensure the workqueue item is properly released.
    • Introduced a check for nil namespace map (nsMap := dc.resourceBindingInfos[key.Namespace]) before attempting to access key.Name to prevent nil pointer dereferences.
  • pkg/dispatcher/cache/resource_binding_processor_test.go
    • Added a new test file for resource_binding_processor.
    • Implemented TestWorkerContinuesAfterCacheMiss to verify that the worker processes subsequent items after encountering a cache miss.
    • Implemented TestWorkerReleasesWorkqueueItemOnCacheMiss to confirm that Done() is correctly invoked on a cache miss, allowing items to be re-queued.
    • Implemented TestWorkerHandlesNilNamespaceMap to ensure the worker gracefully handles cases where the namespace map for a namespace is nil.
Activity
  • No specific activity (comments, reviews, approvals) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical stability issue in the unSuspendResourceBindingTaskWorker. The fix to prevent worker death and workqueue leaks on a cache miss is correct, and the logic of using continue and ensuring Done() is called is sound. The addition of comprehensive unit tests to cover these specific failure scenarios is excellent and greatly improves confidence in the fix.

While reviewing the worker logic, I identified a couple of pre-existing potential issues in the unSuspendResourceBindingTaskWorker function that are outside the scope of this PR's changes but are worth considering for a follow-up to further improve stability:

  1. Unbounded Goroutine Spawning: The call go dc.unSuspendResourceBinding(...) launches a new goroutine for every successfully processed item. Since there are multiple workers pulling from the queue, this can lead to an unbounded number of concurrent goroutines making API patch requests, which could overwhelm the Kubernetes API server. It might be safer to make this call synchronously, relying on the workerNum to control concurrency.
  2. Potential Nil Pointer Dereference: The unSuspendResourceBinding function receives rbi.ResourceBinding but doesn't check if it's nil before dereferencing it (e.g., rb.Namespace). If it's possible for a ResourceBindingInfo in the cache to have a nil ResourceBinding field, this could cause a panic.

These are not regressions from this PR but opportunities for future hardening. Overall, this is a great contribution that fixes a significant bug.

Comment thread pkg/dispatcher/cache/resource_binding_processor_test.go Outdated
Comment thread pkg/dispatcher/cache/resource_binding_processor_test.go Outdated
Comment thread pkg/dispatcher/cache/resource_binding_processor_test.go
Comment thread pkg/dispatcher/cache/resource_binding_processor_test.go
Signed-off-by: goyalpalak18 <goyalpalak1806@gmail.com>
@goyalpalak18 goyalpalak18 force-pushed the fix/unsuspend-worker-break-and-workqueue-leak branch from c9e10c2 to 316a621 Compare February 7, 2026 18:48
@goyalpalak18
Copy link
Copy Markdown
Author

Hi @hwdef @lowang-bh, could you please let me know whether you think this fix's reasoning makes sense or if you would prefer I approach this in another way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants