fix: prevent unsuspend worker death and workqueue leak on cache miss by goyalpalak18 · Pull Request #56 · volcano-sh/volcano-global

goyalpalak18 · 2026-02-07T18:00:22Z

Summary

While reviewing the dispatcher controller logic, I uncovered a critical stability issue in the unSuspendResourceBindingTaskWorker.

Currently, if a ResourceBinding is missing from the cache (e.g., due to a concurrent delete or a transient API failure), the worker loop incorrectly breaks instead of continueing. This kills the worker goroutine and permanently leaks the workqueue item, causing jobs to get stuck in Pending forever.

The Problem

There were two compounding failures in pkg/dispatcher/cache/resource_binding_processor.go:

Worker Death: When rbi, ok := ... returns false (cache miss), the code hits a break. This exits the for loop entirely, terminating the worker thread. If this happens enough times, the controller eventually runs out of workers.
Workqueue Leak: The Done() method was only being called in the success path. When the break happened, the item remained in the workqueue's internal "processing" set. This means even if the event was re-queued later, the workqueue would ignore it because it thinks it's still being processed.

The Fix

I have modified unSuspendResourceBindingTaskWorker to:

Call Done(): Ensure the item is properly released from the workqueue on a cache miss.
Use continue: Skip the current item and proceed to the next one, keeping the worker alive.
Add Safety: Added a defensive nil check for dc.resourceBindingInfos[key.Namespace] to prevent potential nil pointer dereferences if the namespace map is uninitialized.

Tests

I added a new test file pkg/dispatcher/cache/resource_binding_processor_test.go with the following cases:

TestWorkerContinuesAfterCacheMiss: Verifies the worker processes subsequent items after a cache miss (proving the loop didn't break).
TestWorkerReleasesWorkqueueItemOnCacheMiss: Verifies Done() is called so the key can be re-processed later.
TestWorkerHandlesNilNamespaceMap: Verifies the controller handles a nil namespace map without panicking.

volcano-sh-bot · 2026-02-07T18:00:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevin-wangzefeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot · 2026-02-07T18:00:34Z

Welcome @goyalpalak18! It looks like this is your first PR to volcano-sh/volcano-global 🎉

gemini-code-assist · 2026-02-07T18:00:39Z

Summary of Changes

Hello @goyalpalak18, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical stability issue within the dispatcher controller's unSuspendResourceBindingTaskWorker. Previously, a missing ResourceBinding in the cache would cause the worker to prematurely exit and leak the workqueue item, leading to stuck jobs. The changes ensure that workers gracefully handle cache misses by continuing to process other items and properly releasing the affected workqueue item, significantly improving the controller's reliability and preventing resource exhaustion.

Highlights

Worker Stability: Corrected unSuspendResourceBindingTaskWorker logic to prevent worker goroutine termination on cache misses, ensuring continuous operation.
Workqueue Item Release: Implemented proper workqueue item release (Done()) on cache misses to avoid item leaks, preventing jobs from getting stuck in a pending state.
Nil Namespace Map Handling: Added a defensive nil check for namespace maps to enhance robustness and prevent potential panics when a namespace entry is uninitialized.
New Unit Tests: Introduced new unit tests to validate the worker's behavior, specifically covering scenarios like continuing after a cache miss, proper workqueue item release, and graceful handling of nil namespace maps.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

pkg/dispatcher/cache/resource_binding_processor.go
- Modified unSuspendResourceBindingTaskWorker to replace break with continue on cache miss, allowing the worker to process subsequent items.
- Added dc.unSuspendRBTaskQueue.Done(obj) call within the cache miss handling block to ensure the workqueue item is properly released.
- Introduced a check for nil namespace map (nsMap := dc.resourceBindingInfos[key.Namespace]) before attempting to access key.Name to prevent nil pointer dereferences.
pkg/dispatcher/cache/resource_binding_processor_test.go
- Added a new test file for resource_binding_processor.
- Implemented TestWorkerContinuesAfterCacheMiss to verify that the worker processes subsequent items after encountering a cache miss.
- Implemented TestWorkerReleasesWorkqueueItemOnCacheMiss to confirm that Done() is correctly invoked on a cache miss, allowing items to be re-queued.
- Implemented TestWorkerHandlesNilNamespaceMap to ensure the worker gracefully handles cases where the namespace map for a namespace is nil.

Activity

No specific activity (comments, reviews, approvals) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses a critical stability issue in the unSuspendResourceBindingTaskWorker. The fix to prevent worker death and workqueue leaks on a cache miss is correct, and the logic of using continue and ensuring Done() is called is sound. The addition of comprehensive unit tests to cover these specific failure scenarios is excellent and greatly improves confidence in the fix.

While reviewing the worker logic, I identified a couple of pre-existing potential issues in the unSuspendResourceBindingTaskWorker function that are outside the scope of this PR's changes but are worth considering for a follow-up to further improve stability:

Unbounded Goroutine Spawning: The call go dc.unSuspendResourceBinding(...) launches a new goroutine for every successfully processed item. Since there are multiple workers pulling from the queue, this can lead to an unbounded number of concurrent goroutines making API patch requests, which could overwhelm the Kubernetes API server. It might be safer to make this call synchronously, relying on the workerNum to control concurrency.
Potential Nil Pointer Dereference: The unSuspendResourceBinding function receives rbi.ResourceBinding but doesn't check if it's nil before dereferencing it (e.g., rb.Namespace). If it's possible for a ResourceBindingInfo in the cache to have a nil ResourceBinding field, this could cause a panic.

These are not regressions from this PR but opportunities for future hardening. Overall, this is a great contribution that fixes a significant bug.

Signed-off-by: goyalpalak18 <goyalpalak1806@gmail.com>

goyalpalak18 · 2026-02-18T04:18:50Z

Hi @hwdef @lowang-bh, could you please let me know whether you think this fix's reasoning makes sense or if you would prefer I approach this in another way?

volcano-sh-bot requested review from hwdef and lowang-bh February 7, 2026 18:00

volcano-sh-bot added the size/L label Feb 7, 2026

gemini-code-assist Bot reviewed Feb 7, 2026

View reviewed changes

volcano-sh-bot added the do-not-merge/invalid-commit-message label Feb 7, 2026

fix: prevent unsuspend worker death and workqueue leak on cache miss

316a621

Signed-off-by: goyalpalak18 <goyalpalak1806@gmail.com>

goyalpalak18 force-pushed the fix/unsuspend-worker-break-and-workqueue-leak branch from c9e10c2 to 316a621 Compare February 7, 2026 18:48

volcano-sh-bot removed the do-not-merge/invalid-commit-message label Feb 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent unsuspend worker death and workqueue leak on cache miss#56

fix: prevent unsuspend worker death and workqueue leak on cache miss#56
goyalpalak18 wants to merge 1 commit intovolcano-sh:mainfrom
goyalpalak18:fix/unsuspend-worker-break-and-workqueue-leak

goyalpalak18 commented Feb 7, 2026

Uh oh!

volcano-sh-bot commented Feb 7, 2026

Uh oh!

volcano-sh-bot commented Feb 7, 2026

Uh oh!

gemini-code-assist Bot commented Feb 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goyalpalak18 commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

goyalpalak18 commented Feb 7, 2026

Summary

The Problem

The Fix

Tests

Uh oh!

volcano-sh-bot commented Feb 7, 2026

Uh oh!

volcano-sh-bot commented Feb 7, 2026

Uh oh!

gemini-code-assist Bot commented Feb 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goyalpalak18 commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants