Skip to content

feat(inkless): add controller guards for classic-to-diskless switch [POD-2464]#606

Open
EelisK wants to merge 1 commit into
mainfrom
EelisK/POD-2464
Open

feat(inkless): add controller guards for classic-to-diskless switch [POD-2464]#606
EelisK wants to merge 1 commit into
mainfrom
EelisK/POD-2464

Conversation

@EelisK
Copy link
Copy Markdown
Member

@EelisK EelisK commented May 22, 2026

Enforce the following guards when altering the topic config to use diskless.enable=true:

  • Partitions are online
  • No reassignment in progress
  • Enough ISRs
  • No unclean leader election (active recovery nor enablement)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds controller-side validation to block classic → diskless topic switches (diskless.enable=true) unless the topic’s partitions are in a safe/healthy state, and also prevents unclean leader election behavior/config changes from interfering while a switch is pending.

Changes:

  • Added pre-validation in the controller for classic → diskless switch requests (offline partitions, reassignment in progress, under-replication, ELR/last-known ELR, recovering from unclean election, and unclean election enablement).
  • Made unclean leader election triggering and reassignment-revert logic ineligible when a partition has a pending classic → diskless switch.
  • Added unit + cluster integration tests covering the new guardrails and error cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java Implements switch precondition validation and blocks unclean election actions/config while switch is pending.
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java Calls the new pre-validation before applying incremental config changes; merges pre-validation errors into the response.
metadata/src/test/java/org/apache/kafka/controller/ReplicationControlManagerTest.java Adds unit tests for all new validation/error scenarios and pending-switch behavior.
core/src/test/java/kafka/server/InklessTopicTypeSwitcherClusterTest.java Adds integration tests validating rejections at the Admin API level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@EelisK EelisK force-pushed the EelisK/POD-2464 branch from c29701b to 202b830 Compare May 22, 2026 07:45
@EelisK EelisK marked this pull request as ready for review May 22, 2026 08:43
@EelisK EelisK requested a review from giuseppelillo May 22, 2026 08:43
Enforce the following guards when altering the topic config to use
diskless.enable=true:
  - Partitions are online
  - No reassignment in progress
  - Enough ISRs
  - No unclean leader election (active recovery nor enablement)

[POD-2464]
@EelisK EelisK force-pushed the EelisK/POD-2464 branch from 202b830 to 6df88ea Compare May 22, 2026 11:06
Copy link
Copy Markdown
Contributor

@jeqo jeqo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of shape comments.

* Check if unclean leader election would be enabled for a topic after its topic-level
* override is removed (i.e., resolving from node/cluster/static config only).
*/
boolean uncleanLeaderElectionEnabledExcludingTopicOverride(String topicName) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topicName is unused here. Should it be removed?
If it should, then maybe rename:

Suggested change
boolean uncleanLeaderElectionEnabledExcludingTopicOverride(String topicName) {
boolean uncleanLeaderElectionEnabledAtClusterLevel() {

Comment on lines +386 to +392
// Attempt to switch to diskless (should fail with INVALID_CONFIG)
final ExecutionException ex = assertThrows(ExecutionException.class, () ->
alterTopicConfig(admin, topic, Map.of(TopicConfig.DISKLESS_ENABLE_CONFIG, "true")));
assertTrue(ex.getCause().getMessage().contains("INVALID_CONFIG") ||
ex.getCause().getMessage().contains("under-replicated") ||
ex.getCause().getMessage().contains("offline"),
"Expected INVALID_CONFIG for unhealthy topic, got: " + ex.getCause().getMessage());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a more deterministic assertion? e.g. using type checking and compose the expected error message instead of this assertion?

We can use a parametrized test if different failure scenarios are expected (or separate tests?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants