feat(inkless): add controller guards for classic-to-diskless switch [POD-2464]#606
feat(inkless): add controller guards for classic-to-diskless switch [POD-2464]#606EelisK wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds controller-side validation to block classic → diskless topic switches (diskless.enable=true) unless the topic’s partitions are in a safe/healthy state, and also prevents unclean leader election behavior/config changes from interfering while a switch is pending.
Changes:
- Added pre-validation in the controller for classic → diskless switch requests (offline partitions, reassignment in progress, under-replication, ELR/last-known ELR, recovering from unclean election, and unclean election enablement).
- Made unclean leader election triggering and reassignment-revert logic ineligible when a partition has a pending classic → diskless switch.
- Added unit + cluster integration tests covering the new guardrails and error cases.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| metadata/src/main/java/org/apache/kafka/controller/ReplicationControlManager.java | Implements switch precondition validation and blocks unclean election actions/config while switch is pending. |
| metadata/src/main/java/org/apache/kafka/controller/QuorumController.java | Calls the new pre-validation before applying incremental config changes; merges pre-validation errors into the response. |
| metadata/src/test/java/org/apache/kafka/controller/ReplicationControlManagerTest.java | Adds unit tests for all new validation/error scenarios and pending-switch behavior. |
| core/src/test/java/kafka/server/InklessTopicTypeSwitcherClusterTest.java | Adds integration tests validating rejections at the Admin API level. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Enforce the following guards when altering the topic config to use diskless.enable=true: - Partitions are online - No reassignment in progress - Enough ISRs - No unclean leader election (active recovery nor enablement) [POD-2464]
jeqo
left a comment
There was a problem hiding this comment.
A couple of shape comments.
| * Check if unclean leader election would be enabled for a topic after its topic-level | ||
| * override is removed (i.e., resolving from node/cluster/static config only). | ||
| */ | ||
| boolean uncleanLeaderElectionEnabledExcludingTopicOverride(String topicName) { |
There was a problem hiding this comment.
topicName is unused here. Should it be removed?
If it should, then maybe rename:
| boolean uncleanLeaderElectionEnabledExcludingTopicOverride(String topicName) { | |
| boolean uncleanLeaderElectionEnabledAtClusterLevel() { |
| // Attempt to switch to diskless (should fail with INVALID_CONFIG) | ||
| final ExecutionException ex = assertThrows(ExecutionException.class, () -> | ||
| alterTopicConfig(admin, topic, Map.of(TopicConfig.DISKLESS_ENABLE_CONFIG, "true"))); | ||
| assertTrue(ex.getCause().getMessage().contains("INVALID_CONFIG") || | ||
| ex.getCause().getMessage().contains("under-replicated") || | ||
| ex.getCause().getMessage().contains("offline"), | ||
| "Expected INVALID_CONFIG for unhealthy topic, got: " + ex.getCause().getMessage()); |
There was a problem hiding this comment.
Should this be a more deterministic assertion? e.g. using type checking and compose the expected error message instead of this assertion?
We can use a parametrized test if different failure scenarios are expected (or separate tests?)
Enforce the following guards when altering the topic config to use diskless.enable=true: