fix: cap onRequestError retries at 1 in DefaultRetryPolicy#929
Draft
nikagra wants to merge 1 commit into
Draft
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
DefaultRetryPolicy.onRequestError() now retries on the next host only once (nbRetry == 0), then rethrows. Previously, it returned tryNextHost unconditionally, allowing retries across all nodes in the query plan. In large clusters, a single timed-out request could cascade through every node. With a 6-second client timeout and 12 nodes, this means a single thread could be blocked for up to 72 seconds, leading to thread pool exhaustion (RejectedExecutionException). The same fix is applied to DowngradingConsistencyRetryPolicy. This makes onRequestError consistent with onUnavailable, onReadTimeout, and onWriteTimeout, all of which cap retries at 1. Motivated-by: CUSTOMER-331
8388d89 to
3158c9e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DefaultRetryPolicy.onRequestError()now retries on the next host only once (nbRetry == 0), then rethrows. Previously, it returnedtryNextHostunconditionally, allowing retries across all nodes in the query plan.The same fix is applied to
DowngradingConsistencyRetryPolicy.Motivation
Motivated by CUSTOMER-331.
In large clusters (e.g., 12 nodes), a single timed-out request could cascade through every node in the query plan. With a 6-second client timeout and 12 nodes, this means a single thread could be blocked for up to 72 seconds, leading to thread pool exhaustion (
RejectedExecutionException).Historical context
Before JAVA-819 (2015), the driver hardcoded retry-on-next-host for client timeouts without consulting the retry policy at all — it was simply
retry(false, null). WhenonRequestError()was introduced, its purpose was to let users opt out of retries for non-idempotent statements (viaIdempotenceAwareRetryPolicy). To preserve backward compatibility,DefaultRetryPolicy.onRequestError()replicated the old hardcoded behavior: alwaystryNextHost(cl), ignoringnbRetry.The original Javadoc explicitly acknowledged this was problematic:
What changed
onReadTimeoutonWriteTimeoutonUnavailableonRequestErrorRisks
This is a behavioral change. Users relying on the implicit "try all hosts on request error" behavior will now see
OperationTimedOutException(or otherDriverException) surface to the application after 1 retry instead of exhausting the entire query plan.Mitigation: Users who need more retries can implement a custom
RetryPolicywith a higher cap inonRequestError().Tests
Updated integration tests for both
DefaultRetryPolicyandDowngradingConsistencyRetryPolicy:onUnavailabletests.