Skip to content

fix: cap onRequestError retries at 1 in DefaultRetryPolicy#929

Draft
nikagra wants to merge 1 commit into
scylladb:scylla-3.xfrom
nikagra:mhradovich/3.x-cap-onRequestError-retries
Draft

fix: cap onRequestError retries at 1 in DefaultRetryPolicy#929
nikagra wants to merge 1 commit into
scylladb:scylla-3.xfrom
nikagra:mhradovich/3.x-cap-onRequestError-retries

Conversation

@nikagra

@nikagra nikagra commented Jun 18, 2026

Copy link
Copy Markdown

Summary

DefaultRetryPolicy.onRequestError() now retries on the next host only once (nbRetry == 0), then rethrows. Previously, it returned tryNextHost unconditionally, allowing retries across all nodes in the query plan.

The same fix is applied to DowngradingConsistencyRetryPolicy.

Motivation

Motivated by CUSTOMER-331.

In large clusters (e.g., 12 nodes), a single timed-out request could cascade through every node in the query plan. With a 6-second client timeout and 12 nodes, this means a single thread could be blocked for up to 72 seconds, leading to thread pool exhaustion (RejectedExecutionException).

Historical context

Before JAVA-819 (2015), the driver hardcoded retry-on-next-host for client timeouts without consulting the retry policy at all — it was simply retry(false, null). When onRequestError() was introduced, its purpose was to let users opt out of retries for non-idempotent statements (via IdempotenceAwareRetryPolicy). To preserve backward compatibility, DefaultRetryPolicy.onRequestError() replicated the old hardcoded behavior: always tryNextHost(cl), ignoring nbRetry.

The original Javadoc explicitly acknowledged this was problematic:

"For historical reasons, this implementation triggers a retry on the next host in the query plan with the same consistency level, regardless of the statement's idempotence. Note that this breaks the general rule stated in RetryPolicy#onRequestError: 'a retry should only be attempted if the request is known to be idempotent'."

What changed

Method Before After
onReadTimeout 1 retry max 1 retry max (unchanged)
onWriteTimeout 1 retry max 1 retry max (unchanged)
onUnavailable 1 retry max 1 retry max (unchanged)
onRequestError unbounded (try all hosts) 1 retry max

Risks

This is a behavioral change. Users relying on the implicit "try all hosts on request error" behavior will now see OperationTimedOutException (or other DriverException) surface to the application after 1 retry instead of exhausting the entire query plan.

Mitigation: Users who need more retries can implement a custom RetryPolicy with a higher cap in onRequestError().

Tests

Updated integration tests for both DefaultRetryPolicy and DowngradingConsistencyRetryPolicy:

  • Split previous "try all hosts" tests into "try next host on first error" (success on retry) and "rethrow on second error" (cap enforced) — matching the existing pattern used for onUnavailable tests.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: cdbdb081-2adc-4cf0-9ac9-1b7a70595b71

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

DefaultRetryPolicy.onRequestError() now retries on the next host only
once (nbRetry == 0), then rethrows. Previously, it returned tryNextHost
unconditionally, allowing retries across all nodes in the query plan.

In large clusters, a single timed-out request could cascade through
every node. With a 6-second client timeout and 12 nodes, this means a
single thread could be blocked for up to 72 seconds, leading to thread
pool exhaustion (RejectedExecutionException).

The same fix is applied to DowngradingConsistencyRetryPolicy.

This makes onRequestError consistent with onUnavailable, onReadTimeout,
and onWriteTimeout, all of which cap retries at 1.

Motivated-by: CUSTOMER-331
@nikagra nikagra force-pushed the mhradovich/3.x-cap-onRequestError-retries branch from 8388d89 to 3158c9e Compare June 18, 2026 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants