fix: cap onRequestError retries at 1 in DefaultRetryPolicy by nikagra · Pull Request #929 · scylladb/java-driver

nikagra · 2026-06-18T11:58:53Z

Summary

DefaultRetryPolicy.onRequestError() now retries on the next host only once (nbRetry == 0), then rethrows. Previously, it returned tryNextHost unconditionally, allowing retries across all nodes in the query plan.

The same fix is applied to DowngradingConsistencyRetryPolicy.

Motivation

Motivated by CUSTOMER-331.

In large clusters (e.g., 12 nodes), a single timed-out request could cascade through every node in the query plan. With a 6-second client timeout and 12 nodes, this means a single thread could be blocked for up to 72 seconds, leading to thread pool exhaustion (RejectedExecutionException).

Historical context

Before JAVA-819 (2015), the driver hardcoded retry-on-next-host for client timeouts without consulting the retry policy at all — it was simply retry(false, null). When onRequestError() was introduced, its purpose was to let users opt out of retries for non-idempotent statements (via IdempotenceAwareRetryPolicy). To preserve backward compatibility, DefaultRetryPolicy.onRequestError() replicated the old hardcoded behavior: always tryNextHost(cl), ignoring nbRetry.

The original Javadoc explicitly acknowledged this was problematic:

"For historical reasons, this implementation triggers a retry on the next host in the query plan with the same consistency level, regardless of the statement's idempotence. Note that this breaks the general rule stated in RetryPolicy#onRequestError: 'a retry should only be attempted if the request is known to be idempotent'."

What changed

Method	Before	After
`onReadTimeout`	1 retry max	1 retry max (unchanged)
`onWriteTimeout`	1 retry max	1 retry max (unchanged)
`onUnavailable`	1 retry max	1 retry max (unchanged)
`onRequestError`	unbounded (try all hosts)	1 retry max

Risks

This is a behavioral change. Users relying on the implicit "try all hosts on request error" behavior will now see OperationTimedOutException (or other DriverException) surface to the application after 1 retry instead of exhausting the entire query plan.

Mitigation: Users who need more retries can implement a custom RetryPolicy with a higher cap in onRequestError().

Tests

Updated integration tests for both DefaultRetryPolicy and DowngradingConsistencyRetryPolicy:

Split previous "try all hosts" tests into "try next host on first error" (success on retry) and "rethrow on second error" (cap enforced) — matching the existing pattern used for onUnavailable tests.

coderabbitai · 2026-06-18T11:59:02Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: cdbdb081-2adc-4cf0-9ac9-1b7a70595b71

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

DefaultRetryPolicy.onRequestError() now retries on the next host only once (nbRetry == 0), then rethrows. Previously, it returned tryNextHost unconditionally, allowing retries across all nodes in the query plan. In large clusters, a single timed-out request could cascade through every node. With a 6-second client timeout and 12 nodes, this means a single thread could be blocked for up to 72 seconds, leading to thread pool exhaustion (RejectedExecutionException). The same fix is applied to DowngradingConsistencyRetryPolicy. This makes onRequestError consistent with onUnavailable, onReadTimeout, and onWriteTimeout, all of which cap retries at 1. Motivated-by: CUSTOMER-331

nikagra requested review from dkropachev and roydahan June 18, 2026 12:06

nikagra assigned AdamStawarz Jun 18, 2026

nikagra force-pushed the mhradovich/3.x-cap-onRequestError-retries branch from 8388d89 to 3158c9e Compare June 18, 2026 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: cap onRequestError retries at 1 in DefaultRetryPolicy#929

fix: cap onRequestError retries at 1 in DefaultRetryPolicy#929
nikagra wants to merge 1 commit into
scylladb:scylla-3.xfrom
nikagra:mhradovich/3.x-cap-onRequestError-retries

nikagra commented Jun 18, 2026 •

edited by atlassian Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nikagra commented Jun 18, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Historical context

What changed

Risks

Tests

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nikagra commented Jun 18, 2026 •

edited by atlassian Bot

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading