Skip to content

Fix NPE in ChannelPool metrics methods when pool is uninitialized (CUSTOMER-413)#928

Open
nikagra wants to merge 1 commit into
scylla-4.xfrom
fix/customer-413-channel-pool-npe-metrics
Open

Fix NPE in ChannelPool metrics methods when pool is uninitialized (CUSTOMER-413)#928
nikagra wants to merge 1 commit into
scylla-4.xfrom
fix/customer-413-channel-pool-npe-metrics

Conversation

@nikagra

@nikagra nikagra commented Jun 17, 2026

Copy link
Copy Markdown

Summary

Fixes a NullPointerException reported by Accertify (CUSTOMER-413 / Zendesk #67222) when Dropwizard Metrics reporters scrape per-node gauge values while the driver is reconnecting to a node.

Affects: com.scylladb:java-driver-core (confirmed in 4.18.0.1, same code present in scylla-4.x).


Root Cause

ChannelPool has a ChannelSet[] channels field that is not initialized at construction — it starts as null:

// ChannelPool.java
@VisibleForTesting ChannelSet[] channels;  // null until initialize() runs

channels is only assigned inside SingleThreaded.initialize(), which is called only when the first channel connection succeeds:

private void initialize(DriverChannel c) {
    ...
    channels = new ChannelSet[shardsCount];   // assigned here
    ...
    initialized = true;                        // set AFTER channels
}

When the initial connection fails, makeInitialConnection() completes connectFuture immediately — with channels still null — and starts a reconnect in the background:

// on the failure branch of makeInitialConnection():
reconnection.start();
connectFuture.complete(ChannelPool.this);  // pool enters PoolManager.pools with channels == null

PoolManager.onPoolsInit() unconditionally puts every completed pool into the pools map (PoolManager.java:252). The Javadoc comment there even notes: "pool init always succeeds". This is intentional design — but it means a pool with channels == null can be in the map.

Why the existing null-check doesn't help

AbstractMetricUpdater guards against a missing pool entry, but not against an uninitialized pool:

// AbstractMetricUpdater.java
protected int inFlightRequests(Node node) {
    ChannelPool pool = context.getPoolManager().getPools().get(node);
    return (pool == null) ? 0 : pool.getInFlight();  // pool is non-null, but channels is null
}

The NPE

The three gauge methods — and size() — call Arrays.stream(channels) with no null guard:

public int getInFlight() {
    return Arrays.stream(channels).mapToInt(ChannelSet::getInFlight).sum();
    //                   ^^^^^^^^ NullPointerException when channels == null
}

Arrays.stream(null) throws NullPointerException. This is triggered by any Dropwizard reporter (JMX, Graphite, Prometheus bridge, etc.) that calls Gauge.getValue() during the window between a failed initial connection and a successful reconnect.

Full call chain

Metrics reporter  →  Gauge.getValue()
  DropwizardMetricUpdater.java:112   registry.gauge(..., () -> supplier::get)
  DropwizardNodeMetricUpdater.java:50  () -> inFlightRequests(node)
  AbstractMetricUpdater.java:141     pool = getPools().get(node)  [non-null, reconnecting]
  AbstractMetricUpdater.java:142     pool.getInFlight()
  ChannelPool.java:226               Arrays.stream(channels)  →  NPE  (channels == null)

Same path applies to getAvailableIds() and getOrphanedIds().


Fix

Mirror the guard already used by next(), which correctly handles this case via the initialized flag:

public DriverChannel next(...) {
    if (!singleThreaded.initialized) {  // ← existing safe pattern
        return null;
    }
    ...
}

Apply the same guard to all four affected methods:

public int size() {
    if (!singleThreaded.initialized) return 0;
    return Arrays.stream(channels).mapToInt(ChannelSet::size).sum();
}
// same for getAvailableIds(), getInFlight(), getOrphanedIds()

Why initialized and not channels != null

initialized is declared volatile (meaning the JMM guarantees that a thread reading initialized == true will also observe all writes that preceded the volatile write, including channels = new ChannelSet[...]). The channels field itself is not volatile, so checking it directly would be a weaker guarantee. Using initialized is both JMM-correct and consistent with the existing pattern in next().

Why size() is included

size() has the identical unguarded Arrays.stream(channels) pattern. While it is not currently called from any metrics path, it is a public method and would throw the same NPE if called before initialization.


Changed files

  • core/src/main/java/com/datastax/oss/driver/internal/core/pool/ChannelPool.java
    • size() — add initialized guard
    • getAvailableIds() — add initialized guard
    • getInFlight() — add initialized guard
    • getOrphanedIds() — add initialized guard

@nikagra nikagra requested review from dkropachev and mykaul June 17, 2026 14:14
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 5d2c0290-fe02-42a1-b623-1d3fbb584e92

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

When the initial TCP connection to a node fails, ChannelPool.connectFuture
completes (so the pool is added to PoolManager.pools) while `channels` is
still null — initialize() is only called on the success path.

The four methods size(), getAvailableIds(), getInFlight(), and
getOrphanedIds() called Arrays.stream(channels) without guarding against
this, throwing NullPointerException whenever a Dropwizard Metrics reporter
(JMX, Graphite, etc.) scraped the gauge values during the reconnection
window.

Fix: mirror the pattern already used by next(), which checks
singleThreaded.initialized before touching channels. Using the volatile
initialized flag is correct by the JMM: the volatile write at line 365
(initialized = true) happens-after all channels writes, so any thread
that reads initialized == true is guaranteed to see channels as
fully populated.

Fixes: CUSTOMER-413
@nikagra nikagra force-pushed the fix/customer-413-channel-pool-npe-metrics branch from d6dc64c to 250cb9d Compare June 17, 2026 15:56
@nikagra nikagra marked this pull request as ready for review June 17, 2026 15:58
@nikagra nikagra requested a review from Copilot June 17, 2026 15:59

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race where ChannelPool metric accessor methods could throw NullPointerException when a pool instance is published before its channels array is initialized (e.g., after an initial connection failure while reconnecting). This aligns the metric accessors with the existing safety pattern already used by next().

Changes:

  • Guard size(), getAvailableIds(), getInFlight(), and getOrphanedIds() with singleThreaded.initialized to safely return 0 when uninitialized.
  • Add a regression test ensuring those accessors return 0 (and do not throw) when channels is still null after a failed initial connection.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
core/src/main/java/com/datastax/oss/driver/internal/core/pool/ChannelPool.java Adds initialized guards to prevent NPEs when channels is still null.
core/src/test/java/com/datastax/oss/driver/internal/core/pool/ChannelPoolInitTest.java Adds a regression test covering metric accessor behavior for an uninitialized pool.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants