Skip to content

fix(p2p): replace unsynchronized dialing counter with atomic int32#423

Open
amathxbt wants to merge 2 commits into
canopy-network:mainfrom
amathxbt:fix/p2p-dialing-counter-race
Open

fix(p2p): replace unsynchronized dialing counter with atomic int32#423
amathxbt wants to merge 2 commits into
canopy-network:mainfrom
amathxbt:fix/p2p-dialing-counter-race

Conversation

@amathxbt

Copy link
Copy Markdown

Bug

In p2p/p2p.go, DialForOutboundPeers() tracks in-flight dial attempts with a plain int:

dialing := 0
// ...
go func() {
    dialing++  // ← written from goroutine, read from main loop — data race
    p.DialWithBackoff(peerAddress, true)
}()

The counter is incremented inside spawned goroutines while the main loop reads it concurrently — with no mutex or atomic protection. This is a data race confirmed by go test -race.

Result: the node can significantly exceed MaxOutbound peers, opening unnecessary connections, wasting bandwidth, and destabilising the network layer.

Fix

Replace dialing int with var dialing int32 and use atomic.AddInt32 / atomic.LoadInt32 at every read and write site.

Impact

Critical — data race in the P2P layer causes unbounded outbound connections.

@andrewnguyen22 andrewnguyen22 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The finding is real dialing is accessed from multiple goroutines in p2p.go.

This is directionally correct - but the PR is incomplete:

  1. It still reads p.PeerSet.outbound directly without the PeerSet lock at p2p/p2p.go:221, while outbound is written under peer-set mutation paths. That leaves a data race.

  2. The initial configured DialPeers goroutines increment dialing but never decrement it after DialWithBackoff returns. The PR preserves that bug, just with atomics. That can make the later outbound loop permanently think dials are still in flight.

…ound under RLock

Two data-race issues in DialForOutboundPeers():

1. The initial DialPeers goroutines incremented the atomic dialing counter
   but never decremented it after DialWithBackoff returned. The outbound
   capacity loop at the bottom of the function would permanently see a
   non-zero in-flight count for each configured dial peer, blocking new
   outbound connections forever once the initial dials completed.

   Fix: add atomic.AddInt32(&dialing, -1) at the end of each goroutine,
   mirroring the defer decrement already used in the outbound loop.

2. p.PeerSet.outbound was read at line 222 without holding the PeerSet
   mux, while Add/Remove paths write outbound under that lock on other
   goroutines. This is a data race.

   Fix: wrap the read in p.PeerSet.mux.RLock()/RUnlock().
@amathxbt

Copy link
Copy Markdown
Author

Thanks for the thorough review @andrewnguyen22 — both points are valid. Updated the branch with:

  1. Missing dialing decrement — added atomic.AddInt32(&dialing, -1) at the end of each initial DialPeers goroutine, mirroring the defer decrement already used in the outbound capacity loop. Without this the counter stays permanently elevated after the configured peers connect, blocking all future outbound dials.

  2. p.PeerSet.outbound read without lock — wrapped the read at line 222 in p.PeerSet.mux.RLock() / RUnlock(), consistent with how outbound is guarded elsewhere in the peer-set paths.

@amathxbt amathxbt requested a review from andrewnguyen22 June 28, 2026 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants