fix(p2p): replace unsynchronized dialing counter with atomic int32#423
fix(p2p): replace unsynchronized dialing counter with atomic int32#423amathxbt wants to merge 2 commits into
Conversation
andrewnguyen22
left a comment
There was a problem hiding this comment.
The finding is real dialing is accessed from multiple goroutines in p2p.go.
This is directionally correct - but the PR is incomplete:
-
It still reads p.PeerSet.outbound directly without the PeerSet lock at
p2p/p2p.go:221, while outbound is written under peer-set mutation paths. That leaves a data race. -
The initial configured DialPeers goroutines increment dialing but never decrement it after DialWithBackoff returns. The PR preserves that bug, just with atomics. That can make the later outbound loop permanently think dials are still in flight.
…ound under RLock Two data-race issues in DialForOutboundPeers(): 1. The initial DialPeers goroutines incremented the atomic dialing counter but never decremented it after DialWithBackoff returned. The outbound capacity loop at the bottom of the function would permanently see a non-zero in-flight count for each configured dial peer, blocking new outbound connections forever once the initial dials completed. Fix: add atomic.AddInt32(&dialing, -1) at the end of each goroutine, mirroring the defer decrement already used in the outbound loop. 2. p.PeerSet.outbound was read at line 222 without holding the PeerSet mux, while Add/Remove paths write outbound under that lock on other goroutines. This is a data race. Fix: wrap the read in p.PeerSet.mux.RLock()/RUnlock().
|
Thanks for the thorough review @andrewnguyen22 — both points are valid. Updated the branch with:
|
Bug
In
p2p/p2p.go,DialForOutboundPeers()tracks in-flight dial attempts with a plainint:The counter is incremented inside spawned goroutines while the main loop reads it concurrently — with no mutex or atomic protection. This is a data race confirmed by
go test -race.Result: the node can significantly exceed
MaxOutboundpeers, opening unnecessary connections, wasting bandwidth, and destabilising the network layer.Fix
Replace
dialing intwithvar dialing int32and useatomic.AddInt32/atomic.LoadInt32at every read and write site.Impact
Critical — data race in the P2P layer causes unbounded outbound connections.