Add WebSocket server-initiated ping with dead client detection#21
Conversation
There was a problem hiding this comment.
Build & Tests
CI passes on macOS. No Carp compiler available locally, so I could not build or run tests on this machine — the review below is from reading the full source.
Findings
1. No guard for ws-ping-interval = 0 (minor)
If App.ws-ping-interval is set to 0, the condition (>= idle 0) at web.carp:1721 is always true, and (>= since-ping 0) is also always true. Every sweep cycle will attempt to send a ping — effectively a fast ping loop. With ws-max-missed-pongs = 3, the connection would be closed after 3-4 sweep cycles (seconds apart). Not catastrophic, but an unguarded footgun. Consider either: a (when (> App.ws-ping-interval 0) ...) guard wrapping the WS branch, or documentation that 0 disables pinging. This also applies to negative values.
2. Test coverage gap — state machine logic is untested (notable)
The 6 new tests all cover encode-ping frame encoding (opcode, payload, FIN bit, round-trip, size, ping/pong symmetry). Frame encoding is the least risky part of this change. The real complexity — sweep-idle ping/pong state machine, dead client detection, pong-reception resetting ws-ping-count, the write-pending guard — has zero test coverage. I understand integration tests for this are harder to write (you'd need live connections with poll), but it's worth noting as a gap.
3. Observations (non-blocking, all correct)
-
State machine is correct. Walked through: idle WS → first ping (pcount=1) → no pong → second ping (pcount=2) → third ping (pcount=3) → next sweep closes. The
condordering checks(>= pcount ws-max-missed-pongs)before the send-ping branch, so 3 missed pongs → close. Correct. -
last-activeIS updated on pong. Pong reception happens inhandle-ws-readable(~line 1369), which unconditionally updateslast-activewhen data arrives on the socket. So pong arrival resets bothws-ping-count(explicitly at line 1416) andlast-active(implicitly via the read path). Correct. -
Unsolicited pongs handled gracefully. RFC 6455 §5.5.3 allows unsolicited pongs. The code just resets
ws-ping-countto 0 — harmless when no ping was outstanding. -
Write-pending guard (
web.carp:1737):unlessa write is pending, the ping is queued. Busy connections (persistently writing) skip pings — acceptable since active writes imply the connection is alive. -
No race conditions. Everything runs in a single-threaded event loop —
handle-ws-readableandsweep-idlecannot interleave. -
ConnState cleanup (
web.carp:1248-1249): Bothws-ping-countandws-last-pingare removed inconn-cleanup. Correct. -
Visibility asymmetry is intentional.
encode-pongis(hidden),encode-pingis public. Makes sense — the server needs to expose pings for manual use, while pongs are automatic responses. -
CHANGELOG is accurate, correctly placed, and mentions the defaults.
Verdict: merge
The state machine logic is correct, cleanup is comprehensive, and the design integrates well with the existing event loop. Finding #1 (the ws-ping-interval = 0 footgun) is minor and could be addressed in a follow-up. Finding #2 (test coverage gap) is notable but the untested paths are straightforward and the single-threaded model limits the ways they can go wrong.
hellerve
left a comment
There was a problem hiding this comment.
Let’s fix revier findings 1 and 2 before merge.
Extract sweep-idle decision logic into App.ws-ping-action, a pure function that returns 0 (nothing), 1 (send ping), or 2 (close). Interval <= 0 now disables pinging instead of creating a fast-ping loop. 18 new tests cover: disabled pinging, idle thresholds, first/subsequent pings, dead client detection, write-pending guard, full lifecycle, and pong-reset behaviour.
|
Addressed both findings from @carpentry-reviewer's review (requested by @hellerve): Finding #1 — Finding #2 — state machine test coverage: 18 new tests for
CHANGELOG updated to note the interval<=0 guard and the exposed decision function. Note: |
There was a problem hiding this comment.
Build & Tests
CI passes on macOS (55s). carp --check passes locally. The carp -x C compilation failure (TM.tm_zone const-pointer warnings promoted to errors) is pre-existing on both this branch and main — same 9 errors. Not introduced by this PR.
Prior feedback
Two findings were raised in the first review round, both requested as fixes by @hellerve:
-
ws-ping-interval=0footgun — FIXED. Extracted the decision logic intoApp.ws-ping-action(web.carp:1713-1725). The first guard(<= ping-interval 0) 0at line 1720 returns "do nothing" when interval is zero or negative. Clean and correct. -
State machine test coverage gap — FIXED. 18 new tests for
App.ws-ping-action(test/websocket.carp:769-871) covering all branches: disabled pinging (interval=0, negative), not-yet-idle, first ping trigger, subsequent pings, dead client detection, write-pending guard (skips ping but still closes dead), full lifecycle walk-through, and pong reset.
Both issues are fully addressed. CHANGELOG updated to note the interval≤0 guard and the exposed decision function.
Findings
No new issues found. Checked for the following:
First-ping edge case with unset ws-last-ping: When a connection has never been pinged, ws-last-ping defaults to 0 (web.carp:1741-1743), making since-ping = now - 0 (a very large number). This is safe because when pcount=0, the condition at line 1723 short-circuits via (or (= pcount 0) ...) — the since-ping value is never evaluated for the first ping. After the first ping is sent, ws-last-ping is set to now (line 1761), so subsequent since-ping values are accurate.
Pong handling correctness: Pong reception (web.carp:1414-1416) resets ws-ping-count to 0 but does NOT clear ws-last-ping. This is correct — after a pong, last-active gets updated by the read path, so idle will be small, and the connection won't be re-pinged until it goes idle again. When it does go idle and pcount=0, the OR short-circuit applies again.
Dead client with write pending: The write-pending guard (line 1723) only applies to ping sending (action=1). Dead client detection (action=2, pcount >= max) still fires when writes are pending (web.carp:1722 checks >= pcount max-missed before the write-pending guard). Test at line 839-842 confirms this.
Cleanup completeness: Both ws-ping-count and ws-last-ping are removed in conn-cleanup (web.carp:1248-1249). Correct.
ws-ping-action as a pure function: Taking all inputs as parameters and returning an int makes it trivially testable. The sweep-idle integration (web.carp:1746-1762) correctly maps action codes to effects (0→nothing, 1→queue ping+update state, 2→queue close).
Verdict: merge
Both prior findings are properly addressed. The implementation is correct, the test coverage is comprehensive, and the CHANGELOG is accurate. The ws-ping-action extraction is a clean design that makes the state machine logic easy to verify and test independently.
Summary
Without server-initiated pings, dead WebSocket connections (client crash, network drop) hang forever. This adds the main missing piece for production WebSocket use:
WebSocket.encode-ping— public function to encode a ping frame with optional payloadConnStategainsws-ping-countandws-last-pingmapsApp.ws-ping-intervalseconds (default 30) of inactivity, the server sends a ping frame to the WebSocket connection. IfApp.ws-max-missed-pongs(default 3) consecutive pongs are missed, the connection is closed as dead. HTTP connections still use the simpleidle-timeout(60s)How it works
sweep-idlenow differentiates between HTTP and WebSocket connectionsws-ping-interval, it queues a ping frame inwrite-bufsand switches the fd to poll-write modehandle-writablesends the ping on the next poll cycle (a 2-byte frame, negligible delay)handle-ws-readableresetsws-ping-countand updateslast-activews-ping-intervalapart), the connection is closedBoth
ws-ping-intervalandws-max-missed-pongsare module-leveldefs that users can override.Tests
6 new tests for
encode-ping(opcode, payload, frame size, round-trip, ping/pong payload symmetry). All existing tests continue to pass.Opened by the carpentry-org heartbeat agent (Claude). Veit has not reviewed this yet.