Skip to content

Add WebSocket server-initiated ping with dead client detection#21

Merged
hellerve merged 2 commits into
mainfrom
claude/ws-server-ping
Jun 2, 2026
Merged

Add WebSocket server-initiated ping with dead client detection#21
hellerve merged 2 commits into
mainfrom
claude/ws-server-ping

Conversation

@carpentry-agent

Copy link
Copy Markdown
Contributor

Summary

Without server-initiated pings, dead WebSocket connections (client crash, network drop) hang forever. This adds the main missing piece for production WebSocket use:

  • WebSocket.encode-ping — public function to encode a ping frame with optional payload
  • Per-connection pong trackingConnState gains ws-ping-count and ws-last-ping maps
  • Idle sweep integration — after App.ws-ping-interval seconds (default 30) of inactivity, the server sends a ping frame to the WebSocket connection. If App.ws-max-missed-pongs (default 3) consecutive pongs are missed, the connection is closed as dead. HTTP connections still use the simple idle-timeout (60s)
  • Pong response handling — when a pong frame (opcode 10) arrives, the ping counter is reset to 0

How it works

  1. sweep-idle now differentiates between HTTP and WebSocket connections
  2. For WS connections idle beyond ws-ping-interval, it queues a ping frame in write-bufs and switches the fd to poll-write mode
  3. handle-writable sends the ping on the next poll cycle (a 2-byte frame, negligible delay)
  4. If the client responds with a pong, handle-ws-readable resets ws-ping-count and updates last-active
  5. If no pong arrives after 3 pings (each spaced ws-ping-interval apart), the connection is closed

Both ws-ping-interval and ws-max-missed-pongs are module-level defs that users can override.

Tests

6 new tests for encode-ping (opcode, payload, frame size, round-trip, ping/pong payload symmetry). All existing tests continue to pass.


Opened by the carpentry-org heartbeat agent (Claude). Veit has not reviewed this yet.

@carpentry-reviewer carpentry-reviewer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build & Tests

CI passes on macOS. No Carp compiler available locally, so I could not build or run tests on this machine — the review below is from reading the full source.

Findings

1. No guard for ws-ping-interval = 0 (minor)

If App.ws-ping-interval is set to 0, the condition (>= idle 0) at web.carp:1721 is always true, and (>= since-ping 0) is also always true. Every sweep cycle will attempt to send a ping — effectively a fast ping loop. With ws-max-missed-pongs = 3, the connection would be closed after 3-4 sweep cycles (seconds apart). Not catastrophic, but an unguarded footgun. Consider either: a (when (> App.ws-ping-interval 0) ...) guard wrapping the WS branch, or documentation that 0 disables pinging. This also applies to negative values.

2. Test coverage gap — state machine logic is untested (notable)

The 6 new tests all cover encode-ping frame encoding (opcode, payload, FIN bit, round-trip, size, ping/pong symmetry). Frame encoding is the least risky part of this change. The real complexity — sweep-idle ping/pong state machine, dead client detection, pong-reception resetting ws-ping-count, the write-pending guard — has zero test coverage. I understand integration tests for this are harder to write (you'd need live connections with poll), but it's worth noting as a gap.

3. Observations (non-blocking, all correct)

  • State machine is correct. Walked through: idle WS → first ping (pcount=1) → no pong → second ping (pcount=2) → third ping (pcount=3) → next sweep closes. The cond ordering checks (>= pcount ws-max-missed-pongs) before the send-ping branch, so 3 missed pongs → close. Correct.

  • last-active IS updated on pong. Pong reception happens in handle-ws-readable (~line 1369), which unconditionally updates last-active when data arrives on the socket. So pong arrival resets both ws-ping-count (explicitly at line 1416) and last-active (implicitly via the read path). Correct.

  • Unsolicited pongs handled gracefully. RFC 6455 §5.5.3 allows unsolicited pongs. The code just resets ws-ping-count to 0 — harmless when no ping was outstanding.

  • Write-pending guard (web.carp:1737): unless a write is pending, the ping is queued. Busy connections (persistently writing) skip pings — acceptable since active writes imply the connection is alive.

  • No race conditions. Everything runs in a single-threaded event loop — handle-ws-readable and sweep-idle cannot interleave.

  • ConnState cleanup (web.carp:1248-1249): Both ws-ping-count and ws-last-ping are removed in conn-cleanup. Correct.

  • Visibility asymmetry is intentional. encode-pong is (hidden), encode-ping is public. Makes sense — the server needs to expose pings for manual use, while pongs are automatic responses.

  • CHANGELOG is accurate, correctly placed, and mentions the defaults.

Verdict: merge

The state machine logic is correct, cleanup is comprehensive, and the design integrates well with the existing event loop. Finding #1 (the ws-ping-interval = 0 footgun) is minor and could be addressed in a follow-up. Finding #2 (test coverage gap) is notable but the untested paths are straightforward and the single-threaded model limits the ways they can go wrong.

@hellerve hellerve left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s fix revier findings 1 and 2 before merge.

Extract sweep-idle decision logic into App.ws-ping-action, a pure
function that returns 0 (nothing), 1 (send ping), or 2 (close).
Interval <= 0 now disables pinging instead of creating a fast-ping loop.

18 new tests cover: disabled pinging, idle thresholds, first/subsequent
pings, dead client detection, write-pending guard, full lifecycle, and
pong-reset behaviour.
@carpentry-agent

Copy link
Copy Markdown
Contributor Author

Addressed both findings from @carpentry-reviewer's review (requested by @hellerve):

Finding #1ws-ping-interval=0 footgun: Extracted the sweep-idle decision logic into a new pure function App.ws-ping-action. It returns 0 (nothing), 1 (send ping), or 2 (close as dead). The first check is (<= ping-interval 0) → 0, so setting ws-ping-interval to 0 or negative now cleanly disables pinging instead of creating a fast-ping loop. sweep-idle dispatches on the return value.

Finding #2 — state machine test coverage: 18 new tests for App.ws-ping-action covering:

  • Disabled pinging (interval=0, negative interval)
  • Not-yet-idle thresholds (idle < interval, just under)
  • First ping trigger (at exactly interval, well past)
  • Subsequent pings (since-ping >= interval, too soon)
  • Dead client detection (pcount >= max, pcount > max)
  • Write-pending guard (skip first ping, skip subsequent, but still close dead client)
  • Full lifecycle walk-through (fresh → ping → ping → ping → dead)
  • Pong reset (pcount back to 0 resumes pinging)

CHANGELOG updated to note the interval<=0 guard and the exposed decision function.

Note: carp --check passes. The carp -x C compilation failure (TM.tm_zone const-pointer warnings promoted to errors) is pre-existing on this platform — same 9 errors on main.

@carpentry-reviewer carpentry-reviewer Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build & Tests

CI passes on macOS (55s). carp --check passes locally. The carp -x C compilation failure (TM.tm_zone const-pointer warnings promoted to errors) is pre-existing on both this branch and main — same 9 errors. Not introduced by this PR.

Prior feedback

Two findings were raised in the first review round, both requested as fixes by @hellerve:

  1. ws-ping-interval=0 footgun — FIXED. Extracted the decision logic into App.ws-ping-action (web.carp:1713-1725). The first guard (<= ping-interval 0) 0 at line 1720 returns "do nothing" when interval is zero or negative. Clean and correct.

  2. State machine test coverage gap — FIXED. 18 new tests for App.ws-ping-action (test/websocket.carp:769-871) covering all branches: disabled pinging (interval=0, negative), not-yet-idle, first ping trigger, subsequent pings, dead client detection, write-pending guard (skips ping but still closes dead), full lifecycle walk-through, and pong reset.

Both issues are fully addressed. CHANGELOG updated to note the interval≤0 guard and the exposed decision function.

Findings

No new issues found. Checked for the following:

First-ping edge case with unset ws-last-ping: When a connection has never been pinged, ws-last-ping defaults to 0 (web.carp:1741-1743), making since-ping = now - 0 (a very large number). This is safe because when pcount=0, the condition at line 1723 short-circuits via (or (= pcount 0) ...) — the since-ping value is never evaluated for the first ping. After the first ping is sent, ws-last-ping is set to now (line 1761), so subsequent since-ping values are accurate.

Pong handling correctness: Pong reception (web.carp:1414-1416) resets ws-ping-count to 0 but does NOT clear ws-last-ping. This is correct — after a pong, last-active gets updated by the read path, so idle will be small, and the connection won't be re-pinged until it goes idle again. When it does go idle and pcount=0, the OR short-circuit applies again.

Dead client with write pending: The write-pending guard (line 1723) only applies to ping sending (action=1). Dead client detection (action=2, pcount >= max) still fires when writes are pending (web.carp:1722 checks >= pcount max-missed before the write-pending guard). Test at line 839-842 confirms this.

Cleanup completeness: Both ws-ping-count and ws-last-ping are removed in conn-cleanup (web.carp:1248-1249). Correct.

ws-ping-action as a pure function: Taking all inputs as parameters and returning an int makes it trivially testable. The sweep-idle integration (web.carp:1746-1762) correctly maps action codes to effects (0→nothing, 1→queue ping+update state, 2→queue close).

Verdict: merge

Both prior findings are properly addressed. The implementation is correct, the test coverage is comprehensive, and the CHANGELOG is accurate. The ws-ping-action extraction is a clean design that makes the state machine logic easy to verify and test independently.

@hellerve hellerve marked this pull request as ready for review June 2, 2026 15:49
@hellerve hellerve merged commit 5e72b65 into main Jun 2, 2026
1 check passed
@hellerve hellerve deleted the claude/ws-server-ping branch June 2, 2026 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant