Skip to content

Allow CLI control of watchdogs and detect/handle stuck watchdog threads #206

@guillaume-byte

Description

@guillaume-byte

Problem: The watchdog subsystem can become unresponsive or stuck inside threads. Users currently have no reliable CLI control to inspect or terminate stuck watchdogs.

Goals:

  • Allow users to manipulate watchdogs via the CLI: list watchdogs, query status, force-stop/kill a watchdog or its worker thread(s), restart watchdogs.
  • Detect stuck watchdogs by monitoring "alive" notifications, thread liveness, memory and CPU usage, and other heuristics.

Acceptance criteria:

  • CLI commands added: weightslab watchdog list, weightslab watchdog status <id>, weightslab watchdog kill <id|thread>, weightslab watchdog restart <id>.
  • Watchdog status includes last-alive timestamp, thread id(s), memory and CPU snapshot, and stack trace sample if stuck.
  • Implement heuristics for "stuck" detection (e.g., no alive event in configurable timeout, excessive memory growth, thread not scheduling) and surface as a warning in status output.
  • Unit and integration tests covering detection logic and CLI actions (kill/restart) without crashing the agent.
  • Telemetry/logging for watchdog actions and stuck-detection events.

Implementation notes / suggestions:

  • Add a watchdog manager component that exposes a control API (in-process) which the CLI can call, and which records last-alive timestamps per watchdog.
  • For thread termination, prefer cooperative shutdown where possible; provide a hard-kill fallback after a configurable grace period.
  • Consider capturing a thread stack trace snapshot when stuck is suspected to aid debugging.
  • Make stuck detection thresholds configurable via settings.

References: user request — allow CLI manipulation and find if watchdog stuck via alived notification or memory usage.

Metadata

Metadata

Labels

bugSomething isn't workingenhancementNew feature or request

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions