Problem: The watchdog subsystem can become unresponsive or stuck inside threads. Users currently have no reliable CLI control to inspect or terminate stuck watchdogs.
Goals:
- Allow users to manipulate watchdogs via the CLI: list watchdogs, query status, force-stop/kill a watchdog or its worker thread(s), restart watchdogs.
- Detect stuck watchdogs by monitoring "alive" notifications, thread liveness, memory and CPU usage, and other heuristics.
Acceptance criteria:
- CLI commands added:
weightslab watchdog list, weightslab watchdog status <id>, weightslab watchdog kill <id|thread>, weightslab watchdog restart <id>.
- Watchdog status includes last-alive timestamp, thread id(s), memory and CPU snapshot, and stack trace sample if stuck.
- Implement heuristics for "stuck" detection (e.g., no alive event in configurable timeout, excessive memory growth, thread not scheduling) and surface as a warning in status output.
- Unit and integration tests covering detection logic and CLI actions (kill/restart) without crashing the agent.
- Telemetry/logging for watchdog actions and stuck-detection events.
Implementation notes / suggestions:
- Add a watchdog manager component that exposes a control API (in-process) which the CLI can call, and which records last-alive timestamps per watchdog.
- For thread termination, prefer cooperative shutdown where possible; provide a hard-kill fallback after a configurable grace period.
- Consider capturing a thread stack trace snapshot when stuck is suspected to aid debugging.
- Make stuck detection thresholds configurable via settings.
References: user request — allow CLI manipulation and find if watchdog stuck via alived notification or memory usage.
Problem: The watchdog subsystem can become unresponsive or stuck inside threads. Users currently have no reliable CLI control to inspect or terminate stuck watchdogs.
Goals:
Acceptance criteria:
weightslab watchdog list,weightslab watchdog status <id>,weightslab watchdog kill <id|thread>,weightslab watchdog restart <id>.Implementation notes / suggestions:
References: user request — allow CLI manipulation and find if watchdog stuck via alived notification or memory usage.