Summary
We should have an automated tool that periodically spins up a fresh light node on each long-running network (mocha, arabica, mainnet) to verify that new nodes can successfully start, sync, and sample. This would catch startup regressions like the recent tail height overshoot (#4840) before users report them.
Motivation
In v0.29.1, light nodes on mocha failed to start because the syncer tail height estimation overshot the pruning window by ~3.8 hours. This was only discovered through manual testing. An automated canary would have caught this days earlier.
Proposed Behavior
- Run on a schedule (e.g., daily or every few hours)
- For each target network, start a fresh light node (clean datastore) and verify:
- Successful connection to bootstrappers
- Head header obtained
- Tail header within the pruning window
- Initial sync completes (e.g., first 100 headers)
- DAS sampling begins
- Report results to telemetry (OTLP metrics / Grafana dashboard)
- Alert on failure (e.g., PagerDuty, Slack, or Grafana alerting)
Implementation Ideas
- Could be a CI cron job, a standalone service, or a cel-shed subcommand
- Could reuse the existing tastora Docker infrastructure
- Metrics to export: startup latency, time-to-first-sample, bootstrapper reachability, sync speed
Related
Summary
We should have an automated tool that periodically spins up a fresh light node on each long-running network (mocha, arabica, mainnet) to verify that new nodes can successfully start, sync, and sample. This would catch startup regressions like the recent tail height overshoot (#4840) before users report them.
Motivation
In v0.29.1, light nodes on mocha failed to start because the syncer tail height estimation overshot the pruning window by ~3.8 hours. This was only discovered through manual testing. An automated canary would have caught this days earlier.
Proposed Behavior
Implementation Ideas
Related