Skip to content

feat: automated fresh node startup health check #4842

@walldiss

Description

@walldiss

Summary

We should have an automated tool that periodically spins up a fresh light node on each long-running network (mocha, arabica, mainnet) to verify that new nodes can successfully start, sync, and sample. This would catch startup regressions like the recent tail height overshoot (#4840) before users report them.

Motivation

In v0.29.1, light nodes on mocha failed to start because the syncer tail height estimation overshot the pruning window by ~3.8 hours. This was only discovered through manual testing. An automated canary would have caught this days earlier.

Proposed Behavior

  • Run on a schedule (e.g., daily or every few hours)
  • For each target network, start a fresh light node (clean datastore) and verify:
    • Successful connection to bootstrappers
    • Head header obtained
    • Tail header within the pruning window
    • Initial sync completes (e.g., first 100 headers)
    • DAS sampling begins
  • Report results to telemetry (OTLP metrics / Grafana dashboard)
  • Alert on failure (e.g., PagerDuty, Slack, or Grafana alerting)

Implementation Ideas

  • Could be a CI cron job, a standalone service, or a cel-shed subcommand
  • Could reuse the existing tastora Docker infrastructure
  • Metrics to export: startup latency, time-to-first-sample, bootstrapper reachability, sync speed

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions