Skip to content

fix: treat TLS errors as failover-worthy and remove unsafe start-block fallback#102

Open
vietddude wants to merge 2 commits into
mainfrom
fix/tls-failover-startblock
Open

fix: treat TLS errors as failover-worthy and remove unsafe start-block fallback#102
vietddude wants to merge 2 commits into
mainfrom
fix/tls-failover-startblock

Conversation

@vietddude

@vietddude vietddude commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Problem

Production logs for bsc_mainnet-1 revealed two distinct bugs:

WRN Unknown provider error type ... error="...remote error: tls: internal error"
DBG Provider failed but not switching ... state=healthy consecutive_errors=0
...
WRN Cannot get latest block from chain or KV, using config.StartBlock startBlock=74588255
  1. TLS errors didn't trigger failover. remote error: tls: internal error matched no pattern in analyzeError() → classified as generic_error (markUnhealthy=false) → the provider stayed healthy and the indexer kept hammering the broken endpoint.
  2. Stale-block rewind risk. When both the chain RPC and KV failed, determineStartingBlock() fell straight back to config.StartBlock. Root cause: GetLatestBlock collapsed "key absent (first run)" and "store down" into the same error, so a transient outage at boot could silently rewind the indexer to an old block and re-index everything.

Changes

1. TLS errors are now failover-worthy (internal/rpc/failover.go)

  • Added a tls_error pattern (tls: internal error, tls handshake, handshake failure, remote error: tls) with markUnhealthy: true, 2m cooldown → the provider is cooled down and rotated away.

2. Distinguish "not found" from "store error" (pkg/store/blockstore/store.go)

  • GetLatestBlock returns (0, nil) for a missing key (kvstore.ErrKeyNotFound — Consul/Badger), and only propagates genuine store errors.

3. Drop the config.StartBlock fallback, anchor on chain head (internal/worker/regular.go)

  • Rewrote determineStartingBlock: if KV has a prior block → resume (+ queue the gap for catchup); cold start / store down → waitForChainHead() waits until the chain responds (honoring ctx).
  • getLatestBlockWithRetry does bounded retries on the resume-from-KV path.
  • No more rewinding to config.StartBlock (config field kept, now unused).

4. Cleanup / refactor (internal/worker/regular.go)

  • Extracted queueCatchupRanges, shared by determineStartingBlock and skipAheadIfLagging (removed duplication).
  • Split processRegularBlocks (~90 → ~45 lines) into processBatch + commitProgress.
  • Removed dead code checkContinuity (only referenced by a test).

Tests

  • internal/rpc/failover_test.go: TLS error blacklists the provider (tls_error).
  • internal/worker/regular_test.go: determineStartingBlock cases — store down + chain up, wait-for-head (chain fails transiently then recovers), cold start, ctx-cancel → 0, resume from KV.
  • pkg/store/blockstore/store_test.go: GetLatestBlock returns (0, nil) on missing key, propagates real store errors.

go build ./..., go vet, and all tests in the affected packages pass.

@vietddude vietddude requested a review from anhthii June 18, 2026 10:32
@vietddude vietddude force-pushed the fix/tls-failover-startblock branch from dbcc9a4 to 3101c43 Compare June 18, 2026 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant