Webhooks is the Formance service responsible for delivering platform events (ledger, payments, …) to user-configured HTTP endpoints, with HMAC signing and automatic retries.
This document covers three things:
- Architecture deep-dive — how the service actually works today
- Known problems & limitations — a detailed inventory of the issues we are facing
- Improvement plan — a phased, prioritized roadmap to fix them
The binary has two run modes (a third retries mode mentioned in older docs no longer exists):
| Command | Role |
|---|---|
serve |
REST API for managing webhook configs (CRUD, activate/deactivate, secret rotation, test delivery). Can optionally embed the worker with --worker. |
worker |
Background service: consumes events from Kafka/NATS via Watermill, delivers webhooks, and runs the retry loop (Retrier). Exposes only a /_healthcheck endpoint. |
Both are wired with uber/fx dependency injection, share a PostgreSQL database (via bun), and are configured through cobra/pflag flags (see cmd/flag/flags.go).
┌─────────────────┐ ┌───────────────────────────┐ ┌──────────────────┐
│ Kafka / NATS │─────▶│ Worker │─────▶│ User Endpoints │
│ (event source) │ │ ┌───────────┐ ┌─────────┐ │ │ (webhook targets)│
└─────────────────┘ │ │ Consumer │ │ Retrier │ │ └──────────────────┘
│ │ (inline │ │ (DB │ │
│ │ send) │ │ poll) │ │
│ └─────┬─────┘ └────┬────┘ │
└───────┼────────────┼──────┘
▼ ▼
┌─────────────────┐ ┌───────────────────────────┐
│ API Clients │─────▶│ PostgreSQL │
│ (fctl, SDKs) │◀─────│ configs + attempts │
└─────────────────┘ └───────────────────────────┘
▲
│
┌────────┴─────────┐
│ Server │ REST API + OAuth2 auth + audit middleware
└──────────────────┘
Two tables only (pkg/config.go, pkg/attempt.go):
configs — a webhook subscription:
id(UUID, PK),endpoint(URL),event_types(text array),secret(base64, 24 random bytes),active(bool),name,created_at,updated_at
attempts — one row per delivery attempt (not per delivery):
id(UUID, PK),webhook_id(groups all attempts of one event×config pair)config(JSONB snapshot of the full config at send time, including the secret)payload(full event JSON as text),status_code,retry_attempt,status,next_retry_after
Statuses: success, to retry, retrying (claimed by a worker), failed (retry window exhausted).
Key design consequence: the current state of a delivery is smeared across multiple rows. Each retry inserts a new attempt row, and the previous retrying rows are bulk-updated to the new status (UpdateAttemptsStatus). There is no deliveries aggregate.
pkg/worker/module.go — processMessages:
- Watermill delivers a message; payload is unmarshalled into
publish.EventMessage. - Event type is normalized to
<app>.<type>lower-case. FindManyConfigs({event_types, active: true})finds matching configs.- For each config, sequentially and synchronously,
MakeAttemptPOSTs the payload to the endpoint (30s HTTP timeout, pkg/otlp/module.go). - The resulting attempt row (
successorto retry) is inserted. - The message is acked when the handler returns
nil; if persisting a retryable attempt fails, the handler returns an error → nack → broker redelivery. context.WithoutCanceldetaches the handler from shutdown cancellation so in-flight sends complete.
pkg/worker/worker.go — the Retrier loop, every --retry-period (default 3s):
- Stale recovery (≤ once/min): attempts stuck in
retrying> 5 min (crashed worker) are reset toto retry. - Atomic claim (pkg/storage/postgres/postgres.go
FindWebhookIDsToRetry): a CTE selects up to--retry-batch-size(default 50) distinctwebhook_ids whose oldest due attempt is ready, joinsconfigsonattempts.config->>'id'to keep only active configs, locks rows withFOR UPDATE SKIP LOCKED(multi-worker safe), and flips them toretrying. - Parallel processing via a bounded
pondpool: for each webhook, re-read theretryingattempts, verify retry caps before calling the endpoint, re-send withMakeAttemptatretry_attempt + 1, then atomically insert the new attempt and terminalize the claimed rows. If the new attempt is retryable, only the newly inserted row stays into retry. - Sleep, repeat.
Backoff is exponential without jitter (pkg/backoff/exponential.go): min-backoff-delay (1m) doubling up to max-backoff-delay (1h), aborting to failed once cumulative elapsed time exceeds --abort-after (default 10h) or --max-attempts is reached (default 15 attempts). With nominal backoff, 15 attempts are exhausted after about 9h03; abort-after is the elapsed-time backstop for Retry-After, downtime, and backlog delays.
pkg/security/security.go — Svix-style signing:
- Signed string:
{webhook_id}.{unix_timestamp}.{body}, HMAC-SHA256 with the config secret, sent asformance-webhook-signature: v1,<base64>. - Headers:
formance-webhook-id,-timestamp,-test,-idempotency-key(propagated from the event). - API auth: OAuth2 via
go-libs/authmiddleware; audit middleware publishes API calls to anaudit-eventstopic.
- E2E ginkgo suite (test/e2e) running against a real Postgres + NATS through the generated Speakeasy SDK (pkg/client, vendored with a
replacedirective). - Earthly for lint/tests, Nix flake for the dev shell, GoReleaser for releases.
Ordered by severity. References point at the code that produced the incidents; the Phase 1 fixes in this PR mitigate the retry-storm items called out below. The items marked (prod) have caused real production incidents — see §2.0.
This is the recurring, money-burning failure mode behind every webhooks incident we have had. The chain:
A dead or misconfigured customer endpoint (wrong credentials, decommissioned URL, expired tunnel) → every error is retried as if transient → attempts pile up in
to retry→ the retry loop re-sends + theattemptstable explodes → PostgreSQL/RDS load, cross-AZ replication traffic, and AWS cost blow up, drowning observability in noise.
Root cause 1 — no retryable/non-retryable distinction.
Mitigated in this PR by terminal 4xx classification.
Previously, MakeAttempt treated any non-2xx response as to retry (pkg/attempt.go:106-118). A 403 (bad credentials), 404 (gone), or 405 (wrong method) is permanent but was retried for the full --abort-after window anyway. There was no client-error short-circuit, no Retry-After honoring for 429.
Root cause 2 — no circuit breaker, no max-attempt cap; abort-after=30d is the only bound.
Partially mitigated in this PR by --max-attempts (default 15) and --abort-after=10h; the per-endpoint circuit breaker is still future work.
With the previous min-backoff=1m doubling to a max-backoff=1h cap, the delay reached 1h after ~7 retries (~2h cumulative), then fired once per hour for 30 days ≈ 718 more → ~724 attempts per dead endpoint. Production incidents confirmed that guardrails did not bound dead endpoints beyond the 30-day timer.
Root cause 3 — orphaned attempts loop and are never reclaimed or purged.
Mitigated in this PR by retention cleanup for deleted and inactive configs.
Deleting/deactivating a config used to leave to retry rows that the active-config join skipped but nothing ever transitioned to failed (§2.1). Combined with zero retention (§2.3), the table only grew.
Sanitized production evidence: retry storms were dominated by permanent endpoint failures, orphaned attempts stayed indefinitely in to retry, large attempts tables made the retry-claim query expensive, and manual cleanup materially reduced cross-AZ database replication traffic. Detailed incident evidence lives in internal runbooks.
Immediate priorities (detailed in the plan): (a) treat permanent 4xx as terminal, (b) add a per-endpoint circuit breaker + max-attempt cap, (c) implement retention/compaction of attempts, (d) split observability by service.name/route/external dependency so outbound delivery latency is not misdiagnosed as internal service latency.
P0 — Permanent errors retried as transient. (prod)
See §2.0 root cause 1. The fix is small and high-impact: classify 4xx (except 408/429) as failed immediately in MakeAttempt.
P0 — No circuit breaker / max-attempt cap. (prod) See §2.0 root cause 2. ~724 retries per dead endpoint is by design today.
P1 — Head-of-line blocking on the consumer hot path.
processMessages sends webhooks inline, sequentially, inside the broker handler (pkg/worker/module.go:126). One slow endpoint (up to the 30s timeout) stalls the whole topic/partition; with N matching configs a single message can hold the handler for N×30s. Consumer lag builds up under load and every customer is penalized by one bad endpoint. This is the single biggest scalability flaw.
P1 — Broker redelivery causes duplicate sends to healthy endpoints.
If InsertOneAttempt fails for one config's retryable attempt, the handler nacks the whole message (pkg/worker/module.go:143-151). On redelivery, the event is re-sent to all matching configs, including those that already received it successfully. There is no per-(event, config) deduplication (no unique constraint, msg.UUID is not used as an idempotency key in storage).
P1 — Crash window produces duplicate retries.
In processWebhookRetry, the new attempt is inserted before the claimed rows are flipped out of retrying (pkg/worker/worker.go:150-160), and the two writes are not in a transaction. A crash in between leaves old rows in retrying; stale recovery resets them to to retry 5 minutes later, and the webhook is sent again even though a fresh to retry attempt row already exists — duplicate deliveries and divergent attempt counters.
P2 — Deleted configs strand attempts forever.
The claim query filters on configs.active = true via a join. Deleting a config (DELETE /configs/{id}) leaves its to retry attempts orphaned: never claimable, never transitioned to failed, polluting the partial indexes indefinitely. Deactivating then reactivating a config also silently resumes old retries, which may be surprising.
P2 — Graceful shutdown is unreliable.
Retrier.Stop has a default: branch that gives up immediately if the loop is mid-batch ("no communication", pkg/worker/worker.go:75). The loop itself uses bare time.Sleep rather than a context-aware timer, and run() launches it with context.Background() (pkg/worker/module.go:50), so fx shutdown cannot cancel it. In-flight batches can be killed mid-write on pod termination, feeding the stale-recovery duplicate path above.
P3 — Lost audit rows on insert failure after success.
If the HTTP delivery succeeded but InsertOneAttempt fails, we continue and ack: delivery happened but no trace of it exists in the DB.
P1 — No SSRF protection.
endpoint is any URL that url.Parse accepts (pkg/config.go:50). The worker will happily POST signed payloads to http://169.254.169.254/, internal services, or localhost. No scheme allow-list (plain http accepted), no private/link-local IP deny-list, no DNS-rebinding protection, and the default http.Client follows redirects (a public endpoint can 302 to an internal one).
P1 — Secrets are over-exposed.
- Returned in clear text by every config API response (
ConfigembedsConfigUserwithjson:"secret"), includingGET /configsand the/testresponse (which embeds the full config snapshot in the attempt). - Snapshotted into every
attempts.configJSONB row — rotating a secret does not remove the old one from potentially millions of historical rows. - Stored in plain text in Postgres (no envelope encryption).
P2 — Unbounded response body read.
io.ReadAll(resp.Body) in MakeAttempt (pkg/attempt.go:91) reads whatever the remote endpoint returns, with no io.LimitReader. A malicious or buggy endpoint can OOM the worker. The body is only used for a debug log.
P0 — Unbounded attempts table → cross-AZ cost. (prod)
No retention, no archival, no partitioning. Every delivery stores the full payload + full config snapshot; success rows are kept forever. Large attempts tables drive cross-AZ RDS replication traffic, inflate backups, and degrade the retry-claim CTE (NOT EXISTS anti-join over a large to retry backlog). Retention is not optional, it is the proven highest-ROI fix.
P2 — Retry throughput ceiling.
Polling claims ≤ 50 webhook IDs per 3s tick per worker (~16/s). The claim query joins on the JSONB expression attempts.config->>'id' (no expression index) and re-reads attempts twice per webhook (FindAttemptsToRetryByWebhookID, then UpdateAttemptsStatus does another SELECT before its UPDATE). During a large outage of a popular endpoint, the backlog drains slowly and the herd retries without jitter, synchronizing load spikes onto the recovering endpoint.
P2 — GET /configs has no pagination.
FindManyConfigs selects all rows (pkg/storage/postgres/postgres.go:27); the handler wraps the full result in a fake Cursor (pkg/server/get.go). Same for the per-event-type lookup on the hot path — fine today, unbounded by design.
- No delivery visibility: there is no endpoint to list attempts/deliveries, inspect failures, or manually replay a webhook. Support has to query Postgres by hand.
- PUT
/configs/{id}on an unknown id returns 200:Store.UpdateOneConfignever checks existence, so theErrConfigNotFoundbranch in pkg/server/update.go:35 is dead code. It also doesn't bumpupdated_at. - Update replaces the secret silently:
PUT /configs/{id}regenerates a secret when none is provided (viaValidate()), breaking the consumer's signature verification without warning. - No secret rotation overlap:
/secret/changeis immediate; Svix/Stripe-style dual-secret grace periods are impossible, so rotation breaks verification until the consumer redeploys. - No wildcard or prefix event subscriptions (
ledger.*), no event-type catalog endpoint. - No automatic disabling of endpoints failing for days (Stripe/Svix do this) and no customer notification when a webhook is finally marked
failed.
- No metrics, no per-dependency split. (prod) Only traces + logs. No delivery success rate, retry-queue depth, end-to-end latency, or per-endpoint error rate → incidents are debugged with SQL (cf. the recent
skip locked/active configsfirefighting in #147, #150). Worse, outbound delivery latency is not attributed clearly enough and can be misdiagnosed as internal service latency. Dashboards must split byservice.name, route, and external dependency. - Health checks are no-ops: server
/_healthcheckreturns an empty 200 with no DB check; the worker healthcheck logshealth check OKat Info level on every probe (log spam). - Mixed dependency generations:
go-libs/v2everywhere withgo-libs/v5bolted on for audit only;logrus+ go-libs logging coexist;pkg/errorsis deprecated. - Package layout: the root
pkgpackage (webhooks) is a grab-bag of domain model + HTTP delivery + backoff contract; the generated Speakeasy SDK (~100 files) is vendored in-tree with areplace, inflating the module and the diff noise. - Handler style: deeply nested
if err == nil { … } else { … }blocks (pkg/server/test.go), repeated response-encoding boilerplate, nohttp.MaxBytesReaderon request bodies. - Docs drift: this README previously documented "3 starting modes" and stale defaults;
docs/is good but not linked from anywhere.
Phased so that each step ships independently and de-risks the next one. Within a phase, items are ordered by value/effort.
Goal: observe before changing anything.
- OTel metrics (S): counters/histograms for
webhooks_delivery_attempts_total{status,status_class},webhooks_delivery_duration_seconds,webhooks_retry_queue_depth(capped gauge forstatus='to retry'),webhooks_consumer_lag. Wire to the existing OTLP pipeline; build the Grafana dashboard + alerts (queue depth, failure rate). - Real health checks (S): DB ping on
/_healthcheck; drop the per-probe Info log in the worker handler. - Load test harness (M): a reproducible benchmark (k6 or Go) simulating one slow endpoint among N, to quantify the head-of-line problem and validate Phase 2.
These directly kill the §2.0 failure mode and the AWS cost bleed. Ship these first.
- Terminal 4xx classification (S): in
MakeAttempt, mark4xx(except408/429) asfailedimmediately instead ofto retry. HonorRetry-Afterfor429. Kills the bulk of every observed storm (403/404/405 were the top error codes). - Circuit breaker + max-attempt cap (M): per-config, after X consecutive failures stop claiming for a cooldown; hard-cap attempts (e.g. 15) independent of
--abort-after. Lower the default--abort-afterfrom 30d (720 retries) to something coherent with the attempt cap (default 10h). Auto-disable endpoints failing > N days and flag them. - Retention / compaction of
attempts(M): scheduled purge ofsuccess(30d) andfailed(90d) rows + reclaim orphanedto retryfor deleted/inactive configs →failed. This makes the proven manual cleanup path permanent and automatic. - Per-dependency observability (S): the Phase 0 metrics, split by
service.name/ route / outbound dependency so the next endpoint incident is caught from a dashboard, not later by infrastructure billing.
- SSRF guard (M): dedicated
http.Clientfor deliveries withCheckRedirectdisabled (or re-validated per hop), a dialer-level deny-list of private/link-local/loopback ranges (validate the resolved IP, not the hostname, to kill DNS rebinding), and anhttps-only policy with an env escape hatch for on-prem/dev. Validate at config creation and at send time. - Bound response reads (S):
io.LimitReader(resp.Body, 64<<10)inMakeAttempt; persist a truncated body excerpt on failure (useful for the future deliveries API). - Stop leaking secrets (M): redact
secretfrom list/get responses (keep it on create/rotate only — breaking change to coordinate with SDK/fctl); stop snapshotting the secret insideattempts.config(strip it before insert; the retry path re-reads it fromconfigs). - Add jitter to backoff (S): full jitter (
rand(0, delay)) to de-synchronize retry herds onto recovering endpoints. - Retry update optimization (M): make
UpdateAttemptsStatusa singleUPDATE … RETURNING(drop the pre-SELECT). - Fix graceful shutdown (M): replace
time.Sleepwithselect { case <-ctx.Done(); case <-ticker.C }, run the loop off the fx lifecycle context, remove thedefault:escape inStop, drain the pond pool with a deadline. - API correctness (S):
UpdateOneConfigchecks existence (404) and bumpsupdated_at; reject secret regeneration on PUT unless explicitly requested.
The structural fix for head-of-line blocking, duplicates, and the convoluted claim query. Target model:
events ──▶ consumer: INSERT deliveries (outbox) ──▶ ACK immediately
│
dispatcher pool (same claim pattern,
but on a one-row-per-delivery table)
│
HTTP send ──▶ append-only delivery_attempts
- Introduce a
deliveriestable (one row per event×config):id,config_id(FK),event_id,event_type,payload,status (pending|delivering|succeeded|failed|cancelled),attempt_count,next_attempt_at, timestamps. AddUNIQUE (event_id, config_id)→ idempotent broker redelivery for free. Keepdelivery_attemptsas an append-only log (id, delivery_id, status_code, error, duration, response excerpt). - Consumer becomes an outbox writer: on message receipt, insert
pendingdeliveries for all matching configs in one statement (ON CONFLICT DO NOTHING), ack. No HTTP on the hot path → consumer throughput decoupled from endpoint health; per-message work is one SELECT + one INSERT. - Dispatcher replaces the Retrier: same
FOR UPDATE SKIP LOCKEDclaim, but now a trivialWHERE status='pending' AND next_attempt_at <= now() ORDER BY next_attempt_at LIMIT $non an FK-indexed table — no JSONB join, noNOT EXISTSanti-join, no multi-row status smearing. First attempts and retries share one code path. - Per-endpoint fairness & circuit breaking: cap concurrent in-flight deliveries per config; after X consecutive failures open a circuit (skip claims for that config for a cooldown), and auto-disable + flag configs failing for > N days.
- Migration strategy: dual-write window — new code path behind a flag, backfill
to retryattempts intodeliveries, keep the legacy retrier reading the old table until drained, then drop the old claim machinery. The e2e suite plus the Phase 0 load test gate the cutover.
Phase 1 ships a retention job on the legacy attempts table; this phase makes it first-class on the new model.
- Retention/archival: purge (or archive to object storage)
succeededdeliveries after 30d andfailedafter 90d (configurable); attempts follow their delivery viaON DELETE CASCADE. - Payload cap (e.g. 1 MiB) validated at ingestion.
- Optional: monthly partitioning of
delivery_attemptsbycreated_atif volume warrants it (cheap drops instead ofDELETEchurn → less cross-AZ traffic).
- Deliveries API:
GET /deliveries?config_id=&status=&cursor=andGET /deliveries/{id}/attempts— gives support and customers visibility. - Manual replay:
POST /deliveries/{id}/retry(and bulk per config) — the #1 support request for any webhook system. - Real pagination on
GET /configs(bunpaginate is already a dependency). - Dual-secret rotation:
/secret/changekeeps the previous secret valid for a grace period; sign with the new, send both signatures space-separated (theVerifyhelper already iterates multiple signatures). - Wildcard event types (
ledger.*) +GET /event-typescatalog. - OpenAPI/SDK sync for all of the above.
- Migrate fully to
go-libs/v5; droplogrusandpkg/errors(usefmt.Errorf/errors.Join). - Restructure:
internal/{api,delivery,storage,signing}with a slim publicpkg/(model +securityverification helper, which customers may import); move the generated SDK out of the main module. - Flatten handler error handling (early returns), shared
respondJSON/respondErrorhelpers,http.MaxBytesReaderon request bodies. - Unit tests for the dispatcher state machine and backoff (table-driven), keeping ginkgo for e2e.
| Phase | Theme | Risk addressed | Effort |
|---|---|---|---|
| 0 | Metrics & health | Flying blind | ~1 sprint |
| 1 | Stop retry storms (terminal 4xx, circuit breaker, retention, per-dep metrics) | Retry storms, cross-AZ cost, undetected dead endpoints | ~1 sprint |
| 1b | Security/correctness quick wins | SSRF, secret leaks, dup retries, shutdown | 1–2 sprints |
| 2 | Outbox + deliveries model | Head-of-line blocking, duplicates, claim complexity | 2–3 sprints |
| 3 | Retention hardening | Unbounded growth (new model) | ~1 sprint |
| 4 | Deliveries API, replay, rotation | Supportability, product parity | 1–2 sprints |
| 5 | Code modernization | Maintenance drag | continuous |
If you do only one thing this quarter: Phase 1. Terminal-4xx + circuit breaker + automatic retention would have prevented the incident class described in §2.0 and the recurring DB I/O and cost spikes.
Run the linters:
earthly +lintRun the tests:
earthly -P +testsUsage:
webhooks [command]
Available Commands:
serve Run webhooks server (add --worker to embed the worker)
worker Run webhooks worker
version Get webhooks version
Key flags:
--listen string HTTP bind address (default ":8080")
--worker Enable worker inside the server process
--kafka-topics strings Topics to consume (default [default])
--retry-period duration Retry poll period (default 3s)
--retry-batch-size int Webhook IDs claimed per tick (default 50)
--min-backoff-delay duration Minimum backoff delay (default 1m)
--max-backoff-delay duration Maximum backoff delay (default 1h)
--abort-after duration Mark failed after retrying this long (default 10h)
--max-attempts int Hard cap on delivery attempts per webhook (default 15)
--retention-period duration Attempts-table cleanup interval (default 1h)
--retention-success-delay duration Retain success attempts before purging (default 720h)
--retention-failed-delay duration Retain failed attempts before purging (default 2160h)
--auto-migrate Run DB migrations on startup
--storage-postgres-conn-string string Postgres connection string
Further reading: docs/architecture.md, docs/retry-mechanism.md, docs/message-processing.md, docs/security.md.