βββββββ βββββββ βββββββ ββββββ βββββββββββββββββββββββββββββββββ βββ βββ
ββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββ ββββ
βββββββββββ ββββ ββββββββββββββββ βββ βββ ββββββ ββββββββ βββββββ
βββββββ βββ βββ ββββββββββββββββ βββ βββ ββββββ ββββββββ βββββ
βββ βββββββββ βββββββββββ βββ βββ βββ βββββββββββ βββ βββ
βββ βββββββ βββββββ βββ βββ βββ βββ βββββββββββ βββ βββ
Patroni + etcd + HAProxy = 847 config lines. pgbattery = 1 binary (~14 MB).
MongoDB-style failover for PostgreSQL. Idle connections migrate to the new leader in <100 ms with no client reconnect. A COMMIT caught mid-failover is resolved as committed-or-error; other in-flight statements get a retryable error.
Kyle Kingsbury, Jepsen author and distributed-systems correctness researcher:
Q: What's your favorite relational database?
A: Postgres. Fantastic. Love it. Wish it had a good replication story.
From HN discussion and PostgreSQL mailing list:
"I still don't get how folks can hype Postgres with every second post on HN, yet there is no simple batteries-included way to run a HA Postgres cluster with automatic failover like you can do with MongoDB."
β heipei
"In the SQL world, people are used to accepting the absence of real HA (resilience to failure, where transactions continue without interruption) and instead rely on fast DR (stop the service, recover, check for data loss, start the service). Yet they still call it HA because there's nothing else."
β Franck Pachot (Developer Advocate, Crunchy Data)
"There is no way to guarantee correctness with just two replicas. And many stories of lost transactions with Patroni/Stolon already confirms this thesis... I really dream PostgreSQL will be as reliable as MongoDB without need of external services."
β Yura Sokolov, Postgres Professional (pgsql-hackers mailing list)
From Bruce Momjian (PostgreSQL Core Team founding member):
"On the server side, high availability means having the ability to quickly failover to standby hardware, hopefully with no data loss. Failover behavior on the client side is more nuanced... For clients using a connection pooler, things are even more complicated."
From Crunchy Data (major PostgreSQL contributor):
"When communication between [Patroni and etcd] breaks down, it creates instability in the environment resulting in failover, cluster restart, and even the loss of a primary database."
From PgCon 2012 Cluster Summit (official PostgreSQL conference):
"Currently we have to detect faults -- system down -- by polling. This takes much longer than the actual failover takes."
Batteries-included HA for PostgreSQL. Single binary. No external coordination service.
Boot the 3-node demo cluster, write to it, kill the leader, and watch the session stay alive β under five minutes:
git clone https://github.com/electricapp/pgbattery
cd pgbattery
cp .env.example .env # supplies PGBATTERY_MANAGEMENT_API_TOKEN
docker compose up -d # bring up 3-node cluster
psql postgres://postgres@localhost:5432/postgres # gateway always routes to the leader
# In another shell, drop the current leader β the psql session keeps working:
docker kill -s SIGKILL pgbattery-node1-1Then inspect the cluster:
cargo run --release -- status --discover localhost:9081 # live dashboard
cargo run --release -- doctor --discover localhost:9081 # pre-deploy health gate- Gateway listens on
5432and always routes to the current leader. - Internal PG port per container is
5434(reach viadocker compose exec). - Management API on host
9081/9082/9083(container9091); Prometheus metrics on host9091/9092/9093(container9090). - Use TCP (
-h localhost); Unix sockets arenβt exposed in the demo. - Set
PGBATTERY_MANAGEMENT_API_TOKEN(any random string) beforedocker compose upβ the Compose file fails fast if it is unset.cp .env.example .envis the easy path. - Release binary (~14β―MB) lives at
target/release/pgbattery; copy to each node.
pgbattery is a single binary for PostgreSQL HA: leader election, fencing, commit verification, backups, metrics, and TLS, without etcd or a separate load balancer.
Idle connections migrate transparently to the new leader (<100 ms blip in the local 3-node cluster) with no client reconnect. A COMMIT caught mid-failover is resolved as committed-or-error by probing the new leader; other in-flight statements get a retryable error (SQLSTATE 08006). It also includes a CLI, REST API, Prometheus metrics, pg_basebackup/pg_dump automation, and a built-in upgrade workflow. Safety work includes a TLA+ election spec, chaos tooling, LSN-aware promotions, and lease-based fencing for stale primaries.
Gateway (5432) β Governor (Raft) β Supervisor β PostgreSQL
- Gateway: parses PostgreSQL protocol, enforces lease, migrates idle connections.
- Governor: OpenRaft-based consensus with LSN-aware elections.
- Supervisor: manages PostgreSQL (initdb, promote/demote, backup/restore).
Commit probing: Gateway captures txid_current() before COMMIT, and if the backend dies, asks the new leader SELECT txid_status(...). Clients get a definitive answer instead of βmaybe committed.β
See ARCHITECTURE.md for the deep dive.
| Feature | pgbattery | Patroni | CloudNativePG | AWS RDS Multi-AZ |
|---|---|---|---|---|
| Client errors on failover | None (migration) | Reconnect required | Reconnect required | Reconnect required |
| Failover time | <100 ms idle / ~5 s writes | 15-30s | 20-40s | 60-120s (DNS) |
| Connection migration | Yes | No | No | No |
| In-flight COMMIT recovery | Yes (probe+verify) | No | No | No |
| Dependencies | None | etcd/Consul/ZK | Kubernetes | AWS infrastructure |
| Deployment | Single binary | Multiple components | K8s operator | Managed service |
| LSN-aware elections | Yes | Partial | No | N/A (managed) |
| Cost | Self-hosted | Self-hosted | Self-hosted | $$$ + vendor lock-in |
Reproduce with ./demo/bench.py β a uv script that drives pgbench and a
heartbeat probe against the 3-node docker compose cluster. Raw numbers
land in demo/bench-results.json.
Environment. macOS / Docker Desktop, 3 containers on one host (not a tuned production deployment). Treat the numbers as a sanity floor, not a ceiling.
| Workload | Result |
|---|---|
pgbench TPC-B-like, 4 clients Γ 30 s |
1,085 TPS |
| Average write latency through the gateway | 3.7 ms |
| Observation | Value |
|---|---|
| Idle / read-only connection blip | 68 ms (max gap) |
| Active writing connection unavailability | ~5 s |
| Cluster reconvergence (Raft re-election) | 2.7 s |
| Leader before β after | node3 β node1 |
Idle and read-only connections migrate transparently β the heartbeat loop
in bench.py saw a 68 ms worst-case gap. Connections in the middle of a
write transaction see a brief ReadOnlySqlTransaction window (visible in
the demo above) until the new leader is fully promoted; the gateway then
routes new statements to it without the client reconnecting.
| Container | CPU | Memory |
|---|---|---|
| node1 | 4.1 % | 145 MiB |
| node2 | 3.2 % | 116 MiB |
| node3 | 5.2 % | 159 MiB |
Prometheus metrics live inside each container on :9090/metrics (host ports 9091/9092/9093 in the demo). Every emitted metric carries a # HELP line β run curl -s localhost:9091/metrics | grep '^# HELP pgbattery_' for the full enumerated list.
# Raft / cluster
pgbattery_raft_state # 0 follower, 1 candidate, 2 leader
pgbattery_raft_term
pgbattery_raft_commit_index
pgbattery_leader_elections
# Replication (per-replica, labelled by node)
pgbattery_sync_replicas
pgbattery_sync_quorum # 1 if leader has a sync quorum
pgbattery_replica_lag_bytes{node="2"}
pgbattery_replica_lag_seconds{node="2"}
pgbattery_replica_health{node="2"} # 1.0 healthy, 0.5 lagging, 0.0 unhealthy
pgbattery_replica_is_sync{node="2"} # 2.0 sync, 1.0 potential, 0.0 async
# Connections
pgbattery_connections_active
pgbattery_connections_migrated
pgbattery_connections_severed
# Safety / fencing (CONTRACTS L1βL3)
pgbattery_emergency_fence
pgbattery_queries_rejected_lease_expired
pgbattery_local_lsn_bytes
pgbattery_lsn_future_skew_total
- Default flush interval is 250β―ms to limit CPU overhead; crank it down for chaos runs if you need millisecond-level insight.
- Structured debug logs (enable with
RUST_LOG=pgbattery=debug) include leader elections, fencing decisions, and LSN deltas.
Test scripts run under uv, a Python project/runtime manager:
brew install uv # or: pipx install uv
./testing/ci_runner.py --list # discover suites
./testing/ci_runner.py --suite ha-controlplane-pr # ~3 min smoke
./testing/ci_runner.py --suite ha-sequential # full sequential suite- Matrix lives in
testing/ci_matrix.yaml(25 step types, ~20 cases across 4 suites). testing/jepsen_lite.pyβ stdlib-only Jepsen-style register linearisability check.testing/repro_two_sync*.shreproduce the OPEN metric-staleness anomaly tracked inBUGS.md.- CI workflows in
.github/workflows/βha-ci.yml(push/PR/nightly) andjepsen-lite.yml(weekly).
Alpha release. Correctness work ongoing β see CONTRACTS.md for the formal contracts.
- Leader failure: idle connections see a <100 ms blip and keep their connection; a COMMIT in flight is resolved as committed-or-error, other in-flight statements get a retryable error
- Network partition: multi-layer fencing prevents split-brain
- Connection migration: idle transactions survive failover
- Data integrity: synchronous replication, LSN-aware elections, and timeline verification
- Node crash recovery: operator must confirm disk integrity before rejoin (prevents data corruption)
This is intentional: fully automatic recovery in this scenario risks silent data loss.
- ARCHITECTURE.md β system design and component details
- STATE_MACHINE.md β canonical state-machine truth sources (Raft, lease, replication, gateway routing)
- CONTRACTS.md β correctness contracts (W1βW3, L1βL3, S1, R1βR2)
- DEPLOYMENT.md β bootstrap, join, TLS, Prometheus alerts
- RUNBOOK.md β incident response checklists
- MEMBERSHIP.md β voter/learner topology operations
