diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 000000000000..b5afd522414f --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,219 @@ +# AGENTS.md - AI Agent Guide for etcd + +Quick reference for AI agents working on the OpenShift etcd fork. + +## Overview + +**etcd**: Distributed key-value store using Raft consensus, ~10K writes/sec, MVCC storage. Single source of truth for Kubernetes/OpenShift cluster state. + +**OpenShift Fork**: Branch `openshift-X.Y` (not `main`). Commit prefixes: `UPSTREAM: :`, `DOWNSTREAM:`. See [REBASE.openshift.md](./REBASE.openshift.md). + +## Architecture + +### Core Components +- **Raft Consensus** (`server/etcdserver/raft.go`, `server.go`) - All state changes via async proposals: `s.w.Register(id)` → `s.r.Propose(ctx, data)` → wait +- **MVCC Storage** (`server/storage/mvcc/`, `backend/`) - Revision (global), Version (per-key), Compaction, Defrag +- **Watches** (`server/storage/mvcc/watchable_store.go`) - Real-time key change notifications +- **Leases** (`server/lease/lessor.go`) - Time-bound key ownership +- **gRPC API** (`api/etcdserverpb/rpc.proto`) - Edit `.proto` → `make genproto` → Never edit `*.pb.go` + +### Key Directories +``` +server/etcdserver/ # Core: server.go, raft.go, apply*.go +server/storage/mvcc/ # MVCC, compaction +server/storage/backend/ # BoltDB, defrag +server/storage/wal/ # Write-Ahead Log +server/etcdserver/api/v3compactor/ # Auto-compaction +client/v3/ # Go client +etcdctl/ # CLI +etcdutl/ # Utilities (defrag, snapshot) +``` + +## Operations + +### Compaction & Defrag +**Compaction**: Removes old revisions, marks space free (doesn't reclaim disk) +```bash +etcd --auto-compaction-mode=periodic --auto-compaction-retention=5m +``` + +**Defragmentation**: Rewrites DB to reclaim space, blocks writes, needs ~2x DB memory +```bash +etcdctl defrag # Online: 30s-5min +etcdutl defrag --data-dir=/path # Offline: faster, requires stop +``` +**Trigger**: When `(db_total_size - db_size_in_use) / db_total_size > 30%` + +**Files**: `server/storage/mvcc/kvstore_compaction.go`, `server/storage/backend/backend.go` + +### Backup & Restore + +**Snapshot Save** (online, 32KB chunks, SHA256): +```bash +etcdctl snapshot save backup.db +etcdctl snapshot status backup.db --write-out=table +``` +**Files**: `etcdctl/ctlv3/command/snapshot_command.go`, `client/v3/snapshot/v3_snapshot.go`, `server/etcdserver/api/v3rpc/maintenance.go` + +**Snapshot Restore** (offline, requires stop): +```bash +etcdutl snapshot restore backup.db --data-dir=/var/lib/etcd-restore \ + --name member1 --initial-cluster member1=http://host1:2380,... +``` +**Process**: Verify SHA256 → Copy DB → Trim membership → Create WAL/snapshot → Update index +**Files**: `etcdutl/snapshot/v3_snapshot.go`, `server/etcdserver/bootstrap.go` + +**WAL Replay** (automatic on startup, CRC32 validation, auto-repairs torn writes): +**Files**: `server/storage/wal/wal.go` (`ReadAll()`), `server/storage/wal/repair.go` + +**Disk Layout**: +``` +/var/lib/etcd/member/ +├── snap/{term}-{index}.snap, db, {index}.snap.db +└── wal/{seq}-{index}.wal +``` + +### TLS & Certificates + +**Setup** (Client/Peer/Metrics TLS): +```bash +etcd --cert-file=/path/server.crt --key-file=/path/server.key --client-cert-auth \ + --peer-cert-file=/path/peer.crt --peer-key-file=/path/peer.key +``` + +**Features**: Client cert auth, CN/SAN validation, CRL support, dynamic reload (no restart), auto-TLS (dev only) + +**Files**: +- Config: `server/embed/config.go`, `server/embed/etcd.go` +- Loading: `client/pkg/tlsutil/tlsutil.go`, `client/pkg/transport/listener.go` +- Client: `client/v3/config.go`, `server/etcdserver/api/v3rpc/grpc.go` +- Peer: `server/etcdserver/api/rafthttp/transport.go` +- Validation: `client/pkg/transport/listener_tls.go` (CRL, SAN) + +**Enhancement Areas**: Proactive cert reload (inotify, SIGHUP), TLS metrics, OCSP stapling + +### I/O Performance + +**Critical Paths**: +- **WAL**: fsync when `raft.MustSync()` true (target: P99 < 10ms) +- **Backend**: Batched commits every 100ms/10K txns (target: P99 < 25ms) + +**Tuning**: +```bash +etcd --wal-dir=/mnt/nvme/etcd-wal --data-dir=/mnt/ssd/etcd-data \ + --backend-batch-interval=100ms --backend-batch-limit=10000 \ + --snapshot-count=10000 +``` + +**Requirements**: SSD (NVMe preferred), dedicated disk, benchmark with `fio --rw=write --ioengine=sync --fdatasync=1 --size=22m --bs=2300` + +## Development + +### Workflows +- **API Feature**: Edit `.proto` → `make genproto` → Implement in `server/etcdserver/api/v3rpc/` → Client in `client/v3/` → Tests +- **Bug Fix**: Failing test → Minimal fix → `make test-unit PKG=./server/...` → `go test -race -count=100` +- **Performance**: Baseline → Profile (`-cpuprofile`) → Optimize → Document metrics + +### Testing +```bash +make test-unit # Fast, isolated +make test-integration # Real server + clients +make test-e2e # Real processes +go test -race -count=100 ./... # Race detection +make verify # Linters +``` +**Checklist**: Unit + integration (API changes) + E2E (features) + `-race` passes + +## Critical Rules + +### ALWAYS +1. Backwards compatibility - Never remove/rename API fields +2. Use Raft for state - All persistent changes via Raft +3. Handle errors - Check all returns, use zap logging +4. Add tests - All changes require tests +5. Profile first - Measure before optimizing + +### NEVER +1. Modify Raft directly - Use `s.r.Propose()` +2. Block Raft apply loop - Keep it fast +3. Edit generated code - Edit `.proto`, run `make genproto` +4. Break API - Deprecate, don't remove +5. Commit without tests +6. Use `fmt.Println` - Use `lg.Info()` (zap) +7. Assume leadership - Always propose via Raft +8. Disable tests - Fix or file issue +9. Unbounded allocations - Max 1.5MB request +10. Panic in library - Return errors + +### Ask First +- Raft changes: Consult maintainers, read [Raft paper](https://raft.github.io/raft.pdf) +- Dependencies: License, security, maintenance +- Breaking changes: Can it be compatible? +- Performance: Have benchmarks +- Storage format: Migration plan + +## Common Mistakes + +1. **Raft Flow**: Async (Propose → Replicate → Commit → Apply) +2. **Revision vs Version**: Revision=global, Version=per-key +3. **Context**: Check `<-ctx.Done()` in loops +4. **Defrag**: Needs ~2x DB memory, blocks writes +5. **Compaction**: Handle `ErrCompacted` on old revisions +6. **Transactions**: Use `Txn()`, not Get+Put +7. **Resources**: `defer cli.Close()` +8. **Consistency**: Linearizable (slow) vs Serializable (fast, stale) + +## Key Metrics + +**Critical Alerts**: +```promql +etcd_disk_wal_fsync_duration_seconds{quantile="0.99"} > 0.01 # Disk slow +etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.8 # Near quota +(etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_in_bytes) + / etcd_mvcc_db_total_size_in_bytes > 0.3 # High fragmentation +etcd_server_proposals_pending > 100 # Raft slow +rate(etcd_server_leader_changes_seen_total[5m]) > 3 # Unstable leader +``` + +**Other Key Metrics**: +``` +etcd_disk_backend_commit_duration_seconds # Backend commit latency +etcd_server_proposals_committed_total # Raft proposals committed +etcd_debugging_snap_save_total_duration_seconds # Snapshot save time +``` + +**Access**: `curl http://localhost:2379/metrics` or `etcdctl endpoint status --write-out=table` + +## Configuration Defaults + +| Setting | Default | File | +|---------|---------|------| +| Snapshot count | 10,000 | `DefaultSnapshotCount`, `server/etcdserver/server.go` | +| Backend batch interval | 100ms | `defaultBatchInterval`, `server/storage/backend/backend.go` | +| Backend batch limit | 10,000 | `defaultBatchLimit`, `server/storage/backend/backend.go` | +| Database quota | 2GB (OpenShift: 8GB) | `DefaultQuotaBytes`, `server/storage/quota.go` | +| Max request size | 1.5MB | `DefaultMaxRequestBytes`, `server/embed/config.go` | + +## OpenShift + +**Commit Prefixes**: `UPSTREAM: :` (temporary), `UPSTREAM: :` (downstream-only), `DOWNSTREAM:` (OpenShift-specific) + +**CI**: `/payload 4.17 nightly informing`, `/payload 4.17 nightly blocking`, `launch openshift/etcd#PR` + +**Rebase**: `openshift-hack/rebase.sh --etcd-tag=v3.5.15 --openshift-release=openshift-4.17 --jira-id=12345` + +## Resources + +- [etcd.io/docs](https://etcd.io/docs) - Official docs +- [etcd Metrics](https://etcd.io/docs/v3.5/metrics/) - Full metrics +- [Raft Paper](https://raft.github.io/raft.pdf) - Consensus algorithm +- [REBASE.openshift.md](./REBASE.openshift.md) - Rebase procedures +- [OpenShift etcd Practices](https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html/etcd/etcd-practices) + +**Tools**: `etcdctl` (CLI), `etcdutl` (defrag/snapshot), `benchmark` (perf testing) + +--- + +**Version**: 4.0 (Final) +**Last Updated**: 2026-06-26 +**Verified**: Configs/metrics verified against codebase and [official docs](https://etcd.io/docs/v3.5/metrics/) diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 000000000000..d944dcdd9a7b --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,1428 @@ +# etcd - Architecture Documentation + +This document provides a comprehensive overview of etcd's architecture, design decisions, and operational model. + +## Table of Contents + +- [Overview](#overview) +- [System Architecture](#system-architecture) +- [Core Components](#core-components) +- [Raft Consensus](#raft-consensus) +- [Storage Architecture](#storage-architecture) +- [Client API](#client-api) +- [Watch Mechanism](#watch-mechanism) +- [Lease System](#lease-system) +- [Authentication and Authorization](#authentication-and-authorization) +- [Cluster Management](#cluster-management) +- [Performance Characteristics](#performance-characteristics) +- [Failure Modes and Recovery](#failure-modes-and-recovery) +- [Design Decisions](#design-decisions) +- [Deployment Topology](#deployment-topology) + +## Overview + +### What is etcd? + +etcd is a distributed, reliable key-value store for the most critical data of distributed systems. It is the foundation for storing all cluster state in Kubernetes and OpenShift. + +**Core Characteristics**: +- **Consistency**: Strong consistency via Raft consensus +- **Reliability**: Survives network partitions and node failures +- **Performance**: Handles ~10,000 writes/sec in production +- **Simplicity**: Clean gRPC API with straightforward semantics +- **Security**: TLS encryption and RBAC authorization + +**Primary Use Case**: Kubernetes/OpenShift cluster state storage +- Every pod, service, configmap, deployment is stored in etcd +- Leader election and coordination primitives +- Configuration management +- Service discovery + +### Key Features + +1. **Distributed Consensus**: Raft algorithm ensures all nodes agree on state +2. **Multi-Version Storage**: MVCC enables historical queries and watch +3. **Watch API**: Real-time notifications when keys change +4. **Lease System**: Time-bound key ownership with automatic expiration +5. **Transaction Support**: Atomic multi-key operations with conditions +6. **Linearizable Reads**: Strongest consistency guarantee +7. **Member Management**: Dynamic cluster membership changes +8. **Snapshot & Restore**: Point-in-time backup and recovery + +## System Architecture + +### High-Level Architecture + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ etcd Cluster │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ +│ │ (Leader) │◄────►│ (Follower) │◄────►│ (Follower) │ │ +│ │ │ │ │ │ │ │ +│ │ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │ │ +│ │ │ gRPC │ │ │ │ gRPC │ │ │ │ gRPC │ │ │ +│ │ │ Server │ │ │ │ Server │ │ │ │ Server │ │ │ +│ │ └────┬───┘ │ │ └────┬───┘ │ │ └────┬───┘ │ │ +│ │ │ │ │ │ │ │ │ │ │ +│ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ +│ │ │ Raft │ │ │ │ Raft │ │ │ │ Raft │ │ │ +│ │ │ Node │ │ │ │ Node │ │ │ │ Node │ │ │ +│ │ └────┬───┘ │ │ └────┬───┘ │ │ └────┬───┘ │ │ +│ │ │ │ │ │ │ │ │ │ │ +│ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ +│ │ │ MVCC │ │ │ │ MVCC │ │ │ │ MVCC │ │ │ +│ │ │ Store │ │ │ │ Store │ │ │ │ Store │ │ │ +│ │ └────┬───┘ │ │ └────┬───┘ │ │ └────┬───┘ │ │ +│ │ │ │ │ │ │ │ │ │ │ +│ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ +│ │ │ BoltDB │ │ │ │ BoltDB │ │ │ │ BoltDB │ │ │ +│ │ │Backend │ │ │ │Backend │ │ │ │Backend │ │ │ +│ │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │ +│ │ │ │ │ │ │ │ │ │ │ +│ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ +│ │ │ WAL │ │ │ │ WAL │ │ │ │ WAL │ │ │ +│ │ │ Log │ │ │ │ Log │ │ │ │ Log │ │ │ +│ │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │ +│ │ │ │ │ │ │ │ │ │ │ +│ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ ┌────▼───┐ │ │ +│ │ │ Snap │ │ │ │ Snap │ │ │ │ Snap │ │ │ +│ │ │ Store │ │ │ │ Store │ │ │ │ Store │ │ │ +│ │ └────────┘ │ │ └────────┘ │ │ └────────┘ │ │ +│ └──────────────┘ └──────────────┘ └──────────────┘ │ +│ ▲ ▲ ▲ │ +└─────────┼─────────────────────┼─────────────────────┼───────────────┘ + │ │ │ + │ │ │ + ┌─────┴─────────────────────┴─────────────────────┴─────┐ + │ Client Applications │ + │ (Kubernetes API Server, etcdctl, custom clients) │ + └─────────────────────────────────────────────────────────┘ +``` + +### Data Flow + +**Write Operation (Linearizable)**: +``` +1. Client → gRPC API (any node) +2. Node → Forward to Leader (if not leader) +3. Leader → Propose to Raft +4. Raft → Replicate to majority +5. Raft → Commit entry +6. Leader → Apply to MVCC store +7. MVCC → Write to BoltDB backend +8. Backend → Persist to disk +9. Leader → Return response to client +``` + +**Read Operation (Linearizable)**: +``` +1. Client → gRPC API (any node) +2. Node → Check leadership (quorum read) +3. Node → Read from local MVCC store +4. MVCC → Query BoltDB backend +5. Node → Return response to client +``` + +**Read Operation (Serializable)**: +``` +1. Client → gRPC API (any node) +2. Node → Read from local MVCC store (no quorum check) +3. MVCC → Query BoltDB backend +4. Node → Return response to client +``` + +## Core Components + +### 1. EtcdServer + +**Location**: `server/etcdserver/server.go` + +**Responsibilities**: +- Coordinate all server operations +- Manage Raft node lifecycle +- Process client requests +- Apply committed Raft entries +- Manage cluster membership +- Handle snapshots and WAL + +**Key Structures**: +```go +type EtcdServer struct { + // Raft consensus + r raftNode + raftStorage *raft.MemoryStorage + + // Storage + kv mvcc.ConsistentWatchableKV + be backend.Backend + + // Cluster state + cluster api.Cluster + id types.ID + + // Configuration + Cfg config.ServerConfig + + // Lease management + lessor lease.Lessor + + // Apply layer + applyV3 apply.ApplyV3 +} +``` + +**Event Loop**: +The server runs a main event loop that: +1. Receives committed Raft entries +2. Applies entries to state machine (MVCC store) +3. Sends responses to waiting clients +4. Processes snapshots +5. Handles leadership changes + +### 2. Raft Node + +**Location**: `server/etcdserver/raft.go` + +**Responsibilities**: +- Implement Raft consensus protocol +- Manage leader election +- Replicate log entries +- Handle network communication between nodes +- Manage Raft configuration changes + +**Raft States**: +- **Leader**: Accepts writes, replicates to followers +- **Follower**: Replicates from leader, redirects writes +- **Candidate**: Transitional state during election +- **Learner**: Non-voting member (used for adding nodes) + +**Communication**: +- Uses `rafthttp` package for peer-to-peer communication +- Maintains persistent connections between nodes +- Handles message serialization and network failures + +### 3. MVCC Storage + +**Location**: `server/storage/mvcc/` + +**Architecture**: +``` +┌─────────────────────────────────────────┐ +│ ConsistentWatchableKV │ +│ (Combines consistency + watch) │ +└─────────────────┬───────────────────────┘ + │ +┌─────────────────▼───────────────────────┐ +│ WatchableKV │ +│ (Adds watch functionality) │ +└─────────────────┬───────────────────────┘ + │ +┌─────────────────▼───────────────────────┐ +│ KV Store │ +│ (Core MVCC operations) │ +│ - Put, Get, Delete, Txn │ +│ - Revision management │ +└─────────────────┬───────────────────────┘ + │ +┌─────────────────▼───────────────────────┐ +│ BoltDB Backend │ +│ (Persistent storage) │ +└─────────────────────────────────────────┘ +``` + +**Key Concepts**: + +**Revision**: Global monotonically increasing counter +- Increments on every write transaction +- Used for point-in-time queries +- Forms the basis for MVCC + +**Key Structure**: +``` +Key: /registry/pods/default/my-pod + CreateRevision: 100 + ModRevision: 105 + Version: 3 + Value: +``` + +**Index Structure**: +``` +BoltDB Buckets: + key → keyIndex (revision history) + keyIndex → + + meta → consistentIndex (last applied Raft index) + meta → scheduledCompactRevision + + rev_{revision} → key-value data +``` + +### 4. Backend Storage (BoltDB) + +**Location**: `server/storage/backend/` + +**BoltDB Characteristics**: +- Embedded key-value database +- B+tree data structure +- ACID transactions +- MVCC support +- Memory-mapped files for performance +- Single-writer, multiple-readers + +**Buckets**: +- `key`: Stores key index with revision history +- `meta`: Stores metadata (consistent index, compaction, etc.) +- `lease`: Stores lease information +- `auth`: Stores authentication data +- `members`: Stores cluster membership +- `cluster`: Stores cluster configuration + +**Backend Operations**: +```go +// Batch write (transaction) +tx := be.BatchTx() +tx.Lock() +defer tx.Unlock() +tx.UnsafePut(buckets.Key, key, value) +``` + +**Optimization**: +- Read transactions don't block writes +- Batch commits for better performance +- Periodic defragmentation to reclaim space + +### 5. Write-Ahead Log (WAL) + +**Location**: `server/storage/wal/` + +**Purpose**: Ensure durability of Raft log entries before they're applied. + +**Characteristics**: +- Append-only log structure +- Fsync after every write for durability +- Segmented files for easier management +- Used for crash recovery + +**WAL Record Types**: +```go +type Record struct { + Type RecordType // Entry, State, Snapshot, CRC + Data []byte + Crc uint32 +} +``` + +**Recovery Process**: +1. Read WAL from last snapshot +2. Replay entries to rebuild Raft state +3. Apply committed entries to state machine +4. Resume normal operation + +### 6. Snapshot Store + +**Location**: `server/etcdserver/api/snap/` + +**Purpose**: Periodic snapshots of entire state for faster recovery. + +**Snapshot Process**: +``` +1. Trigger snapshot (after N entries, typically 10,000) +2. Serialize current MVCC state +3. Write snapshot file +4. Update WAL with snapshot metadata +5. Truncate old WAL entries +``` + +**Benefits**: +- Faster recovery (don't replay entire WAL) +- Smaller WAL size +- Efficient cluster bootstrapping + +**Snapshot Format**: +``` +Snapshot File: + - Metadata (index, term, cluster config) + - BoltDB database dump + - CRC checksum +``` + +## Raft Consensus + +### Raft Overview + +etcd uses the Raft consensus algorithm to maintain a consistent, replicated log across all nodes. + +**Raft Properties**: +- **Leader-based**: One leader coordinates all writes +- **Strong consistency**: Linearizable reads and writes +- **Fault tolerance**: Survives f failures in 2f+1 cluster +- **Understandable**: Simpler than Paxos, easier to implement + +### Leader Election + +**Process**: +1. Follower times out waiting for heartbeat +2. Becomes candidate, increments term +3. Votes for itself, requests votes from others +4. Wins if receives majority votes +5. Becomes leader, sends heartbeats + +**Election Timeout**: Randomized to avoid split votes +- Typical: 1000-5000ms +- Prevents multiple candidates simultaneously + +**Safety**: Only candidates with up-to-date logs can win +- Candidate's log must contain all committed entries +- Ensures committed entries are never lost + +### Log Replication + +**Write Flow**: +``` +1. Client sends write to leader +2. Leader appends entry to local log +3. Leader sends AppendEntries RPC to followers +4. Followers append entry, respond with success +5. Leader commits entry after majority acknowledges +6. Leader applies entry to state machine +7. Leader notifies followers of commit +8. Followers apply entry to state machine +``` + +**Log Structure**: +``` +Index: 1 2 3 4 5 6 +Term: 1 1 2 2 3 3 +Entry: [A] [B] [C] [D] [E] [F] + ↑ ↑ + Committed Uncommitted +``` + +**Commit Rules**: +- Entry is committed when majority has it +- All entries before committed entry are also committed +- Committed entries are durable and will never be lost + +### Log Compaction + +**Problem**: Log grows unbounded over time. + +**Solution**: Snapshot + truncate log. + +**Process**: +1. Create snapshot of current state +2. Store snapshot index and term +3. Truncate log up to snapshot index +4. New nodes receive snapshot instead of full log + +**Triggered by**: Raft entry count (default: 10,000 entries) + +### Network Partitions + +**Scenario**: Network partition splits cluster into two groups. + +**Majority Partition** (has quorum): +- Elects new leader +- Continues accepting writes +- Operates normally + +**Minority Partition** (no quorum): +- Cannot elect leader +- Rejects writes +- Accepts serializable reads (may be stale) + +**Recovery**: When partition heals +- Minority rejoins cluster +- Syncs with current leader +- Conflicting uncommitted entries are discarded + +### Membership Changes + +**Safe Reconfiguration**: Raft's joint consensus prevents split-brain during membership changes. + +**Process**: +1. Propose configuration change (add/remove member) +2. Enter joint consensus (both old and new configs) +3. Commit joint consensus +4. Transition to new configuration +5. Commit new configuration + +**Learner Members**: Non-voting members used for safe addition +- Receive log replication +- Don't participate in voting +- Promoted to voting member when caught up + +## Storage Architecture + +### MVCC Implementation + +**Multi-Version Concurrency Control** enables: +- Snapshot isolation for transactions +- Historical queries +- Watch from any revision +- Non-blocking reads + +**Revision Semantics**: + +**Main Revision**: Global counter for all changes +``` +Transaction 1: Put key=A → Revision 10 +Transaction 2: Put key=B, Put key=C → Revision 11 (both get same revision) +``` + +**Mod Revision**: When key was last modified +``` +Put key=A value=1 → ModRevision=10 +Put key=A value=2 → ModRevision=15 +Put key=B value=x → ModRevision=15 +``` + +**Version**: How many times key was modified +``` +Put key=A value=1 → Version=1 +Put key=A value=2 → Version=2 +Put key=A value=3 → Version=3 +``` + +**Key Index Structure**: +```go +type keyIndex struct { + key []byte + modified revision // last modified revision + generations []generation +} + +type generation struct { + ver int64 // version counter + created revision // create revision + revs []revision // all modifications +} +``` + +**Example**: +``` +Put foo=a → Rev 10 +Put foo=b → Rev 15 +Delete foo → Rev 20 +Put foo=c → Rev 25 + +keyIndex for "foo": + modified: (25,0) + generations: + [0]: created: (10,0), ver: 2, revs: [(10,0), (15,0)] + [1]: created: (25,0), ver: 1, revs: [(25,0)] +``` + +### Compaction + +**Purpose**: Reclaim space by removing old revisions. + +**Types**: + +**Periodic Compaction** (default): +- Automatically compacts based on time +- Keeps revisions for configured duration (e.g., 5 minutes) +- `--auto-compaction-mode=periodic --auto-compaction-retention=5m` + +**Revision Compaction**: +- Keeps last N revisions +- `--auto-compaction-mode=revision --auto-compaction-retention=1000` + +**Process**: +1. Mark revisions < target as deleted +2. Async goroutine removes deleted revisions +3. BoltDB frees space in B+tree +4. Space reusable immediately + +**Effect on Operations**: +- Queries at compacted revision return `ErrCompacted` +- Watches from compacted revision fail +- Historical data is lost + +### Defragmentation + +**Problem**: Even after compaction, BoltDB file has fragmentation and wasted space. + +**Solution**: Defragmentation creates new database file with only live data. + +**Process**: +1. Create new BoltDB file +2. Copy all live data to new file +3. Atomically replace old file +4. Old file space reclaimed + +**Trigger**: +```bash +etcdctl defrag # Online defrag (blocks writes) +etcdutl defrag --data-dir=/path # Offline defrag +``` + +**Trade-offs**: +- Online: Convenient but blocks writes, doubles disk usage temporarily +- Offline: Requires downtime but more efficient + +### Transaction Model + +**Transaction Structure**: +``` +If +Then +Else +``` + +**Example**: +```go +txn := Txn(). + If(Compare(Value("key"), "=", "old")). + Then(OpPut("key", "new"), OpPut("status", "updated")). + Else(OpGet("key")) +``` + +**Semantics**: +- Evaluated atomically +- All comparisons in If() evaluated first +- Execute Then() if all comparisons succeed +- Execute Else() otherwise +- Return results of executed operations + +**Compare Operations**: +- `Value`: Compare key value +- `Version`: Compare key version +- `CreateRevision`: Compare create revision +- `ModRevision`: Compare mod revision +- `Lease`: Compare lease ID + +**Use Cases**: +- Compare-and-swap (CAS) +- Distributed locks +- Conditional updates +- Atomic multi-key operations + +## Client API + +### gRPC Services + +**KV Service** (`rpc.proto`): +```protobuf +service KV { + rpc Range(RangeRequest) returns (RangeResponse); // Get + rpc Put(PutRequest) returns (PutResponse); // Put + rpc DeleteRange(DeleteRangeRequest) returns (DeleteRangeResponse); // Delete + rpc Txn(TxnRequest) returns (TxnResponse); // Transaction + rpc Compact(CompactionRequest) returns (CompactionResponse); // Compact +} +``` + +**Watch Service**: +```protobuf +service Watch { + rpc Watch(stream WatchRequest) returns (stream WatchResponse); +} +``` + +**Lease Service**: +```protobuf +service Lease { + rpc LeaseGrant(LeaseGrantRequest) returns (LeaseGrantResponse); + rpc LeaseRevoke(LeaseRevokeRequest) returns (LeaseRevokeResponse); + rpc LeaseKeepAlive(stream LeaseKeepAliveRequest) returns (stream LeaseKeepAliveResponse); + rpc LeaseTimeToLive(LeaseTimeToLiveRequest) returns (LeaseTimeToLiveResponse); + rpc LeaseLeases(LeaseLeasesRequest) returns (LeaseLeasesResponse); +} +``` + +**Cluster Service**: +```protobuf +service Cluster { + rpc MemberAdd(MemberAddRequest) returns (MemberAddResponse); + rpc MemberRemove(MemberRemoveRequest) returns (MemberRemoveResponse); + rpc MemberUpdate(MemberUpdateRequest) returns (MemberUpdateResponse); + rpc MemberList(MemberListRequest) returns (MemberListResponse); + rpc MemberPromote(MemberPromoteRequest) returns (MemberPromoteResponse); +} +``` + +**Maintenance Service**: +```protobuf +service Maintenance { + rpc Alarm(AlarmRequest) returns (AlarmResponse); + rpc Status(StatusRequest) returns (StatusResponse); + rpc Defragment(DefragmentRequest) returns (DefragmentResponse); + rpc Hash(HashRequest) returns (HashResponse); + rpc HashKV(HashKVRequest) returns (HashKVResponse); + rpc Snapshot(SnapshotRequest) returns (stream SnapshotResponse); + rpc MoveLeader(MoveLeaderRequest) returns (MoveLeaderResponse); + rpc Downgrade(DowngradeRequest) returns (DowngradeResponse); +} +``` + +### Client Library (client/v3) + +**Basic Operations**: +```go +// Create client +cli, err := clientv3.New(clientv3.Config{ + Endpoints: []string{"localhost:2379"}, + DialTimeout: 5 * time.Second, +}) +defer cli.Close() + +// Put +ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) +_, err = cli.Put(ctx, "key", "value") +cancel() + +// Get +ctx, cancel = context.WithTimeout(context.Background(), 5*time.Second) +resp, err := cli.Get(ctx, "key") +cancel() + +// Get with prefix +resp, err := cli.Get(ctx, "prefix", clientv3.WithPrefix()) + +// Delete +_, err = cli.Delete(ctx, "key") + +// Transaction +txn := cli.Txn(ctx). + If(clientv3.Compare(clientv3.Value("key"), "=", "old")). + Then(clientv3.OpPut("key", "new")). + Else(clientv3.OpGet("key")) +resp, err := txn.Commit() +``` + +**Watch**: +```go +watchChan := cli.Watch(context.Background(), "key") +for watchResp := range watchChan { + for _, event := range watchResp.Events { + fmt.Printf("Type: %s, Key: %s, Value: %s\n", + event.Type, event.Kv.Key, event.Kv.Value) + } +} +``` + +**Lease**: +```go +// Grant lease +lease, err := cli.Grant(ctx, 10) // 10 seconds + +// Put with lease +_, err = cli.Put(ctx, "key", "value", clientv3.WithLease(lease.ID)) + +// Keep alive +ch, err := cli.KeepAlive(context.Background(), lease.ID) +for ka := range ch { + // Lease renewed +} + +// Revoke lease (deletes associated keys) +_, err = cli.Revoke(ctx, lease.ID) +``` + +## Watch Mechanism + +### Architecture + +``` +┌────────────────────────────────────────────────┐ +│ Watch Clients │ +└─────────────────┬──────────────────────────────┘ + │ +┌─────────────────▼──────────────────────────────┐ +│ WatchableStore │ +│ ┌──────────────────────────────────────────┐ │ +│ │ Watcher Registry │ │ +│ │ - watchers map[string]*watcherGroup │ │ +│ │ - victims (slow watchers) │ │ +│ └──────────────────────────────────────────┘ │ +└─────────────────┬──────────────────────────────┘ + │ +┌─────────────────▼──────────────────────────────┐ +│ Event Generator │ +│ - Notifies watchers on Put/Delete │ +│ - Batches events for efficiency │ +└─────────────────┬──────────────────────────────┘ + │ +┌─────────────────▼──────────────────────────────┐ +│ MVCC Store │ +│ - Generates events during apply │ +└────────────────────────────────────────────────┘ +``` + +### Watch Types + +**Key Watch**: Watch single key +```go +watchChan := cli.Watch(ctx, "foo") +``` + +**Prefix Watch**: Watch all keys with prefix +```go +watchChan := cli.Watch(ctx, "foo", clientv3.WithPrefix()) +``` + +**Range Watch**: Watch key range +```go +watchChan := cli.Watch(ctx, "foo", clientv3.WithRange("foz")) +``` + +**Historical Watch**: Watch from past revision +```go +watchChan := cli.Watch(ctx, "foo", clientv3.WithRev(100)) +``` + +### Event Types + +```go +type Event struct { + Type EventType // PUT or DELETE + Kv *KeyValue // Current key-value + PrevKv *KeyValue // Previous key-value (if WithPrevKV) +} +``` + +### Watch Guarantees + +1. **Ordered**: Events delivered in revision order +2. **Reliable**: No events are lost or duplicated +3. **Resumable**: Can resume from any revision +4. **Atomic**: Transactional puts generate single event + +### Slow Consumer Handling + +**Problem**: Slow consumer can't keep up with event rate. + +**Solution**: Event buffering with overflow detection. + +**Behavior**: +- Events buffered in channel (default 1024) +- If buffer fills, watcher marked as "victim" +- Victim watchers receive all queued events in one batch +- Client must process or risk watch cancellation + +## Lease System + +### Architecture + +``` +┌────────────────────────────────────────────┐ +│ Lessor │ +│ ┌──────────────────────────────────────┐ │ +│ │ Lease Map │ │ +│ │ leaseID → Lease{TTL, keys} │ │ +│ └──────────────────────────────────────┘ │ +│ ┌──────────────────────────────────────┐ │ +│ │ Expiry Queue │ │ +│ │ heap of leases by expiry time │ │ +│ └──────────────────────────────────────┘ │ +└───────────────┬────────────────────────────┘ + │ + │ Expired leases + ▼ +┌───────────────────────────────────────────┐ +│ Lease Revoker │ +│ - Proposes lease revocation via Raft │ +│ - Deletes associated keys │ +└───────────────────────────────────────────┘ +``` + +### Lease Lifecycle + +**1. Grant Lease**: +```go +lease, err := cli.Grant(ctx, 30) // 30 seconds TTL +``` +- Assigns unique lease ID +- Sets initial TTL +- Raft-replicated for consistency + +**2. Attach Keys to Lease**: +```go +cli.Put(ctx, "key", "value", clientv3.WithLease(lease.ID)) +``` +- Key ownership tied to lease +- Multiple keys can share one lease +- Key deleted when lease expires + +**3. Keep Alive (Renew)**: +```go +ch, err := cli.KeepAlive(ctx, lease.ID) +for ka := range ch { + // Lease renewed +} +``` +- Client sends periodic heartbeats +- Resets lease expiry time +- Continues until context canceled + +**4. Lease Expiration**: +- Lessor detects expired lease +- Proposes revocation via Raft +- Keys associated with lease deleted +- Lease removed from map + +**5. Explicit Revocation**: +```go +cli.Revoke(ctx, lease.ID) +``` +- Immediately revokes lease +- Deletes all associated keys +- Raft-replicated + +### Use Cases + +**Distributed Locks**: +```go +// Acquire lock +lease, _ := cli.Grant(ctx, 30) +txn := cli.Txn(ctx). + If(clientv3.Compare(clientv3.CreateRevision("lock"), "=", 0)). + Then(clientv3.OpPut("lock", "holder", clientv3.WithLease(lease.ID))) +resp, _ := txn.Commit() + +if resp.Succeeded { + // Lock acquired + defer cli.Revoke(context.Background(), lease.ID) + + // Keep alive in background + ch, _ := cli.KeepAlive(context.Background(), lease.ID) + go func() { + for range ch {} + }() + + // Critical section +} +``` + +**Session Management**: +- Client creates lease at start +- Attaches session data to lease +- Keeps lease alive periodically +- Session auto-deleted if client crashes + +**Service Discovery**: +- Service registers endpoint with lease +- Keeps lease alive while running +- Endpoint removed on service crash + +## Authentication and Authorization + +### Authentication + +**Supported Methods**: +- **Simple Password**: Username/password authentication +- **TLS Client Certificates**: Mutual TLS authentication + +**User Management**: +```bash +etcdctl user add myuser # Add user +etcdctl user grant-role myuser admin # Grant role +etcdctl auth enable # Enable auth +``` + +**Client Authentication**: +```go +cli, err := clientv3.New(clientv3.Config{ + Endpoints: []string{"localhost:2379"}, + Username: "myuser", + Password: "mypassword", +}) +``` + +### Authorization (RBAC) + +**Role-Based Access Control**: + +**Roles**: Named collection of permissions +```bash +etcdctl role add myrole +etcdctl role grant-permission myrole read /foo +etcdctl role grant-permission myrole readwrite /bar +``` + +**Users**: Assigned one or more roles +```bash +etcdctl user add alice +etcdctl user grant-role alice myrole +``` + +**Permissions**: +- `read`: Get, watch +- `write`: Put, delete +- `readwrite`: Both read and write + +**Key Ranges**: Permissions apply to key ranges +```bash +# Permission on single key +etcdctl role grant-permission myrole read /exact-key + +# Permission on key prefix +etcdctl role grant-permission myrole read /prefix/ --prefix=true + +# Permission on key range +etcdctl role grant-permission myrole read /start /end +``` + +**Root User**: Special user with all permissions +- Created during `auth enable` +- Cannot be deleted +- Used for administrative tasks + +## Cluster Management + +### Cluster Bootstrapping + +**Static Bootstrap**: All members known at start +```bash +# Member 1 +etcd --name=member1 \ + --initial-cluster=member1=http://host1:2380,member2=http://host2:2380,member3=http://host3:2380 \ + --initial-cluster-state=new + +# Member 2 +etcd --name=member2 \ + --initial-cluster=member1=http://host1:2380,member2=http://host2:2380,member3=http://host3:2380 \ + --initial-cluster-state=new + +# Member 3 +etcd --name=member3 \ + --initial-cluster=member1=http://host1:2380,member2=http://host2:2380,member3=http://host3:2380 \ + --initial-cluster-state=new +``` + +**Discovery Bootstrap**: Members discover each other via discovery service +```bash +# Generate discovery URL +curl https://discovery.etcd.io/new?size=3 + +# Start members with discovery URL +etcd --name=member1 --discovery=https://discovery.etcd.io/xxxxx +``` + +### Adding Members + +**1. Add Learner** (recommended): +```bash +etcdctl member add newmember --learner=true --peer-urls=http://newhost:2380 +``` + +**2. Start New Member**: +```bash +etcd --name=newmember \ + --initial-cluster-state=existing \ + --initial-cluster=member1=http://host1:2380,...,newmember=http://newhost:2380 +``` + +**3. Promote Learner to Voting Member**: +```bash +etcdctl member promote +``` + +**Why Learners?** +- Prevents quorum loss during catch-up +- New member doesn't vote until fully synchronized +- Safe to add multiple learners + +### Removing Members + +```bash +# List members +etcdctl member list + +# Remove member +etcdctl member remove + +# Stop member process +systemctl stop etcd +``` + +**Quorum Considerations**: +- 3-member cluster: Can remove 1 member safely (quorum: 2) +- 5-member cluster: Can remove 2 members safely (quorum: 3) +- Never remove majority of members simultaneously + +### Disaster Recovery + +**Scenario**: Lost quorum (majority of members failed). + +**Recovery Steps**: + +**1. Stop all members**: +```bash +systemctl stop etcd +``` + +**2. Restore from snapshot on one member**: +```bash +etcdutl snapshot restore snapshot.db \ + --name=member1 \ + --initial-cluster=member1=http://host1:2380 \ + --initial-advertise-peer-urls=http://host1:2380 +``` + +**3. Start restored member**: +```bash +etcd --force-new-cluster +``` + +**4. Add new members** (follow normal add process). + +## Performance Characteristics + +### Throughput + +**Typical Performance** (on SSD): +- Sequential writes: ~10,000 ops/sec +- Random writes: ~5,000-8,000 ops/sec +- Reads (local): ~100,000+ ops/sec +- Linearizable reads (quorum): ~10,000 ops/sec + +**Factors**: +- Disk I/O (WAL fsync is bottleneck) +- Network latency (Raft replication) +- Key/value size +- Number of watchers +- CPU and memory + +### Latency + +**Write Latency** (p99): +- Local SSD: 10-50ms +- Network SSD: 50-100ms +- HDD: 100-500ms + +**Read Latency**: +- Serializable (local): <1ms +- Linearizable (quorum): 10-50ms + +**Components**: +- Network RTT: 1-10ms +- Raft replication: 5-20ms +- WAL fsync: 5-20ms (SSD), 50-200ms (HDD) +- BoltDB write: 1-5ms + +### Scalability Limits + +**Cluster Size**: +- Recommended: 3 or 5 members +- Maximum: 7 members (diminishing returns) +- Larger clusters: Higher latency, lower throughput + +**Database Size**: +- Recommended: <8GB +- Warning at: 8GB +- Alarm at: 10GB (default quota) +- Maximum tested: 100GB+ + +**Watchers**: +- Typical: <10,000 watchers +- Tested: 100,000+ watchers +- Impact: Memory usage, event fanout latency + +**Keys**: +- Millions of keys supported +- Watch performance degrades with many keys per prefix +- Compaction critical for large keyspaces + +### Optimization Techniques + +**1. Use SSDs**: Dramatic improvement in write latency. + +**2. Dedicated Disk**: Don't share disk with other I/O-intensive apps. + +**3. Tune OS**: +```bash +# Increase file descriptors +ulimit -n 65536 + +# Disable swap +swapoff -a + +# I/O scheduler +echo noop > /sys/block/sda/queue/scheduler +``` + +**4. etcd Configuration**: +```bash +# Snapshot less frequently (reduce I/O) +--snapshot-count=50000 + +# Larger request size limit +--max-request-bytes=10485760 + +# Auto-compaction +--auto-compaction-mode=periodic +--auto-compaction-retention=5m +``` + +**5. Client Best Practices**: +- Use serializable reads when possible +- Batch operations in transactions +- Use prefix watches instead of many individual watches +- Close watchers when done +- Reuse client connections + +## Failure Modes and Recovery + +### Single Node Failure + +**3-Member Cluster**: +- Quorum: 2 nodes +- Healthy nodes: 2 +- Status: **Operational** +- Behavior: Cluster continues, leader election if leader failed + +**5-Member Cluster**: +- Quorum: 3 nodes +- Healthy nodes: 4 +- Status: **Operational** +- Behavior: Cluster continues normally + +**Recovery**: Replace failed node with new member. + +### Quorum Loss + +**3-Member Cluster** (2 failures): +- Quorum: 2 nodes +- Healthy nodes: 1 +- Status: **Unavailable** +- Behavior: Reads may work (serializable), writes fail + +**5-Member Cluster** (3 failures): +- Quorum: 3 nodes +- Healthy nodes: 2 +- Status: **Unavailable** + +**Recovery**: Restore from snapshot or repair members. + +### Network Partition + +**Scenario**: 3-member cluster splits into [2] and [1]. + +**Majority Partition [2]**: +- Has quorum +- Elects leader +- Accepts writes +- Operational + +**Minority Partition [1]**: +- No quorum +- Cannot elect leader +- Rejects writes +- Serves stale serializable reads + +**Recovery**: When partition heals +- Minority rejoins +- Syncs with leader +- Resumes normal operation + +### Disk Failure + +**Symptoms**: +- Slow I/O +- WAL write errors +- Backend commit timeouts +- Member drops out of cluster + +**Recovery**: +1. Stop member +2. Replace disk +3. Restore from snapshot OR +4. Remove and re-add member + +### Database Corruption + +**Detection**: +- Hash mismatch errors +- Backend corruption errors +- Cluster consistency check failures + +**Recovery**: +1. Identify corrupt member +2. Stop corrupt member +3. Restore from snapshot +4. Restart member + +**Prevention**: +- Use ECC memory +- Validate backups regularly +- Monitor cluster health + +### Split-Brain Prevention + +**Raft Guarantees**: +- Only one leader per term +- Leader requires majority votes +- Two partitions cannot both have quorum + +**Example**: 3-node cluster splits [2] vs [1] +- Partition [2]: Can elect leader, form quorum +- Partition [1]: Cannot elect leader, no quorum +- **No split-brain possible** + +## Design Decisions + +### Why Raft Instead of Paxos? + +**Reasons**: +- **Understandability**: Raft is easier to understand and implement +- **Strong leader**: Simplifies log management +- **Modularity**: Separate leader election, log replication, safety +- **Proof of correctness**: Formally verified safety properties + +**Trade-offs**: +- Paxos may have slightly better performance in some scenarios +- Raft's strong leader can be a bottleneck + +### Why BoltDB? + +**Reasons**: +- **Embedded**: No separate database process +- **ACID**: Strong consistency guarantees +- **MVCC**: Perfect fit for etcd's needs +- **Memory-mapped**: Efficient reads +- **Simple**: Easy to understand and debug + +**Trade-offs**: +- Single-writer (all writes through one goroutine) +- File size growth requires defragmentation +- Not optimized for very large datasets (>100GB) + +### Why gRPC? + +**Reasons**: +- **Performance**: Binary protocol, HTTP/2 multiplexing +- **Type Safety**: Protocol buffers with generated code +- **Streaming**: Bi-directional streaming for watch +- **Cross-language**: Clients in any language +- **Built-in**: Authentication, load balancing, timeouts + +**Trade-offs**: +- More complex than REST +- Requires HTTP/2 +- Less human-readable than JSON + +### Why MVCC? + +**Reasons**: +- **Watch**: Enables efficient watch from any revision +- **Transactions**: Snapshot isolation for txns +- **History**: Point-in-time queries +- **Kubernetes**: Matches Kubernetes resourceVersion semantics + +**Trade-offs**: +- Storage overhead (multiple versions) +- Requires compaction to reclaim space +- More complex than simple key-value + +## Deployment Topology + +### Development (1 Node) + +``` +┌──────────────┐ +│ etcd-1 │ +│ (Single) │ +└──────────────┘ +``` + +**Use**: Local development, testing +**Fault Tolerance**: None +**Performance**: Full read/write speed + +### Production (3 Nodes) + +``` +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ etcd-1 │◄────►│ etcd-2 │◄────►│ etcd-3 │ +│ (Leader) │ │ (Follower) │ │ (Follower) │ +└──────────────┘ └──────────────┘ └──────────────┘ +``` + +**Use**: Small production clusters +**Fault Tolerance**: 1 node failure +**Quorum**: 2 nodes + +### High Availability (5 Nodes) + +``` +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ etcd-1 │◄────►│ etcd-2 │◄────►│ etcd-3 │ +│ (Leader) │ │ (Follower) │ │ (Follower) │ +└──────────────┘ └──────────────┘ └──────────────┘ + ▲ ▲ + │ │ + ▼ ▼ +┌──────────────┐ ┌──────────────┐ +│ etcd-4 │◄──────────────────────────►│ etcd-5 │ +│ (Follower) │ │ (Follower) │ +└──────────────┘ └──────────────┘ +``` + +**Use**: Large production clusters +**Fault Tolerance**: 2 node failures +**Quorum**: 3 nodes + +### Multi-Region (5 Nodes) + +``` +Region 1 Region 2 Region 3 +┌──────────┐ ┌──────────┐ ┌──────────┐ +│ etcd-1 │◄───────►│ etcd-2 │◄───────►│ etcd-3 │ +│(Follower)│ │ (Leader) │ │(Follower)│ +└──────────┘ └──────────┘ └──────────┘ + ▲ + │ + ▼ +Region 1 ┌──────────┐ Region 3 +┌──────────┐ │ etcd-4 │ ┌──────────┐ +│ etcd-5 │◄───────►│(Follower)│◄───────►│ (etc) │ +│(Follower)│ └──────────┘ │ │ +└──────────┘ Region 2 └──────────┘ +``` + +**Use**: Global availability +**Considerations**: +- Higher latency (cross-region) +- Place majority in low-latency region +- Consider network costs + +### Kubernetes/OpenShift + +``` +┌─────────────────────────────────────────────┐ +│ Kubernetes/OpenShift Cluster │ +│ │ +│ ┌────────────────────────────────────────┐ │ +│ │ Control Plane Nodes │ │ +│ │ │ │ +│ │ ┌──────────┐ ┌──────────┐ ┌───────┐│ │ +│ │ │ etcd-1 │ │ etcd-2 │ │ etcd-3││ │ +│ │ │(Static │ │(Static │ │(Static││ │ +│ │ │ Pod) │ │ Pod) │ │ Pod) ││ │ +│ │ └──────────┘ └──────────┘ └───────┘│ │ +│ │ ▲ ▲ ▲ │ │ +│ └───────┼──────────────┼──────────────┼──┘ │ +│ │ │ │ │ +│ ┌───────▼──────────────▼──────────────▼──┐ │ +│ │ kube-apiserver instances │ │ +│ │ (read/write cluster state to etcd) │ │ +│ └────────────────────────────────────────┘ │ +└─────────────────────────────────────────────┘ +``` + +**Characteristics**: +- etcd runs as static pods +- Co-located with kube-apiserver +- Dedicated data directory (hostPath) +- Separate network for peer communication + +--- + +**Document Version**: 1.0 +**Last Updated**: 2026-06-25 +**Maintained By**: OpenShift etcd Team diff --git a/CLAUDE.md b/CLAUDE.md new file mode 120000 index 000000000000..47dc3e3d863c --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1 @@ +AGENTS.md \ No newline at end of file diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 08d1807e9879..3d11110cfab6 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -3,6 +3,10 @@ etcd is Apache 2.0 licensed and accepts contributions via GitHub pull requests. This document outlines the basics of contributing to etcd. +**Note**: This is the **OpenShift fork** of etcd. For OpenShift-specific procedures, see [REBASE.openshift.md](./REBASE.openshift.md). + +## Contributor Workflow + This is a rough outline of what a contributor's workflow looks like: * [Find something to work on](#Find-something-to-work-on) * [Check for flaky tests](#Check-for-flaky-tests) @@ -20,11 +24,19 @@ If you have any questions, please reach out using one of the methods listed in [ Before making a change please look through the resources below to learn more about etcd and tools used for development. +**Essential Reading** (especially for new contributors): +* **[CLAUDE.md](./CLAUDE.md)** / **[AGENTS.md](./AGENTS.md)** - AI agent entry point and comprehensive development guide with code organization, workflows, and best practices +* **[ARCHITECTURE.md](./ARCHITECTURE.md)** - Detailed architecture documentation covering Raft, MVCC, storage, and more +* **[REBASE.openshift.md](./REBASE.openshift.md)** - OpenShift-specific rebase procedures + +**External Resources**: * Please learn about [Git](https://github.com/git-guides) version control system used in etcd. * Read the [etcd learning resources](https://etcd.io/docs/v3.5/learning/) * Read the [etcd community membership](/Documentation/contributor-guide/community-membership.md) * Watch [etcd deep dive](https://www.youtube.com/watch?v=D2pm6ufIt98&t=927s) * Watch [etcd code walkthrough](https://www.youtube.com/watch?v=H3XaSF6wF7w) +* Read the [Raft consensus algorithm paper](https://raft.github.io/raft.pdf) +* Upstream repository: https://github.com/etcd-io/etcd ## Find something to work on