Add snapshot backup support for PostgreSQL#2101
Conversation
Implement snapshot backups that leverage filesystem-level or cloud disk snapshots (e.g., AWS EBS, Azure Managed Disks, GCP Persistent Disks, ZFS, LVM) for creating PostgreSQL backups while maintaining proper database consistency and point-in-time recovery capabilities. Snapshot backups store only metadata in WAL-G storage while the actual database files remain in externally managed snapshots. This approach provides near-instantaneous backups regardless of database size and significantly reduces storage costs in WAL-G's object storage. Key components: 1. snapshot-push command: Coordinates backup creation by calling pg_start_backup(), executing a user-defined snapshot command, calling pg_stop_backup(), and uploading metadata. The snapshot command receives environment variables (WALG_SNAPSHOT_NAME, WALG_PG_DATA, WALG_SNAPSHOT_START_LSN, WALG_SNAPSHOT_START_WAL_FILE) for proper snapshot tagging and identification. 2. snapshot-fetch command: Prepares restored snapshots for PostgreSQL recovery by creating backup_label and tablespace_map files from stored metadata. Supports automatic recovery configuration for both PostgreSQL 12+ (recovery.signal) and earlier versions (recovery.conf), with optional point-in-time recovery target specification. 3. Automatic WAL protection: Critical safety feature that prevents deletion of WAL segments required by snapshot backups. During any delete operation, WAL-G identifies all snapshot backups and protects their required WAL range (start LSN to finish LSN) from deletion, ensuring snapshot backups remain recoverable even with aggressive retention policies. 4. Exact backup_label preservation: Stores the exact content returned by pg_stop_backup() for backup_label and tablespace_map files rather than reconstructing them. This ensures compatibility across all PostgreSQL versions and handles any future format changes automatically, as PostgreSQL-generated files are guaranteed to be readable by PostgreSQL. Implementation stores snapshot metadata in BackupSentinelDto with new BackupLabel and TablespaceMap fields. Snapshot backups are identified by FilesMetadataDisabled=true, CompressedSize=0, and presence of BackupLabel content. Delete operations check IsSnapshotBackup() and protect required WAL segments through GetPermanentBackupsAndWals() modifications. Configuration uses WALG_SNAPSHOT_COMMAND (required) for snapshot creation and WALG_SNAPSHOT_DELETE_COMMAND (optional) for cleanup. Commands execute in shell context with environment variables for maximum flexibility across different infrastructure providers. Testing includes comprehensive snapshot_test.sh with 10 test cases covering backup creation, restoration, PITR, deletion, retention policies, and WAL protection verification. Tests use cp -al (hardlinks) to emulate filesystem snapshots without requiring cloud infrastructure. Documentation in docs/PostgreSQL_Snapshot.md provides complete usage examples for major cloud providers (AWS, Azure, GCP) and on-premises solutions (ZFS, LVM), along with technical implementation details, security considerations, and best practices. Snapshot backups integrate seamlessly with existing WAL-G features including backup-list, delete commands, permanent backup flag, encryption, compression (for WAL files), and multiple storage backends. Author: Cursor, Sonnet 4.5, some whacking by me Discussion: wal-g#1781
| } | ||
|
|
||
| if inRecovery { | ||
| return errors.New("Cannot perform snapshot backup on a standby server") |
There was a problem hiding this comment.
Possible issue: when pg_start_backup() runs successfully but returns inRecovery = true, the function immediately returns an error without calling pg_stop_backup().
There was a problem hiding this comment.
thanks for reviewing! Well, this code was written by Cursor and it is...well... I see no reason to disallow snapshot backup on standby.
There was a problem hiding this comment.
thanks for reviewing! Well, this code was written by Cursor and it is...well... I see no reason to disallow snapshot backup on standby.
Hi !
I thought about it for a while — why standby is forbidden — but in the end, I accepted the “religion of the code.” Probably for the best: there’s no real load on the primary, and using a replica doesn’t add much value.
If you run pg_stop_backup() on a replica, it finishes immediately without waiting for the WAL segments (the range mentioned in backup_label) to be archived — especially since archive_mode is usually disabled on replicas.
So you get a “successful” backup, but at restore time, some required WAL segments might be missing.
On the primary, pg_stop_backup() always waits until all WALs are archived, ensuring consistency.
From my view — if backups on replicas are ever allowed, docs must clearly state that users are fully responsible for controlling WAL archiving.
There was a problem hiding this comment.
We already support backups on standby since 2017. So there's no way back in forbidding them somewhere :)
| errorGroup, _ := errgroup.WithContext(ctx) | ||
| errorGroup.Go(func() error { | ||
| err := json2.MarshalWrite(writer, data) | ||
| err := json.NewEncoder(writer).Encode(data) |
There was a problem hiding this comment.
Is there any reason to rollback #2056 ?
If it breaks compilation - upgrade to 1.25 and add GOEXPERIMENT=jsonv2 to your env (this part is not obvious).
There was a problem hiding this comment.
yup, this is bogus change, I'll keep json2
|
|
||
| sbh.QueryRunner, err = NewPgQueryRunner(conn) | ||
| if err != nil { | ||
| return errors.Wrap(err, "failed to build query runner") |
There was a problem hiding this comment.
NIT: it seems that github/pkg/errors is public archive and we can use fmt.Errorf() instead.
There was a problem hiding this comment.
I think at some point we should make this consistent across codebase...
| var lsnString string | ||
| var inRecovery bool | ||
| err = sbh.QueryRunner.Connection.QueryRow(context.TODO(), startBackupQuery, backupLabel).Scan( | ||
| &walFileName, &lsnString, &inRecovery) |
There was a problem hiding this comment.
Why not use sbh.QueryRunner.StartBackup()?
Implement snapshot backups that leverage filesystem-level or cloud disk snapshots (e.g., AWS EBS, Azure Managed Disks, GCP Persistent Disks, ZFS, LVM) for creating PostgreSQL backups while maintaining proper database consistency and point-in-time recovery capabilities.
Snapshot backups store only metadata in WAL-G storage while the actual database files remain in externally managed snapshots. This approach provides near-instantaneous backups regardless of database size and significantly reduces storage costs in WAL-G's object storage.
Key components:
snapshot-push command: Coordinates backup creation by calling pg_start_backup(), executing a user-defined snapshot command, calling pg_stop_backup(), and uploading metadata. The snapshot command receives environment variables (WALG_SNAPSHOT_NAME, WALG_PG_DATA, WALG_SNAPSHOT_START_LSN, WALG_SNAPSHOT_START_WAL_FILE) for proper snapshot tagging and identification.
snapshot-fetch command: Prepares restored snapshots for PostgreSQL recovery by creating backup_label and tablespace_map files from stored metadata. Supports automatic recovery configuration for both PostgreSQL 12+ (recovery.signal) and earlier versions (recovery.conf), with optional point-in-time recovery target specification.
Automatic WAL protection: Critical safety feature that prevents deletion of WAL segments required by snapshot backups. During any delete operation, WAL-G identifies all snapshot backups and protects their required WAL range (start LSN to finish LSN) from deletion, ensuring snapshot backups remain recoverable even with aggressive retention policies.
Exact backup_label preservation: Stores the exact content returned by pg_stop_backup() for backup_label and tablespace_map files rather than reconstructing them. This ensures compatibility across all PostgreSQL versions and handles any future format changes automatically, as PostgreSQL-generated files are guaranteed to be readable by PostgreSQL.
Implementation stores snapshot metadata in BackupSentinelDto with new BackupLabel and TablespaceMap fields. Snapshot backups are identified by FilesMetadataDisabled=true, CompressedSize=0, and presence of BackupLabel content. Delete operations check IsSnapshotBackup() and protect required WAL segments through GetPermanentBackupsAndWals() modifications.
Configuration uses WALG_SNAPSHOT_COMMAND (required) for snapshot creation and WALG_SNAPSHOT_DELETE_COMMAND (optional) for cleanup. Commands execute in shell context with environment variables for maximum flexibility across different infrastructure providers.
Testing includes comprehensive snapshot_test.sh with 10 test cases covering backup creation, restoration, PITR, deletion, retention policies, and WAL protection verification. Tests use cp -al (hardlinks) to emulate filesystem snapshots without requiring cloud infrastructure.
Documentation in docs/PostgreSQL_Snapshot.md provides complete usage examples for major cloud providers (AWS, Azure, GCP) and on-premises solutions (ZFS, LVM), along with technical implementation details, security considerations, and best practices.
Snapshot backups integrate seamlessly with existing WAL-G features including backup-list, delete commands, permanent backup flag, encryption, compression (for WAL files), and multiple storage backends.
Author: Cursor, Sonnet 4.5, some whacking by me
Discussion: #1781
Database name
Wal-g provides support for many databases, please write down name of database you uses.
Pull request description
Describe what this PR fixes
// problem is ...
Please provide steps to reproduce (if it's a bug)
// it can really help
Please add config and wal-g stdout/stderr logs for debug purpose
also you can use WALG_LOG_LEVEL=DEVEL for logs collecting
If you can, provide logs
```bash any logs here ```