Skip to content

Add snapshot backup support for PostgreSQL#2101

Draft
x4m wants to merge 3 commits intowal-g:masterfrom
x4m:snapshot
Draft

Add snapshot backup support for PostgreSQL#2101
x4m wants to merge 3 commits intowal-g:masterfrom
x4m:snapshot

Conversation

@x4m
Copy link
Copy Markdown
Collaborator

@x4m x4m commented Nov 1, 2025

Implement snapshot backups that leverage filesystem-level or cloud disk snapshots (e.g., AWS EBS, Azure Managed Disks, GCP Persistent Disks, ZFS, LVM) for creating PostgreSQL backups while maintaining proper database consistency and point-in-time recovery capabilities.

Snapshot backups store only metadata in WAL-G storage while the actual database files remain in externally managed snapshots. This approach provides near-instantaneous backups regardless of database size and significantly reduces storage costs in WAL-G's object storage.

Key components:

  1. snapshot-push command: Coordinates backup creation by calling pg_start_backup(), executing a user-defined snapshot command, calling pg_stop_backup(), and uploading metadata. The snapshot command receives environment variables (WALG_SNAPSHOT_NAME, WALG_PG_DATA, WALG_SNAPSHOT_START_LSN, WALG_SNAPSHOT_START_WAL_FILE) for proper snapshot tagging and identification.

  2. snapshot-fetch command: Prepares restored snapshots for PostgreSQL recovery by creating backup_label and tablespace_map files from stored metadata. Supports automatic recovery configuration for both PostgreSQL 12+ (recovery.signal) and earlier versions (recovery.conf), with optional point-in-time recovery target specification.

  3. Automatic WAL protection: Critical safety feature that prevents deletion of WAL segments required by snapshot backups. During any delete operation, WAL-G identifies all snapshot backups and protects their required WAL range (start LSN to finish LSN) from deletion, ensuring snapshot backups remain recoverable even with aggressive retention policies.

  4. Exact backup_label preservation: Stores the exact content returned by pg_stop_backup() for backup_label and tablespace_map files rather than reconstructing them. This ensures compatibility across all PostgreSQL versions and handles any future format changes automatically, as PostgreSQL-generated files are guaranteed to be readable by PostgreSQL.

Implementation stores snapshot metadata in BackupSentinelDto with new BackupLabel and TablespaceMap fields. Snapshot backups are identified by FilesMetadataDisabled=true, CompressedSize=0, and presence of BackupLabel content. Delete operations check IsSnapshotBackup() and protect required WAL segments through GetPermanentBackupsAndWals() modifications.

Configuration uses WALG_SNAPSHOT_COMMAND (required) for snapshot creation and WALG_SNAPSHOT_DELETE_COMMAND (optional) for cleanup. Commands execute in shell context with environment variables for maximum flexibility across different infrastructure providers.

Testing includes comprehensive snapshot_test.sh with 10 test cases covering backup creation, restoration, PITR, deletion, retention policies, and WAL protection verification. Tests use cp -al (hardlinks) to emulate filesystem snapshots without requiring cloud infrastructure.

Documentation in docs/PostgreSQL_Snapshot.md provides complete usage examples for major cloud providers (AWS, Azure, GCP) and on-premises solutions (ZFS, LVM), along with technical implementation details, security considerations, and best practices.

Snapshot backups integrate seamlessly with existing WAL-G features including backup-list, delete commands, permanent backup flag, encryption, compression (for WAL files), and multiple storage backends.

Author: Cursor, Sonnet 4.5, some whacking by me
Discussion: #1781

Database name

Wal-g provides support for many databases, please write down name of database you uses.

Pull request description

Describe what this PR fixes

// problem is ...

Please provide steps to reproduce (if it's a bug)

// it can really help

Please add config and wal-g stdout/stderr logs for debug purpose

also you can use WALG_LOG_LEVEL=DEVEL for logs collecting

If you can, provide logs

```bash any logs here ```

Implement snapshot backups that leverage filesystem-level or cloud disk
snapshots (e.g., AWS EBS, Azure Managed Disks, GCP Persistent Disks, ZFS,
LVM) for creating PostgreSQL backups while maintaining proper database
consistency and point-in-time recovery capabilities.

Snapshot backups store only metadata in WAL-G storage while the actual
database files remain in externally managed snapshots. This approach
provides near-instantaneous backups regardless of database size and
significantly reduces storage costs in WAL-G's object storage.

Key components:

1. snapshot-push command: Coordinates backup creation by calling
   pg_start_backup(), executing a user-defined snapshot command, calling
   pg_stop_backup(), and uploading metadata. The snapshot command receives
   environment variables (WALG_SNAPSHOT_NAME, WALG_PG_DATA,
   WALG_SNAPSHOT_START_LSN, WALG_SNAPSHOT_START_WAL_FILE) for proper
   snapshot tagging and identification.

2. snapshot-fetch command: Prepares restored snapshots for PostgreSQL
   recovery by creating backup_label and tablespace_map files from stored
   metadata. Supports automatic recovery configuration for both PostgreSQL
   12+ (recovery.signal) and earlier versions (recovery.conf), with
   optional point-in-time recovery target specification.

3. Automatic WAL protection: Critical safety feature that prevents deletion
   of WAL segments required by snapshot backups. During any delete
   operation, WAL-G identifies all snapshot backups and protects their
   required WAL range (start LSN to finish LSN) from deletion, ensuring
   snapshot backups remain recoverable even with aggressive retention
   policies.

4. Exact backup_label preservation: Stores the exact content returned by
   pg_stop_backup() for backup_label and tablespace_map files rather than
   reconstructing them. This ensures compatibility across all PostgreSQL
   versions and handles any future format changes automatically, as
   PostgreSQL-generated files are guaranteed to be readable by PostgreSQL.

Implementation stores snapshot metadata in BackupSentinelDto with new
BackupLabel and TablespaceMap fields. Snapshot backups are identified by
FilesMetadataDisabled=true, CompressedSize=0, and presence of BackupLabel
content. Delete operations check IsSnapshotBackup() and protect required
WAL segments through GetPermanentBackupsAndWals() modifications.

Configuration uses WALG_SNAPSHOT_COMMAND (required) for snapshot creation
and WALG_SNAPSHOT_DELETE_COMMAND (optional) for cleanup. Commands execute
in shell context with environment variables for maximum flexibility across
different infrastructure providers.

Testing includes comprehensive snapshot_test.sh with 10 test cases covering
backup creation, restoration, PITR, deletion, retention policies, and WAL
protection verification. Tests use cp -al (hardlinks) to emulate filesystem
snapshots without requiring cloud infrastructure.

Documentation in docs/PostgreSQL_Snapshot.md provides complete usage
examples for major cloud providers (AWS, Azure, GCP) and on-premises
solutions (ZFS, LVM), along with technical implementation details,
security considerations, and best practices.

Snapshot backups integrate seamlessly with existing WAL-G features
including backup-list, delete commands, permanent backup flag, encryption,
compression (for WAL files), and multiple storage backends.

Author: Cursor, Sonnet 4.5, some whacking by me
Discussion: wal-g#1781
@x4m x4m requested a review from a team as a code owner November 1, 2025 17:48
}

if inRecovery {
return errors.New("Cannot perform snapshot backup on a standby server")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible issue: when pg_start_backup() runs successfully but returns inRecovery = true, the function immediately returns an error without calling pg_stop_backup().

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for reviewing! Well, this code was written by Cursor and it is...well... I see no reason to disallow snapshot backup on standby.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for reviewing! Well, this code was written by Cursor and it is...well... I see no reason to disallow snapshot backup on standby.

Hi !
I thought about it for a while — why standby is forbidden — but in the end, I accepted the “religion of the code.” Probably for the best: there’s no real load on the primary, and using a replica doesn’t add much value.

If you run pg_stop_backup() on a replica, it finishes immediately without waiting for the WAL segments (the range mentioned in backup_label) to be archived — especially since archive_mode is usually disabled on replicas.
So you get a “successful” backup, but at restore time, some required WAL segments might be missing.

On the primary, pg_stop_backup() always waits until all WALs are archived, ensuring consistency.

From my view — if backups on replicas are ever allowed, docs must clearly state that users are fully responsible for controlling WAL archiving.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already support backups on standby since 2017. So there's no way back in forbidding them somewhere :)

@x4m x4m marked this pull request as draft November 2, 2025 17:27
Comment thread internal/uploader.go
errorGroup, _ := errgroup.WithContext(ctx)
errorGroup.Go(func() error {
err := json2.MarshalWrite(writer, data)
err := json.NewEncoder(writer).Encode(data)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to rollback #2056 ?
If it breaks compilation - upgrade to 1.25 and add GOEXPERIMENT=jsonv2 to your env (this part is not obvious).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, this is bogus change, I'll keep json2


sbh.QueryRunner, err = NewPgQueryRunner(conn)
if err != nil {
return errors.Wrap(err, "failed to build query runner")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: it seems that github/pkg/errors is public archive and we can use fmt.Errorf() instead.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at some point we should make this consistent across codebase...

var lsnString string
var inRecovery bool
err = sbh.QueryRunner.Connection.QueryRow(context.TODO(), startBackupQuery, backupLabel).Scan(
&walFileName, &lsnString, &inRecovery)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use sbh.QueryRunner.StartBackup()?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

postgres PostgreSQL issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants