Skip to content

catchup receive: fallback to full backup when empy dir given#2221

Open
x4m wants to merge 7 commits intowal-g:masterfrom
x4m:empty_rcv
Open

catchup receive: fallback to full backup when empy dir given#2221
x4m wants to merge 7 commits intowal-g:masterfrom
x4m:empty_rcv

Conversation

@x4m
Copy link
Copy Markdown
Collaborator

@x4m x4m commented Mar 25, 2026

Describe what this PR fixes

catchup-receive used to fatal immediately when the target directory had no pg_control, making it impossible to bootstrap a new standby from scratch without first running pg_basebackup.

This PR adds a three-way check on the receiver side (using utility.IsDirectoryEmpty, consistent with backup-fetch):

pg_control exists — normal incremental catchup, behaviour unchanged
directory is empty — receiver sends a zero sentinel to the sender; sender detects it, skips identity/timeline/checkpoint validation, and transfers all files as full copies — equivalent to an initial base backup over the catchup protocol
directory has files but no pg_control — fatal with a clear message, avoids silently overwriting an unrelated data directory

Please provide steps to reproduce (if it's a bug)

mkdir /var/lib/postgresql/new_standby
wal-g catchup-receive /var/lib/postgresql/new_standby 1337 &

before this PR: FATAL: open .../global/pg_control: no such file or directory

wal-g catchup-send /var/lib/postgresql/data localhost:1337

Please add config and wal-g stdout/stderr logs for debug purpose

Before (fatal on empty directory)

INFO: Receiving /var/lib/postgresql/new_standby on port 1337
FATAL: open /var/lib/postgresql/new_standby/global/pg_control: no such file or directory

After (full copy succeeds)

INFO: Receiving /var/lib/postgresql/new_standby on port 1337
INFO: Data directory is empty, requesting full copy from sender

... sender transfers all files ...

INFO: Receive done

x4m added 3 commits March 25, 2026 18:07
If the receiver's data directory has no pg_control and is empty,
catchup-receive now sends a zero-value PgControlData sentinel and an
empty file list instead of fataling.  catchup-send detects the zero
SystemIdentifier, skips the identity/timeline/checkpoint checks, and
sends all files as full copies — effectively performing a pg_basebackup-
style initial population of the standby directory.

If the directory has files but no pg_control, catchup-receive fatals
with a clear message, consistent with the check in backup-fetch
(NonEmptyDBDataDirectoryError).  The directory state classification
uses utility.IsDirectoryEmpty, matching the approach in
checkDBDirectoryForUnwrap.

Made-with: Cursor
Test classifyDataDirectory with an empty directory, a directory
containing pg_control, and a directory with files but no pg_control.
Also test that sendControlAndFileList encodes a zero-value PgControlData
and an empty BackupFileList when the target directory is empty.

Made-with: Cursor
Add a new scenario that creates a completely empty PGDATA directory,
runs catchup-receive on it, then runs catchup-send from the primary.
After the transfer, the standby is started and its pg_dump output is
compared to the primary dump to verify a successful full copy.

Made-with: Cursor
@x4m x4m requested a review from a team as a code owner March 25, 2026 13:14
x4m added 4 commits March 25, 2026 20:42
wait_while_pg_not_ready.sh loops until transaction_read_only is off,
which never happens on a hot standby.  pg_ctl -w already waits for the
server to be ready to accept connections.

Made-with: Cursor
…ucket

Protect the receiver's data directory from accidental PostgreSQL startup
while a catchup is in progress:

- At the start of HandleCatchupReceive, pg_control is renamed to
  pg_control.catchup (or an empty sentinel is created for a fresh empty
  directory).  Without pg_control PostgreSQL refuses to start.

- The sender's pg_control arrives as an IsBinContents command.  Instead
  of writing it directly, the receiver stores it as pg_control.new so
  the directory remains unbootable even after all data files are received.

- Only when IsDone is received are the final two atomic steps performed:
    1. pg_control.new  ->  pg_control   (database becomes startable)
    2. pg_control.catchup is removed    (sentinel cleanup)

- classifyDataDirectory gains a new pgDataStateInterrupted state: if
  pg_control.catchup exists but pg_control does not, the receiver fatals
  with a clear message asking the operator to remove the directory and
  retry from scratch.  Partial retry is not safe because already-written
  paged files may receive incorrect incremental applies.

- receiveFileList now skips pg_control.catchup by name so the sender
  never treats the sentinel as a file to delete.

Also add the missing minio bucket mariadb_pitr_mariabackup_full to
docker-compose.yml; the bucket was omitted when the test was introduced
in commit 7751b10, causing the test to fail with NoSuchBucket.

Unit tests: TestClassifyDataDirectory_Interrupted,
            TestClassifyDataDirectory_NormalBeatsInterrupted.

Made-with: Cursor
Copy link
Copy Markdown
Contributor

@vadv vadv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing MkdirAll in IsFull branch

targetFileName = utility.SanitizePath(PgControlNewPath)
}
tracelog.InfoLogger.Printf("Writing file %v", targetFileName)
err := os.MkdirAll(path.Dir(path.Join(directory, targetFileName)), 0700)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IsFull branch below (line 343) needs the same MkdirAll — on an empty receiver the parent directories won't exist yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants