Skip to content

File encryption modules poc#1

Open
AndersAstrand wants to merge 8 commits into
masterfrom
file-encryption-modules-poc
Open

File encryption modules poc#1
AndersAstrand wants to merge 8 commits into
masterfrom
file-encryption-modules-poc

Conversation

@AndersAstrand

@AndersAstrand AndersAstrand commented May 12, 2026

Copy link
Copy Markdown
Owner

This is a POC for introducing file encryption modules that can be used to encrypt files at rest. The modules are usable both from the backend, and from frontend tools (pg_checksums patched as an example).

All the code here is generated using claude and codex, so it should be considered throw-away. But I think the concept can work.

If an encryption module wants an SQL level configuration interface, that would have to be a separate module that modifies configuration the encryption module is able to read. It would have to store it somewhere readable from frontend tools.

The poc encrypts pages and temp files.

The encryption module is chosen at initdb time and all pages (including catalog) are encrypted. Each relation gets an extra fork called _key and each page has some reserved space for any authentication data the encryption module needs. The module itself decides the format of both of these.

The poc includes two encryption modules, and patches one frontend tool.

The encryption modules just uses a raw secret key as configuration for now, but a real world encryption module would need to get the key from elsewhere. Or even multiple keys if we want different principal keys for different things or even an deeper hierarchy of keys.

AndersAstrand and others added 8 commits May 12, 2026 16:42
Introduce an infrastructure for loadable modules that encrypt files at
rest.  Configuring an encryption library is an all-or-nothing choice:
when the library named in pg_control loads, every flavor of on-disk
I/O this series teaches the server to encrypt -- BufFile temp files,
logical-decoding spill files, and relation pages -- is routed through
the module.  There is no partial mode.

A module returns a FileEncryptionCallbacks struct from
_PG_file_encryption_module_init.  The struct describes two encryption
flows that share the module's per-process state:

  * Record-stream encryption: encrypt_cb / decrypt_cb on a
        (path, file_offset, plaintext) tuple.  The module produces
        ciphertext of length data_len + overhead_size with the overhead
        at the tail.  Used by callers that work on variable-length
        records with no per-relation identity (BufFile, reorderbuffer
        spill, both wired up in later commits).  Per-call binding
        context: (path, file_offset) -- modules mix it into key/IV/MAC
        derivation however they see fit (AEAD AAD, HMAC input, XTS-
        style tweak, ...) so substituting one record for another at
        decrypt time fails.

  * Per-relation page encryption: a five-callback group
        (generate_object_key_cb, object_open_cb, object_close_cb,
        encrypt_page_cb, decrypt_page_cb) plus page_overhead_size.
        One DEK per relation, generated at relation-create time and
        shared by all forks and segment files of the relation; pages
        are encrypted BLCKSZ-in / BLCKSZ-out with the per-page overhead
        at the tail.  page_overhead_size may be zero -- modes like
        AES-XTS, or a stream cipher with a deterministic IV derived
        from the binding context, don't need a per-page trailer at
        all.  Per-page binding context: (fork, blocknum); the
        relation's RelFileLocator is bound into the wrapped DEK at
        object-open time.

All seven callbacks are required: the validator rejects any module
that ships fewer.  Optional startup_cb / shutdown_cb hooks let the
module attach per-process state to FileEncryptionModuleState; the
state is allocated fresh in each process (including forked children),
so a module never observes a private_data pointer left behind by an
ancestor.

The ABI lives in src/include/common/file_encryption_module.h so the
same .so is loadable from both the backend and any libpgcommon-based
frontend tool.  src/common/file_encryption_load.c gives frontend tools
a dlopen-and-init helper.  The module's entry point is

    bool _PG_file_encryption_module_init(const char *config,
                                          const FileEncryptionCallbacks **out,
                                          char **errmsg);

receiving an opaque, module-defined configuration string.

Encryption identity lives in pg_control:

  * file_encryption_library is a NAMEDATALEN field stamped at initdb
        time and immutable afterwards.  Both backend startup and
        frontend tools read the same field.
  * page_reserved_size is stamped from the module's declared
        page_overhead_size at initdb time (which may be zero).
  * file_encryption_config is a PGC_POSTMASTER, GUC_SUPERUSER_ONLY,
        GUC_NO_SHOW_ALL string GUC that holds the module-defined
        configuration blob (typically secret key material).

Initdb learns --file-encryption-library=NAME and
--file-encryption-config=STRING; it forwards the library name to
bootstrap via -L and the config blob to every backend it spawns via
extra_options.  Bootstrap parses -L, loads the module via
process_file_encryption_library() BEFORE BootStrapXLOG, and
BootStrapXLOG queries FileEncryptionPageReservedSize() to populate
pg_control's page_reserved_size.  pg_controldata prints the new
fields.  pg_upgrade refuses upgrades across encryption boundaries.

Module load-time validation requires the module's page_overhead_size
to match the cluster's page_reserved_size exactly: a mismatch means
the module was changed or replaced after initdb, which would silently
corrupt every page on the next write.

KEY_FORKNUM is added to ForkNumber/forkNames/MAX_FORKNUM as the fifth
fork; relations under file encryption will hold their wrapped DEK
there (the on-disk machinery for that fork lands with the page-
encryption commit).

SMgrRelationData gains a `void *encryption_object_state` field
(zeroed on smgropen() and released via the module's object_close_cb
from smgrdestroy()) where the page-encryption flow caches per-relation
state.

PG_CONTROL_VERSION is bumped because ControlFileData grows new fields.

A stub XOR-based test module lands under src/test/modules/
test_file_encryption.  It implements all seven callbacks with
page_overhead_size = 0, so it doubles as the in-tree demonstration
that zero-overhead page encryption is a supported configuration.  Its
record-stream consumers (BufFile, reorderbuffer spill) all arrive in
subsequent commits, so this commit only ships the framework and the
module-loading machinery.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A reference file_encryption_library module that ties both API flows
together with real cryptography.

Configuration: a single hex-encoded 256-bit key, passed verbatim via
the file_encryption_config GUC and forwarded by the loader to the
module's _PG_file_encryption_module_init.  The key is used as a
key-encryption key (KEK), not directly to encrypt caller bytes.  The
module parses the string itself (no per-module GUC namespace).

Record-stream encryption (BufFile, reorderbuffer spill):

  * Per-call: generate a fresh 256-bit data-encryption key, encrypt
    the plaintext with AES-256-GCM, wrap the DEK with AES-256-GCM
    under the KEK, store the wrapped DEK alongside the ciphertext.
  * AAD = basename(path) || file_offset(be64).  Using only the
    basename keeps CREATE DATABASE FILE_COPY and ALTER DATABASE SET
    TABLESPACE working (both clone files into directories whose
    leading path components change).
  * On-disk per call (96 bytes overhead):
        [ ciphertext N bytes ]
        [ data IV 12B ][ data tag 16B ]
        [ wrap IV 12B ][ wrapped data key 32B ]
        [ wrap tag 16B ][ format 4B ][ pad 4B ]

Per-relation page encryption:

  * generate_object_key_cb: mints a fresh DEK at relation-create time,
    wraps it under the KEK with AAD = relNumber(be32), returns the
    wrapped bytes that go into KEY_FORKNUM block 0.
  * object_open_cb: unwraps the DEK into a per-relation state struct
    the core caches on SMgrRelation.encryption_object_state.
  * encrypt_page_cb / decrypt_page_cb: AES-256-GCM with a fresh
    per-page IV.  AAD = fork(be32) || blocknum(be32).  The fork field
    distinguishes MAIN block N from INIT block N so neither can be
    substituted for the other on disk; reinit re-encrypts INIT bytes
    under MAIN's AAD on unlogged-relation reset.
  * On-disk per page (32 bytes trailer):
        [ data IV 12B ][ data tag 16B ][ format 4B ]

The declared page_overhead_size of 32 becomes the cluster's
page_reserved_size when the module is loaded at initdb time via
--file-encryption-library=basic_file_encryption.

A TAP smoke test (001_synthetic.pl) initdb's a cluster with the
module configured, starts it, and round-trips a value through a
freshly created heap.  initdb's bootstrap and post-bootstrap steps
exercise the full per-process module-state lifecycle (every catalog
write goes through encrypt_page_cb under a backend that has run the
module's startup_cb), so just standing the cluster up is meaningful
coverage; the SELECT then proves a page-encryption round-trip end to
end.  The BufFile, reorderbuffer-spill, relation-page, and
unlogged-relation tests arrive with the subsequent commits that wire
each of those flows up.

Gated on --with-openssl in both the autoconf and meson build systems,
alongside pgcrypto.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Carve the cluster's page_reserved_size off the tail of every page: the
usable area shrinks from BLCKSZ to BLCKSZ - reserved, and pd_upper /
pd_special bounds are validated against that smaller window.

The trailer is owned by the smgr / encryption layer.  In the plaintext
view it is always zero (PageInit zeroes it; no AM ever writes there
because pd_special never extends that far), so pd_checksum continues
to cover the full BLCKSZ without modification.  When the smgr layer
encrypts a page on the way to disk (subsequent commit), it consumes
the trailer for the module's per-page metadata; on the way back, it
hands the buffer manager a plaintext page with a zero trailer so
pd_checksum verifies.

Page-header changes:

  * PageInit shrinks usable area by GetPageReservedSize(), zeroes the
    full page (including the trailer), and places pd_upper / pd_special
    inside the usable area.
  * PageGetUsableSize(page) returns BLCKSZ - reserved; PageGetSpecialSize
    and PageValidateSpecialPointer use it.
  * Six "corrupted page pointers" defensive checks in bufpage.c, plus
    one in bufmask.c's mask_unused_space, tighten the pd_special upper
    bound from BLCKSZ to BLCKSZ - reserved.

Header-exposed AM size macros stay BLCKSZ-based (so they remain
compile-time constants for static contexts and extensions) and each
gets a runtime-aware sibling that subtracts GetPageReservedSize():

  * MaxHeapTupleSize          -> MaxHeapTupleSizeForCluster()
  * BTMaxItemSize             -> BTMaxItemSizeForCluster()
  * BTMaxItemSizeNoHeapTid    -> BTMaxItemSizeNoHeapTidForCluster()
  * GiSTPageSize              -> GiSTPageSizeForCluster()
  * GinMaxItemSize            -> GinMaxItemSizeForCluster()
  * GinDataPageMaxDataSize    -> GinDataPageMaxDataSizeForCluster()
  * GinListPageSize           -> GinListPageSizeForCluster()
  * SPGIST_PAGE_CAPACITY      -> SpGistPageCapacityForCluster()
  * REVMAP_CONTENT_SIZE       -> RevmapContentSizeForCluster()
  * REVMAP_PAGE_MAXITEMS      -> RevmapPageMaxItemsForCluster()

HashMaxItemSize(page) is already runtime-bound on its page argument;
it just switches from PageGetPageSize to PageGetUsableSize, which is
not a compile-time-to-runtime shift.

File-local helpers (BrinMaxItemSize in brin_pageops.c, GIN_PAGE_FREESIZE
in ginfast.c) are mutated in place: they're not exported, so there is
no extension compatibility argument for keeping them compile-time.

Updated runtime call sites use the cluster-aware variants:

  * heap insert (hio.c) and rewrite (rewriteheap.c) "row is too big"
    rejects use the cluster value; the "nearly empty" threshold does
    too
  * btree insert/sort/dedup/utils fit checks and amcheck verification
  * gin entry-page, data-page, posting-tree, and pending-list checks
  * gist sortbuild's pageFreeSpace estimate and gistfitpage
  * spgist insert/split and freespace decisions
  * brin revmap layout (HEAPBLK_TO_REVMAP_BLK / _INDEX) and
    pageinspect's brin_revmap_data
  * planner's heap-tuple density estimate (plancat.c)
  * planner's brin revmap-pages estimate (selfuncs.c)
  * lazy vacuum's freespace recording (vacuumlazy.c)
  * heapam_handler.c HEAP_USABLE_BYTES_PER_PAGE
  * btree split-location and dedup space accounting in nbtsplitloc.c
    and nbtdedup.c switch from PageGetPageSize to PageGetUsableSize
  * bufmask.c mask_page_content stops at the trailer

Two header-exposed call sites intentionally keep the BLCKSZ-derived
upper bound:

  * nbtree.c estnbtreeshared sizing -- wants the worst-case allocation
  * nbtxlog.c REDO state->maxpostingsize -- conservatively larger than
    primary's cap, must accommodate any posting list the primary built

Compile-time constants used as struct or stack-array dimensions
(MaxHeapTuplesPerPage, MaxTIDsPerBTreePage, etc.) are intentionally
left BLCKSZ-based; they only over-allocate by a handful of slots in
encrypted clusters, which is harmless.

FSM and VM forks are NOT covered: NodesPerPage and MAPSIZE are
deliberately left BLCKSZ-based, and md.c (next commit) will exempt
those forks from encryption.  They carry only metadata, not user
data, and the bypass keeps the FSM tree-walking and VM bitmap
arithmetic free of runtime parameterization.

With reserved = 0 (default when no encryption is configured) the
on-disk format is byte-identical to upstream and the regression suite
passes unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When file_encryption_library is configured, md.c routes every
MAIN_FORKNUM / INIT_FORKNUM block through the configured module on
read, write, and extend.  The module produces ciphertext exactly
BLCKSZ bytes long; the cluster's page_reserved_size trailer (which
matches the module's declared page_overhead_size, possibly zero) is
its workspace for per-page metadata.  The data-encryption key comes
from per-relation state cached on SMgrRelation (populated lazily on
first I/O via FileEncryptionOpenObject, which smgr-reads the
relation's KEY fork and asks the module to unwrap the DEK).

KEY-fork lifecycle:

  * RelationCreateStorage() creates KEY_FORKNUM alongside MAIN_FORKNUM
        when file encryption is configured, calls
        FileEncryptionGenerateObjectKey() to mint the wrapped DEK, and
        writes it synchronously (smgrextend + smgrimmedsync).  A
        dedicated XLOG_SMGR_KEY_FORK_CREATE WAL record carries the
        BLCKSZ-sized page-formatted KEY block inline and its redo
        writes the block straight to disk on the standby (bypassing
        the buffer pool), so a subsequent encrypted-fork FPI that
        evicts an INIT/MAIN buffer can always read the wrapped DEK
        back via smgrread.  WAL is emitted for any non-TEMP relation
        (not just permanent ones): unlogged relations also need their
        KEY fork to survive crashes for reinit and standby promotion.
  * smgrDoPendingSyncs() (the wal_level=minimal end-of-xact path)
        skips KEY_FORKNUM: it has already been WAL-logged at create
        time, and falls through to smgrdosyncall() in the fsync path.
  * Relation-copy paths (CreateAndCopyRelationData,
        heapam_relation_copy_data, index_copy_data, tablecmds.c
        SET TABLESPACE) skip KEY_FORKNUM: RelationCreateStorage
        already minted a fresh destination DEK, so copying the source
        KEY fork bytes would clobber it.  The MAIN/INIT fork copy goes
        through the buffer manager, which transparently decrypts under
        the source DEK on read and re-encrypts under the destination
        DEK on write.
  * Unlogged-relation reinit (reinit.c) preserves KEY_FORKNUM
        alongside INIT_FORKNUM and re-encrypts INIT-fork pages into
        the MAIN fork during reset.  The INIT-fork ciphertext was
        produced with a binding context that included
        fork=INIT_FORKNUM; a raw byte copy into MAIN would not
        decrypt later (the read path supplies fork=MAIN_FORKNUM).
        reencrypt_init_segment() decrypts each block of the INIT
        segment file using the relation's DEK and re-encrypts under
        the MAIN binding context, producing a MAIN segment file that
        the running cluster reads normally.  Plaintext never leaves
        backend memory; the DEK is unchanged.  Unencrypted clusters
        keep using the simple copy_file path.
  * Base backup (basebackup.c) includes KEY_FORKNUM in unlogged
        relations' backed-up forks, alongside INIT_FORKNUM.  A
        restored cluster needs the wrapped DEK to read pages written
        after restart.
  * The data-checksum worker (datachecksum_state.c) skips KEY_FORKNUM:
        the wrapped DEK is not a Page-formatted block by the standard
        catalog rules, and PageSetChecksum has nothing useful to do
        with it.

md.c wiring:

  * mdwritev encrypts each block from the buffer pool's plaintext page
        via FileEncryptionEncryptPage and points the iovec at a per-
        backend workspace.  The buffer pool's plaintext is never
        mutated.
  * mdextend uses the same workspace for the single-block case.
  * mdreadv decrypts each block in place after the read.  All-zero
        on-disk pages (left behind by mdzeroextend) are passed through
        unchanged so PageIsNew() recognises fresh pages.
  * mdstartreadv (asynchronous) pre-opens the encryption object in the
        issuing backend before submitting the AIO; decrypts each block
        in md_readv_complete.  The completion callback runs in
        PGAIO_HS_COMPLETED_IO/SHARED state and reads the iovec via
        pgaio_io_get_iovec(); pgaio_io_get_iovec's state assertion is
        relaxed accordingly to permit completion-time access.  The
        callback looks up SMgrRelation via smgropen() to reach the
        cached encryption state.
  * smgr_aio_reopen() pre-opens the per-relation encryption state in
        io workers, alongside opening the file descriptor.  io workers
        process completions in a critical section that can't tolerate
        the palloc / smgrread that FileEncryptionOpenObject would
        otherwise do on first sight of a relation; pre-opening before
        the IO dispatch makes the completion-time path a cache-hot
        lookup.

The workspace is a single MAX_IO_COMBINE_LIMIT * BLCKSZ buffer per
backend, lazily allocated from TopMemoryContext.  At io_combine_limit
defaults that's ~256 KB per process.  mdinit() and
process_file_encryption_library() each call md_init_enc_workspace()
to allocate the workspace eagerly, before any AIO completion callback
(which runs inside a critical section and can't allocate) fires.

md_fork_is_encrypted() is exported in md.h so smgr.c can consult it
from smgr_aio_reopen without duplicating the policy.

FSM and VM forks are exempt from encryption.  md_fork_is_encrypted()
returns false for them, so their pages travel to disk unchanged.
fp_nodes (FSM tree) and the VM bitmap continue to use the entire
BLCKSZ page.  The cost is a small metadata leak -- free-space
estimates and per-block all-visible bits -- that we accept in
exchange for keeping NodesPerPage / FSM_TREE_DEPTH compile-time
constants and avoiding the runtime parameterisation that would
otherwise cascade through fsmpage.c and visibilitymap.c.

Module contract: decrypt_page_cb must produce plaintext whose
trailing page_overhead_size bytes are zero when called for a relation
page (a no-op when page_overhead_size = 0).  The buffer manager's
PageIsVerifiedExtended computes pd_checksum over the full BLCKSZ; the
writer's plaintext trailer is zero (PageInit zeroes it; no AM ever
writes there because pd_special doesn't extend past the trailer
boundary), and the decrypt output must match for the checksum to
verify.

TAP tests exercising the full page-encryption path land here:

  * contrib/basic_file_encryption/t/004_relation_pages.pl verifies
    heap+btree round-trip across restart, that the on-disk heap fork
    doesn't contain plaintext, that FSM/VM forks remain plaintext
    (per md.c bypass), and that one-byte tampering of an encrypted
    page surfaces a tag-mismatch error.
  * contrib/basic_file_encryption/t/005_unlogged_and_backup.pl
    exercises unlogged-relation reset across crash recovery (KEY fork
    preserved by reinit; DEK still valid on the wiped-then-repopulated
    MAIN fork) and the base-backup path (KEY fork included for
    unlogged relations so a restored cluster can encrypt new writes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When file_encryption_library is configured, each physical component
file of a BufFile is encrypted block-at-a-time.  The on-disk layout
becomes a sequence of fixed-size physical blocks:

  [ uint32 plaintext_len ][ uint32 ciphertext_len ][ ciphertext... ]

with each block occupying a BLCKSZ + BUFFILE_ENC_OVERHEAD slot.
BUFFILE_ENC_OVERHEAD (256 bytes) bounds the room available for the
8-byte length header plus the module's per-call overhead (IVs, auth
tags, wrapped data keys, ...); a module declaring more than
BUFFILE_ENC_OVERHEAD - 8 bytes of overhead per BLCKSZ plaintext cannot
be used for BufFile.

Fixing the physical slot size lets logical block N map to a known
physical offset, so seek and truncate work without an offset map.

When no module is configured, the encrypted code paths are gated on
FileEncryptionEnabled() and the on-disk format is byte-identical to
upstream.

Block-aligned writes are required by current callers (logtape,
gistbuildbuffers, ...), so the encrypted path Asserts that and rejects
mid-block writes.  Flushing a partial block other than at the file's
high-water mark would also break the layout: BufFileLogicalSize and
the read path credit every non-trailing block at full BLCKSZ, so a
partial mid-file block would silently misreport the logical size and
turn reads into early EOFs.  No current caller does this, but
BufFileDumpBuffer Asserts the high-water-mark invariant so a future
caller can't regress without noticing.

At EOF, BufFileLoadEncryptedBlock advances curOffset to the caller's
exact logical position rather than leaving it at the start of the
block it declined to load.  Without this, a follow-up read after an
EOF probe would reload the block and re-expose its tuples.

Four TAP tests cover the integration: a sort-spill round-trip and
tampered-decrypt corruption check under src/test/modules/
test_file_encryption (using the test module), plus apply-streaming
end-to-end coverage that drives the subscriber-side BufFile through
stream-segment reopen and savepoint truncation.  contrib/
basic_file_encryption/t/002_buffile.pl provides the same coverage
under the AES-256-GCM reference module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When file_encryption_library is configured, ReorderBuffer spill files
are encrypted record-by-record.  Each spill file is a sequence of
variable-size records of the form:

  [ uint32 plaintext_size ][ uint32 ciphertext_size ][ ciphertext... ]

Variable-size records are fine here because the spill stream is
strictly sequential: restoring just walks records front to back, with
no random access.  The serialize and restore paths thread the file
path and file offset through every encrypt/decrypt call, so modules
that bind those into their per-call key/IV/MAC derivation have what
they need.

When no module is configured, the encrypted code paths are gated on
FileEncryptionEnabled() and the on-disk format is byte-identical to
upstream.

Three new TAP tests cover this path: under src/test/modules/
test_file_encryption, an end-to-end round-trip via logical decoding
with forced spilling, and a corruption test that uses the test
module's tamper_mode GUC to confirm the size-mismatch check in the
restore loop fires.  contrib/basic_file_encryption/t/003_logical_
decoding.pl provides the same end-to-end coverage under the
AES-256-GCM reference module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A second reference file_encryption_library module, built on OpenSSL's
SM4-CTR cipher and SM3 digest (both in the OpenSSL default provider on
modern builds; no extra provider load required).  Demonstrates that
the file-encryption API isn't coupled to one cipher choice: a different
algorithm pairing with different overhead sizes and a different MAC
scheme drops into the same callback shape and works without any
changes to the core or to other consumers.

Configuration: passed via file_encryption_config as a newline-separated
list of "key=value" entries that the module parses itself (no per-
module GUC namespace).  Recognised keys:

  * key       (required, 64 hex characters = 256-bit KEK)
  * provider  (optional, OpenSSL provider to load; default none)
  * cipher    (optional, default "SM4-CTR")
  * digest    (optional, default "SM3")

The KEK never encrypts user bytes directly; it only wraps data-
encryption keys (DEKs).

Record-stream encryption (BufFile, reorderbuffer spill):

  * Per-call: generate a fresh DEK and MAC key, encrypt the plaintext
    with the configured cipher (CTR-style streaming) under the DEK,
    HMAC the ciphertext under the MAC key.  Both keys are wrapped
    under the KEK -- cipher under wrap_iv, HMAC for integrity -- and
    stored in the trailer.
  * AAD = basename(path) || file_offset(be64), as in basic_file_encryption.
  * On-disk per call (136 bytes overhead):
        [ ciphertext N bytes ]
        [ data IV 16B ][ data tag 32B ]
        [ wrap IV 16B ][ wrapped (DEK || MAC) 32B ][ wrap tag 32B ]
        [ format 4B ][ pad 4B ]

Per-relation page encryption:

  * generate_object_key_cb: mints fresh (DEK, MAC key) at relation-
    create time, wraps both under the KEK with AAD = relNumber(be32),
    returns the wrapped bytes that go into KEY_FORKNUM block 0.
  * object_open_cb: unwraps the keys into a per-relation state struct
    the core caches on SMgrRelation.encryption_object_state.
  * encrypt_page_cb / decrypt_page_cb: encrypt body under DEK with a
    fresh per-page IV; HMAC binds AAD = fork(be32) || blocknum(be32).
  * Object-key wrap layout (84 bytes):
        [ wrap IV 16B ][ wrapped (DEK || MAC) 32B ]
        [ wrap tag 32B ][ format 4B ]
  * Per-page trailer (56 bytes):
        [ data IV 16B ][ HMAC tag 32B ][ format 4B ][ pad 4B ]

A 56-byte page_overhead_size (vs basic_file_encryption's 32) is what
proves the trailer-size mechanism is module-driven rather than
hardcoded: a cluster initialized with sm4_file_encryption gets
page_reserved_size = 56 in pg_control and routes all AM size limits,
PageInit arithmetic, and md.c trailers through that value.

Two TAP tests: BufFile record-stream round-trip and page-encryption
round-trip against real heap+btree relations.

Gated on --with-openssl in both the autoconf and meson build systems.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Teach pg_checksums to operate on clusters that were initialized with
--file-encryption-library.  At startup, after ReadControlFile, if
ControlFile->file_encryption_library is set, pg_checksums dlopens the
named module via src/common/file_encryption_load.c and runs its
startup_cb; per-relation DEKs are unwrapped lazily on first sight of
each relation by reading the relation's KEY fork (located via the
filesystem layout, format defined in common/file_encryption_keyblock.h)
and handing the wrapped bytes to the module's object_open_cb.

Each block read from a MAIN/INIT fork is routed through the module's
decrypt_page_cb before pd_checksum verification.  All-zero blocks (left
behind by mdzeroextend on the backend) are passed through unchanged so
PageIsNew() still recognises fresh pages.  FSM, VM, and KEY forks are
read as plaintext, matching the backend's md_fork_is_encrypted() bypass.

For --enable on an encrypted cluster: decrypt each block, set
pd_checksum on the plaintext, encrypt_page_cb to re-encrypt with the
same per-relation DEK, then write the ciphertext back.  KEY-fork blocks
are skipped on --enable -- their pd_checksum was set at relation-create
time by FileEncryptionGenerateObjectKey and is already correct, mirroring
the backend's data-checksum worker (datachecksum_state.c).

Module configuration:

  * --file-encryption-config=STRING (new CLI flag), or
  * PGFILEENCRYPTIONCONFIG environment variable (fallback).

pg_checksums is built with link_whole on libpgcommon + libpgport and
export_dynamic so the dlopen'd module can resolve palloc, pstrdup,
pg_strong_random, etc. against pg_checksums itself.

TAP test in src/bin/pg_checksums/t/003_encrypted.pl exercises:

  * --check against a freshly-initdb'd encrypted cluster (zero
    expected bad checksums)
  * the missing-config error path (with a useful hint)
  * the PGFILEENCRYPTIONCONFIG env var fallback
  * --enable on an encrypted cluster initdb'd with --no-data-checksums
    (decrypt -> set checksum -> re-encrypt -> write back), followed by
    a server restart that round-trips a payload through the rewritten
    pages, and a final --check that verifies the rewritten checksums.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
coderabbitai[bot]

This comment was marked as spam.

Repository owner deleted a comment from coderabbitai Bot May 12, 2026
Repository owner deleted a comment from coderabbitai Bot May 12, 2026
Repository owner deleted a comment from coderabbitai Bot May 12, 2026
Repository owner deleted a comment from coderabbitai Bot May 12, 2026
Repository owner deleted a comment from coderabbitai Bot May 12, 2026
Repository owner deleted a comment from coderabbitai Bot May 12, 2026
Repository owner deleted a comment from coderabbitai Bot May 12, 2026
Repository owner deleted a comment from coderabbitai Bot May 12, 2026
Repository owner deleted a comment from coderabbitai Bot May 12, 2026
Repository owner deleted a comment from coderabbitai Bot May 12, 2026
Repository owner deleted a comment from coderabbitai Bot May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant