File encryption modules poc#1
Open
AndersAstrand wants to merge 8 commits into
Open
Conversation
Introduce an infrastructure for loadable modules that encrypt files at
rest. Configuring an encryption library is an all-or-nothing choice:
when the library named in pg_control loads, every flavor of on-disk
I/O this series teaches the server to encrypt -- BufFile temp files,
logical-decoding spill files, and relation pages -- is routed through
the module. There is no partial mode.
A module returns a FileEncryptionCallbacks struct from
_PG_file_encryption_module_init. The struct describes two encryption
flows that share the module's per-process state:
* Record-stream encryption: encrypt_cb / decrypt_cb on a
(path, file_offset, plaintext) tuple. The module produces
ciphertext of length data_len + overhead_size with the overhead
at the tail. Used by callers that work on variable-length
records with no per-relation identity (BufFile, reorderbuffer
spill, both wired up in later commits). Per-call binding
context: (path, file_offset) -- modules mix it into key/IV/MAC
derivation however they see fit (AEAD AAD, HMAC input, XTS-
style tweak, ...) so substituting one record for another at
decrypt time fails.
* Per-relation page encryption: a five-callback group
(generate_object_key_cb, object_open_cb, object_close_cb,
encrypt_page_cb, decrypt_page_cb) plus page_overhead_size.
One DEK per relation, generated at relation-create time and
shared by all forks and segment files of the relation; pages
are encrypted BLCKSZ-in / BLCKSZ-out with the per-page overhead
at the tail. page_overhead_size may be zero -- modes like
AES-XTS, or a stream cipher with a deterministic IV derived
from the binding context, don't need a per-page trailer at
all. Per-page binding context: (fork, blocknum); the
relation's RelFileLocator is bound into the wrapped DEK at
object-open time.
All seven callbacks are required: the validator rejects any module
that ships fewer. Optional startup_cb / shutdown_cb hooks let the
module attach per-process state to FileEncryptionModuleState; the
state is allocated fresh in each process (including forked children),
so a module never observes a private_data pointer left behind by an
ancestor.
The ABI lives in src/include/common/file_encryption_module.h so the
same .so is loadable from both the backend and any libpgcommon-based
frontend tool. src/common/file_encryption_load.c gives frontend tools
a dlopen-and-init helper. The module's entry point is
bool _PG_file_encryption_module_init(const char *config,
const FileEncryptionCallbacks **out,
char **errmsg);
receiving an opaque, module-defined configuration string.
Encryption identity lives in pg_control:
* file_encryption_library is a NAMEDATALEN field stamped at initdb
time and immutable afterwards. Both backend startup and
frontend tools read the same field.
* page_reserved_size is stamped from the module's declared
page_overhead_size at initdb time (which may be zero).
* file_encryption_config is a PGC_POSTMASTER, GUC_SUPERUSER_ONLY,
GUC_NO_SHOW_ALL string GUC that holds the module-defined
configuration blob (typically secret key material).
Initdb learns --file-encryption-library=NAME and
--file-encryption-config=STRING; it forwards the library name to
bootstrap via -L and the config blob to every backend it spawns via
extra_options. Bootstrap parses -L, loads the module via
process_file_encryption_library() BEFORE BootStrapXLOG, and
BootStrapXLOG queries FileEncryptionPageReservedSize() to populate
pg_control's page_reserved_size. pg_controldata prints the new
fields. pg_upgrade refuses upgrades across encryption boundaries.
Module load-time validation requires the module's page_overhead_size
to match the cluster's page_reserved_size exactly: a mismatch means
the module was changed or replaced after initdb, which would silently
corrupt every page on the next write.
KEY_FORKNUM is added to ForkNumber/forkNames/MAX_FORKNUM as the fifth
fork; relations under file encryption will hold their wrapped DEK
there (the on-disk machinery for that fork lands with the page-
encryption commit).
SMgrRelationData gains a `void *encryption_object_state` field
(zeroed on smgropen() and released via the module's object_close_cb
from smgrdestroy()) where the page-encryption flow caches per-relation
state.
PG_CONTROL_VERSION is bumped because ControlFileData grows new fields.
A stub XOR-based test module lands under src/test/modules/
test_file_encryption. It implements all seven callbacks with
page_overhead_size = 0, so it doubles as the in-tree demonstration
that zero-overhead page encryption is a supported configuration. Its
record-stream consumers (BufFile, reorderbuffer spill) all arrive in
subsequent commits, so this commit only ships the framework and the
module-loading machinery.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A reference file_encryption_library module that ties both API flows
together with real cryptography.
Configuration: a single hex-encoded 256-bit key, passed verbatim via
the file_encryption_config GUC and forwarded by the loader to the
module's _PG_file_encryption_module_init. The key is used as a
key-encryption key (KEK), not directly to encrypt caller bytes. The
module parses the string itself (no per-module GUC namespace).
Record-stream encryption (BufFile, reorderbuffer spill):
* Per-call: generate a fresh 256-bit data-encryption key, encrypt
the plaintext with AES-256-GCM, wrap the DEK with AES-256-GCM
under the KEK, store the wrapped DEK alongside the ciphertext.
* AAD = basename(path) || file_offset(be64). Using only the
basename keeps CREATE DATABASE FILE_COPY and ALTER DATABASE SET
TABLESPACE working (both clone files into directories whose
leading path components change).
* On-disk per call (96 bytes overhead):
[ ciphertext N bytes ]
[ data IV 12B ][ data tag 16B ]
[ wrap IV 12B ][ wrapped data key 32B ]
[ wrap tag 16B ][ format 4B ][ pad 4B ]
Per-relation page encryption:
* generate_object_key_cb: mints a fresh DEK at relation-create time,
wraps it under the KEK with AAD = relNumber(be32), returns the
wrapped bytes that go into KEY_FORKNUM block 0.
* object_open_cb: unwraps the DEK into a per-relation state struct
the core caches on SMgrRelation.encryption_object_state.
* encrypt_page_cb / decrypt_page_cb: AES-256-GCM with a fresh
per-page IV. AAD = fork(be32) || blocknum(be32). The fork field
distinguishes MAIN block N from INIT block N so neither can be
substituted for the other on disk; reinit re-encrypts INIT bytes
under MAIN's AAD on unlogged-relation reset.
* On-disk per page (32 bytes trailer):
[ data IV 12B ][ data tag 16B ][ format 4B ]
The declared page_overhead_size of 32 becomes the cluster's
page_reserved_size when the module is loaded at initdb time via
--file-encryption-library=basic_file_encryption.
A TAP smoke test (001_synthetic.pl) initdb's a cluster with the
module configured, starts it, and round-trips a value through a
freshly created heap. initdb's bootstrap and post-bootstrap steps
exercise the full per-process module-state lifecycle (every catalog
write goes through encrypt_page_cb under a backend that has run the
module's startup_cb), so just standing the cluster up is meaningful
coverage; the SELECT then proves a page-encryption round-trip end to
end. The BufFile, reorderbuffer-spill, relation-page, and
unlogged-relation tests arrive with the subsequent commits that wire
each of those flows up.
Gated on --with-openssl in both the autoconf and meson build systems,
alongside pgcrypto.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Carve the cluster's page_reserved_size off the tail of every page: the
usable area shrinks from BLCKSZ to BLCKSZ - reserved, and pd_upper /
pd_special bounds are validated against that smaller window.
The trailer is owned by the smgr / encryption layer. In the plaintext
view it is always zero (PageInit zeroes it; no AM ever writes there
because pd_special never extends that far), so pd_checksum continues
to cover the full BLCKSZ without modification. When the smgr layer
encrypts a page on the way to disk (subsequent commit), it consumes
the trailer for the module's per-page metadata; on the way back, it
hands the buffer manager a plaintext page with a zero trailer so
pd_checksum verifies.
Page-header changes:
* PageInit shrinks usable area by GetPageReservedSize(), zeroes the
full page (including the trailer), and places pd_upper / pd_special
inside the usable area.
* PageGetUsableSize(page) returns BLCKSZ - reserved; PageGetSpecialSize
and PageValidateSpecialPointer use it.
* Six "corrupted page pointers" defensive checks in bufpage.c, plus
one in bufmask.c's mask_unused_space, tighten the pd_special upper
bound from BLCKSZ to BLCKSZ - reserved.
Header-exposed AM size macros stay BLCKSZ-based (so they remain
compile-time constants for static contexts and extensions) and each
gets a runtime-aware sibling that subtracts GetPageReservedSize():
* MaxHeapTupleSize -> MaxHeapTupleSizeForCluster()
* BTMaxItemSize -> BTMaxItemSizeForCluster()
* BTMaxItemSizeNoHeapTid -> BTMaxItemSizeNoHeapTidForCluster()
* GiSTPageSize -> GiSTPageSizeForCluster()
* GinMaxItemSize -> GinMaxItemSizeForCluster()
* GinDataPageMaxDataSize -> GinDataPageMaxDataSizeForCluster()
* GinListPageSize -> GinListPageSizeForCluster()
* SPGIST_PAGE_CAPACITY -> SpGistPageCapacityForCluster()
* REVMAP_CONTENT_SIZE -> RevmapContentSizeForCluster()
* REVMAP_PAGE_MAXITEMS -> RevmapPageMaxItemsForCluster()
HashMaxItemSize(page) is already runtime-bound on its page argument;
it just switches from PageGetPageSize to PageGetUsableSize, which is
not a compile-time-to-runtime shift.
File-local helpers (BrinMaxItemSize in brin_pageops.c, GIN_PAGE_FREESIZE
in ginfast.c) are mutated in place: they're not exported, so there is
no extension compatibility argument for keeping them compile-time.
Updated runtime call sites use the cluster-aware variants:
* heap insert (hio.c) and rewrite (rewriteheap.c) "row is too big"
rejects use the cluster value; the "nearly empty" threshold does
too
* btree insert/sort/dedup/utils fit checks and amcheck verification
* gin entry-page, data-page, posting-tree, and pending-list checks
* gist sortbuild's pageFreeSpace estimate and gistfitpage
* spgist insert/split and freespace decisions
* brin revmap layout (HEAPBLK_TO_REVMAP_BLK / _INDEX) and
pageinspect's brin_revmap_data
* planner's heap-tuple density estimate (plancat.c)
* planner's brin revmap-pages estimate (selfuncs.c)
* lazy vacuum's freespace recording (vacuumlazy.c)
* heapam_handler.c HEAP_USABLE_BYTES_PER_PAGE
* btree split-location and dedup space accounting in nbtsplitloc.c
and nbtdedup.c switch from PageGetPageSize to PageGetUsableSize
* bufmask.c mask_page_content stops at the trailer
Two header-exposed call sites intentionally keep the BLCKSZ-derived
upper bound:
* nbtree.c estnbtreeshared sizing -- wants the worst-case allocation
* nbtxlog.c REDO state->maxpostingsize -- conservatively larger than
primary's cap, must accommodate any posting list the primary built
Compile-time constants used as struct or stack-array dimensions
(MaxHeapTuplesPerPage, MaxTIDsPerBTreePage, etc.) are intentionally
left BLCKSZ-based; they only over-allocate by a handful of slots in
encrypted clusters, which is harmless.
FSM and VM forks are NOT covered: NodesPerPage and MAPSIZE are
deliberately left BLCKSZ-based, and md.c (next commit) will exempt
those forks from encryption. They carry only metadata, not user
data, and the bypass keeps the FSM tree-walking and VM bitmap
arithmetic free of runtime parameterization.
With reserved = 0 (default when no encryption is configured) the
on-disk format is byte-identical to upstream and the regression suite
passes unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When file_encryption_library is configured, md.c routes every
MAIN_FORKNUM / INIT_FORKNUM block through the configured module on
read, write, and extend. The module produces ciphertext exactly
BLCKSZ bytes long; the cluster's page_reserved_size trailer (which
matches the module's declared page_overhead_size, possibly zero) is
its workspace for per-page metadata. The data-encryption key comes
from per-relation state cached on SMgrRelation (populated lazily on
first I/O via FileEncryptionOpenObject, which smgr-reads the
relation's KEY fork and asks the module to unwrap the DEK).
KEY-fork lifecycle:
* RelationCreateStorage() creates KEY_FORKNUM alongside MAIN_FORKNUM
when file encryption is configured, calls
FileEncryptionGenerateObjectKey() to mint the wrapped DEK, and
writes it synchronously (smgrextend + smgrimmedsync). A
dedicated XLOG_SMGR_KEY_FORK_CREATE WAL record carries the
BLCKSZ-sized page-formatted KEY block inline and its redo
writes the block straight to disk on the standby (bypassing
the buffer pool), so a subsequent encrypted-fork FPI that
evicts an INIT/MAIN buffer can always read the wrapped DEK
back via smgrread. WAL is emitted for any non-TEMP relation
(not just permanent ones): unlogged relations also need their
KEY fork to survive crashes for reinit and standby promotion.
* smgrDoPendingSyncs() (the wal_level=minimal end-of-xact path)
skips KEY_FORKNUM: it has already been WAL-logged at create
time, and falls through to smgrdosyncall() in the fsync path.
* Relation-copy paths (CreateAndCopyRelationData,
heapam_relation_copy_data, index_copy_data, tablecmds.c
SET TABLESPACE) skip KEY_FORKNUM: RelationCreateStorage
already minted a fresh destination DEK, so copying the source
KEY fork bytes would clobber it. The MAIN/INIT fork copy goes
through the buffer manager, which transparently decrypts under
the source DEK on read and re-encrypts under the destination
DEK on write.
* Unlogged-relation reinit (reinit.c) preserves KEY_FORKNUM
alongside INIT_FORKNUM and re-encrypts INIT-fork pages into
the MAIN fork during reset. The INIT-fork ciphertext was
produced with a binding context that included
fork=INIT_FORKNUM; a raw byte copy into MAIN would not
decrypt later (the read path supplies fork=MAIN_FORKNUM).
reencrypt_init_segment() decrypts each block of the INIT
segment file using the relation's DEK and re-encrypts under
the MAIN binding context, producing a MAIN segment file that
the running cluster reads normally. Plaintext never leaves
backend memory; the DEK is unchanged. Unencrypted clusters
keep using the simple copy_file path.
* Base backup (basebackup.c) includes KEY_FORKNUM in unlogged
relations' backed-up forks, alongside INIT_FORKNUM. A
restored cluster needs the wrapped DEK to read pages written
after restart.
* The data-checksum worker (datachecksum_state.c) skips KEY_FORKNUM:
the wrapped DEK is not a Page-formatted block by the standard
catalog rules, and PageSetChecksum has nothing useful to do
with it.
md.c wiring:
* mdwritev encrypts each block from the buffer pool's plaintext page
via FileEncryptionEncryptPage and points the iovec at a per-
backend workspace. The buffer pool's plaintext is never
mutated.
* mdextend uses the same workspace for the single-block case.
* mdreadv decrypts each block in place after the read. All-zero
on-disk pages (left behind by mdzeroextend) are passed through
unchanged so PageIsNew() recognises fresh pages.
* mdstartreadv (asynchronous) pre-opens the encryption object in the
issuing backend before submitting the AIO; decrypts each block
in md_readv_complete. The completion callback runs in
PGAIO_HS_COMPLETED_IO/SHARED state and reads the iovec via
pgaio_io_get_iovec(); pgaio_io_get_iovec's state assertion is
relaxed accordingly to permit completion-time access. The
callback looks up SMgrRelation via smgropen() to reach the
cached encryption state.
* smgr_aio_reopen() pre-opens the per-relation encryption state in
io workers, alongside opening the file descriptor. io workers
process completions in a critical section that can't tolerate
the palloc / smgrread that FileEncryptionOpenObject would
otherwise do on first sight of a relation; pre-opening before
the IO dispatch makes the completion-time path a cache-hot
lookup.
The workspace is a single MAX_IO_COMBINE_LIMIT * BLCKSZ buffer per
backend, lazily allocated from TopMemoryContext. At io_combine_limit
defaults that's ~256 KB per process. mdinit() and
process_file_encryption_library() each call md_init_enc_workspace()
to allocate the workspace eagerly, before any AIO completion callback
(which runs inside a critical section and can't allocate) fires.
md_fork_is_encrypted() is exported in md.h so smgr.c can consult it
from smgr_aio_reopen without duplicating the policy.
FSM and VM forks are exempt from encryption. md_fork_is_encrypted()
returns false for them, so their pages travel to disk unchanged.
fp_nodes (FSM tree) and the VM bitmap continue to use the entire
BLCKSZ page. The cost is a small metadata leak -- free-space
estimates and per-block all-visible bits -- that we accept in
exchange for keeping NodesPerPage / FSM_TREE_DEPTH compile-time
constants and avoiding the runtime parameterisation that would
otherwise cascade through fsmpage.c and visibilitymap.c.
Module contract: decrypt_page_cb must produce plaintext whose
trailing page_overhead_size bytes are zero when called for a relation
page (a no-op when page_overhead_size = 0). The buffer manager's
PageIsVerifiedExtended computes pd_checksum over the full BLCKSZ; the
writer's plaintext trailer is zero (PageInit zeroes it; no AM ever
writes there because pd_special doesn't extend past the trailer
boundary), and the decrypt output must match for the checksum to
verify.
TAP tests exercising the full page-encryption path land here:
* contrib/basic_file_encryption/t/004_relation_pages.pl verifies
heap+btree round-trip across restart, that the on-disk heap fork
doesn't contain plaintext, that FSM/VM forks remain plaintext
(per md.c bypass), and that one-byte tampering of an encrypted
page surfaces a tag-mismatch error.
* contrib/basic_file_encryption/t/005_unlogged_and_backup.pl
exercises unlogged-relation reset across crash recovery (KEY fork
preserved by reinit; DEK still valid on the wiped-then-repopulated
MAIN fork) and the base-backup path (KEY fork included for
unlogged relations so a restored cluster can encrypt new writes).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When file_encryption_library is configured, each physical component file of a BufFile is encrypted block-at-a-time. The on-disk layout becomes a sequence of fixed-size physical blocks: [ uint32 plaintext_len ][ uint32 ciphertext_len ][ ciphertext... ] with each block occupying a BLCKSZ + BUFFILE_ENC_OVERHEAD slot. BUFFILE_ENC_OVERHEAD (256 bytes) bounds the room available for the 8-byte length header plus the module's per-call overhead (IVs, auth tags, wrapped data keys, ...); a module declaring more than BUFFILE_ENC_OVERHEAD - 8 bytes of overhead per BLCKSZ plaintext cannot be used for BufFile. Fixing the physical slot size lets logical block N map to a known physical offset, so seek and truncate work without an offset map. When no module is configured, the encrypted code paths are gated on FileEncryptionEnabled() and the on-disk format is byte-identical to upstream. Block-aligned writes are required by current callers (logtape, gistbuildbuffers, ...), so the encrypted path Asserts that and rejects mid-block writes. Flushing a partial block other than at the file's high-water mark would also break the layout: BufFileLogicalSize and the read path credit every non-trailing block at full BLCKSZ, so a partial mid-file block would silently misreport the logical size and turn reads into early EOFs. No current caller does this, but BufFileDumpBuffer Asserts the high-water-mark invariant so a future caller can't regress without noticing. At EOF, BufFileLoadEncryptedBlock advances curOffset to the caller's exact logical position rather than leaving it at the start of the block it declined to load. Without this, a follow-up read after an EOF probe would reload the block and re-expose its tuples. Four TAP tests cover the integration: a sort-spill round-trip and tampered-decrypt corruption check under src/test/modules/ test_file_encryption (using the test module), plus apply-streaming end-to-end coverage that drives the subscriber-side BufFile through stream-segment reopen and savepoint truncation. contrib/ basic_file_encryption/t/002_buffile.pl provides the same coverage under the AES-256-GCM reference module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When file_encryption_library is configured, ReorderBuffer spill files are encrypted record-by-record. Each spill file is a sequence of variable-size records of the form: [ uint32 plaintext_size ][ uint32 ciphertext_size ][ ciphertext... ] Variable-size records are fine here because the spill stream is strictly sequential: restoring just walks records front to back, with no random access. The serialize and restore paths thread the file path and file offset through every encrypt/decrypt call, so modules that bind those into their per-call key/IV/MAC derivation have what they need. When no module is configured, the encrypted code paths are gated on FileEncryptionEnabled() and the on-disk format is byte-identical to upstream. Three new TAP tests cover this path: under src/test/modules/ test_file_encryption, an end-to-end round-trip via logical decoding with forced spilling, and a corruption test that uses the test module's tamper_mode GUC to confirm the size-mismatch check in the restore loop fires. contrib/basic_file_encryption/t/003_logical_ decoding.pl provides the same end-to-end coverage under the AES-256-GCM reference module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A second reference file_encryption_library module, built on OpenSSL's
SM4-CTR cipher and SM3 digest (both in the OpenSSL default provider on
modern builds; no extra provider load required). Demonstrates that
the file-encryption API isn't coupled to one cipher choice: a different
algorithm pairing with different overhead sizes and a different MAC
scheme drops into the same callback shape and works without any
changes to the core or to other consumers.
Configuration: passed via file_encryption_config as a newline-separated
list of "key=value" entries that the module parses itself (no per-
module GUC namespace). Recognised keys:
* key (required, 64 hex characters = 256-bit KEK)
* provider (optional, OpenSSL provider to load; default none)
* cipher (optional, default "SM4-CTR")
* digest (optional, default "SM3")
The KEK never encrypts user bytes directly; it only wraps data-
encryption keys (DEKs).
Record-stream encryption (BufFile, reorderbuffer spill):
* Per-call: generate a fresh DEK and MAC key, encrypt the plaintext
with the configured cipher (CTR-style streaming) under the DEK,
HMAC the ciphertext under the MAC key. Both keys are wrapped
under the KEK -- cipher under wrap_iv, HMAC for integrity -- and
stored in the trailer.
* AAD = basename(path) || file_offset(be64), as in basic_file_encryption.
* On-disk per call (136 bytes overhead):
[ ciphertext N bytes ]
[ data IV 16B ][ data tag 32B ]
[ wrap IV 16B ][ wrapped (DEK || MAC) 32B ][ wrap tag 32B ]
[ format 4B ][ pad 4B ]
Per-relation page encryption:
* generate_object_key_cb: mints fresh (DEK, MAC key) at relation-
create time, wraps both under the KEK with AAD = relNumber(be32),
returns the wrapped bytes that go into KEY_FORKNUM block 0.
* object_open_cb: unwraps the keys into a per-relation state struct
the core caches on SMgrRelation.encryption_object_state.
* encrypt_page_cb / decrypt_page_cb: encrypt body under DEK with a
fresh per-page IV; HMAC binds AAD = fork(be32) || blocknum(be32).
* Object-key wrap layout (84 bytes):
[ wrap IV 16B ][ wrapped (DEK || MAC) 32B ]
[ wrap tag 32B ][ format 4B ]
* Per-page trailer (56 bytes):
[ data IV 16B ][ HMAC tag 32B ][ format 4B ][ pad 4B ]
A 56-byte page_overhead_size (vs basic_file_encryption's 32) is what
proves the trailer-size mechanism is module-driven rather than
hardcoded: a cluster initialized with sm4_file_encryption gets
page_reserved_size = 56 in pg_control and routes all AM size limits,
PageInit arithmetic, and md.c trailers through that value.
Two TAP tests: BufFile record-stream round-trip and page-encryption
round-trip against real heap+btree relations.
Gated on --with-openssl in both the autoconf and meson build systems.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Teach pg_checksums to operate on clusters that were initialized with
--file-encryption-library. At startup, after ReadControlFile, if
ControlFile->file_encryption_library is set, pg_checksums dlopens the
named module via src/common/file_encryption_load.c and runs its
startup_cb; per-relation DEKs are unwrapped lazily on first sight of
each relation by reading the relation's KEY fork (located via the
filesystem layout, format defined in common/file_encryption_keyblock.h)
and handing the wrapped bytes to the module's object_open_cb.
Each block read from a MAIN/INIT fork is routed through the module's
decrypt_page_cb before pd_checksum verification. All-zero blocks (left
behind by mdzeroextend on the backend) are passed through unchanged so
PageIsNew() still recognises fresh pages. FSM, VM, and KEY forks are
read as plaintext, matching the backend's md_fork_is_encrypted() bypass.
For --enable on an encrypted cluster: decrypt each block, set
pd_checksum on the plaintext, encrypt_page_cb to re-encrypt with the
same per-relation DEK, then write the ciphertext back. KEY-fork blocks
are skipped on --enable -- their pd_checksum was set at relation-create
time by FileEncryptionGenerateObjectKey and is already correct, mirroring
the backend's data-checksum worker (datachecksum_state.c).
Module configuration:
* --file-encryption-config=STRING (new CLI flag), or
* PGFILEENCRYPTIONCONFIG environment variable (fallback).
pg_checksums is built with link_whole on libpgcommon + libpgport and
export_dynamic so the dlopen'd module can resolve palloc, pstrdup,
pg_strong_random, etc. against pg_checksums itself.
TAP test in src/bin/pg_checksums/t/003_encrypted.pl exercises:
* --check against a freshly-initdb'd encrypted cluster (zero
expected bad checksums)
* the missing-config error path (with a useful hint)
* the PGFILEENCRYPTIONCONFIG env var fallback
* --enable on an encrypted cluster initdb'd with --no-data-checksums
(decrypt -> set checksum -> re-encrypt -> write back), followed by
a server restart that round-trips a payload through the rewritten
pages, and a final --check that verifies the rewritten checksums.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a POC for introducing file encryption modules that can be used to encrypt files at rest. The modules are usable both from the backend, and from frontend tools (pg_checksums patched as an example).
All the code here is generated using claude and codex, so it should be considered throw-away. But I think the concept can work.
If an encryption module wants an SQL level configuration interface, that would have to be a separate module that modifies configuration the encryption module is able to read. It would have to store it somewhere readable from frontend tools.
The poc encrypts pages and temp files.
The encryption module is chosen at initdb time and all pages (including catalog) are encrypted. Each relation gets an extra fork called _key and each page has some reserved space for any authentication data the encryption module needs. The module itself decides the format of both of these.
The poc includes two encryption modules, and patches one frontend tool.
The encryption modules just uses a raw secret key as configuration for now, but a real world encryption module would need to get the key from elsewhere. Or even multiple keys if we want different principal keys for different things or even an deeper hierarchy of keys.