local chDB checkpoints for Maple’s embedded chDB store by robbiemu · Pull Request #129 · MapleTechLabs/maple

robbiemu · 2026-06-26T19:09:59Z

Summary

This is a draft implementation of local chDB checkpoints for Maple’s embedded chDB store.

The goal is to give local-mode Maple a recoverable checkpoint path for dirty shutdowns without trying to copy a live chDB data directory. Instead, this uses chDB/ClickHouse-native BACKUP DATABASE ... TO Disk(...) and RESTORE DATABASE ... FROM Disk(...), then validates the backup in a sacrificial chDB instance before promoting it as restorable.

This PR includes:

maple start --chdb-config-file <path>
maple checkpoint
maple restore --yes
maple start --on-dirty-store=<wipe|fail|restore-checkpoint>
checkpoint manifest validation for chDB version and schema fingerprint
unit coverage for checkpoint config generation, checkpoint promotion, and manifest compatibility

Architecture

Maple still runs chDB embedded/in-process. This PR does not introduce a separate database server.

The checkpoint flow is:

maple start may be given a ClickHouse config file that enables backups:

<clickhouse>
  <backups>
    <allowed_disk>default</allowed_disk>
    <allowed_path>backups</allowed_path>
  </backups>
</clickhouse>

maple checkpoint asks the running local server to execute:

BACKUP DATABASE default TO Disk('default', 'backups/building/backup')

The CLI opens a separate scratch chDB instance and restores that backup using a src disk pointed at the live data dir.

If restore validation succeeds, Maple writes a manifest and promotes:

backups/building -> backups/current
backups/current  -> backups/previous

maple restore --yes restores only from backups/current, into a sibling restore dir first. The dirty/original store is quarantined only after the restored store has validated.

Behavioral Cases

Normal checkpoint

If Maple is running with backup support enabled, maple checkpoint creates backups/current, validates it in scratch chDB, writes manifest.json, and keeps the prior checkpoint as backups/previous.

Missing backup config

If the server was not started with backup support, maple checkpoint now reports an actionable error instead of surfacing raw chDB Code: 318.

Manual restore

maple restore --yes:

requires backups/current/backup
reads backups/current/manifest.json
refuses mismatched chdbVersion
refuses mismatched schemaFingerprint
restores into data.restore-building
validates the restored store
quarantines the old data dir
moves the restored dir into place
rewrites Maple’s local store marker

Dirty store on startup

Current default behavior is preserved:

maple start --on-dirty-store=wipe

Additional options:

maple start --on-dirty-store=fail
maple start --on-dirty-store=restore-checkpoint

The restore mode is opt-in. It attempts to restore from the last promoted checkpoint instead of wiping the dirty store.

Incompatible store

Compatibility checks run before dirty-store recovery. If the existing store was written by an incompatible chDB build, Maple refuses before attempting recovery.

Bad or old checkpoint

A checkpoint with an incompatible chDB version or schema fingerprint is rejected before restore.

Validation

Local validation performed:

bunx oxfmt on touched files
bunx oxlint on touched files
bun --filter=@maple/cli typecheck
bun test apps/cli/test
native chDB smoke test:
- maple start
- maple checkpoint
- maple restore --yes
- restart from restored store

Key Adoption Questions

Should this be split?

This draft currently includes checkpoint creation and dirty-store restore behavior. We may want two PRs:

backup config + maple checkpoint
maple restore + dirty-store recovery

The second part changes startup recovery semantics, even though the new behavior is opt-in.

Should users manage the chDB config?

This PR exposes --chdb-config-file. That is useful and explicit, but probably not the final UX. Longer term, Maple may want to generate/manage the standard backup config automatically.

Disk and quota implications

A checkpoint is another copy of the local OLAP store, not a tiny journal. Keeping current and previous means operators should expect additional disk usage, potentially near 2x depending on compression and data shape.

This matters especially for containerized/local deployments with fixed volume quotas.

Restore-time expectations

Operators need to size their local store around acceptable recovery time.

If the checkpoint is 4 GB, 8 GB, 64 GB, etc., how long can startup recovery take? The answer should influence recommended volume size, checkpoint frequency, and whether automatic restore on startup is acceptable.

RAM pressure

This approach does not add a separate OLTP twin database, but checkpoint validation/restoration opens another chDB instance during the operation. We should confirm memory behavior at realistic store sizes, especially on 16 GB machines.

If we later choose an OTLP/OLTP mirror architecture, that would be a separate design with much larger steady-state disk and memory implications.

Promotion atomicity

The current building/current/previous promotion is pragmatic but not perfectly transactional across every crash point. A future hardening pass could use versioned checkpoint dirs plus an atomic pointer, or add startup reconciliation for stranded building/.

Should restore ever be automatic?

--on-dirty-store=restore-checkpoint is convenient, but it is still an implicit recovery action during startup. Maintainers may prefer manual maple restore --yes only, or may want restore mode gated behind an explicit operator flag.

Fixes #113

pullfrog

Important

Solid draft — the backup/restore mechanics, manifest compat gating, and quarantine-before-delete ordering are well thought out. One real bug: --on-dirty-store is silently dropped when starting detached, so maple start --background --on-dirty-store=restore-checkpoint runs with the default wipe instead. Worth fixing before merge.

Reviewed changes — initial review of the new local-mode checkpoint/restore feature (ClickHouse-native BACKUP/RESTORE through embedded chDB), covering crash-safety of the destructive filesystem operations and the load-bearing ClickHouse contract assumptions.

Checkpoint creation — maple checkpoint runs BACKUP DATABASE default against the running server, validates it by restoring into a sacrificial scratch chDB, writes manifest.json, then rotates building→current→previous via promoteBuilding.
Manual restore — maple restore --yes restores backups/current into a sibling data.restore-building, quarantines the live dir, swaps the restored dir into place, and re-stamps the store marker.
Dirty-store policy — maple start --on-dirty-store=<wipe|fail|restore-checkpoint> (default wipe); the new restore-checkpoint mode recovers from the last promoted checkpoint instead of wiping.
Config plumbing — --chdb-config-file flows through serve.ts→chdb.ts as --config-file=; Chdb.open gains configFile? and bootstrapSchema?.
Compat gating — checkpoints are rejected on chdbVersion or schemaFingerprint mismatch; new checkpoints.test.ts covers config XML, promotion rotation, and manifest compat (but not the destructive restore path or the backup-error detection).

⚠️ Restore and promotion have non-atomic windows that can strand the live store

The restore swap and the checkpoint promotion are each a sequence of independent rename/rm calls with no startup reconciliation, so a process kill (or host crash) at the wrong moment leaves on-disk state that no later maple start repairs automatically.

The two most consequential windows during restoreCheckpoint:

A kill between rename(dataDir → quarantine) and rename(restoreDir → dataDir) leaves dataDir absent entirely. The next maple start sees no store, bootstraps an empty one, and the real data survives only in the data.quarantine-<ts> sibling — recoverable, but only by hand.
A kill after the swap but before the marker re-stamp leaves the store described by the previous run's marker and open-sentinel siblings (they live beside dataDir, so the renames never touch them), reintroducing exactly the marker/store-skew the dirty-store path exists to handle.

The PR body already names this ("promotion … not perfectly transactional across every crash point"), so this is a flag for the design conversation rather than a request to make it bulletproof now.

Technical details

# Restore / promotion non-atomicity

## Affected sites
- `apps/cli/src/server/checkpoints.ts:310-317` — `restoreCheckpoint`: `rename(dataDir → quarantinePath)` then `rename(restoreDir → dataDir)` then `markStoreClosed` + marker write are four separate syscalls; a crash between renames leaves `dataDir` missing, and a crash before the marker write leaves the swapped-in store stamped by the *previous* marker/open-sentinel siblings (which live in `dirname(dataDir)` and are never moved by the renames).
- `apps/cli/src/server/checkpoints.ts:223-231` — `promoteBuilding`: `rm(previous)` → `rename(current → previous)` → `rename(building → current)`; a crash between the first two destroys the only existing rollback checkpoint before `current` is replaced.

## Required outcome
- A crash at any single point of restore/promote must leave the store either fully on the old state or fully on the new one, OR a startup reconciliation step must detect and repair stranded `building/` / `.restore-building` / partially-swapped states.

## Suggested approach (optional)
- Versioned checkpoint dirs + a single atomic pointer rename (write the new pointer to a temp file, then `rename` it over the live pointer) gives single-syscall promotion. For restore, re-stamping the marker *before* the final `rename(restoreDir → dataDir)` (i.e. write the marker into `restoreDir`'s sibling-equivalent, or stamp inside the dir) narrows the skew window. A `maple start` reconciliation pass that removes stray `building/`/`.restore-building` and detects a missing `dataDir` with a present quarantine sibling would close the "store vanished" case.

## Open questions for the human
- Is hardening this in-scope for this PR, or deferred to the follow-up the PR body already floats? The answer likely depends on whether `--on-dirty-store=restore-checkpoint` ships enabled-by-default or stays opt-in.

^{｜ ⚠️ this action is pinned to a commit SHA, which freezes the cleanup step — switch to @v0 or keep the SHA fresh with Dependabot ｜ Fix all ➔ ｜ Fix 👍s ➔ ｜ View workflow run ｜ Using Claude Opus ｜ 𝕏}

pullfrog · 2026-06-26T19:17:43Z

 			String(port),
 			"--data-dir",
 			dataDir,
+			...(chdbConfigFile ? ["--chdb-config-file", chdbConfigFile] : []),


⚠️ The detached child command forwards --chdb-config-file and --offline but omits --on-dirty-store. So maple start --background --on-dirty-store=restore-checkpoint (or =fail) silently runs the spawned foreground process with the default wipe — the opposite of what the operator asked for, and on a dirty store that means the data is discarded instead of recovered.

Technical details

# `--on-dirty-store` dropped on detached re-exec ## Affected sites - `apps/cli/src/commands/server.ts` `childArgs` (the `...(chdbConfigFile ? ["--chdb-config-file", chdbConfigFile] : [])` block) — `a.onDirtyStore` is never appended, so the re-exec'd foreground process falls back to `Flag.withDefault("wipe")`. ## Required outcome - A detached start preserves the operator's `--on-dirty-store` choice in the spawned process. ## Suggested approach (optional) - Thread `onDirtyStore` into `startDetached` (alongside `chdbConfigFile`) and append `"--on-dirty-store", onDirtyStore` to `childArgs`. A test asserting the constructed `childArgs` contains the flag would lock this down.

pullfrog · 2026-06-26T19:17:43Z

+				const message = error instanceof Error ? error.message : String(error)
+				if (
+					message.includes("backups.allowed_disk") ||
+					message.includes("INVALID_CONFIG_PARAMETER")


ℹ️ This branch keys the friendly "start with --chdb-config-file" hint off substring matches against chDB's raw error text, which isn't an authoritatively documented contract — INVALID_CONFIG_PARAMETER in particular is broad enough that an unrelated config error would produce a misleading hint, and if chDB surfaces a different string (e.g. an UNKNOWN_DISK-style message) for the missing-backup-config case the hint never fires. The fallback rethrow keeps this safe (worst case is a rawer message), so this is just a robustness note. A test exercising the actual missing-config error would confirm the match holds for the chDB build you bundle.

pullfrog · 2026-06-26T19:17:43Z

+export const previousDir = (dataDir: string): string => join(checkpointRoot(dataDir), "previous")
+const restoreBuildingDir = (dataDir: string): string => `${dataDir}.restore-building`
+const quarantineDir = (dataDir: string): string =>
+	`${dataDir}.quarantine-${new Date().toISOString().replace(/[:.]/g, "-")}`


ℹ️ quarantineDir derives its suffix purely from an ISO timestamp at millisecond resolution, so two restores within the same millisecond (or a leftover quarantine dir from an identical timestamp) collide on the same path — the second rename(dataDir → quarantinePath) would fail or merge into an existing dir. Unlikely in practice; appending a short random/pid suffix would make it collision-proof.

First-hand proof that the immutable-checkpoint-source trajectory is viable, with fault injection at every phase. No blocker found, so fallback trajectories B and A were not evaluated. Trajectory C replaces the mutable building/current/previous checkpoint design with immutable, addressable-by-id snapshots + an atomic current.json pointer + pin files for anti-GC. The archive exporter never opens the live dataDir: it restores a checkpoint by ID into external scratch, exports from there, removes the scratch. Proven (22/22 + 11/11 + concurrency): - 7 capabilities: create/validate/promote/restore-external/overwrite-safe/ pin-prevents-GC/export-from-scratch/scratch-removal-safe. - Concurrency: C1(markerA)->pin->C2(markerB)->restore-C1(A not B)-> restore-C2(both)->release-C1->GC. Snapshots are genuinely independent. - Fault matrix: crash at backup/validate/manifest/pointer/gc/restore; current.json never advances to an incomplete snapshot; live store never mutated; GC idempotent and cleans debris. - Pinning > exclusive lock: pin's failure mode is safe over-retention; stale lock's is unsafe deletion or stuck-lock. Separation of concerns: the live-swap marker-rewrite issue is PR MapleTechLabs#129 (checkpoint recovery), not archive-branch. The archive path uses restoreSnapshotToScratch (external scratch only), never a live swap. Adds reproducible prototype + fault-injection scripts in experiments/trajectory/. Does not begin the production implementation; recommends the checkpoint module move to immutable snapshots + current.json.

robbiemu · 2026-06-30T16:29:08Z

still looking at it btw. The changes requested happen to go well with a second change I want, at least for a fork if not for this project, so I am actually trying. What pullfrog wants is non-trivial in the details, but I think we've got it.

I have to say though, that fully satisfying the direction desired here probably veers away from the design principles I anticipate when I read "Maple" as the name of the project.

remote service administrator added 4 commits June 25, 2026 19:02

feat(cli): pass chDB config file to local server

31d633f

feat(cli): add local chDB checkpoint command

8be7e3e

feat(cli): restore dirty local store from checkpoint

19c0530

fix(cli): harden chDB checkpoint recovery

3fa344f

pullfrog Bot reviewed Jun 26, 2026

View reviewed changes

Merge branch 'main' into codex/chdb-checkpoints

bbe6be3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

local chDB checkpoints for Maple’s embedded chDB store#129

local chDB checkpoints for Maple’s embedded chDB store#129
robbiemu wants to merge 5 commits into
MapleTechLabs:mainfrom
robbiemu:codex/chdb-checkpoints

robbiemu commented Jun 26, 2026 •

edited

Loading

Uh oh!

pullfrog Bot left a comment

Uh oh!

pullfrog Bot Jun 26, 2026

Uh oh!

pullfrog Bot Jun 26, 2026

Uh oh!

pullfrog Bot Jun 26, 2026

Uh oh!

robbiemu commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

robbiemu commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Behavioral Cases

Normal checkpoint

Missing backup config

Manual restore

Dirty store on startup

Incompatible store

Bad or old checkpoint

Validation

Key Adoption Questions

Should this be split?

Should users manage the chDB config?

Disk and quota implications

Restore-time expectations

RAM pressure

Promotion atomicity

Should restore ever be automatic?

Uh oh!

pullfrog Bot left a comment

Choose a reason for hiding this comment

⚠️ Restore and promotion have non-atomic windows that can strand the live store

Uh oh!

pullfrog Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

pullfrog Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

pullfrog Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

robbiemu commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

robbiemu commented Jun 26, 2026 •

edited

Loading