local chDB checkpoints for Maple’s embedded chDB store#129
Conversation
There was a problem hiding this comment.
Important
Solid draft — the backup/restore mechanics, manifest compat gating, and quarantine-before-delete ordering are well thought out. One real bug: --on-dirty-store is silently dropped when starting detached, so maple start --background --on-dirty-store=restore-checkpoint runs with the default wipe instead. Worth fixing before merge.
Reviewed changes — initial review of the new local-mode checkpoint/restore feature (ClickHouse-native BACKUP/RESTORE through embedded chDB), covering crash-safety of the destructive filesystem operations and the load-bearing ClickHouse contract assumptions.
- Checkpoint creation —
maple checkpointrunsBACKUP DATABASE defaultagainst the running server, validates it by restoring into a sacrificial scratch chDB, writesmanifest.json, then rotatesbuilding→current→previousviapromoteBuilding. - Manual restore —
maple restore --yesrestoresbackups/currentinto a siblingdata.restore-building, quarantines the live dir, swaps the restored dir into place, and re-stamps the store marker. - Dirty-store policy —
maple start --on-dirty-store=<wipe|fail|restore-checkpoint>(defaultwipe); the newrestore-checkpointmode recovers from the last promoted checkpoint instead of wiping. - Config plumbing —
--chdb-config-fileflows throughserve.ts→chdb.tsas--config-file=;Chdb.opengainsconfigFile?andbootstrapSchema?. - Compat gating — checkpoints are rejected on
chdbVersionorschemaFingerprintmismatch; newcheckpoints.test.tscovers config XML, promotion rotation, and manifest compat (but not the destructive restore path or the backup-error detection).
⚠️ Restore and promotion have non-atomic windows that can strand the live store
The restore swap and the checkpoint promotion are each a sequence of independent rename/rm calls with no startup reconciliation, so a process kill (or host crash) at the wrong moment leaves on-disk state that no later maple start repairs automatically.
The two most consequential windows during restoreCheckpoint:
- A kill between
rename(dataDir → quarantine)andrename(restoreDir → dataDir)leavesdataDirabsent entirely. The nextmaple startsees no store, bootstraps an empty one, and the real data survives only in thedata.quarantine-<ts>sibling — recoverable, but only by hand. - A kill after the swap but before the marker re-stamp leaves the store described by the previous run's marker and open-sentinel siblings (they live beside
dataDir, so the renames never touch them), reintroducing exactly the marker/store-skew the dirty-store path exists to handle.
The PR body already names this ("promotion … not perfectly transactional across every crash point"), so this is a flag for the design conversation rather than a request to make it bulletproof now.
Technical details
# Restore / promotion non-atomicity
## Affected sites
- `apps/cli/src/server/checkpoints.ts:310-317` — `restoreCheckpoint`: `rename(dataDir → quarantinePath)` then `rename(restoreDir → dataDir)` then `markStoreClosed` + marker write are four separate syscalls; a crash between renames leaves `dataDir` missing, and a crash before the marker write leaves the swapped-in store stamped by the *previous* marker/open-sentinel siblings (which live in `dirname(dataDir)` and are never moved by the renames).
- `apps/cli/src/server/checkpoints.ts:223-231` — `promoteBuilding`: `rm(previous)` → `rename(current → previous)` → `rename(building → current)`; a crash between the first two destroys the only existing rollback checkpoint before `current` is replaced.
## Required outcome
- A crash at any single point of restore/promote must leave the store either fully on the old state or fully on the new one, OR a startup reconciliation step must detect and repair stranded `building/` / `.restore-building` / partially-swapped states.
## Suggested approach (optional)
- Versioned checkpoint dirs + a single atomic pointer rename (write the new pointer to a temp file, then `rename` it over the live pointer) gives single-syscall promotion. For restore, re-stamping the marker *before* the final `rename(restoreDir → dataDir)` (i.e. write the marker into `restoreDir`'s sibling-equivalent, or stamp inside the dir) narrows the skew window. A `maple start` reconciliation pass that removes stray `building/`/`.restore-building` and detects a missing `dataDir` with a present quarantine sibling would close the "store vanished" case.
## Open questions for the human
- Is hardening this in-scope for this PR, or deferred to the follow-up the PR body already floats? The answer likely depends on whether `--on-dirty-store=restore-checkpoint` ships enabled-by-default or stays opt-in.@v0 or keep the SHA fresh with Dependabot | Fix all ➔ | Fix 👍s ➔ | View workflow run | Using Claude Opus | 𝕏
| String(port), | ||
| "--data-dir", | ||
| dataDir, | ||
| ...(chdbConfigFile ? ["--chdb-config-file", chdbConfigFile] : []), |
There was a problem hiding this comment.
--chdb-config-file and --offline but omits --on-dirty-store. So maple start --background --on-dirty-store=restore-checkpoint (or =fail) silently runs the spawned foreground process with the default wipe — the opposite of what the operator asked for, and on a dirty store that means the data is discarded instead of recovered.
Technical details
# `--on-dirty-store` dropped on detached re-exec
## Affected sites
- `apps/cli/src/commands/server.ts` `childArgs` (the `...(chdbConfigFile ? ["--chdb-config-file", chdbConfigFile] : [])` block) — `a.onDirtyStore` is never appended, so the re-exec'd foreground process falls back to `Flag.withDefault("wipe")`.
## Required outcome
- A detached start preserves the operator's `--on-dirty-store` choice in the spawned process.
## Suggested approach (optional)
- Thread `onDirtyStore` into `startDetached` (alongside `chdbConfigFile`) and append `"--on-dirty-store", onDirtyStore` to `childArgs`. A test asserting the constructed `childArgs` contains the flag would lock this down.| const message = error instanceof Error ? error.message : String(error) | ||
| if ( | ||
| message.includes("backups.allowed_disk") || | ||
| message.includes("INVALID_CONFIG_PARAMETER") |
There was a problem hiding this comment.
ℹ️ This branch keys the friendly "start with --chdb-config-file" hint off substring matches against chDB's raw error text, which isn't an authoritatively documented contract — INVALID_CONFIG_PARAMETER in particular is broad enough that an unrelated config error would produce a misleading hint, and if chDB surfaces a different string (e.g. an UNKNOWN_DISK-style message) for the missing-backup-config case the hint never fires. The fallback rethrow keeps this safe (worst case is a rawer message), so this is just a robustness note. A test exercising the actual missing-config error would confirm the match holds for the chDB build you bundle.
| export const previousDir = (dataDir: string): string => join(checkpointRoot(dataDir), "previous") | ||
| const restoreBuildingDir = (dataDir: string): string => `${dataDir}.restore-building` | ||
| const quarantineDir = (dataDir: string): string => | ||
| `${dataDir}.quarantine-${new Date().toISOString().replace(/[:.]/g, "-")}` |
There was a problem hiding this comment.
ℹ️ quarantineDir derives its suffix purely from an ISO timestamp at millisecond resolution, so two restores within the same millisecond (or a leftover quarantine dir from an identical timestamp) collide on the same path — the second rename(dataDir → quarantinePath) would fail or merge into an existing dir. Unlikely in practice; appending a short random/pid suffix would make it collision-proof.
First-hand proof that the immutable-checkpoint-source trajectory is viable, with fault injection at every phase. No blocker found, so fallback trajectories B and A were not evaluated. Trajectory C replaces the mutable building/current/previous checkpoint design with immutable, addressable-by-id snapshots + an atomic current.json pointer + pin files for anti-GC. The archive exporter never opens the live dataDir: it restores a checkpoint by ID into external scratch, exports from there, removes the scratch. Proven (22/22 + 11/11 + concurrency): - 7 capabilities: create/validate/promote/restore-external/overwrite-safe/ pin-prevents-GC/export-from-scratch/scratch-removal-safe. - Concurrency: C1(markerA)->pin->C2(markerB)->restore-C1(A not B)-> restore-C2(both)->release-C1->GC. Snapshots are genuinely independent. - Fault matrix: crash at backup/validate/manifest/pointer/gc/restore; current.json never advances to an incomplete snapshot; live store never mutated; GC idempotent and cleans debris. - Pinning > exclusive lock: pin's failure mode is safe over-retention; stale lock's is unsafe deletion or stuck-lock. Separation of concerns: the live-swap marker-rewrite issue is PR MapleTechLabs#129 (checkpoint recovery), not archive-branch. The archive path uses restoreSnapshotToScratch (external scratch only), never a live swap. Adds reproducible prototype + fault-injection scripts in experiments/trajectory/. Does not begin the production implementation; recommends the checkpoint module move to immutable snapshots + current.json.
|
still looking at it btw. The changes requested happen to go well with a second change I want, at least for a fork if not for this project, so I am actually trying. What pullfrog wants is non-trivial in the details, but I think we've got it. I have to say though, that fully satisfying the direction desired here probably veers away from the design principles I anticipate when I read "Maple" as the name of the project. |

Summary
This is a draft implementation of local chDB checkpoints for Maple’s embedded chDB store.
The goal is to give local-mode Maple a recoverable checkpoint path for dirty shutdowns without trying to copy a live chDB data directory. Instead, this uses chDB/ClickHouse-native
BACKUP DATABASE ... TO Disk(...)andRESTORE DATABASE ... FROM Disk(...), then validates the backup in a sacrificial chDB instance before promoting it as restorable.This PR includes:
maple start --chdb-config-file <path>maple checkpointmaple restore --yesmaple start --on-dirty-store=<wipe|fail|restore-checkpoint>Architecture
Maple still runs chDB embedded/in-process. This PR does not introduce a separate database server.
The checkpoint flow is:
maple startmay be given a ClickHouse config file that enables backups:maple checkpointasks the running local server to execute:The CLI opens a separate scratch chDB instance and restores that backup using a
srcdisk pointed at the live data dir.If restore validation succeeds, Maple writes a manifest and promotes:
maple restore --yesrestores only frombackups/current, into a sibling restore dir first. The dirty/original store is quarantined only after the restored store has validated.Behavioral Cases
Normal checkpoint
If Maple is running with backup support enabled,
maple checkpointcreatesbackups/current, validates it in scratch chDB, writesmanifest.json, and keeps the prior checkpoint asbackups/previous.Missing backup config
If the server was not started with backup support,
maple checkpointnow reports an actionable error instead of surfacing raw chDBCode: 318.Manual restore
maple restore --yes:backups/current/backupbackups/current/manifest.jsonchdbVersionschemaFingerprintdata.restore-buildingDirty store on startup
Current default behavior is preserved:
Additional options:
The restore mode is opt-in. It attempts to restore from the last promoted checkpoint instead of wiping the dirty store.
Incompatible store
Compatibility checks run before dirty-store recovery. If the existing store was written by an incompatible chDB build, Maple refuses before attempting recovery.
Bad or old checkpoint
A checkpoint with an incompatible chDB version or schema fingerprint is rejected before restore.
Validation
Local validation performed:
bunx oxfmton touched filesbunx oxlinton touched filesbun --filter=@maple/cli typecheckbun test apps/cli/testmaple startmaple checkpointmaple restore --yesKey Adoption Questions
Should this be split?
This draft currently includes checkpoint creation and dirty-store restore behavior. We may want two PRs:
maple checkpointmaple restore+ dirty-store recoveryThe second part changes startup recovery semantics, even though the new behavior is opt-in.
Should users manage the chDB config?
This PR exposes
--chdb-config-file. That is useful and explicit, but probably not the final UX. Longer term, Maple may want to generate/manage the standard backup config automatically.Disk and quota implications
A checkpoint is another copy of the local OLAP store, not a tiny journal. Keeping
currentandpreviousmeans operators should expect additional disk usage, potentially near 2x depending on compression and data shape.This matters especially for containerized/local deployments with fixed volume quotas.
Restore-time expectations
Operators need to size their local store around acceptable recovery time.
If the checkpoint is 4 GB, 8 GB, 64 GB, etc., how long can startup recovery take? The answer should influence recommended volume size, checkpoint frequency, and whether automatic restore on startup is acceptable.
RAM pressure
This approach does not add a separate OLTP twin database, but checkpoint validation/restoration opens another chDB instance during the operation. We should confirm memory behavior at realistic store sizes, especially on 16 GB machines.
If we later choose an OTLP/OLTP mirror architecture, that would be a separate design with much larger steady-state disk and memory implications.
Promotion atomicity
The current
building/current/previouspromotion is pragmatic but not perfectly transactional across every crash point. A future hardening pass could use versioned checkpoint dirs plus an atomic pointer, or add startup reconciliation for strandedbuilding/.Should restore ever be automatic?
--on-dirty-store=restore-checkpointis convenient, but it is still an implicit recovery action during startup. Maintainers may prefer manualmaple restore --yesonly, or may want restore mode gated behind an explicit operator flag.Fixes #113