Skip to content

local chDB checkpoints for Maple’s embedded chDB store#129

Open
robbiemu wants to merge 5 commits into
MapleTechLabs:mainfrom
robbiemu:codex/chdb-checkpoints
Open

local chDB checkpoints for Maple’s embedded chDB store#129
robbiemu wants to merge 5 commits into
MapleTechLabs:mainfrom
robbiemu:codex/chdb-checkpoints

Conversation

@robbiemu

@robbiemu robbiemu commented Jun 26, 2026

Copy link
Copy Markdown

Summary

This is a draft implementation of local chDB checkpoints for Maple’s embedded chDB store.

The goal is to give local-mode Maple a recoverable checkpoint path for dirty shutdowns without trying to copy a live chDB data directory. Instead, this uses chDB/ClickHouse-native BACKUP DATABASE ... TO Disk(...) and RESTORE DATABASE ... FROM Disk(...), then validates the backup in a sacrificial chDB instance before promoting it as restorable.

This PR includes:

  • maple start --chdb-config-file <path>
  • maple checkpoint
  • maple restore --yes
  • maple start --on-dirty-store=<wipe|fail|restore-checkpoint>
  • checkpoint manifest validation for chDB version and schema fingerprint
  • unit coverage for checkpoint config generation, checkpoint promotion, and manifest compatibility

Architecture

Maple still runs chDB embedded/in-process. This PR does not introduce a separate database server.

The checkpoint flow is:

  1. maple start may be given a ClickHouse config file that enables backups:

    <clickhouse>
      <backups>
        <allowed_disk>default</allowed_disk>
        <allowed_path>backups</allowed_path>
      </backups>
    </clickhouse>
  2. maple checkpoint asks the running local server to execute:

    BACKUP DATABASE default TO Disk('default', 'backups/building/backup')
  3. The CLI opens a separate scratch chDB instance and restores that backup using a src disk pointed at the live data dir.

  4. If restore validation succeeds, Maple writes a manifest and promotes:

    backups/building -> backups/current
    backups/current  -> backups/previous
    
  5. maple restore --yes restores only from backups/current, into a sibling restore dir first. The dirty/original store is quarantined only after the restored store has validated.

Behavioral Cases

Normal checkpoint

If Maple is running with backup support enabled, maple checkpoint creates backups/current, validates it in scratch chDB, writes manifest.json, and keeps the prior checkpoint as backups/previous.

Missing backup config

If the server was not started with backup support, maple checkpoint now reports an actionable error instead of surfacing raw chDB Code: 318.

Manual restore

maple restore --yes:

  • requires backups/current/backup
  • reads backups/current/manifest.json
  • refuses mismatched chdbVersion
  • refuses mismatched schemaFingerprint
  • restores into data.restore-building
  • validates the restored store
  • quarantines the old data dir
  • moves the restored dir into place
  • rewrites Maple’s local store marker

Dirty store on startup

Current default behavior is preserved:

maple start --on-dirty-store=wipe

Additional options:

maple start --on-dirty-store=fail
maple start --on-dirty-store=restore-checkpoint

The restore mode is opt-in. It attempts to restore from the last promoted checkpoint instead of wiping the dirty store.

Incompatible store

Compatibility checks run before dirty-store recovery. If the existing store was written by an incompatible chDB build, Maple refuses before attempting recovery.

Bad or old checkpoint

A checkpoint with an incompatible chDB version or schema fingerprint is rejected before restore.

Validation

Local validation performed:

  • bunx oxfmt on touched files
  • bunx oxlint on touched files
  • bun --filter=@maple/cli typecheck
  • bun test apps/cli/test
  • native chDB smoke test:
    • maple start
    • maple checkpoint
    • maple restore --yes
    • restart from restored store

Key Adoption Questions

Should this be split?

This draft currently includes checkpoint creation and dirty-store restore behavior. We may want two PRs:

  1. backup config + maple checkpoint
  2. maple restore + dirty-store recovery

The second part changes startup recovery semantics, even though the new behavior is opt-in.

Should users manage the chDB config?

This PR exposes --chdb-config-file. That is useful and explicit, but probably not the final UX. Longer term, Maple may want to generate/manage the standard backup config automatically.

Disk and quota implications

A checkpoint is another copy of the local OLAP store, not a tiny journal. Keeping current and previous means operators should expect additional disk usage, potentially near 2x depending on compression and data shape.

This matters especially for containerized/local deployments with fixed volume quotas.

Restore-time expectations

Operators need to size their local store around acceptable recovery time.

If the checkpoint is 4 GB, 8 GB, 64 GB, etc., how long can startup recovery take? The answer should influence recommended volume size, checkpoint frequency, and whether automatic restore on startup is acceptable.

RAM pressure

This approach does not add a separate OLTP twin database, but checkpoint validation/restoration opens another chDB instance during the operation. We should confirm memory behavior at realistic store sizes, especially on 16 GB machines.

If we later choose an OTLP/OLTP mirror architecture, that would be a separate design with much larger steady-state disk and memory implications.

Promotion atomicity

The current building/current/previous promotion is pragmatic but not perfectly transactional across every crash point. A future hardening pass could use versioned checkpoint dirs plus an atomic pointer, or add startup reconciliation for stranded building/.

Should restore ever be automatic?

--on-dirty-store=restore-checkpoint is convenient, but it is still an implicit recovery action during startup. Maintainers may prefer manual maple restore --yes only, or may want restore mode gated behind an explicit operator flag.

Fixes #113

@pullfrog pullfrog Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Solid draft — the backup/restore mechanics, manifest compat gating, and quarantine-before-delete ordering are well thought out. One real bug: --on-dirty-store is silently dropped when starting detached, so maple start --background --on-dirty-store=restore-checkpoint runs with the default wipe instead. Worth fixing before merge.

Reviewed changes — initial review of the new local-mode checkpoint/restore feature (ClickHouse-native BACKUP/RESTORE through embedded chDB), covering crash-safety of the destructive filesystem operations and the load-bearing ClickHouse contract assumptions.

  • Checkpoint creationmaple checkpoint runs BACKUP DATABASE default against the running server, validates it by restoring into a sacrificial scratch chDB, writes manifest.json, then rotates buildingcurrentprevious via promoteBuilding.
  • Manual restoremaple restore --yes restores backups/current into a sibling data.restore-building, quarantines the live dir, swaps the restored dir into place, and re-stamps the store marker.
  • Dirty-store policymaple start --on-dirty-store=<wipe|fail|restore-checkpoint> (default wipe); the new restore-checkpoint mode recovers from the last promoted checkpoint instead of wiping.
  • Config plumbing--chdb-config-file flows through serve.tschdb.ts as --config-file=; Chdb.open gains configFile? and bootstrapSchema?.
  • Compat gating — checkpoints are rejected on chdbVersion or schemaFingerprint mismatch; new checkpoints.test.ts covers config XML, promotion rotation, and manifest compat (but not the destructive restore path or the backup-error detection).

⚠️ Restore and promotion have non-atomic windows that can strand the live store

The restore swap and the checkpoint promotion are each a sequence of independent rename/rm calls with no startup reconciliation, so a process kill (or host crash) at the wrong moment leaves on-disk state that no later maple start repairs automatically.

The two most consequential windows during restoreCheckpoint:

  • A kill between rename(dataDir → quarantine) and rename(restoreDir → dataDir) leaves dataDir absent entirely. The next maple start sees no store, bootstraps an empty one, and the real data survives only in the data.quarantine-<ts> sibling — recoverable, but only by hand.
  • A kill after the swap but before the marker re-stamp leaves the store described by the previous run's marker and open-sentinel siblings (they live beside dataDir, so the renames never touch them), reintroducing exactly the marker/store-skew the dirty-store path exists to handle.

The PR body already names this ("promotion … not perfectly transactional across every crash point"), so this is a flag for the design conversation rather than a request to make it bulletproof now.

Technical details
# Restore / promotion non-atomicity

## Affected sites
- `apps/cli/src/server/checkpoints.ts:310-317``restoreCheckpoint`: `rename(dataDir → quarantinePath)` then `rename(restoreDir → dataDir)` then `markStoreClosed` + marker write are four separate syscalls; a crash between renames leaves `dataDir` missing, and a crash before the marker write leaves the swapped-in store stamped by the *previous* marker/open-sentinel siblings (which live in `dirname(dataDir)` and are never moved by the renames).
- `apps/cli/src/server/checkpoints.ts:223-231``promoteBuilding`: `rm(previous)``rename(current → previous)``rename(building → current)`; a crash between the first two destroys the only existing rollback checkpoint before `current` is replaced.

## Required outcome
- A crash at any single point of restore/promote must leave the store either fully on the old state or fully on the new one, OR a startup reconciliation step must detect and repair stranded `building/` / `.restore-building` / partially-swapped states.

## Suggested approach (optional)
- Versioned checkpoint dirs + a single atomic pointer rename (write the new pointer to a temp file, then `rename` it over the live pointer) gives single-syscall promotion. For restore, re-stamping the marker *before* the final `rename(restoreDir → dataDir)` (i.e. write the marker into `restoreDir`'s sibling-equivalent, or stamp inside the dir) narrows the skew window. A `maple start` reconciliation pass that removes stray `building/`/`.restore-building` and detects a missing `dataDir` with a present quarantine sibling would close the "store vanished" case.

## Open questions for the human
- Is hardening this in-scope for this PR, or deferred to the follow-up the PR body already floats? The answer likely depends on whether `--on-dirty-store=restore-checkpoint` ships enabled-by-default or stays opt-in.

Pullfrog  | ⚠️ this action is pinned to a commit SHA, which freezes the cleanup step — switch to @v0 or keep the SHA fresh with Dependabot | Fix all ➔Fix 👍s ➔View workflow run | Using Claude Opus𝕏

String(port),
"--data-dir",
dataDir,
...(chdbConfigFile ? ["--chdb-config-file", chdbConfigFile] : []),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ The detached child command forwards --chdb-config-file and --offline but omits --on-dirty-store. So maple start --background --on-dirty-store=restore-checkpoint (or =fail) silently runs the spawned foreground process with the default wipe — the opposite of what the operator asked for, and on a dirty store that means the data is discarded instead of recovered.

Technical details
# `--on-dirty-store` dropped on detached re-exec

## Affected sites
- `apps/cli/src/commands/server.ts` `childArgs` (the `...(chdbConfigFile ? ["--chdb-config-file", chdbConfigFile] : [])` block) — `a.onDirtyStore` is never appended, so the re-exec'd foreground process falls back to `Flag.withDefault("wipe")`.

## Required outcome
- A detached start preserves the operator's `--on-dirty-store` choice in the spawned process.

## Suggested approach (optional)
- Thread `onDirtyStore` into `startDetached` (alongside `chdbConfigFile`) and append `"--on-dirty-store", onDirtyStore` to `childArgs`. A test asserting the constructed `childArgs` contains the flag would lock this down.

const message = error instanceof Error ? error.message : String(error)
if (
message.includes("backups.allowed_disk") ||
message.includes("INVALID_CONFIG_PARAMETER")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ This branch keys the friendly "start with --chdb-config-file" hint off substring matches against chDB's raw error text, which isn't an authoritatively documented contract — INVALID_CONFIG_PARAMETER in particular is broad enough that an unrelated config error would produce a misleading hint, and if chDB surfaces a different string (e.g. an UNKNOWN_DISK-style message) for the missing-backup-config case the hint never fires. The fallback rethrow keeps this safe (worst case is a rawer message), so this is just a robustness note. A test exercising the actual missing-config error would confirm the match holds for the chDB build you bundle.

export const previousDir = (dataDir: string): string => join(checkpointRoot(dataDir), "previous")
const restoreBuildingDir = (dataDir: string): string => `${dataDir}.restore-building`
const quarantineDir = (dataDir: string): string =>
`${dataDir}.quarantine-${new Date().toISOString().replace(/[:.]/g, "-")}`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ quarantineDir derives its suffix purely from an ISO timestamp at millisecond resolution, so two restores within the same millisecond (or a leftover quarantine dir from an identical timestamp) collide on the same path — the second rename(dataDir → quarantinePath) would fail or merge into an existing dir. Unlikely in practice; appending a short random/pid suffix would make it collision-proof.

robbiemu added a commit to robbiemu/sakura-no-ki that referenced this pull request Jun 27, 2026
First-hand proof that the immutable-checkpoint-source trajectory is viable,
with fault injection at every phase. No blocker found, so fallback
trajectories B and A were not evaluated.

Trajectory C replaces the mutable building/current/previous checkpoint
design with immutable, addressable-by-id snapshots + an atomic current.json
pointer + pin files for anti-GC. The archive exporter never opens the live
dataDir: it restores a checkpoint by ID into external scratch, exports
from there, removes the scratch.

Proven (22/22 + 11/11 + concurrency):
- 7 capabilities: create/validate/promote/restore-external/overwrite-safe/
  pin-prevents-GC/export-from-scratch/scratch-removal-safe.
- Concurrency: C1(markerA)->pin->C2(markerB)->restore-C1(A not B)->
  restore-C2(both)->release-C1->GC. Snapshots are genuinely independent.
- Fault matrix: crash at backup/validate/manifest/pointer/gc/restore;
  current.json never advances to an incomplete snapshot; live store never
  mutated; GC idempotent and cleans debris.
- Pinning > exclusive lock: pin's failure mode is safe over-retention;
  stale lock's is unsafe deletion or stuck-lock.

Separation of concerns: the live-swap marker-rewrite issue is PR MapleTechLabs#129
(checkpoint recovery), not archive-branch. The archive path uses
restoreSnapshotToScratch (external scratch only), never a live swap.

Adds reproducible prototype + fault-injection scripts in
experiments/trajectory/. Does not begin the production implementation;
recommends the checkpoint module move to immutable snapshots + current.json.
@robbiemu

Copy link
Copy Markdown
Author

still looking at it btw. The changes requested happen to go well with a second change I want, at least for a fork if not for this project, so I am actually trying. What pullfrog wants is non-trivial in the details, but I think we've got it.

I have to say though, that fully satisfying the direction desired here probably veers away from the design principles I anticipate when I read "Maple" as the name of the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Maple Local Mode Destructive Recovery After Unclean Shutdown

2 participants