Skip to content

docs(specs): add Staged Insert Specification#177

Draft
dimitri-yatsenko wants to merge 4 commits into
mainfrom
docs/spec-staged-insert
Draft

docs(specs): add Staged Insert Specification#177
dimitri-yatsenko wants to merge 4 commits into
mainfrom
docs/spec-staged-insert

Conversation

@dimitri-yatsenko
Copy link
Copy Markdown
Member

Summary

New normative spec at src/reference/specs/staged-insert.md defining the staged-insert contract for all object-store codecs (not just <object@>). The implementation in datajoint-python is deferred to a follow-up PR; this spec is what that PR will reference.

What changed

File Change
src/reference/specs/staged-insert.md (new) The spec — ~300 lines, normative
mkdocs.yaml Nav entry under Reference → Specifications → Data Operations
src/reference/specs/data-manipulation.md §2.9 (existing brief mention) now cross-links to the new spec for normative details

Spec contents

  • Lifecycle: setup → drafting → finalization → unwinding (4 phases, with diagram)
  • Codec staged-write protocol: three methods on the Codec base class — staged_handle, finalize_staged, cleanup_staged. Default raises DataJointError so non-participation is explicit.
  • Two concrete lifecycles:
    • Schema-addressed (handle at canonical path immediately; finalize computes metadata) — used by <object@>, <npy@>, future schema-addressed codecs.
    • Hash-addressed (handle at _staging/ path; finalize streams + hashes, moves to canonical _hash/ path with dedup) — used by <blob@>, <attach@>, <hash@>.
  • Path-construction shapes (normative): schema-addressed canonical, hash-addressed staging, hash-addressed canonical.
  • Per-codec metadata contracts: each codec's finalize_staged returns a dict structurally equal to what its encode() would produce for the same content — testable invariant.
  • Atomicity model: at-most-once with cleanup; explicit boundaries for block exceptions, duplicate-PK on insert, concurrent dedup, BaseException.
  • Concurrency: per-PK serialization requirement, hash-dedup semantics, transaction interaction.
  • Codec compatibility matrix: 4 built-in codecs in (<object@>, <npy@>, <blob@>, <attach@>); in-table and reference codecs explicitly out.
  • Worked examples for each supported codec.
  • Future work scope notes (<filepath@> staging, multi-row variants, resumable inserts).

Open spec decisions surfaced for your review

These are spec-level decisions I made with my best judgment; flag any you'd push back on:

  1. <attach@> filename API: spec'd as staged.set_filename(field, 'name.ext') parallel to staged.rec[k] = v. Alternative was a convention like staged.rec[f'{field}_filename']. The explicit helper avoids name-collision risk with real attributes.
  2. Staging-path location: spec'd as _staging/{schema}/{table}/{field}_{token}{ext}. Orphans are traceable to their schema/table for GC. Alternative was a flat _staging/{token}.
  3. Hash algorithm: spec'd as SHA-256 to match today's <blob@>/<hash@> behavior. Worth committing to in the spec so plugin authors don't accidentally pick something else.
  4. KeyboardInterrupt handling: spec says "not caught; orphans reclaimed by GC." Did not add a recommendation to mask SIGINT — felt like over-prescription.

What this PR does not do

  • No code changes in datajoint-python. The implementation PR will land after this spec is merged.
  • No changes to the user-facing how-to (src/how-to/staged-insert.md — that's PR #175, still open). Once the implementation lands, a follow-up PR will update the how-to to reference this spec for normative details.

Test plan

  • mkdocs serve and confirm the new spec page renders under Reference → Specifications → Data Operations
  • All cross-links resolve (codec-api.md, data-manipulation.md, type-system.md, object-store-configuration.md, the how-to, garbage-collection)
  • Compatibility matrix reads correctly as a single table
  • Atomicity table doesn't contradict observed <object@> behavior today (run a quick <object@> smoke test against current master)

@dimitri-yatsenko
Copy link
Copy Markdown
Member Author

Planned edits before merge

While discussing the staged-insert spec, we surfaced that <zarr@> (from dj-zarr-codecs) is a real codec — a sibling of <object@> under SchemaCodec, not built on top of it. Today's pattern of "use <object@> to host a Zarr store" is a workaround that works because <object@> is the generic schema-addressed-directory codec; once staged_insert1 accepts any SchemaCodec (per this spec), <zarr@> becomes the type-correct choice. The spec should reflect that.

Two edits are planned. They will be applied as a final commit on this branch immediately before merge, after the matching implementation PR in datajoint-python lands — so the spec ships in lockstep with the as-shipped code, not as an aspirational document.

Edit 1 — Add a "Plugin codecs" sub-section to the Codec compatibility matrix

After the built-in compatibility matrix (currently the last row is "Other custom codec"), add:

Plugin codecs (examples)

Third-party packages can register additional SchemaCodec or HashAddressedCodec subclasses that inherit staged-insert support automatically:

Codec Plugin package Lifecycle Notes
<zarr@> dj-zarr-codecs Schema-addressed Canonical example of a third-party SchemaCodec; stores shape, dtype, and provenance in the column metadata

(Other plugins — dj-photon-codecs, dj-figpack-codecs — can be added here as they adopt the protocol.)

Edit 2 — Replace the <object@> Zarr example with a <zarr@> example

Currently the Examples section leads with <object@> writing Zarr, which is the workaround. After implementation, the canonical example becomes:

<zarr@> — Zarr array (via dj-zarr-codecs plugin)

import zarr

with ImagingSession.staged_insert1 as staged:
    staged.rec['subject_id'] = 1
    staged.rec['session_id'] = 1
    z = zarr.open(staged.store('frames', '.zarr'), mode='w',
                  shape=(1000, 512, 512), chunks=(1, 512, 512), dtype='uint16')
    for i in range(1000):
        z[i] = acquire_frame()
    staged.rec['n_frames'] = 1000

<object@> — Generic multi-file directory

For directory layouts without a format-aware codec (custom binary formats, mixed files, ad-hoc collections):

with Dataset.staged_insert1 as staged:
    staged.rec['dataset_id'] = 1
    fs = staged.store('artifact')        # fsspec.FSMap
    fs['data.bin']      = signal.tobytes()
    fs['metadata.json'] = json.dumps({'session': '2026-05-21'}).encode()

Use <object@> only as a fallback. Prefer a format-aware codec (<zarr@>, <npy@>, or a custom SchemaCodec subclass) when one exists — you get richer column metadata (shape, dtype, etc.) and a typed fetch result.

Merge sequencing

This PR is reviewable now as a design document. The two edits above are deltas against this PR's tip that I'll commit just before merge. Concretely:

  1. Reviewers land any spec-shape feedback on the current diff.
  2. Implementation PR in datajoint-python lands (introduces the generalized gate so <npy@>, <blob@>, <attach@>, <zarr@>, etc. work with staged_insert1).
  3. I push Edits 1 and 2 here as one final commit referencing the merged implementation PR.
  4. This PR merges.

If you'd prefer them applied now (and accept that the spec runs ahead of the code for a few weeks), say so — happy to flip the order.

@dimitri-yatsenko
Copy link
Copy Markdown
Member Author

Revised planned edits — superseding my previous comment

@dimitri-yatsenko corrected my framing: <zarr@> is built around insert1(numpy_array), not staged insert. The codec's encode() synchronously serializes the array to Zarr format. Staged insert isn't part of the <zarr@> idiom — and importantly, can't be, because <zarr@> requires a fully-formed numpy or zarr array as the encode input. There's no way to stream chunks through <zarr@>'s normal encode path.

That means my earlier "Edit 2: replace <object@> Zarr example with <zarr@>" was misframed. The <object@>-hosts-Zarr example in the spec is not a workaround — it's the correct pattern for the only case staged insert exists to serve: arrays too large to materialize. <zarr@> is for the materializable case and uses ordinary insert1.

The real divide is array size, not codec preference. Revised plan:

Revised Edit 1 — Add a "When to use staged insert" callout near the top of the spec

Add this paragraph after the Overview, before §Scope:

Staged insert is for content too large to materialize in process memory. For arrays that fit in memory, ordinary insert1 with a typed codec is both simpler and more idiomatic — pass a numpy array to <zarr@> or <npy@> and the codec handles serialization. Reach for staged_insert1 only when you can't hold the full value in memory (multi-GB Zarr stores being streamed from an instrument, HDF5 files written incrementally, blobs piped from a producer).

Revised Edit 2 — Add <zarr@> to the compatibility matrix with the right framing

Under the existing "Plugin codecs" sub-section idea, the row becomes:

Codec Plugin package Staged insert? Notes
<zarr@> dj-zarr-codecs Not typical Use ordinary insert1 with a numpy or zarr array; the codec serializes to Zarr internally. Staged insert is rarely the right tool here — <zarr@>'s encode requires a materialized array. For streaming Zarr writes that don't fit in memory, use <object@> with staged.store(field, '.zarr') and open Zarr directly.

This is more honest than putting <zarr@> in the "supported" column — it technically inherits the protocol from SchemaCodec, but it's not the right tool for the staged use case.

Dropped: the previous "replace the <object@> Zarr example" edit

The current <object@> Zarr example in the spec stays. It correctly shows the streaming pattern that <zarr@> can't serve. I'll just add a small note pointing readers to <zarr@> + insert1 for the in-memory case:

<object@> — Streaming Zarr / HDF5 / multi-file directories

Use <object@> when the data is built up incrementally and doesn't fit in memory. For Zarr arrays that do fit in memory, use <zarr@> with ordinary insert1 instead — it's simpler and yields a typed fetch result.

# [existing example unchanged]

Merge sequencing unchanged

These edits still land as one final commit just before merge, after the implementation PR in datajoint-python ships the generalized gate.

@dimitri-yatsenko
Copy link
Copy Markdown
Member Author

Test & validate dj-zarr-codecs against the spec

To turn the spec from a normative document into a verifiable contract, I'll add a Conformance section listing tests every plugin codec must pass, and then implement them in dj-zarr-codecs once the datajoint-python implementation PR lands. Listing both halves below.

Part 1 — Add a Conformance section to this spec (planned third edit, ships with the others before merge)

A new section after §Codec compatibility matrix:

Conformance tests

Every codec that participates in the staged-write protocol MUST pass the following tests. They're stated here as a contract; reference implementations live in datajoint-python's integration suite for the built-in codecs, and tests/conformance.py (TBD) provides reusable fixtures third-party packages can import.

Required for any participating codec

Test Asserts
test_codec_admitted_by_staged_insert_gate A table whose field uses this codec accepts with table.staged_insert1 without raising.
test_staged_write_lands_at_canonical_path After a clean exit, the written content exists at the path the codec returned (schema-addressed canonical, or hash-addressed canonical after the rename).
test_staged_insert_metadata_matches_encode The metadata dict assigned to staged.rec[field] on finalization is structurally equal to what the same codec's encode() would produce for equivalent content.
test_staged_insert_fetch_roundtrip After staged insert, fetching the field returns a value indistinguishable from what an ordinary insert1 of the same content would have produced.
test_staged_cleanup_on_exception Raising inside the with block leaves no row inserted and no canonical artifact (and for hash-addressed codecs, no staging artifact).
test_staged_primary_key_required Calling staged.open() or staged.store() before all primary key attributes are set on staged.rec raises DataJointError.

Additional for hash-addressed codecs

Test Asserts
test_staged_dedup_hit Two staged inserts of the same content to different primary keys produce one canonical hash-addressed object; both rows reference it.
test_staged_concurrent_canonical_collision A staging-to-canonical rename whose destination is concurrently created falls through to the dedup branch without error.

Part 2 — Implement the conformance tests in dj-zarr-codecs

Once the datajoint-python implementation PR merges, open a PR against dj-zarr-codecs that:

  1. Bumps the datajoint-python pin in pyproject.toml (pixi.toml) from rev = "f4b02583251c" to the merged implementation commit.

  2. Adds tests/test_staged_insert.py implementing the six SchemaCodec conformance tests above against <zarr@>:

    class TestZarrStagedConformance:
        def test_codec_admitted_by_staged_insert_gate(self, schema): ...
        def test_staged_write_lands_at_canonical_path(self, schema): ...
        def test_staged_insert_metadata_matches_encode(self, schema): ...
        def test_staged_insert_fetch_roundtrip(self, schema): ...
        def test_staged_cleanup_on_exception(self, schema): ...
        def test_staged_primary_key_required(self, schema): ...

    Each test exercises a small <zarr@> table, comparing staged-insert behavior against ordinary insert1 for the same array.

  3. Adds <zarr@>-specific tests beyond the generic conformance contract:

    Test Asserts
    test_staged_zarr_shape_dtype_recorded Staged-inserted <zarr@> metadata column contains shape, dtype, store, and provenance matching what <zarr@>'s encode() would have produced.
    test_staged_zarr_chunked_write_roundtrip Open Zarr via zarr.open(staged.store(field, '.zarr')), write in chunks larger than memory budget, fetch, assert chunk-by-chunk equality. Demonstrates the streaming case <zarr@> was previously not designed for.
    test_zarr_insert1_still_works Regression guard: the existing test_numpy_array_roundtrip / test_zarray_roundtrip tests still pass after the gate generalization. (The <zarr@> insert1 path is the idiomatic one for in-memory arrays — must not regress.)
  4. Confirms the codec compatibility matrix claim by running the conformance suite against both schema-addressed (<zarr@>) and hash-addressed (a small <blob@>-style sanity test) codecs to prove the design isn't <zarr@>-specific.

Sequencing

Step Repo Status
1. Spec review datajoint-docs #177 this PR
2. Implementation datajoint-python not yet opened — references the spec as design
3. Spec edits (Zarr framing, conformance section) commit datajoint-docs #177 held until step 2 merges
4. dj-zarr-codecs conformance tests dj-zarr-codecs held until step 3 merges
5. Merge sequence: step 2 → 3 → 4

Steps 2 and 4 land sequentially because step 4 needs the implementation to pass. Step 3 (the spec) lands between them because the spec describes what step 2 shipped and what step 4 validates.

@dimitri-yatsenko
Copy link
Copy Markdown
Member Author

Deferred: <photon@> (from dj-photon-codecs)

Examined dj-photon-codecs/codecs.py. <photon@> is a SchemaCodec subclass like <zarr@> and <npy@>, so the generalized gate admits it. The only wrinkle is that <photon@> transforms data inside encode() (Anscombe stabilization + Blosc/Zstd + Zarr attrs), and the staged path bypasses encode() — so on the staged path, the caller is responsible for applying the transform per chunk before writing.

Decision: not adding a <photon@> subsection or matrix row to this spec yet. Sequencing:

  1. This PR's <zarr@> framing lands (already in aa0f66d).
  2. datajoint-python implementation PR lands the generalized gate.
  3. dj-zarr-codecs conformance tests pass (the validation step from #177-comment-4509522796).
  4. Then revisit <photon@> with empirical evidence about how transforming codecs fit the protocol — whether the "caller pre-transforms" pattern is enough (likely), whether the codec needs to expose its transform as a public helper, or whether more protocol surface is warranted.

The expectation, based on the protocol mechanics, is that <photon@> works the same way as <zarr@> once <zarr@> works — same FSMap-driven Zarr write, different caller-side preprocessing. We'll confirm rather than assume.

@MilagrosMarin
Copy link
Copy Markdown
Collaborator

Read this carefully against the dj-python source — and caught up on aa0f66d and the four follow-up comments above. The sequencing plan (spec → impl PR → dj-zarr-codecs conformance suite → revisit <photon@>) is clean, and the conformance contract you've drafted (test_staged_insert_metadata_matches_encode, test_staged_dedup_hit, etc.) will catch most of the structural-equality concerns I'd otherwise flag. A few suggestions where the spec text — independent of the impl/conformance steps — could use sharpening:

On the hash algorithm (Open Decision #3). The decision says SHA-256 is spec'd "to match today's <blob@>/<hash@> behavior", but hash_registry.compute_hash (hash_registry.py:51-67) uses MD5 + base32 producing a 26-char lowercase token:

md5_digest = hashlib.md5(data).digest()
return base64.b32encode(md5_digest).decode("ascii").rstrip("=").lower()

Two ways to reconcile:

  • Re-state Update styling #3 as a migration to SHA-256 (which invalidates existing hash paths — worth calling out as breaking)
  • Or align the spec to MD5+base32+26-char (and adjust the {hash[:2]}/{hash[2:4]} path template, which only makes intuitive sense for a longer hex string)

On the hash-addressed canonical path. Spec line 162 gives _hash/{hash[:2]}/{hash[2:4]}/{hash}, but hash_registry.build_hash_path today produces _hash/{schema_name}/{fold_path}/{content_hash} with configurable subfolding from the store spec. {schema} is load-bearing today for isolation; subfolding is per-store-tunable. Both seem worth preserving in the normative shape, or explicitly dropping with rationale.

On the <object@> normative metadata. Spec line 174 has {path, size, hash: None, ext, is_dir, timestamp, item_count?, mime_type?}. Today:

  • ObjectCodec.encode returns {path, store, size, ext, is_dir, item_count, timestamp} (builtin_codecs/object.py:166-174) — has store, no hash, no mime_type
  • staged_insert._compute_metadata returns {path, size, hash: None, ext, is_dir, timestamp, item_count} or with mime_type for files

So three shapes today (encode, staged, spec). The conformance test will catch the impl-side divergence, but the spec itself should pick whichever is normative and signal that the others converge to it.

On <blob@> / <attach@> "structurally equal to encode()". Today BlobCodec.encode and AttachCodec.encode return bytes; the metadata dict ({hash, path, store, size}) comes from the chained <hash@> codec, not from those codecs directly. The spec implies their encode() returns dicts directly — which is what the impl PR will do, but a sentence like "the implementation refactors BlobCodec.encode/AttachCodec.encode to return metadata dicts directly" would tell plugin authors what's authoritative.

Small related: HashCodec's own metadata shape is documented three different ways in source (class docstring → {hash, store, size}; encode docstring → {hash, path, schema, store, size}; this spec → {hash, path, store, size}). Worth picking one as part of the impl PR.

On forward-looking pieces. HashAddressedCodec, _build_staging_path, _enrich_staged_metadata, and the three protocol methods on Codec don't exist in source yet (grep -rn "class HashAddressedCodec" src/datajoint/ is empty). Spec describes them present-tense — totally fine for a normative spec, but a small banner near the top ("Implementation status: spec only; reference impl lands in datajoint-python#TBD") would tell readers what's authoritative-today vs. authoritative-after-impl.

On staging vs canonical path consistency. Spec gives staging as _staging/{schema}/{table}/{field}_{token}{ext} (with schema/table), and canonical as _hash/{hash[:2]}/{hash[2:4]}/{hash} (no schema). If {schema} stays in canonical (suggestion 2 above), they align naturally; if it goes, the staging path is more granular than necessary.

On the cross-link to ../../how-to/staged-insert.md. That target lives on #175 (approved but unmerged). The link will resolve once #175 lands; sequencing #175#177 (or batching them) avoids a temporary broken link if linkcheck runs in between.

On the aa0f66d <zarr@> framing. Like it. The "two paths — insert1 for in-memory, staged_insert1 for content that can't materialize" framing is clear, and demoting <object@> to "generic multi-file fallback" is the right call. The _enrich_staged_metadata line ("reads the just-written Zarr array's metadata via zarr.open(store, mode='r')") is concrete enough that a dj-zarr-codecs plugin author can implement it from the spec alone.

On the conformance contract (comment #3). The six required + two hash-addressed tests are well-scoped. One addition worth considering: a test_staged_handle_rejects_non_participating_codecs that asserts the default staged_handle on Codec raises DataJointError with the codec name in the message — that's the gate the spec relies on to make non-participation explicit.

On <photon@> deferral (comment #4). Reasonable. The "caller pre-transforms per chunk" pattern is the right starting hypothesis, and validating against dj-zarr-codecs first before generalizing the protocol to transforming codecs is the right risk-mitigation order.

None of this is showstopper — the spec's structure and the sequencing plan are both right. Mostly nudges around making "matches today" claims actually match today (or signaling that they're aspirational), and a couple of small forward-looking framing improvements.

@dimitri-yatsenko
Copy link
Copy Markdown
Member Author

Thank you @MilagrosMarin — every claim you pulled from source was correct. Pushed ec3a0dd with corrections.

Addressed in this commit

# Your point Resolution
1 Hash algorithm: SHA-256 ≠ today's MD5+base32 Spec corrected to MD5 + base32 → 26-char lowercase, matching hash_registry.compute_hash (hash_registry.py:51-67). Not a migration.
2 Hash canonical path shape Spec corrected to _hash/{schema}/{content_hash} (flat) or _hash/{schema}/{fold_*}/{content_hash} (subfolded), matching hash_registry.build_hash_path. {schema} is now load-bearing in the normative shape; {fold_*} derived from subfolding in the store spec.
3 <object@> metadata had three shapes (encode / staged / spec) Pinned the normative shape to ObjectCodec.encode's actual output {path, store, size, ext, is_dir, item_count, timestamp} (builtin_codecs/object.py:166-174). Removed hash: None and mime_type? from the spec. Noted that the impl PR refactors StagedInsert._compute_metadata to converge on this.
4 <blob@> / <attach@> encode() returns bytes, not dict Added a paragraph clarifying that today's encode returns bytes, the dict shape comes from the chained <hash@> codec, and the impl PR refactors BlobCodec.encode/AttachCodec.encode to return the dict shape directly.
5 HashCodec shape documented three ways in source Noted in the same paragraph as #4 that consolidation to {hash, path, store, size} happens as part of the impl PR.
6 Forward-looking pieces (HashAddressedCodec, _build_staging_path, _enrich_staged_metadata, the three protocol methods) Added an "Implementation status" admonition at the top of the spec, explicitly calling out which pieces are forward-looking vs as-shipped, with source line numbers as anchors.
7 Staging path has {schema}/{table}, canonical hash path didn't Resolved naturally by fix #2{schema} is now in the canonical hash path too, so the two paths align on the schema dimension. (Staging keeps the {table} segment for GC traceability — flagged in the new path table.)

Tracked for the final pre-merge commit on this branch

These ship alongside the Zarr framing (already in aa0f66d) when the implementation PR lands:

# Your point Plan
8 Cross-link to ../../how-to/staged-insert.md (target on #175, unmerged) Sequence #175#177. If #175 doesn't land first, I'll downgrade the link to a non-clickable reference in the final commit.
10 Add test_staged_handle_rejects_non_participating_codecs to the conformance suite Added to my notes for the conformance section (planned final pre-merge commit). Tests that the default staged_handle on Codec raises DataJointError with the codec name in the message — closes the explicit-non-participation loop.

No action

  • Fix typos #9 (aa0f66d Zarr framing) — appreciated, no change needed.
  • Minor fixes #11 (<photon@> deferral) — confirmed reasoning; deferred until dj-zarr-codecs conformance lands.

Re-review whenever you have time. If you'd rather see the conformance section and the rejection-test now (rather than at final pre-merge), say the word and I'll fold them into this PR.

@MilagrosMarin
Copy link
Copy Markdown
Collaborator

Thanks @dimitri-yatsenko — verified ec3a0dd against master line-by-line, all seven fixes land cleanly:

✅ Hash algo / canonical-path / {schema} segment / subfolding-as-store-config — all now point to hash_registry.compute_hash and hash_registry.build_hash_path with the right shape and the right rationale.
<object@> shape pinned to ObjectCodec.encode's actual output; the convergence note explicitly enumerates the three places that diverge today and which is normative.
<blob@> / <attach@> clarification correctly explains today's bytes return + chained <hash@>, and surfaces the HashCodec triple-documentation issue as part of the same impl-PR refactor.
✅ The "Implementation status" admonition is exactly the right framing — three source anchors at the top is enough for a reader to verify the as-shipped state without spelunking.

On your question — defer the conformance section + rejection test to the final pre-merge commit. The spec is reviewable as a design doc now; the conformance section becomes meaningful only once the impl PR is concrete enough that the test names anchor to real assertions. Folding it in now risks drift between conformance and what ships.

The PR reads well in its current state.

@dimitri-yatsenko dimitri-yatsenko requested review from kushalbakshi and removed request for kushalbakshi May 21, 2026 16:12
Defines the staged-insert contract as a normative spec so the
implementation has a single source of truth and third-party codec
authors have a documented protocol to implement.

Covers:
- Lifecycle (setup → drafting → finalization → unwinding)
- The codec-side staged-write protocol (staged_handle / finalize_staged
  / cleanup_staged on the Codec base class)
- Two concrete lifecycle variants: schema-addressed (handle at canonical
  path, finalize computes metadata) and hash-addressed (handle at
  _staging path, finalize hashes content and renames to canonical
  _hash/ path with dedup)
- Path-construction shapes for both addressing schemes
- Per-codec metadata contracts (testable invariants matching each
  codec's encode() output)
- Atomicity model (at-most-once with cleanup; not transactional)
- Concurrency behavior (per-PK, hash dedup, transaction interaction,
  BaseException leakage)
- Codec compatibility matrix (the four built-in object-store codecs in,
  in-table and reference codecs explicitly out)
- Worked examples for <object@>, <npy@>, <blob@>, <attach@>
- Future-work scope notes for filepath staging, multi-row variants,
  and resumable inserts

Implementation is deferred to a follow-up PR in datajoint-python; this
spec is the design that PR will reference.

Nav: add under Reference → Specifications → Data Operations alongside
data-manipulation.md and autopopulate.md.
Adds <zarr@> (from dj-zarr-codecs) as a first-class supported codec in
the staged-insert spec:

- New "Concrete protocol behavior" subsection describing both usage
  paths: ordinary insert1 (canonical for in-memory arrays) and
  staged_insert1 (for arrays too large to materialize, via direct
  FSMap-driven Zarr writes).
- New row in the Codec compatibility matrix.
- New Examples entry showing both paths side-by-side; demoted the
  generic <object@> example to a multi-file/directory fallback.
…ert spec

Corrections grounded in datajoint-python master:

- Hash algorithm: spec said sha256/hex; corrected to MD5+base32 → 26-char
  lowercase token, matching hash_registry.compute_hash (hash_registry.py:51-67).
- Hash-addressed canonical path: spec said `_hash/{h[:2]}/{h[2:4]}/{h}`;
  corrected to `_hash/{schema}/{content_hash}` (flat) or
  `_hash/{schema}/{fold_*}/{content_hash}` (subfolded), matching
  hash_registry.build_hash_path. The {schema} segment is load-bearing for
  isolation; subfolding is per-store-tunable.
- <object@> normative metadata shape: pinned to ObjectCodec.encode's actual
  output `{path, store, size, ext, is_dir, item_count, timestamp}`
  (builtin_codecs/object.py:166-174). Noted the two-place convergence work
  the impl PR will do (StagedInsert._compute_metadata refactor; earlier
  draft of this spec).
- <blob@>/<attach@> shape: clarified that today's BlobCodec.encode and
  AttachCodec.encode return raw bytes, and the dict shape comes from the
  chained <hash@> codec — the impl PR refactors them to return dicts
  directly. Also noted that HashCodec's three-way documented inconsistency
  will be consolidated as part of the same refactor.
- Implementation-status banner: added at top of spec to signal which pieces
  are forward-looking vs as-shipped, with source line numbers as anchors.

Items still in flight (planned for final pre-merge commit on this branch):
- Conformance test section (incl. new test_staged_handle_rejects_non_participating_codecs
  per Milagros' suggestion)
- Cross-link sequencing vs PR #175 (how-to)
- aa0f66d Zarr framing edits (already in)
Every example in §Examples now includes the @Schema class declaration
with definition string, matching the house style in codec-api.md.
Readers can copy a complete, self-contained snippet rather than
mentally fill in the table schema. int32 used throughout per the
core-types-in-docs convention.

Covers <zarr@> (both ordinary and staged paths), <object@>, <npy@>,
<blob@>, <attach@>.
@dimitri-yatsenko dimitri-yatsenko force-pushed the docs/spec-staged-insert branch from 21b781a to fb2b228 Compare May 21, 2026 16:56
@dimitri-yatsenko dimitri-yatsenko marked this pull request as draft May 21, 2026 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants