Skip to content

docs: dedicated how-to page for staged insert#175

Merged
MilagrosMarin merged 2 commits into
mainfrom
docs/staged-insert-howto
May 21, 2026
Merged

docs: dedicated how-to page for staged insert#175
MilagrosMarin merged 2 commits into
mainfrom
docs/staged-insert-howto

Conversation

@dimitri-yatsenko
Copy link
Copy Markdown
Member

Summary

staged_insert1 is the right tool for inserting multi-GB Zarr/HDF5/streaming objects with atomic database+storage semantics, but it had no nav entry and no cross-link from insert-data.md — it was a subsection inside use-object-storage.md that readers only found by scrolling. Promote it to a dedicated How-To with full coverage.

Changes

  • New page: src/how-to/staged-insert.md — overview, "How it Works" (atomicity contract on both clean exit and exception), full API (rec, store, open, fs), patterns for Zarr/HDF5/instrument streaming, limitations, troubleshooting, See Also.
  • Nav: added under How-To → Object Storage, sandwiched between Use Object Storage and Use NPY Codec.
  • use-object-storage.md: replaced the "Write Directly to Object Storage" subsection with a brief pointer + link, so readers in the object-storage flow still get the handoff but the canonical content lives on the new page.
  • insert-data.md: inline note after "Insert with Blobs" and an entry at the top of See Also, so readers landing on the standard insert page discover the right tool for large objects.

Why a dedicated page (not a more prominent subsection)

Per offline discussion: the topic is valuable enough to warrant its own discoverability surface. Keeping it as a subsection — even renamed and repositioned — wouldn't appear in the left sidebar TOC, which is the highest-traffic discovery path.

Test plan

  • mkdocs serve and verify Staged Insert appears in the left sidebar under How-To → Object Storage
  • Open /how-to/staged-insert/ — confirm page renders end-to-end, all code blocks syntax-highlight, See Also links resolve
  • Open /how-to/use-object-storage/ — confirm the pointer near the bottom links to the new page
  • Open /how-to/insert-data/ — confirm both the inline "For multi-GB arrays..." and the See Also entry link to the new page
  • mkdocs-linkcheck (or equivalent) clean

`staged_insert1` was buried in `use-object-storage.md` (subsection
"Write Directly to Object Storage") with no nav entry, so readers
scanning the How-To sidebar couldn't find it and `insert-data.md`
didn't link to it. Promote it to its own how-to:

- New `src/how-to/staged-insert.md` covering overview, atomicity
  guarantees, full API (`rec`, `store`, `open`, `fs`), Zarr/HDF5/
  streaming patterns, limitations, and troubleshooting.
- Add nav entry under Object Storage in `mkdocs.yaml`.
- Replace the section in `use-object-storage.md` with a short pointer
  and link to the new page.
- Cross-link from `insert-data.md` (inline after "Insert with Blobs"
  and in See Also).
Copy link
Copy Markdown
Collaborator

@MilagrosMarin MilagrosMarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dimitri-yatsenko! Verified the API surface and atomicity contract against src/datajoint/staged_insert.py and tests/integration/test_object.py::TestStagedInsert:

✅ API matches source — staged.rec, staged.store(field, ext='')fsspec.FSMap, staged.open(field, ext='', mode='wb') → IO, staged.fsAbstractFileSystem.
✅ Atomicity is accurately described end-to-end. Worth highlighting: the doc correctly says database-insert failure during exit also cleans up — confirmed because _finalize() is called inside the outer try (staged_insert.py:304-310), so a duplicate-PK error gets caught and _cleanup() runs.
✅ Cross-links from insert-data.md and use-object-storage.md solve the discoverability gap. Nav placement under "How-To → Object Storage" is right.
✅ The pattern of not explicitly assigning staged.rec['field'] = z (which is shown in the dj-python source docstring) is actually more correct than the source docstring — _finalize overwrites it with computed metadata anyway. Good call.

Two accuracy issues worth fixing before merge:

1. <blob@> support claim is incorrect. Line 15 says:

It is only available for object-typed fields (<...@> syntax) and codecs that support direct storage handles — primarily <object@> (Zarr / HDF5 / multi-file) and <blob@> written via a file handle.

But staged_insert.py:100-101 explicitly:

if not (attr.codec and attr.codec.name == "object"):
    raise DataJointError(f"Attribute '{field}' is not an <object> type")

Only <object@> passes. The tests only exercise <object@local>. A <blob@> field would raise today. Either drop the <blob@> clause or note it as a planned extension.

2. "Content hash for hash-addressed codecs" describes nothing the implementation does. Line 64:

Computes object metadata (size, manifest, content hash for hash-addressed codecs) from the staged objects.

_compute_metadata (:165-249) sets "hash": None in both the directory and single-file branches. Since the only supported codec is <object@> (path-addressed), no hash is ever computed. Suggest just "Computes object metadata (size, manifest) from the staged objects."

Minor (optional):

  • <object@store> named-store form isn't mentioned — Quick Start uses <object@> (default store), but tests use <object@local>. One-line note in API Reference would round it out.
  • The context manager catches Exception, not BaseException, so KeyboardInterrupt mid-write leaves staged objects behind. Worth a bullet in Limitations so readers know not to rely on cleanup for Ctrl+C scenarios.

Otherwise this is a clean, well-structured how-to — much better than the buried subsection. Happy to approve once the two accuracy items are fixed.

From @MilagrosMarin's review on #175:

- Drop the inaccurate '<blob@> written via a file handle' claim.
  staged_insert.py:100-101 explicitly rejects anything except codec
  name == 'object', so only <object@> is supported. Note the actual
  error behavior instead.
- Drop 'content hash for hash-addressed codecs' from the metadata
  list. _compute_metadata always sets hash: None for both directory
  and single-file branches; no hash is ever computed.
- Mention the named-store form '<object@name>' alongside '<object@>'
  in the Table.staged_insert1 API reference.
- Add a Limitations bullet noting that cleanup catches Exception not
  BaseException, so KeyboardInterrupt mid-write can leave staged
  objects behind; point to the garbage-collection how-to.
@dimitri-yatsenko
Copy link
Copy Markdown
Member Author

Thanks @MilagrosMarin — pushed e0f4de0 addressing all four points.

Required fixes:

  1. <blob@> claim — dropped. The line now says only <object@> is supported and that attempting staged.store() / staged.open() on any other field type raises DataJointError. Confirmed against staged_insert.py:100-101.
  2. Content hash — dropped from the metadata list. The "How It Works" step now reads "Computes object metadata (size, manifest)" — matches _compute_metadata's hash: None in both branches.

Optional (both done):
3. Named-store form — added a sentence in the Table.staged_insert1 API reference: "<object@> uses stores.default, and <object@name> uses the named store."
4. BaseException / Ctrl+C cleanup gap — added a Limitations bullet calling out that KeyboardInterrupt bypasses the cleanup path, with a pointer to Clean Up Storage for reclaiming orphaned staged objects.

Ready for another look.

@dimitri-yatsenko
Copy link
Copy Markdown
Member Author

@MilagrosMarin — thanks again for the careful read here. Your accuracy points led directly to a broader design question that's worth flagging: once we asked why only <object@> is supported by staged insert, it turned out the gate is narrower than necessary. The same lifecycle generalizes cleanly to <npy@> (schema-addressed) and — with a staging-then-rename step — to <blob@> and <attach@> (hash-addressed). The "content hash for hash-addressed codecs" framing that we removed from this how-to is in fact real — for hash-addressed codecs the hash is the path — it just doesn't apply to <object@> today.

I've opened a parallel spec PR that pins this down end-to-end: #177 — Staged Insert Specification. It defines the lifecycle, the codec-side protocol (staged_handle / finalize_staged / cleanup_staged), path/metadata contracts for both addressing schemes, the atomicity model (including your BaseException point), the concurrency model, and the codec compatibility matrix. The implementation PR in datajoint-python will reference it and land after the spec is reviewed.

For #175: it's correct as-is for the current <object@>-only implementation. Once the implementation PR lands, a small follow-up will update this how-to to reflect the broader codec support and link to the spec for normative details. Happy for you to approve this PR standalone in the meantime so the discoverability/structure improvements ship.

Copy link
Copy Markdown
Collaborator

@MilagrosMarin MilagrosMarin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dimitri-yatsenko — all four points verified in e0f4de0. <blob@> claim dropped, "content hash" phrase dropped, named-store form added, BaseException/Ctrl+C caveat in Limitations with pointer to garbage-collection.md (target exists). Spec-PR scoping (#177) is sensible — will review separately.

Approving.

@MilagrosMarin MilagrosMarin merged commit 6f867bf into main May 21, 2026
3 checks passed
dimitri-yatsenko added a commit that referenced this pull request May 21, 2026
…ert spec

Corrections grounded in datajoint-python master:

- Hash algorithm: spec said sha256/hex; corrected to MD5+base32 → 26-char
  lowercase token, matching hash_registry.compute_hash (hash_registry.py:51-67).
- Hash-addressed canonical path: spec said `_hash/{h[:2]}/{h[2:4]}/{h}`;
  corrected to `_hash/{schema}/{content_hash}` (flat) or
  `_hash/{schema}/{fold_*}/{content_hash}` (subfolded), matching
  hash_registry.build_hash_path. The {schema} segment is load-bearing for
  isolation; subfolding is per-store-tunable.
- <object@> normative metadata shape: pinned to ObjectCodec.encode's actual
  output `{path, store, size, ext, is_dir, item_count, timestamp}`
  (builtin_codecs/object.py:166-174). Noted the two-place convergence work
  the impl PR will do (StagedInsert._compute_metadata refactor; earlier
  draft of this spec).
- <blob@>/<attach@> shape: clarified that today's BlobCodec.encode and
  AttachCodec.encode return raw bytes, and the dict shape comes from the
  chained <hash@> codec — the impl PR refactors them to return dicts
  directly. Also noted that HashCodec's three-way documented inconsistency
  will be consolidated as part of the same refactor.
- Implementation-status banner: added at top of spec to signal which pieces
  are forward-looking vs as-shipped, with source line numbers as anchors.

Items still in flight (planned for final pre-merge commit on this branch):
- Conformance test section (incl. new test_staged_handle_rejects_non_participating_codecs
  per Milagros' suggestion)
- Cross-link sequencing vs PR #175 (how-to)
- aa0f66d Zarr framing edits (already in)
@dimitri-yatsenko dimitri-yatsenko deleted the docs/staged-insert-howto branch May 21, 2026 16:28
dimitri-yatsenko added a commit that referenced this pull request May 21, 2026
…ert spec

Corrections grounded in datajoint-python master:

- Hash algorithm: spec said sha256/hex; corrected to MD5+base32 → 26-char
  lowercase token, matching hash_registry.compute_hash (hash_registry.py:51-67).
- Hash-addressed canonical path: spec said `_hash/{h[:2]}/{h[2:4]}/{h}`;
  corrected to `_hash/{schema}/{content_hash}` (flat) or
  `_hash/{schema}/{fold_*}/{content_hash}` (subfolded), matching
  hash_registry.build_hash_path. The {schema} segment is load-bearing for
  isolation; subfolding is per-store-tunable.
- <object@> normative metadata shape: pinned to ObjectCodec.encode's actual
  output `{path, store, size, ext, is_dir, item_count, timestamp}`
  (builtin_codecs/object.py:166-174). Noted the two-place convergence work
  the impl PR will do (StagedInsert._compute_metadata refactor; earlier
  draft of this spec).
- <blob@>/<attach@> shape: clarified that today's BlobCodec.encode and
  AttachCodec.encode return raw bytes, and the dict shape comes from the
  chained <hash@> codec — the impl PR refactors them to return dicts
  directly. Also noted that HashCodec's three-way documented inconsistency
  will be consolidated as part of the same refactor.
- Implementation-status banner: added at top of spec to signal which pieces
  are forward-looking vs as-shipped, with source line numbers as anchors.

Items still in flight (planned for final pre-merge commit on this branch):
- Conformance test section (incl. new test_staged_handle_rejects_non_participating_codecs
  per Milagros' suggestion)
- Cross-link sequencing vs PR #175 (how-to)
- aa0f66d Zarr framing edits (already in)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants