Skip to content

Unify built-in storage protocols through StorageAdapter (Alt B from #1432) #1440

@dimitri-yatsenko

Description

@dimitri-yatsenko

Background

PR #1432 introduces a StorageAdapter ABC + entry-point plugin system for third-party storage protocols. The four built-in protocols (file, s3, gcs, azure) remain hardcoded in storage.py, producing a deliberate two-tier pattern: extension authors go through the ABC; built-ins keep their if self.protocol == "..." branches.

This issue tracks the follow-up work to collapse that two-tier pattern by routing all protocols through StorageAdapter, so built-ins and third-party adapters become first-class citizens of the same code path.

The user-facing API (dj.config.stores) does not change.

Scope

storage.py on master has ~22 protocol-conditional branches: 12 in setup methods (already addressable by the ABC drafted in #1432), and 10 in the I/O methods that are not yet addressable through the current four-method interface. This issue is about the I/O 10.

Method Line Branch nature Design question
put_file 470 safe_copy (file) vs fsspec.put_file (clouds) Atomicity contract
put_buffer 518 safe_write (file) vs fsspec.pipe_file (clouds) Atomicity contract
get_file 497 pathlib (file) vs fsspec.get_file (clouds)
get_buffer 549 open() (file) vs fsspec.cat_file (clouds)
remove 604 Path.unlink (file) vs fsspec.rm (clouds)
open 651 pre-write mkdir (file only)
put_folder 697 shutil.copytree (file) vs fsspec.put recursive (clouds) Recursive-op semantics
remove_folder 737 shutil.rmtree (file) vs fsspec.rm recursive (clouds) Recursive-op semantics
copy_from_url 796 mkdir + copy (file) vs fsspec.pipe (clouds)
copy_folder_from_url 847 mkdir (file) vs no-op (clouds)

Gating design questions

These must be resolved before any I/O branch is moved into an adapter. They are not blockers for #1432 itself.

  1. Atomic-write semantics. safe_write and safe_copy (utils.py:131,153) write to a .saving/.copying temp file and os.rename to the final path — a documented atomicity guarantee. fsspec.LocalFileSystem.put_file/pipe_file write directly. A LocalFileAdapter must preserve this contract; either by reimplementing the temp-file-then-rename pattern in the adapter, or by deciding (with explicit user-facing notice) that DataJoint no longer guarantees atomicity for the local backend.

  2. Recursive-op semantics. shutil.copytree and shutil.rmtree have specific behaviors around symlinks, partial failures, and metadata that fsspec.put recursive / fsspec.rm do not match 1:1. Document the divergences and decide whether to preserve shutil semantics for LocalFileAdapter or accept the fsspec semantics.

  3. Interface shape. Open question whether each I/O operation gets its own adapter method, or whether the adapter exposes a smaller surface (e.g. a local_atomic_writer hook that only file-protocol adapters override, with everything else delegating to the fsspec defaults). The 10 branches above are not all equally weighty — open (pre-write mkdir) and copy_folder_from_url (mkdir vs no-op) are nearly trivial; put_file/put_buffer/put_folder are the load-bearing ones.

Proposed phasing

Each phase is a separate, reviewable PR.

  • Phase 0 — test scaffolding. Add per-protocol unit tests for the four built-ins covering _create_filesystem, _full_path, get_url, and each I/O method. This is the prerequisite that makes the rest safe — currently only test_get_store_spec_file_protocol exercises StorageBackend per-protocol logic; everything else relies on integration tests, which don't typically stress atomicity or recursive-op edge cases. Block all subsequent phases on Phase 0.
  • Phase 1 — setup-layer migration. Route all four built-ins through StorageAdapter.create_filesystem, validate_spec, full_path, get_url. Eliminates ~12 setup branches. No I/O changes; design questions above are not yet engaged.
  • Phase 2 — design decision on atomicity + recursive ops. Write up the design proposal (a short doc or a discussion in this issue) addressing the gating questions. Get sign-off before coding.
  • Phase 3 — I/O migration. One protocol at a time (files3gcsazure), each in its own PR with the Phase-0 test suite as the regression backstop.

Non-goals

  • Changing dj.config.stores user-facing API.
  • Deprecating or removing fsspec; the adapters wrap it.
  • Adding new built-in protocols beyond the existing four.

Acceptance criteria

  • storage.py contains zero if self.protocol == "..." branches.
  • All four built-in adapters (FileAdapter, S3Adapter, GCSAdapter, AzureAdapter) are registered at import time and discoverable via get_storage_adapter(protocol).
  • Existing integration test suite passes unchanged.
  • New per-protocol unit tests exist (delivered in Phase 0) and cover atomicity + recursive-op semantics for the local backend.
  • Atomicity contract on the local backend is either preserved or, if changed, documented in release notes.

Tracking

This is the architectural follow-up to #1432. No target release is committed until the Phase 2 design discussion concludes — sequencing is against other work, not a fixed date.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions