feat: zero-copy mmap read path for large plaintext Arrow on the File backend (#171) by 27Bslash6 · Pull Request #187 · cachekit-io/cachekit-py

27Bslash6 · 2026-06-17T13:31:57Z

What

Builds the real zero-copy memory-mapped read path for the File backend: large plaintext Arrow DataFrames are read via mmap instead of os.read-ing the whole file onto the heap.

Closes #171 (the epic). Built on #162 (unwrap→memoryview, already merged) and shaped by a multi-agent design review.

Why

The "memory-mapped reads" advertised for uncompressed Arrow never existed — FileBackend.get() os.reads the entire file then slices it (full heap copy). Setting compression=none only inflated the payload with no read-RSS benefit. This delivers the win the docs promised.

Measured (perf test): a 150 MB value reads at ~0.1 MB RSS via mmap vs ~157 MB via os.read.

How

FileBackend.get_buffer(key) (POSIX) maps the file and returns a borrowed _MmapHandle — a zero-copy memoryview of the payload past the 14-byte header.
get_cached_value(_async) takes this fast path when the value is plaintext Arrow returning pandas, deserializes straight over the mmap (pa.py_buffer is zero-copy), and closes the handle in a finally so the mapping never escapes the call frame.

Scope & safety (deliberately narrow)

Concern	Decision
Platform	POSIX only. Windows pins mapped files against rename/unlink (breaks atomic `set()`); it keeps the `os.read` path.
Eligibility	Plaintext + pandas only. Encrypted values can't mmap (AES-GCM decrypt owns its buffer); a `pyarrow.Table` return aliases the mapping → use-after-free on close.
Security	fd opened `O_NOFOLLOW`, header validated before mapping. Never a path-based mmap (would follow an attacker-swapped symlink).
L1 (blocker C)	mmap confined to the deserialize frame, never stored. Belt-and-suspenders: `L1Cache.put` now rejects non-`bytes` loudly, so a future regression can't silently pin a mapped inode for the entry's TTL.
Integrity	eager xxHash3 kept — it faults into reclaimable page cache, not heap, so the steady-state RSS win survives.
Size ceiling	fixed 512 MB read-side cap (independent of `max_value_mb`); larger files fall back to `os.read`.
Rename-under-mmap	`set()` stays rename-replace-only; a regression test asserts an open mmap sees the old snapshot while a concurrent `set()` lands.

Compression stays a write-side concern (Arrow IPC self-describes on read), so no serializer-wiring change was needed — the over-engineered "per-backend serializer" step from the original epic was cut.

Tests

get_buffer unit tests (roundtrip, mmap-backed view, missing/oversized/expired→None, idempotent close-releases-view, rename-under-mmap invariant).
mmap read-path wiring (eligibility predicate, get_buffer delegation, the get_cached_value branch confines + closes the handle, real Arrow DataFrame round-trips through the actual mmap).
L1Cache.put non-bytes rejection.
RSS perf test (slow, POSIX): mmap RSS < 0.5× full-read RSS.

Gates: 1846 passed (unit + critical + architecture), basedpyright 0 errors, ruff clean, doctests pass. No protocol/Rust/TS/SaaS change.

Summary by CodeRabbit

New Features
- Added zero-copy buffer reads for cached values where supported, reducing transient memory use for eligible large entries.
Improvements
- The file backend now provides POSIX mmap-based reads for uncompressed, plaintext Arrow data paths, with safe buffer-handle lifecycle management and automatic fallback to standard reads when unavailable.
- Updated related Arrow documentation and configuration text.
Enforcement
- L1Cache now rejects non-bytes payloads with a TypeError.
Tests
- Added unit and performance coverage for the mmap fast path and buffer-handle behaviour; CI now runs additional memory-invariant performance checks on Python 3.12.

…backend (#171) The "memory-mapped reads" advertised for uncompressed Arrow never existed: the File backend os.read the whole file onto the heap. This builds the real path. FileBackend.get_buffer() memory-maps a value (POSIX) and returns a borrowed zero-copy view of the payload. get_cached_value(_async) takes this fast path for plaintext Arrow that returns pandas, deserializing straight over the mmap (pa.py_buffer is zero-copy) and closing the handle in a finally. Measured: a 150 MB value reads at ~0.1 MB RSS via mmap vs ~157 MB via os.read. Scope and safety (from the design review): - POSIX only. Windows pins mapped files against rename/unlink, which would break the atomic write-then-rename set() path; Windows keeps the os.read path. - Plaintext + pandas only. Encrypted values can never mmap (AES-GCM decrypt owns its buffer); a pyarrow.Table return aliases the mapping (use-after-free on close). - Security: the fd is opened O_NOFOLLOW and the header validated before mapping; we never use a path-based mmap (it would follow an attacker-swapped symlink). - The mmap is confined to the deserialize call frame and never reaches L1; a belt-and-suspenders guard in L1Cache.put rejects non-bytes so any future regression fails loud instead of pinning a mapped inode for the entry's TTL. - Read-side size ceiling (512 MB) independent of max_value_mb; larger files fall back to os.read. The eager xxHash3 integrity check is kept (it faults into reclaimable page cache, not heap, so the steady-state RSS win survives). No protocol/Rust/TS/SaaS change. Compression stays a write-side concern (Arrow IPC self-describes on read), so no serializer-wiring change was needed.

coderabbitai · 2026-06-17T13:32:28Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 84694f19-7343-4a6a-a707-53dc1df6c6a5

📥 Commits

Reviewing files that changed from the base of the PR and between 0ed9be9 and 7619dc9.

📒 Files selected for processing (1)

.github/workflows/ci.yml

Walkthrough

Implements a POSIX-only zero-copy mmap read fast-path for the File backend. Introduces BufferHandle and BufferReadableBackend protocols, adds FileBackend.get_buffer() with mmap lifecycle management via _MmapHandle, wires eligibility checks and fast-path branches into sync cache retrieval, enforces bytes-only inputs for L1Cache.put(), and adds unit, integration, and RSS performance tests.

Changes

Zero-copy mmap read path for FileBackend

Layer / File(s)	Summary
BufferHandle and BufferReadableBackend protocols `src/cachekit/backends/base.py`	Defines `BufferHandle` (zero-copy `memoryview` + idempotent `close()`) and `BufferReadableBackend` (`get_buffer()` returning `Optional[BufferHandle]`) as structural protocols establishing the zero-copy contract.
FileBackend mmap implementation `src/cachekit/backends/file/backend.py`	Adds `MMAP_MAX_BYTES` safety ceiling, `_MmapHandle` owning the mmap and payload `memoryview`, and `FileBackend.get_buffer()` with `O_NOFOLLOW` open, 14-byte header/TTL validation, shared file lock, and cleanup on all failure paths.
Cache handler fast-path wiring `src/cachekit/cache_handler.py`	Adds `supports_buffer_read()` type guard, `CacheSerializationHandler.supports_mmap_read()` eligibility check (plaintext Arrow→pandas only), widens `deserialize_data` to accept `memoryview`, extends `CacheHandlerStrategy` protocol, implements `StandardCacheHandler.get_buffer()` with backpressure/timeout, and inserts mmap fast-path branches with `finally`-close into sync `get_cached_value()`; async path documents no mmap support.
L1Cache bytes-only enforcement `src/cachekit/l1_cache.py`, `tests/unit/test_l1_memory_bounds.py`	Adds an `isinstance(value, bytes)` guard to `L1Cache.put()` raising `TypeError` for `memoryview`/`bytearray` inputs, with updated docstring and unit tests asserting rejection and non-storage of invalid payloads.
Unit and integration tests for mmap path `tests/unit/backends/test_file_backend.py`, `tests/unit/test_mmap_read_path.py`	Adds `TestMmapBuffer` (roundtrip, mmap-backed aliasing, missing/expired/oversized fallback, idempotent close, atomic-rename concurrency invariant) and `test_mmap_read_path.py` covering eligibility, delegation, sync branching, and end-to-end Arrow + FileBackend integration.
Performance test, documentation, and configuration `tests/performance/test_file_backend_perf.py`, `src/cachekit/config/settings.py`, `src/cachekit/serializers/arrow_serializer.py`, `.github/workflows/ci.yml`, `.secrets.baseline`	Adds POSIX-only RSS benchmark asserting mmap RSS increase is less than 0.5× full-read; updates `arrow_compression` description and `ArrowSerializer` docstrings to reference the mmap path; adds CI step for memory-invariant performance tests on Python 3.12; refreshes the secrets baseline timestamp and line number.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant CacheOperationHandler
  participant CacheSerializationHandler
  participant StandardCacheHandler
  participant FileBackend

  Client->>CacheOperationHandler: get_cached_value(key)
  CacheOperationHandler->>CacheSerializationHandler: supports_mmap_read()

  alt plaintext Arrow→pandas eligible
    CacheOperationHandler->>StandardCacheHandler: get_buffer(key)
    StandardCacheHandler->>FileBackend: get_buffer(key)
    FileBackend->>FileBackend: open O_NOFOLLOW, validate header+TTL
    FileBackend->>FileBackend: mmap file read-only
    FileBackend-->>StandardCacheHandler: _MmapHandle
    StandardCacheHandler-->>CacheOperationHandler: BufferHandle
    CacheOperationHandler->>CacheSerializationHandler: deserialize_data(handle.view)
    CacheOperationHandler->>CacheOperationHandler: handle.close() [finally]
    CacheOperationHandler-->>Client: deserialized DataFrame
  else mmap not eligible or get_buffer() → None
    CacheOperationHandler->>StandardCacheHandler: get(key)
    StandardCacheHandler->>FileBackend: get(key)
    FileBackend-->>StandardCacheHandler: bytes
    StandardCacheHandler-->>CacheOperationHandler: bytes
    CacheOperationHandler->>CacheSerializationHandler: deserialize_data(bytes)
    CacheOperationHandler-->>Client: deserialized value
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

cachekit-io/cachekit-py#152: Both PRs touch the Arrow read/deserialisation path around memoryview-backed data (main PR's mmap/buffer fast path passes handle.view into deserialize_data, and retrieved PR updates ArrowSerializer.deserialize to correctly sniff/validate Arrow IPC from a memoryview), so the changes are code-level compatible.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.89% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly summarises the main change: implementing a zero-copy mmap read path for large plaintext Arrow data on the File backend, directly addressing issue `#171`.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering what was built, why it was needed, how it works, scope/safety decisions, testing approach, and test results. It follows the spirit of the template's key sections.
Linked Issues check	✅ Passed	The PR implements the primary objective (`#2` from `#171`): `FileBackend.get_buffer()` provides mmap-backed zero-copy reads for POSIX plaintext Arrow returning pandas. Measured RSS improvement (~0.1 MB vs ~157 MB for 150 MB values) confirms the core win. Sub-tasks `#3`–#6 remain open, as noted in the PR.
Out of Scope Changes check	✅ Passed	All changes align with the stated objectives: protocol definitions, mmap implementation, serialisation handling, L1Cache type-checking, documentation updates, and comprehensive testing. The `.secrets.baseline` update is housekeeping unrelated to scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/file-mmap-read-path

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-17T13:37:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/test_mmap_read_path.py`:
- Around line 41-56: The test_arrow_return_format_not_eligible method mutates
the process-wide cached ArrowSerializer instance's return_format property, which
could cause race conditions or flakiness if pytest runs tests in parallel or if
other tests depend on the serializer's default state. While the finally block in
the method correctly restores the original value, you should add isolation
safeguards: either mark the test with a pytest fixture or decorator to prevent
parallel execution with other tests that use the ArrowSerializer, or add
additional teardown cleanup after the finally block to ensure the serializer
cache is in a consistent state for other tests. This prevents the mutation
window between the serializer.return_format assignment and restoration from
causing test flakiness.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ebbbbd1c-3973-48d3-8fc5-dc537df2f375

📥 Commits

Reviewing files that changed from the base of the PR and between 0901732 and 1f17532.

📒 Files selected for processing (11)

.secrets.baseline
src/cachekit/backends/base.py
src/cachekit/backends/file/backend.py
src/cachekit/cache_handler.py
src/cachekit/config/settings.py
src/cachekit/l1_cache.py
src/cachekit/serializers/arrow_serializer.py
tests/performance/test_file_backend_perf.py
tests/unit/backends/test_file_backend.py
tests/unit/test_l1_memory_bounds.py
tests/unit/test_mmap_read_path.py

- Remove the speculative mmap branch from get_cached_value_async: it has no caller (the async decorator path inlines get_async), so it was dead code and the only thing dragging diff coverage below the 80% patch gate. YAGNI. - get_buffer branch tests: corrupt magic, empty payload, truncated file, unexpected OSError -> BackendError, and the rename-under-mmap invariant. - StandardCacheHandler.get_buffer error-path tests (BackendError + generic). - pragma: no cover on the genuinely untestable-on-Linux defensive branches (POSIX guard, rare open errors, idempotent-close excepts, mid-map cleanup). - Test the arrow return_format gate with a throwaway serializer instead of mutating the process-wide cached instance (addresses CodeRabbit nit).

…very run Run the deterministic `performance and slow` tests in CI: the #171 mmap zero-copy read (RSS stays flat) and the #152 large-object allocation caps. They were excluded by the blanket `-m "not slow"` filter despite being reproducible resource invariants, not wall-clock benchmarks. ~5s, run once on 3.12 (version-independent). Flaky timing/throughput benchmarks stay `performance` WITHOUT `slow` and remain out of CI.

The autouse redis-isolation fixture (tests/conftest.py) spawns a local redis-server binary unless REDIS_URL is set; the CI runner has no such binary (it uses dockerized Redis), so the step ERRORed at fixture setup. Mirror the unit/critical steps and point it at the external Redis.

coderabbitai Bot requested changes Jun 17, 2026

View reviewed changes

Comment thread tests/unit/test_mmap_read_path.py

coderabbitai Bot previously approved these changes Jun 17, 2026

View reviewed changes

27Bslash6 dismissed coderabbitai[bot]’s stale review via 0ed9be9 June 17, 2026 23:12

coderabbitai Bot previously approved these changes Jun 17, 2026

View reviewed changes

27Bslash6 dismissed coderabbitai[bot]’s stale review via 7619dc9 June 17, 2026 23:18

coderabbitai Bot approved these changes Jun 17, 2026

View reviewed changes

27Bslash6 merged commit 1105454 into main Jun 18, 2026
32 checks passed

27Bslash6 deleted the feat/file-mmap-read-path branch June 18, 2026 02:51

cachekit-release-bot Bot mentioned this pull request Jun 17, 2026

chore(main): release 0.10.0 #186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: zero-copy mmap read path for large plaintext Arrow on the File backend (#171)#187

feat: zero-copy mmap read path for large plaintext Arrow on the File backend (#171)#187
27Bslash6 merged 4 commits into
mainfrom
feat/file-mmap-read-path

27Bslash6 commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

27Bslash6 commented Jun 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Scope & safety (deliberately narrow)

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

27Bslash6 commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

codecov Bot commented Jun 17, 2026 •

edited

Loading