Skip to content

feat: zero-copy mmap read path for large plaintext Arrow on the File backend (#171)#187

Merged
27Bslash6 merged 4 commits into
mainfrom
feat/file-mmap-read-path
Jun 18, 2026
Merged

feat: zero-copy mmap read path for large plaintext Arrow on the File backend (#171)#187
27Bslash6 merged 4 commits into
mainfrom
feat/file-mmap-read-path

Conversation

@27Bslash6

@27Bslash6 27Bslash6 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

What

Builds the real zero-copy memory-mapped read path for the File backend: large plaintext Arrow DataFrames are read via mmap instead of os.read-ing the whole file onto the heap.

Closes #171 (the epic). Built on #162 (unwrap→memoryview, already merged) and shaped by a multi-agent design review.

Why

The "memory-mapped reads" advertised for uncompressed Arrow never existedFileBackend.get() os.reads the entire file then slices it (full heap copy). Setting compression=none only inflated the payload with no read-RSS benefit. This delivers the win the docs promised.

Measured (perf test): a 150 MB value reads at ~0.1 MB RSS via mmap vs ~157 MB via os.read.

How

  • FileBackend.get_buffer(key) (POSIX) maps the file and returns a borrowed _MmapHandle — a zero-copy memoryview of the payload past the 14-byte header.
  • get_cached_value(_async) takes this fast path when the value is plaintext Arrow returning pandas, deserializes straight over the mmap (pa.py_buffer is zero-copy), and closes the handle in a finally so the mapping never escapes the call frame.

Scope & safety (deliberately narrow)

Concern Decision
Platform POSIX only. Windows pins mapped files against rename/unlink (breaks atomic set()); it keeps the os.read path.
Eligibility Plaintext + pandas only. Encrypted values can't mmap (AES-GCM decrypt owns its buffer); a pyarrow.Table return aliases the mapping → use-after-free on close.
Security fd opened O_NOFOLLOW, header validated before mapping. Never a path-based mmap (would follow an attacker-swapped symlink).
L1 (blocker C) mmap confined to the deserialize frame, never stored. Belt-and-suspenders: L1Cache.put now rejects non-bytes loudly, so a future regression can't silently pin a mapped inode for the entry's TTL.
Integrity eager xxHash3 kept — it faults into reclaimable page cache, not heap, so the steady-state RSS win survives.
Size ceiling fixed 512 MB read-side cap (independent of max_value_mb); larger files fall back to os.read.
Rename-under-mmap set() stays rename-replace-only; a regression test asserts an open mmap sees the old snapshot while a concurrent set() lands.

Compression stays a write-side concern (Arrow IPC self-describes on read), so no serializer-wiring change was needed — the over-engineered "per-backend serializer" step from the original epic was cut.

Tests

  • get_buffer unit tests (roundtrip, mmap-backed view, missing/oversized/expired→None, idempotent close-releases-view, rename-under-mmap invariant).
  • mmap read-path wiring (eligibility predicate, get_buffer delegation, the get_cached_value branch confines + closes the handle, real Arrow DataFrame round-trips through the actual mmap).
  • L1Cache.put non-bytes rejection.
  • RSS perf test (slow, POSIX): mmap RSS < 0.5× full-read RSS.

Gates: 1846 passed (unit + critical + architecture), basedpyright 0 errors, ruff clean, doctests pass. No protocol/Rust/TS/SaaS change.

Summary by CodeRabbit

  • New Features
    • Added zero-copy buffer reads for cached values where supported, reducing transient memory use for eligible large entries.
  • Improvements
    • The file backend now provides POSIX mmap-based reads for uncompressed, plaintext Arrow data paths, with safe buffer-handle lifecycle management and automatic fallback to standard reads when unavailable.
    • Updated related Arrow documentation and configuration text.
  • Enforcement
    • L1Cache now rejects non-bytes payloads with a TypeError.
  • Tests
    • Added unit and performance coverage for the mmap fast path and buffer-handle behaviour; CI now runs additional memory-invariant performance checks on Python 3.12.

…backend (#171)

The "memory-mapped reads" advertised for uncompressed Arrow never existed: the
File backend os.read the whole file onto the heap. This builds the real path.

FileBackend.get_buffer() memory-maps a value (POSIX) and returns a borrowed
zero-copy view of the payload. get_cached_value(_async) takes this fast path
for plaintext Arrow that returns pandas, deserializing straight over the mmap
(pa.py_buffer is zero-copy) and closing the handle in a finally. Measured: a
150 MB value reads at ~0.1 MB RSS via mmap vs ~157 MB via os.read.

Scope and safety (from the design review):
- POSIX only. Windows pins mapped files against rename/unlink, which would break
  the atomic write-then-rename set() path; Windows keeps the os.read path.
- Plaintext + pandas only. Encrypted values can never mmap (AES-GCM decrypt owns
  its buffer); a pyarrow.Table return aliases the mapping (use-after-free on close).
- Security: the fd is opened O_NOFOLLOW and the header validated before mapping;
  we never use a path-based mmap (it would follow an attacker-swapped symlink).
- The mmap is confined to the deserialize call frame and never reaches L1; a
  belt-and-suspenders guard in L1Cache.put rejects non-bytes so any future
  regression fails loud instead of pinning a mapped inode for the entry's TTL.
- Read-side size ceiling (512 MB) independent of max_value_mb; larger files fall
  back to os.read. The eager xxHash3 integrity check is kept (it faults into
  reclaimable page cache, not heap, so the steady-state RSS win survives).

No protocol/Rust/TS/SaaS change. Compression stays a write-side concern (Arrow
IPC self-describes on read), so no serializer-wiring change was needed.
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 84694f19-7343-4a6a-a707-53dc1df6c6a5

📥 Commits

Reviewing files that changed from the base of the PR and between 0ed9be9 and 7619dc9.

📒 Files selected for processing (1)
  • .github/workflows/ci.yml

Walkthrough

Implements a POSIX-only zero-copy mmap read fast-path for the File backend. Introduces BufferHandle and BufferReadableBackend protocols, adds FileBackend.get_buffer() with mmap lifecycle management via _MmapHandle, wires eligibility checks and fast-path branches into sync cache retrieval, enforces bytes-only inputs for L1Cache.put(), and adds unit, integration, and RSS performance tests.

Changes

Zero-copy mmap read path for FileBackend

Layer / File(s) Summary
BufferHandle and BufferReadableBackend protocols
src/cachekit/backends/base.py
Defines BufferHandle (zero-copy memoryview + idempotent close()) and BufferReadableBackend (get_buffer() returning Optional[BufferHandle]) as structural protocols establishing the zero-copy contract.
FileBackend mmap implementation
src/cachekit/backends/file/backend.py
Adds MMAP_MAX_BYTES safety ceiling, _MmapHandle owning the mmap and payload memoryview, and FileBackend.get_buffer() with O_NOFOLLOW open, 14-byte header/TTL validation, shared file lock, and cleanup on all failure paths.
Cache handler fast-path wiring
src/cachekit/cache_handler.py
Adds supports_buffer_read() type guard, CacheSerializationHandler.supports_mmap_read() eligibility check (plaintext Arrow→pandas only), widens deserialize_data to accept memoryview, extends CacheHandlerStrategy protocol, implements StandardCacheHandler.get_buffer() with backpressure/timeout, and inserts mmap fast-path branches with finally-close into sync get_cached_value(); async path documents no mmap support.
L1Cache bytes-only enforcement
src/cachekit/l1_cache.py, tests/unit/test_l1_memory_bounds.py
Adds an isinstance(value, bytes) guard to L1Cache.put() raising TypeError for memoryview/bytearray inputs, with updated docstring and unit tests asserting rejection and non-storage of invalid payloads.
Unit and integration tests for mmap path
tests/unit/backends/test_file_backend.py, tests/unit/test_mmap_read_path.py
Adds TestMmapBuffer (roundtrip, mmap-backed aliasing, missing/expired/oversized fallback, idempotent close, atomic-rename concurrency invariant) and test_mmap_read_path.py covering eligibility, delegation, sync branching, and end-to-end Arrow + FileBackend integration.
Performance test, documentation, and configuration
tests/performance/test_file_backend_perf.py, src/cachekit/config/settings.py, src/cachekit/serializers/arrow_serializer.py, .github/workflows/ci.yml, .secrets.baseline
Adds POSIX-only RSS benchmark asserting mmap RSS increase is less than 0.5× full-read; updates arrow_compression description and ArrowSerializer docstrings to reference the mmap path; adds CI step for memory-invariant performance tests on Python 3.12; refreshes the secrets baseline timestamp and line number.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant CacheOperationHandler
  participant CacheSerializationHandler
  participant StandardCacheHandler
  participant FileBackend

  Client->>CacheOperationHandler: get_cached_value(key)
  CacheOperationHandler->>CacheSerializationHandler: supports_mmap_read()

  alt plaintext Arrow→pandas eligible
    CacheOperationHandler->>StandardCacheHandler: get_buffer(key)
    StandardCacheHandler->>FileBackend: get_buffer(key)
    FileBackend->>FileBackend: open O_NOFOLLOW, validate header+TTL
    FileBackend->>FileBackend: mmap file read-only
    FileBackend-->>StandardCacheHandler: _MmapHandle
    StandardCacheHandler-->>CacheOperationHandler: BufferHandle
    CacheOperationHandler->>CacheSerializationHandler: deserialize_data(handle.view)
    CacheOperationHandler->>CacheOperationHandler: handle.close() [finally]
    CacheOperationHandler-->>Client: deserialized DataFrame
  else mmap not eligible or get_buffer() → None
    CacheOperationHandler->>StandardCacheHandler: get(key)
    StandardCacheHandler->>FileBackend: get(key)
    FileBackend-->>StandardCacheHandler: bytes
    StandardCacheHandler-->>CacheOperationHandler: bytes
    CacheOperationHandler->>CacheSerializationHandler: deserialize_data(bytes)
    CacheOperationHandler-->>Client: deserialized value
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • cachekit-io/cachekit-py#152: Both PRs touch the Arrow read/deserialisation path around memoryview-backed data (main PR's mmap/buffer fast path passes handle.view into deserialize_data, and retrieved PR updates ArrowSerializer.deserialize to correctly sniff/validate Arrow IPC from a memoryview), so the changes are code-level compatible.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly summarises the main change: implementing a zero-copy mmap read path for large plaintext Arrow data on the File backend, directly addressing issue #171.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering what was built, why it was needed, how it works, scope/safety decisions, testing approach, and test results. It follows the spirit of the template's key sections.
Linked Issues check ✅ Passed The PR implements the primary objective (#2 from #171): FileBackend.get_buffer() provides mmap-backed zero-copy reads for POSIX plaintext Arrow returning pandas. Measured RSS improvement (~0.1 MB vs ~157 MB for 150 MB values) confirms the core win. Sub-tasks #3#6 remain open, as noted in the PR.
Out of Scope Changes check ✅ Passed All changes align with the stated objectives: protocol definitions, mmap implementation, serialisation handling, L1Cache type-checking, documentation updates, and comprehensive testing. The .secrets.baseline update is housekeeping unrelated to scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/file-mmap-read-path

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/test_mmap_read_path.py`:
- Around line 41-56: The test_arrow_return_format_not_eligible method mutates
the process-wide cached ArrowSerializer instance's return_format property, which
could cause race conditions or flakiness if pytest runs tests in parallel or if
other tests depend on the serializer's default state. While the finally block in
the method correctly restores the original value, you should add isolation
safeguards: either mark the test with a pytest fixture or decorator to prevent
parallel execution with other tests that use the ArrowSerializer, or add
additional teardown cleanup after the finally block to ensure the serializer
cache is in a consistent state for other tests. This prevents the mutation
window between the serializer.return_format assignment and restoration from
causing test flakiness.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ebbbbd1c-3973-48d3-8fc5-dc537df2f375

📥 Commits

Reviewing files that changed from the base of the PR and between 0901732 and 1f17532.

📒 Files selected for processing (11)
  • .secrets.baseline
  • src/cachekit/backends/base.py
  • src/cachekit/backends/file/backend.py
  • src/cachekit/cache_handler.py
  • src/cachekit/config/settings.py
  • src/cachekit/l1_cache.py
  • src/cachekit/serializers/arrow_serializer.py
  • tests/performance/test_file_backend_perf.py
  • tests/unit/backends/test_file_backend.py
  • tests/unit/test_l1_memory_bounds.py
  • tests/unit/test_mmap_read_path.py

Comment thread tests/unit/test_mmap_read_path.py
- Remove the speculative mmap branch from get_cached_value_async: it has no
  caller (the async decorator path inlines get_async), so it was dead code and
  the only thing dragging diff coverage below the 80% patch gate. YAGNI.
- get_buffer branch tests: corrupt magic, empty payload, truncated file,
  unexpected OSError -> BackendError, and the rename-under-mmap invariant.
- StandardCacheHandler.get_buffer error-path tests (BackendError + generic).
- pragma: no cover on the genuinely untestable-on-Linux defensive branches
  (POSIX guard, rare open errors, idempotent-close excepts, mid-map cleanup).
- Test the arrow return_format gate with a throwaway serializer instead of
  mutating the process-wide cached instance (addresses CodeRabbit nit).
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 17, 2026
…very run

Run the deterministic `performance and slow` tests in CI: the #171 mmap
zero-copy read (RSS stays flat) and the #152 large-object allocation caps.
They were excluded by the blanket `-m "not slow"` filter despite being
reproducible resource invariants, not wall-clock benchmarks. ~5s, run once
on 3.12 (version-independent). Flaky timing/throughput benchmarks stay
`performance` WITHOUT `slow` and remain out of CI.
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 17, 2026
The autouse redis-isolation fixture (tests/conftest.py) spawns a local
redis-server binary unless REDIS_URL is set; the CI runner has no such binary
(it uses dockerized Redis), so the step ERRORed at fixture setup. Mirror the
unit/critical steps and point it at the external Redis.
@27Bslash6 27Bslash6 merged commit 1105454 into main Jun 18, 2026
32 checks passed
@27Bslash6 27Bslash6 deleted the feat/file-mmap-read-path branch June 18, 2026 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EPIC: Zero-copy mmap read path for the File backend (large-object read RSS)

1 participant