feat: zero-copy mmap read path for large plaintext Arrow on the File backend (#171)#187
Conversation
…backend (#171) The "memory-mapped reads" advertised for uncompressed Arrow never existed: the File backend os.read the whole file onto the heap. This builds the real path. FileBackend.get_buffer() memory-maps a value (POSIX) and returns a borrowed zero-copy view of the payload. get_cached_value(_async) takes this fast path for plaintext Arrow that returns pandas, deserializing straight over the mmap (pa.py_buffer is zero-copy) and closing the handle in a finally. Measured: a 150 MB value reads at ~0.1 MB RSS via mmap vs ~157 MB via os.read. Scope and safety (from the design review): - POSIX only. Windows pins mapped files against rename/unlink, which would break the atomic write-then-rename set() path; Windows keeps the os.read path. - Plaintext + pandas only. Encrypted values can never mmap (AES-GCM decrypt owns its buffer); a pyarrow.Table return aliases the mapping (use-after-free on close). - Security: the fd is opened O_NOFOLLOW and the header validated before mapping; we never use a path-based mmap (it would follow an attacker-swapped symlink). - The mmap is confined to the deserialize call frame and never reaches L1; a belt-and-suspenders guard in L1Cache.put rejects non-bytes so any future regression fails loud instead of pinning a mapped inode for the entry's TTL. - Read-side size ceiling (512 MB) independent of max_value_mb; larger files fall back to os.read. The eager xxHash3 integrity check is kept (it faults into reclaimable page cache, not heap, so the steady-state RSS win survives). No protocol/Rust/TS/SaaS change. Compression stays a write-side concern (Arrow IPC self-describes on read), so no serializer-wiring change was needed.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughImplements a POSIX-only zero-copy mmap read fast-path for the File backend. Introduces ChangesZero-copy mmap read path for FileBackend
Sequence Diagram(s)sequenceDiagram
participant Client
participant CacheOperationHandler
participant CacheSerializationHandler
participant StandardCacheHandler
participant FileBackend
Client->>CacheOperationHandler: get_cached_value(key)
CacheOperationHandler->>CacheSerializationHandler: supports_mmap_read()
alt plaintext Arrow→pandas eligible
CacheOperationHandler->>StandardCacheHandler: get_buffer(key)
StandardCacheHandler->>FileBackend: get_buffer(key)
FileBackend->>FileBackend: open O_NOFOLLOW, validate header+TTL
FileBackend->>FileBackend: mmap file read-only
FileBackend-->>StandardCacheHandler: _MmapHandle
StandardCacheHandler-->>CacheOperationHandler: BufferHandle
CacheOperationHandler->>CacheSerializationHandler: deserialize_data(handle.view)
CacheOperationHandler->>CacheOperationHandler: handle.close() [finally]
CacheOperationHandler-->>Client: deserialized DataFrame
else mmap not eligible or get_buffer() → None
CacheOperationHandler->>StandardCacheHandler: get(key)
StandardCacheHandler->>FileBackend: get(key)
FileBackend-->>StandardCacheHandler: bytes
StandardCacheHandler-->>CacheOperationHandler: bytes
CacheOperationHandler->>CacheSerializationHandler: deserialize_data(bytes)
CacheOperationHandler-->>Client: deserialized value
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/unit/test_mmap_read_path.py`:
- Around line 41-56: The test_arrow_return_format_not_eligible method mutates
the process-wide cached ArrowSerializer instance's return_format property, which
could cause race conditions or flakiness if pytest runs tests in parallel or if
other tests depend on the serializer's default state. While the finally block in
the method correctly restores the original value, you should add isolation
safeguards: either mark the test with a pytest fixture or decorator to prevent
parallel execution with other tests that use the ArrowSerializer, or add
additional teardown cleanup after the finally block to ensure the serializer
cache is in a consistent state for other tests. This prevents the mutation
window between the serializer.return_format assignment and restoration from
causing test flakiness.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ebbbbd1c-3973-48d3-8fc5-dc537df2f375
📒 Files selected for processing (11)
.secrets.baselinesrc/cachekit/backends/base.pysrc/cachekit/backends/file/backend.pysrc/cachekit/cache_handler.pysrc/cachekit/config/settings.pysrc/cachekit/l1_cache.pysrc/cachekit/serializers/arrow_serializer.pytests/performance/test_file_backend_perf.pytests/unit/backends/test_file_backend.pytests/unit/test_l1_memory_bounds.pytests/unit/test_mmap_read_path.py
- Remove the speculative mmap branch from get_cached_value_async: it has no caller (the async decorator path inlines get_async), so it was dead code and the only thing dragging diff coverage below the 80% patch gate. YAGNI. - get_buffer branch tests: corrupt magic, empty payload, truncated file, unexpected OSError -> BackendError, and the rename-under-mmap invariant. - StandardCacheHandler.get_buffer error-path tests (BackendError + generic). - pragma: no cover on the genuinely untestable-on-Linux defensive branches (POSIX guard, rare open errors, idempotent-close excepts, mid-map cleanup). - Test the arrow return_format gate with a throwaway serializer instead of mutating the process-wide cached instance (addresses CodeRabbit nit).
…very run Run the deterministic `performance and slow` tests in CI: the #171 mmap zero-copy read (RSS stays flat) and the #152 large-object allocation caps. They were excluded by the blanket `-m "not slow"` filter despite being reproducible resource invariants, not wall-clock benchmarks. ~5s, run once on 3.12 (version-independent). Flaky timing/throughput benchmarks stay `performance` WITHOUT `slow` and remain out of CI.
The autouse redis-isolation fixture (tests/conftest.py) spawns a local redis-server binary unless REDIS_URL is set; the CI runner has no such binary (it uses dockerized Redis), so the step ERRORed at fixture setup. Mirror the unit/critical steps and point it at the external Redis.
What
Builds the real zero-copy memory-mapped read path for the File backend: large plaintext Arrow DataFrames are read via
mmapinstead ofos.read-ing the whole file onto the heap.Closes #171 (the epic). Built on #162 (unwrap→memoryview, already merged) and shaped by a multi-agent design review.
Why
The "memory-mapped reads" advertised for uncompressed Arrow never existed —
FileBackend.get()os.reads the entire file then slices it (full heap copy). Settingcompression=noneonly inflated the payload with no read-RSS benefit. This delivers the win the docs promised.Measured (perf test): a 150 MB value reads at ~0.1 MB RSS via mmap vs ~157 MB via
os.read.How
FileBackend.get_buffer(key)(POSIX) maps the file and returns a borrowed_MmapHandle— a zero-copymemoryviewof the payload past the 14-byte header.get_cached_value(_async)takes this fast path when the value is plaintext Arrow returning pandas, deserializes straight over the mmap (pa.py_bufferis zero-copy), and closes the handle in afinallyso the mapping never escapes the call frame.Scope & safety (deliberately narrow)
set()); it keeps theos.readpath.pyarrow.Tablereturn aliases the mapping → use-after-free on close.O_NOFOLLOW, header validated before mapping. Never a path-based mmap (would follow an attacker-swapped symlink).L1Cache.putnow rejects non-bytesloudly, so a future regression can't silently pin a mapped inode for the entry's TTL.max_value_mb); larger files fall back toos.read.set()stays rename-replace-only; a regression test asserts an open mmap sees the old snapshot while a concurrentset()lands.Compression stays a write-side concern (Arrow IPC self-describes on read), so no serializer-wiring change was needed — the over-engineered "per-backend serializer" step from the original epic was cut.
Tests
get_bufferunit tests (roundtrip, mmap-backed view, missing/oversized/expired→None, idempotent close-releases-view, rename-under-mmap invariant).get_bufferdelegation, theget_cached_valuebranch confines + closes the handle, real Arrow DataFrame round-trips through the actual mmap).L1Cache.putnon-bytes rejection.Gates: 1846 passed (unit + critical + architecture), basedpyright 0 errors, ruff clean, doctests pass. No protocol/Rust/TS/SaaS change.
Summary by CodeRabbit
bytespayloads with aTypeError.