Skip to content

feat(odin): Implement BoundedReader for safe current-day file streaming#543

Open
vikramlc-cognite wants to merge 5 commits into
masterfrom
EDGE-607-implement-bounded-reader
Open

feat(odin): Implement BoundedReader for safe current-day file streaming#543
vikramlc-cognite wants to merge 5 commits into
masterfrom
EDGE-607-implement-bounded-reader

Conversation

@vikramlc-cognite

Copy link
Copy Markdown
Contributor

Summary

Adds BoundedReader, a lightweight BinaryIO wrapper that gates reads at a byte count captured at snapshot time, enabling safe streaming of the active log file (file.log) to CDF Files without risking an HTTP Content-Length mismatch as the logger appends new lines.


Type of change

  • New feature (non-breaking change that adds functionality)

What changed

  • cognite/extractorutils/unstable/core/_bounded_reader.py (new): BoundedReader class with:
    • init(f: BinaryIO, max_bytes: int) - stores the snapshot size and a mutable remaining counter
    • len() -> int - returns the fixed snapshot size; requests.utils.super_len() reads len before os.fstat(), so this bypasses the live inode size and declares the correct Content-Length
    • read(size: int = -1) -> bytes - caps each read at remaining; decrements by len(data) (not size) to stay accurate if the underlying file returns fewer bytes than requested; returns b"" once exhausted
    • Context manager (enter / exit) - delegates close() to the underlying file handle
  • tests/test_unstable/test_bounded_reader.py (new): 9 tests covering all behaviours above

Why it changed

The active log file (file.log) is written by TimedRotatingFileHandler while the action runs. Passing a plain open(log_path, "rb") to IOFileUploadQueue is unsafe:

  1. super_len(file) falls through to os.fstat(file.fileno()).st_size - the live inode size at that instant
  2. IOByteStream then reads until read() returns b"", which for a growing file means it reads past the declared length
  3. httpx raises LocalProtocolError: "Too much data for declared Content-Length" - hard upload failure

BoundedReader eliminates this by declaring the snapshot size via len and hard-stopping reads at that boundary. No buffering, no copies - the only overhead is one integer decrement per read() call.

Related issue

EDGE-183 / EDGE-607

Related docs

BULK_LOG_UPLOAD_ACTION_PLAN - "Current Day File Handling" section


What to focus on during review

  • len vs _remaining: len intentionally returns _size (the fixed snapshot), not _remaining. If it returned _remaining, a super_len() call after a partial read would declare a wrong (lower) Content-Length. The test_len_returns_snapshot_size_and_is_stable_after_reads test pins this invariant.
  • _remaining -= len(data) not _remaining -= to_read: If the underlying file returns fewer bytes than requested (file shorter than snapshot - midnight rotation race), the counter must track actual bytes read, not requested bytes. Decrementing by len(data) keeps it accurate.
  • No inheritance from BinaryIO / RawIOBase: Intentional. ChunkedStream documents the mypy incompatibility between RawIOBase and BinaryIO.
    BoundedReader only needs to satisfy the upload path's interface: len, read(), and the context manager. Adding inheritance would pull in abstract methods we don't need.
  • Midnight rotation race (see test): If TimedRotatingFileHandler rotates between the os.path.getsize() snapshot and open(), the new file.log is empty. BoundedReader.read() returns b"" immediately while len still reports the snapshot - the upload layer sees "too little data" and marks the file failed. This is the documented behaviour; the caller retries.

Test evidence

$ python -m pytest tests/test_unstable/test_bounded_reader.py -v

tests/test_unstable/test_bounded_reader.py::test_len_returns_snapshot_size_and_is_stable_after_reads PASSED
tests/test_unstable/test_bounded_reader.py::test_read_respects_snapshot_bound[read_all] PASSED
tests/test_unstable/test_bounded_reader.py::test_read_respects_snapshot_bound[read_partial] PASSED
tests/test_unstable/test_bounded_reader.py::test_read_respects_snapshot_bound[read_exceeds_snapshot] PASSED
tests/test_unstable/test_bounded_reader.py::test_read_returns_empty_after_exhaustion PASSED
tests/test_unstable/test_bounded_reader.py::test_zero_snapshot_returns_empty_immediately PASSED
tests/test_unstable/test_bounded_reader.py::test_context_manager_closes_underlying_file PASSED
tests/test_unstable/test_bounded_reader.py::test_super_len_reads_len_not_fstat PASSED
tests/test_unstable/test_bounded_reader.py::test_midnight_rotation_race_file_shorter_than_snapshot PASSED

9 passed in 0.03s


Risks and unknowns

  • "Too little data" midnight race: Documented in design doc. BoundedReader itself is correct - the upload failure is expected and the caller handles it by retrying. No mitigation is added here; explicit detection (switch to uploading the freshly rotated file.log.YYYY-MM-DD after rotation) is deferred to Task 4 if needed.
  • super_len lookup order: Relies on requests.utils.super_len() checking len before os.fstat(). This is stable behaviour in the requests library; test_super_len_reads_len_not_fstat will catch a regression if it ever changes.

Rollout and rollback

N/A - internal implementation class, not yet wired into any action. No config changes, no migrations.


Checklist

  • Self-reviewed the diff
  • Tests added (9 tests, all passing)
  • Docs N/A - internal private class; design rationale is in BULK_LOG_UPLOAD_ACTION_PLAN.md
  • No secrets, credentials, or PII committed
  • Breaking changes: none

@vikramlc-cognite vikramlc-cognite self-assigned this Jun 22, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces BoundedReader, a wrapper for binary file handles designed to limit reads to a snapshot byte count, preventing issues when uploading active log files that are still being appended to. The review feedback highlights that BoundedReader needs to implement tell(), closed, and close() to fully satisfy the file-like interface required by multipart uploads (e.g., when wrapped in ChunkedStream for files larger than 5 GiB) and recommends adding corresponding unit tests.

Comment thread cognite/extractorutils/unstable/core/_bounded_reader.py Outdated
Comment thread tests/test_unstable/test_bounded_reader.py
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.17%. Comparing base (3bb7bb5) to head (ea58e9b).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #543      +/-   ##
==========================================
+ Coverage   83.08%   83.17%   +0.09%     
==========================================
  Files          45       46       +1     
  Lines        4522     4548      +26     
==========================================
+ Hits         3757     3783      +26     
  Misses        765      765              
Files with missing lines Coverage Δ
...te/extractorutils/unstable/core/_bounded_reader.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vikramlc-cognite

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces BoundedReader, a utility class designed to wrap a binary file handle and limit reads to a specified snapshot size, preventing upload errors when log files are actively being appended to. The feedback suggests handling cases where the underlying stream's read method returns None to avoid a potential TypeError when calling len(data).

Comment thread cognite/extractorutils/unstable/core/_bounded_reader.py Outdated
@vikramlc-cognite vikramlc-cognite marked this pull request as ready for review June 24, 2026 09:58
@vikramlc-cognite vikramlc-cognite requested a review from a team as a code owner June 24, 2026 09:58

@Yaseen-A-Khan Yaseen-A-Khan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good, just a nit comment.

Comment thread cognite/extractorutils/unstable/core/_bounded_reader.py Outdated

@Yaseen-A-Khan Yaseen-A-Khan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vikramlc-cognite vikramlc-cognite added the waiting-for-risk-review Waiting for a member of the risk review team to take an action label Jun 29, 2026
@nithinb nithinb self-assigned this Jun 30, 2026
@nithinb nithinb added the risk-review-ongoing Risk review is in progress label Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

risk-review-ongoing Risk review is in progress waiting-for-risk-review Waiting for a member of the risk review team to take an action

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants