Skip to content

feat: add commit timeout to CommitBuilder#6773

Draft
wjones127 wants to merge 8 commits into
lance-format:mainfrom
wjones127:feat/commit-builder-timeout
Draft

feat: add commit timeout to CommitBuilder#6773
wjones127 wants to merge 8 commits into
lance-format:mainfrom
wjones127:feat/commit-builder-timeout

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

Previously, CommitBuilder::execute / execute_batch could block
indefinitely if the underlying object store hung (we observed a 3-hour
wait in production). This adds a configurable timeout that defaults to
5 minutes:

  • Rust: CommitBuilder::with_timeout(Option<Duration>). None disables
    the timeout; Some(Duration::ZERO) is rejected with InvalidInput.
  • Python: LanceDataset.commit(..., commit_timeout=timedelta(minutes=5)),
    pass None to disable.
  • Java: new CommitBuilder(ds).commitTimeout(Duration.ofMinutes(5)),
    pass null to disable.

Fixes #6747

@github-actions github-actions Bot added enhancement New feature or request python java labels May 13, 2026
wjones127 added 7 commits May 14, 2026 13:16
Long-running commits (e.g. a stuck remote object store) could previously
block indefinitely. CommitBuilder now applies a 5 minute timeout to
execute() / execute_batch() by default. Use `with_timeout(None)` to
disable, or `with_timeout(Some(duration))` to override. Exposed in the
Python (`commit_timeout: timedelta`) and Java (`commitTimeout(Duration)`)
bindings.

Fixes lance-format#6747
- Java `commitTimeoutMillis` now defaults to 5 minutes directly (primitive
  `long`), so JNI sees only two states: `< 0` disables, `>= 0` is millis.
  Removes the prior three-state boxed-`Long` encoding and the dependency
  on Rust's `DEFAULT_COMMIT_TIMEOUT` from the JNI layer.
- Add Rust tests pinning the 5-minute default and exercising the timeout
  through `execute_batch`.
- Add Python and Java tests that verify the timeout actually fires (Python
  via a sleeping `commit_lock`, Java via a 1-nanosecond timeout).
- Document that the timeout bounds the whole execute()/execute_batch()
  call including retries, and that an elapsed timeout returns Error::IO.
- Comment why timeout failures are mapped to Error::IO.
- Pin the Python commit_timeout default to a named constant that mirrors
  Rust's DEFAULT_COMMIT_TIMEOUT.
- Tighten the Python tests: assert the exact exception class for a zero
  duration and add a negative-duration case.
A commit timeout was previously reported as Error::IO, which is
misleading — a timeout is not an I/O failure. Add a dedicated
Error::Timeout variant and use it for the commit timeout.

- Rust: new Error::Timeout { message, location } in lance-core.
- Python: infer_error maps it to TimeoutError; the commit entry points
  now route through infer_error instead of hardcoding PyIOError, so
  invalid input also surfaces as ValueError instead of OSError.
- Java: new unchecked org.lance.LanceTimeoutException; the JNI error
  conversion maps Error::Timeout to it.

The retry-loop timeout in retry.rs still maps to TooMuchWriteContention;
migrating it is left as a deliberate follow-up.
The test previously committed through an in-memory store with
with_timeout(None) and asserted success — indistinguishable from a
commit using the default timeout. Run it through a throttled store so
the commit takes real wall-clock time, proving None lets a commit
complete that a small timeout would have killed.
Wrapping execute_inner in tokio::time::Timeout deepened the commit
future's type enough that downstream async fns awaiting CommitBuilder
::execute (e.g. in lance-namespace-impls) overflowed rustc's layout
-query depth limit. Box the inner future so the recursion stops at the
box pointer.
…tests

Switching the commit entry points to infer_error changed the exception
type for every commit error, breaking tests that expect OSError for
commit conflicts and unsupported ops. Add a scoped io_or_timeout_error
helper that keeps the historical PyIOError default and only routes
Error::Timeout to TimeoutError.

The Python timeout-fires test used a commit_lock that slept 2s, which
blocked the tokio executor thread so the timer could never fire. Drop
the lock and use a 1us timeout on a normal commit instead — the commit
yields at its first object-store write, after which the elapsed timer
fires. Fold it into test_commit_timeout.

Also apply spotless formatting to the Java sources.
@wjones127 wjones127 force-pushed the feat/commit-builder-timeout branch from d67eff2 to cd431f6 Compare May 14, 2026 20:18
The JNI boundary passed the timeout as milliseconds, so
Duration.ofNanos(1).toMillis() truncated to 0 — a valid positive
duration silently became a zero timeout, which Rust then rejected as
InvalidInput. Pass nanoseconds instead, matching Rust's Duration and
Python's timedelta precision, and fixing the testTimeoutFires case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add global timeout to CommitBuilder

1 participant