feat: add commit timeout to CommitBuilder#6773
Draft
wjones127 wants to merge 8 commits into
Draft
Conversation
Long-running commits (e.g. a stuck remote object store) could previously block indefinitely. CommitBuilder now applies a 5 minute timeout to execute() / execute_batch() by default. Use `with_timeout(None)` to disable, or `with_timeout(Some(duration))` to override. Exposed in the Python (`commit_timeout: timedelta`) and Java (`commitTimeout(Duration)`) bindings. Fixes lance-format#6747
- Java `commitTimeoutMillis` now defaults to 5 minutes directly (primitive `long`), so JNI sees only two states: `< 0` disables, `>= 0` is millis. Removes the prior three-state boxed-`Long` encoding and the dependency on Rust's `DEFAULT_COMMIT_TIMEOUT` from the JNI layer. - Add Rust tests pinning the 5-minute default and exercising the timeout through `execute_batch`. - Add Python and Java tests that verify the timeout actually fires (Python via a sleeping `commit_lock`, Java via a 1-nanosecond timeout).
- Document that the timeout bounds the whole execute()/execute_batch() call including retries, and that an elapsed timeout returns Error::IO. - Comment why timeout failures are mapped to Error::IO. - Pin the Python commit_timeout default to a named constant that mirrors Rust's DEFAULT_COMMIT_TIMEOUT. - Tighten the Python tests: assert the exact exception class for a zero duration and add a negative-duration case.
A commit timeout was previously reported as Error::IO, which is
misleading — a timeout is not an I/O failure. Add a dedicated
Error::Timeout variant and use it for the commit timeout.
- Rust: new Error::Timeout { message, location } in lance-core.
- Python: infer_error maps it to TimeoutError; the commit entry points
now route through infer_error instead of hardcoding PyIOError, so
invalid input also surfaces as ValueError instead of OSError.
- Java: new unchecked org.lance.LanceTimeoutException; the JNI error
conversion maps Error::Timeout to it.
The retry-loop timeout in retry.rs still maps to TooMuchWriteContention;
migrating it is left as a deliberate follow-up.
The test previously committed through an in-memory store with with_timeout(None) and asserted success — indistinguishable from a commit using the default timeout. Run it through a throttled store so the commit takes real wall-clock time, proving None lets a commit complete that a small timeout would have killed.
Wrapping execute_inner in tokio::time::Timeout deepened the commit future's type enough that downstream async fns awaiting CommitBuilder ::execute (e.g. in lance-namespace-impls) overflowed rustc's layout -query depth limit. Box the inner future so the recursion stops at the box pointer.
…tests Switching the commit entry points to infer_error changed the exception type for every commit error, breaking tests that expect OSError for commit conflicts and unsupported ops. Add a scoped io_or_timeout_error helper that keeps the historical PyIOError default and only routes Error::Timeout to TimeoutError. The Python timeout-fires test used a commit_lock that slept 2s, which blocked the tokio executor thread so the timer could never fire. Drop the lock and use a 1us timeout on a normal commit instead — the commit yields at its first object-store write, after which the elapsed timer fires. Fold it into test_commit_timeout. Also apply spotless formatting to the Java sources.
d67eff2 to
cd431f6
Compare
The JNI boundary passed the timeout as milliseconds, so Duration.ofNanos(1).toMillis() truncated to 0 — a valid positive duration silently became a zero timeout, which Rust then rejected as InvalidInput. Pass nanoseconds instead, matching Rust's Duration and Python's timedelta precision, and fixing the testTimeoutFires case.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously,
CommitBuilder::execute/execute_batchcould blockindefinitely if the underlying object store hung (we observed a 3-hour
wait in production). This adds a configurable timeout that defaults to
5 minutes:
CommitBuilder::with_timeout(Option<Duration>).Nonedisablesthe timeout;
Some(Duration::ZERO)is rejected withInvalidInput.LanceDataset.commit(..., commit_timeout=timedelta(minutes=5)),pass
Noneto disable.new CommitBuilder(ds).commitTimeout(Duration.ofMinutes(5)),pass
nullto disable.Fixes #6747