Skip to content

feat: Support dynamic batchsize based on query size#696

Closed
ad-claw000 wants to merge 1 commit into
developfrom
fix/issue-328
Closed

feat: Support dynamic batchsize based on query size#696
ad-claw000 wants to merge 1 commit into
developfrom
fix/issue-328

Conversation

@ad-claw000

Copy link
Copy Markdown
Contributor

Summary

Allows ParallelQuery and ParallelLoader to dynamically adjust batch size based on the estimated bytes of the query + blobs to prevent large payload rejection and to allow user to limit memory consumption per batch.

Verification

Tests using ParallelQuery and ParallelLoader complete correctly and dynamic batching logs show when batch limits are hit. The feature is tested against large blobs estimating query bounds correctly.

Fixes #328.

This adds a `max_bytes_per_batch` parameter to ParallelQuery and
ParallelLoader, allowing the worker to automatically adjust the batch
size (the number of items in a batch) to ensure the total size of the
batch's encoded items stays under the limit, avoiding oversized payload
rejections.

Fixes #328
Copilot AI review requested due to automatic review settings May 19, 2026 22:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional byte-based dynamic batching mode to ParallelQuery / ParallelLoader so batches can be sized based on an estimated payload size (intended to avoid oversized encoded-query rejections and reduce per-batch memory use), addressing #328.

Changes:

  • Add max_bytes_per_batch parameter to ParallelQuery.query() and ParallelLoader.ingest().
  • Implement dynamic batch formation in ParallelQuery.worker() when max_bytes_per_batch is set.
  • Document the new loader parameter and forward it from ParallelLoader.ingest() into ParallelQuery.query().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
aperturedb/ParallelQuery.py Adds max_bytes_per_batch option and dynamic byte-based batch splitting in the worker loop.
aperturedb/ParallelLoader.py Exposes max_bytes_per_batch on ingest() and forwards it into query().
Comments suppressed due to low confidence (5)

aperturedb/ParallelQuery.py:320

  • The function docstring is no longer the first statement in query() because self.max_bytes_per_batch = ... comes before the triple-quoted string. This prevents Python from treating it as the method docstring (e.g., help()/Sphinx won’t pick it up). Move the assignment below the docstring (or set the attribute later) so the docstring remains effective.
    def query(self, generator, batchsize: int = 1, numthreads: int = 4, stats: bool = False, max_bytes_per_batch: int = None) -> None:
        self.max_bytes_per_batch = max_bytes_per_batch
        """

aperturedb/ParallelQuery.py:242

  • The byte estimation for dynamic batching only sums blob sizes; it does not include the serialized command JSON size (and related encoding overhead). Since the goal is to keep the encoded query + blobs under a limit, this can still produce oversized payloads for large/complex command bodies. Consider adding an estimate for item[0] (e.g., JSON-encoded byte length) to item_bytes.
                for blob in item[1]:
                    if isinstance(blob, bytes):
                        item_bytes += len(blob)
                    else:
                        item_bytes += len(str(blob))

aperturedb/ParallelQuery.py:245

  • Edge case: if a single item’s estimated size exceeds max_bytes_per_batch, the current logic will still enqueue it into an empty batch and send a payload above the limit (likely still rejected). Handle this explicitly (e.g., send the item alone with a warning, or raise a clear error advising to increase the limit / reduce blob size).
                if len(current_batch) > 0 and (current_bytes + item_bytes > max_bytes):
                    try:

aperturedb/ParallelQuery.py:319

  • max_bytes_per_batch is accepted by ParallelQuery.query(), but in the use_dask path it is not forwarded into the per-partition loader.query(...) calls in DaskManager.run(), so dynamic byte-based batching is effectively disabled when generator.use_dask is true. Consider plumbing the parameter through the Dask execution path as well (or documenting that it’s not supported under Dask).
        self.max_bytes_per_batch = max_bytes_per_batch

aperturedb/ParallelQuery.py:225

  • New behavior (max_bytes_per_batch dynamic batching) is not covered by tests. Since test/test_Parallel.py and test/test_ResponseHandler.py already exercise ParallelQuery.query(), please add a unit test that sets a small max_bytes_per_batch and asserts the work is split into multiple do_batch calls (and that remainder handling works).
        if max_bytes is not None and max_bytes > 0:
            logger.info(
                f"Worker {thid} executing dynamically sized batches (max {max_bytes} bytes), {self.stats=}")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +223 to +227
if max_bytes is not None and max_bytes > 0:
logger.info(
f"Worker {thid} executing dynamically sized batches (max {max_bytes} bytes), {self.stats=}")
current_batch = []
current_bytes = 0
Comment on lines 173 to +176
batchsize (int, optional): The size of batch to be used. Defaults to 1.
numthreads (int, optional): Number of workers to create. Defaults to 4.
stats (bool, optional): If stats need to be presented, realtime. Defaults to False.
max_bytes_per_batch (int, optional): Automatic batch sizing to keep encoded queries under this limit.
@luisremis

Copy link
Copy Markdown
Contributor

duplicated with #669

@luisremis luisremis closed this May 19, 2026
@luisremis luisremis deleted the fix/issue-328 branch May 19, 2026 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ParallelQuery batchsize cannot adjust for query size

3 participants