feat: Support dynamic batchsize based on query size by ad-claw000 · Pull Request #696 · aperture-data/aperturedb-python

ad-claw000 · 2026-05-19T22:57:30Z

Summary

Allows ParallelQuery and ParallelLoader to dynamically adjust batch size based on the estimated bytes of the query + blobs to prevent large payload rejection and to allow user to limit memory consumption per batch.

Verification

Tests using ParallelQuery and ParallelLoader complete correctly and dynamic batching logs show when batch limits are hit. The feature is tested against large blobs estimating query bounds correctly.

Fixes #328.

This adds a `max_bytes_per_batch` parameter to ParallelQuery and ParallelLoader, allowing the worker to automatically adjust the batch size (the number of items in a batch) to ensure the total size of the batch's encoded items stays under the limit, avoiding oversized payload rejections. Fixes #328

Copilot

Pull request overview

This PR adds an optional byte-based dynamic batching mode to ParallelQuery / ParallelLoader so batches can be sized based on an estimated payload size (intended to avoid oversized encoded-query rejections and reduce per-batch memory use), addressing #328.

Changes:

Add max_bytes_per_batch parameter to ParallelQuery.query() and ParallelLoader.ingest().
Implement dynamic batch formation in ParallelQuery.worker() when max_bytes_per_batch is set.
Document the new loader parameter and forward it from ParallelLoader.ingest() into ParallelQuery.query().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
aperturedb/ParallelQuery.py	Adds `max_bytes_per_batch` option and dynamic byte-based batch splitting in the worker loop.
aperturedb/ParallelLoader.py	Exposes `max_bytes_per_batch` on `ingest()` and forwards it into `query()`.

Comments suppressed due to low confidence (5)

aperturedb/ParallelQuery.py:320

The function docstring is no longer the first statement in query() because self.max_bytes_per_batch = ... comes before the triple-quoted string. This prevents Python from treating it as the method docstring (e.g., help()/Sphinx won’t pick it up). Move the assignment below the docstring (or set the attribute later) so the docstring remains effective.

    def query(self, generator, batchsize: int = 1, numthreads: int = 4, stats: bool = False, max_bytes_per_batch: int = None) -> None:
        self.max_bytes_per_batch = max_bytes_per_batch
        """

aperturedb/ParallelQuery.py:242

The byte estimation for dynamic batching only sums blob sizes; it does not include the serialized command JSON size (and related encoding overhead). Since the goal is to keep the encoded query + blobs under a limit, this can still produce oversized payloads for large/complex command bodies. Consider adding an estimate for item[0] (e.g., JSON-encoded byte length) to item_bytes.

                for blob in item[1]:
                    if isinstance(blob, bytes):
                        item_bytes += len(blob)
                    else:
                        item_bytes += len(str(blob))

aperturedb/ParallelQuery.py:245

Edge case: if a single item’s estimated size exceeds max_bytes_per_batch, the current logic will still enqueue it into an empty batch and send a payload above the limit (likely still rejected). Handle this explicitly (e.g., send the item alone with a warning, or raise a clear error advising to increase the limit / reduce blob size).

                if len(current_batch) > 0 and (current_bytes + item_bytes > max_bytes):
                    try:

aperturedb/ParallelQuery.py:319

max_bytes_per_batch is accepted by ParallelQuery.query(), but in the use_dask path it is not forwarded into the per-partition loader.query(...) calls in DaskManager.run(), so dynamic byte-based batching is effectively disabled when generator.use_dask is true. Consider plumbing the parameter through the Dask execution path as well (or documenting that it’s not supported under Dask).

        self.max_bytes_per_batch = max_bytes_per_batch

aperturedb/ParallelQuery.py:225

New behavior (max_bytes_per_batch dynamic batching) is not covered by tests. Since test/test_Parallel.py and test/test_ResponseHandler.py already exercise ParallelQuery.query(), please add a unit test that sets a small max_bytes_per_batch and asserts the work is split into multiple do_batch calls (and that remainder handling works).

        if max_bytes is not None and max_bytes > 0:
            logger.info(
                f"Worker {thid} executing dynamically sized batches (max {max_bytes} bytes), {self.stats=}")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        if max_bytes is not None and max_bytes > 0:
+            logger.info(
+                f"Worker {thid} executing dynamically sized batches (max {max_bytes} bytes), {self.stats=}")
+            current_batch = []
+            current_bytes = 0


            batchsize (int, optional): The size of batch to be used. Defaults to 1.
            numthreads (int, optional): Number of workers to create. Defaults to 4.
            stats (bool, optional): If stats need to be presented, realtime. Defaults to False.
+            max_bytes_per_batch (int, optional): Automatic batch sizing to keep encoded queries under this limit.


luisremis · 2026-05-19T23:06:51Z

duplicated with #669

Copilot AI review requested due to automatic review settings May 19, 2026 22:57

Copilot started reviewing on behalf of ad-claw000 May 19, 2026 22:58 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

luisremis closed this May 19, 2026

luisremis deleted the fix/issue-328 branch May 19, 2026 23:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support dynamic batchsize based on query size#696

feat: Support dynamic batchsize based on query size#696
ad-claw000 wants to merge 1 commit into
developfrom
fix/issue-328

ad-claw000 commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

luisremis commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ad-claw000 commented May 19, 2026

Summary

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

luisremis commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants