feat: add recursive chunking strategy and batched vector insertions by abhay-2108 · Pull Request #2505 · arc53/DocsGPT

abhay-2108 · 2026-05-27T02:43:06Z

Refactor: Optimize Ingestion Pipeline via Batched Vector Insertions & Recursive Chunking

Description

This Pull Request resolves systemic performance and context-retrieval limitations within the core RAG ingestion architecture by implementing a dual-optimization layer:

Ingestion Throughput Optimization: Converts high-latency sequential single-chunk vector store insertions into high-performance, batched payloads.
Retrieval Context Optimization: Replaces the legacy character-slice word-cutting layer with a semantic, boundary-aware recursive token splitter.

Both additions are designed for 100% backward compatibility, feature explicit Python type-hinting, follow Google-style docstring specifications, and pass all newly extended test coverage parameters.

Problem & Solution Context

1. Batched Vector Store Ingestion

The Problem: Chunks were historically written to target vector databases sequentially (1 chunk per synchronous network round-trip). For large documents (500+ fragments), this architecture triggered hundreds of independent API/DB requests, introducing massive network latency blockades.
The Solution: Refactored the core loop within application/parser/embedding_pipeline.py to batch insert documents utilizing a configurable threshold (settings.EMBEDDING_BATCH_SIZE, defaulting to 100 chunks). Aligned the database checkpoint mechanism (_record_progress) and Server-Sent Events (SSE) toast streams (publish_user_event) to transition smoothly on batch boundaries rather than single-item increments.

2. Semantic-Aware Recursive Chunking

The Problem: The legacy "classic_chunk" strategy split strings using rigid token-count slicing (body_tokens[current_position:end_position]). This frequently sliced individual words and phrases in half across chunk boundaries, causing severe semantic fragmentation and lowering downstream LLM generation quality.
The Solution: Introduced a pluggable "recursive_chunk" engine fallback under application/parser/chunking.py utilizing RecursiveCharacterTextSplitter.from_tiktoken_encoder with a default "cl100k_base" encoder configuration. It natively prioritizes clean semantic breaks (paragraphs \n\n, sentences ., ?, !, and words ) while tracking an explicit chunk_overlap layer (defaults to 200 tokens) to preserve contextual flow.

Head-to-Head Architectural Benchmarks

Ingestion Speed Metrics (Local Evaluation Suite)

Boundary Chunking Quality Evaluation

Classic Strategy Layer: Generated ~9 fractured words across split chunk boundaries.
Recursive Strategy Layer: Generated 0 fractured words (100% alignment along native syntax boundaries).

Required Visual Proof (Screenshots / Screen Recording)

Testing, Quality Control & Verification

Code Coverage & Test Assertions

Extended the backend testing modules to explicitly assert the integrity of the batched data pathways and dispatch routes:

tests/parser/test_chunking.py (28/28 passing): Validated recursive_chunk setup parameters, dispatcher integration, chunk edge boundary processing, and token metadata retention.
tests/parser/file/test_embedding_pipeline.py (13/13 passing): Reconstructed mock environments to handle multi-item tuple listings and assert bulk execution handling within add_texts_to_store_with_retry.

- Implement 'recursive_chunk' strategy in chunking.py using tiktoken to prevent broken words at chunk boundaries - Refactor embedding_pipeline.py to support batched vector store insertions, reducing API overhead - Add unit tests in test_chunking.py to validate the new splitter and configuration - Update test_embedding_pipeline.py to assert correct batched operations

vercel · 2026-05-27T02:43:11Z

@abhay-2108 is attempting to deploy a commit to the Arc53 Team on Vercel.

A member of the Team first needs to authorize it.

abhay-2108 · 2026-06-02T07:58:54Z

Hey @dartpain / @ManishMadan2882! Just bumping this to see if you've had a chance to look it over. I know the GitHub Actions workflows are currently waiting on maintainer approval to trigger the test suite.

Let me know if you want me to adjust anything with the batching logic or the recursive splitter parameters. Ready to make any changes needed to get this merged!

github-actions Bot added application Application tests Tests labels May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add recursive chunking strategy and batched vector insertions#2505

feat: add recursive chunking strategy and batched vector insertions#2505
abhay-2108 wants to merge 1 commit into
arc53:mainfrom
abhay-2108:chore/codebase-optimization-scan

abhay-2108 commented May 27, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 27, 2026

Uh oh!

abhay-2108 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

abhay-2108 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Refactor: Optimize Ingestion Pipeline via Batched Vector Insertions & Recursive Chunking

Description

1. Batched Vector Store Ingestion

2. Semantic-Aware Recursive Chunking

Head-to-Head Architectural Benchmarks

Ingestion Speed Metrics (Local Evaluation Suite)

Boundary Chunking Quality Evaluation

Required Visual Proof (Screenshots / Screen Recording)

Testing, Quality Control & Verification

Code Coverage & Test Assertions

Uh oh!

vercel Bot commented May 27, 2026

Uh oh!

abhay-2108 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abhay-2108 commented May 27, 2026 •

edited

Loading