Skip to content

perf: parallelize extract_graph and summarize_text in cognify pipeline #2488

@soichisumi

Description

@soichisumi

Problem

The cognify pipeline runs extract_graph_from_data and summarize_text sequentially for each batch of chunks. Both tasks are LLM-bound and independent of each other, so they could run concurrently.

Current flow (sequential):

extract_graph_from_data(chunks)  # LLM calls
    ↓
summarize_text(chunks)           # LLM calls

Proposed flow (parallel):

asyncio.gather(
    extract_graph_from_data(chunks),   # LLM calls
    summarize_text(chunks),            # LLM calls
)

Impact

On a dataset with multiple chunks, this roughly halves the LLM-bound processing time for the cognify pipeline. In our testing with FalkorDB and OpenRouter, a 6KB document went from ~45s to ~25s for the graph+summary phase.

Implementation Notes

  • Both functions accept the same data_chunks input and produce independent outputs
  • The results need to be merged before passing to add_data_points
  • Care needed with shared state (e.g., chunk objects should not be mutated by both tasks simultaneously)
  • Could be implemented as a configuration option (parallel vs sequential) for safety

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions