Skip to content

[Bug]index_data_points: shallow copy of metadata dict causes only first index_field to be embedded #2529

@xzhouww

Description

@xzhouww

Title

index_data_points: shallow copy of metadata dict causes only first index_field to be embedded

Description

When a custom DataPoint defines multiple index_fields (e.g., ["problem", "conclusion", "follow_up"]), only the first field gets properly embedded. The remaining collections are created but contain embeddings and text from the first field.

Root Cause

In cognee/tasks/storage/index_data_points.py, lines 36-48:

for field_name in data_point.metadata["index_fields"]:  # iterates over original list
    # ...
    indexed_data_point = data_point.model_copy()  # shallow copy
    indexed_data_point.metadata["index_fields"] = [field_name]  # mutates the ORIGINAL dict!

model_copy() (Pydantic v2) performs a shallow copy of dict fields. Since indexed_data_point.metadata and data_point.metadata point to the same dict object, the assignment indexed_data_point.metadata["index_fields"] = [field_name] mutates the original data_point.metadata["index_fields"] from ["problem", "conclusion", "follow_up"] to ["problem"].

This causes the for loop to terminate after the first iteration, as the list it's iterating over has been truncated to a single element.

Reproduction

from cognee.infrastructure.engine import DataPoint
from cognee.tasks.storage import add_data_points
import cognee

class MyCase(DataPoint):
    problem: str = ""
    conclusion: str = ""
    metadata: dict = {"index_fields": ["problem", "conclusion"]}

async def main():
    await cognee.prune.prune_data()
    await cognee.prune.prune_system(metadata=True)

    from cognee.modules.engine.operations.setup import setup
    await setup()

    cases = [
        MyCase(problem="PROBLEM_AAA", conclusion="CONCLUSION_BBB"),
    ]
    await add_data_points(cases)

    # Check: both collections have the same content (should be different)
    from cognee.infrastructure.databases.vector import get_vector_engine
    engine = get_vector_engine()
    conn = await engine.get_connection()
    for coll in ["MyCase_problem", "MyCase_conclusion"]:
        table = await conn.open_table(coll)
        data = await table.to_arrow()
        text = data.column("payload")[0].as_py().get("text", "")
        print(f"{coll}: {text}")
        # Both print "CONCLUSION_BBB" — should be "PROBLEM_AAA" and "CONCLUSION_BBB" respectively

import asyncio
asyncio.run(main())

Minimal verification of the shallow copy issue

from cognee.infrastructure.engine import DataPoint

class TestDP(DataPoint):
    a: str = ""
    b: str = ""
    metadata: dict = {"index_fields": ["a", "b"]}

dp = TestDP(a="AAA", b="BBB")
clone = dp.model_copy()
clone.metadata["index_fields"] = ["b"]

print(dp.metadata["index_fields"])
# Output: ["b"]  — original object was mutated!
# Expected: ["a", "b"]

Suggested Fix

Two-line change in cognee/tasks/storage/index_data_points.py:

-        for field_name in data_point.metadata["index_fields"]:
+        for field_name in list(data_point.metadata["index_fields"]):
             # ...
             indexed_data_point = data_point.model_copy()
+            indexed_data_point.metadata = dict(data_point.metadata)
             indexed_data_point.metadata["index_fields"] = [field_name]
  1. list(...) creates a copy of the field names list before iteration, preventing the loop from being truncated.
  2. dict(...) creates a shallow copy of the metadata dict, preventing mutation of the original DataPoint's metadata.

Using model_copy(deep=True) would also work but is unnecessarily expensive since only the metadata dict needs isolation.

Environment

  • cognee version: 0.5.5
  • Python: 3.12
  • Pydantic: v2
  • Vector DB: LanceDB

Impact

Any user defining a custom DataPoint with multiple index_fields will silently get incorrect embeddings — all vector collections will contain the same embeddings from the first field only. This makes multi-field semantic search ineffective.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions