Title
index_data_points: shallow copy of metadata dict causes only first index_field to be embedded
Description
When a custom DataPoint defines multiple index_fields (e.g., ["problem", "conclusion", "follow_up"]), only the first field gets properly embedded. The remaining collections are created but contain embeddings and text from the first field.
Root Cause
In cognee/tasks/storage/index_data_points.py, lines 36-48:
for field_name in data_point.metadata["index_fields"]: # iterates over original list
# ...
indexed_data_point = data_point.model_copy() # shallow copy
indexed_data_point.metadata["index_fields"] = [field_name] # mutates the ORIGINAL dict!
model_copy() (Pydantic v2) performs a shallow copy of dict fields. Since indexed_data_point.metadata and data_point.metadata point to the same dict object, the assignment indexed_data_point.metadata["index_fields"] = [field_name] mutates the original data_point.metadata["index_fields"] from ["problem", "conclusion", "follow_up"] to ["problem"].
This causes the for loop to terminate after the first iteration, as the list it's iterating over has been truncated to a single element.
Reproduction
from cognee.infrastructure.engine import DataPoint
from cognee.tasks.storage import add_data_points
import cognee
class MyCase(DataPoint):
problem: str = ""
conclusion: str = ""
metadata: dict = {"index_fields": ["problem", "conclusion"]}
async def main():
await cognee.prune.prune_data()
await cognee.prune.prune_system(metadata=True)
from cognee.modules.engine.operations.setup import setup
await setup()
cases = [
MyCase(problem="PROBLEM_AAA", conclusion="CONCLUSION_BBB"),
]
await add_data_points(cases)
# Check: both collections have the same content (should be different)
from cognee.infrastructure.databases.vector import get_vector_engine
engine = get_vector_engine()
conn = await engine.get_connection()
for coll in ["MyCase_problem", "MyCase_conclusion"]:
table = await conn.open_table(coll)
data = await table.to_arrow()
text = data.column("payload")[0].as_py().get("text", "")
print(f"{coll}: {text}")
# Both print "CONCLUSION_BBB" — should be "PROBLEM_AAA" and "CONCLUSION_BBB" respectively
import asyncio
asyncio.run(main())
Minimal verification of the shallow copy issue
from cognee.infrastructure.engine import DataPoint
class TestDP(DataPoint):
a: str = ""
b: str = ""
metadata: dict = {"index_fields": ["a", "b"]}
dp = TestDP(a="AAA", b="BBB")
clone = dp.model_copy()
clone.metadata["index_fields"] = ["b"]
print(dp.metadata["index_fields"])
# Output: ["b"] — original object was mutated!
# Expected: ["a", "b"]
Suggested Fix
Two-line change in cognee/tasks/storage/index_data_points.py:
- for field_name in data_point.metadata["index_fields"]:
+ for field_name in list(data_point.metadata["index_fields"]):
# ...
indexed_data_point = data_point.model_copy()
+ indexed_data_point.metadata = dict(data_point.metadata)
indexed_data_point.metadata["index_fields"] = [field_name]
list(...) creates a copy of the field names list before iteration, preventing the loop from being truncated.
dict(...) creates a shallow copy of the metadata dict, preventing mutation of the original DataPoint's metadata.
Using model_copy(deep=True) would also work but is unnecessarily expensive since only the metadata dict needs isolation.
Environment
- cognee version: 0.5.5
- Python: 3.12
- Pydantic: v2
- Vector DB: LanceDB
Impact
Any user defining a custom DataPoint with multiple index_fields will silently get incorrect embeddings — all vector collections will contain the same embeddings from the first field only. This makes multi-field semantic search ineffective.
Title
index_data_points: shallow copy ofmetadatadict causes only firstindex_fieldto be embeddedDescription
When a custom
DataPointdefines multipleindex_fields(e.g.,["problem", "conclusion", "follow_up"]), only the first field gets properly embedded. The remaining collections are created but contain embeddings and text from the first field.Root Cause
In
cognee/tasks/storage/index_data_points.py, lines 36-48:model_copy()(Pydantic v2) performs a shallow copy of dict fields. Sinceindexed_data_point.metadataanddata_point.metadatapoint to the same dict object, the assignmentindexed_data_point.metadata["index_fields"] = [field_name]mutates the originaldata_point.metadata["index_fields"]from["problem", "conclusion", "follow_up"]to["problem"].This causes the
forloop to terminate after the first iteration, as the list it's iterating over has been truncated to a single element.Reproduction
Minimal verification of the shallow copy issue
Suggested Fix
Two-line change in
cognee/tasks/storage/index_data_points.py:list(...)creates a copy of the field names list before iteration, preventing the loop from being truncated.dict(...)creates a shallow copy of the metadata dict, preventing mutation of the original DataPoint's metadata.Using
model_copy(deep=True)would also work but is unnecessarily expensive since only the metadata dict needs isolation.Environment
Impact
Any user defining a custom DataPoint with multiple
index_fieldswill silently get incorrect embeddings — all vector collections will contain the same embeddings from the first field only. This makes multi-field semantic search ineffective.