Bug Description
retrieve_existing_edges() has three compounding issues that make edge deduplication non-functional during graph expansion:
1. Overwrite instead of accumulate
graph_node_edges is reassigned inside the chunk loop, so only the last chunk's edges are queried against the graph DB:
# retrieve_existing_edges.py L64-67
for index, data_chunk in enumerate(data_chunks):
graph = chunk_graphs[index]
# ...
graph_node_edges = [ # ← reassigned each iteration, should be .extend()
(edge.target_node_id, edge.source_node_id, edge.relationship_name)
for edge in graph.edges
]
2. Key format mismatch
The producer builds keys by plain concatenation:
# retrieve_existing_edges.py L81
existing_edges_map[str(edge[0]) + str(edge[1]) + str(edge[2])] = True
# e.g. "abc123def456mentioned_in"
But the consumer uses underscore-separated keys:
# expand_with_nodes_and_edges.py L26-28
def _create_edge_key(source_id, target_id, relationship_name):
return f"{source_id}_{target_id}_{relationship_name}"
# e.g. "abc123_def456_mentioned_in"
These never match, making dedup a no-op.
3. Missing ID normalization
The producer uses raw edge.source_node_id / edge.target_node_id, while the consumer normalizes them through generate_node_id() (lowercases, strips spaces, hashes to UUID5) and generate_edge_name(). Even with the separator fix, keys still can't match because the ID formats differ.
Impact
- Multi-chunk documents lose dedup for all but the last chunk's content edges
- All content-edge dedup checks fail → duplicate edges accumulate in the graph database on every
cognify() run
Expected Behavior
All chunks' graph edges should participate in the existence query, and the resulting map keys should use the same format and normalization as the consumer.
Environment
- Branch:
dev (commit 452333a)
- Python 3.12
Bug Description
retrieve_existing_edges()has three compounding issues that make edge deduplication non-functional during graph expansion:1. Overwrite instead of accumulate
graph_node_edgesis reassigned inside the chunk loop, so only the last chunk's edges are queried against the graph DB:2. Key format mismatch
The producer builds keys by plain concatenation:
But the consumer uses underscore-separated keys:
These never match, making dedup a no-op.
3. Missing ID normalization
The producer uses raw
edge.source_node_id/edge.target_node_id, while the consumer normalizes them throughgenerate_node_id()(lowercases, strips spaces, hashes to UUID5) andgenerate_edge_name(). Even with the separator fix, keys still can't match because the ID formats differ.Impact
cognify()runExpected Behavior
All chunks' graph edges should participate in the existence query, and the resulting map keys should use the same format and normalization as the consumer.
Environment
dev(commit452333a)