Skip to content

Edge deduplication in retrieve_existing_edges is non-functional (overwrite, key format, ID normalization) #2557

@YizukiAme

Description

@YizukiAme

Bug Description

retrieve_existing_edges() has three compounding issues that make edge deduplication non-functional during graph expansion:

1. Overwrite instead of accumulate

graph_node_edges is reassigned inside the chunk loop, so only the last chunk's edges are queried against the graph DB:

# retrieve_existing_edges.py L64-67
for index, data_chunk in enumerate(data_chunks):
    graph = chunk_graphs[index]
    # ...
    graph_node_edges = [  # ← reassigned each iteration, should be .extend()
        (edge.target_node_id, edge.source_node_id, edge.relationship_name)
        for edge in graph.edges
    ]

2. Key format mismatch

The producer builds keys by plain concatenation:

# retrieve_existing_edges.py L81
existing_edges_map[str(edge[0]) + str(edge[1]) + str(edge[2])] = True
# e.g. "abc123def456mentioned_in"

But the consumer uses underscore-separated keys:

# expand_with_nodes_and_edges.py L26-28
def _create_edge_key(source_id, target_id, relationship_name):
    return f"{source_id}_{target_id}_{relationship_name}"
# e.g. "abc123_def456_mentioned_in"

These never match, making dedup a no-op.

3. Missing ID normalization

The producer uses raw edge.source_node_id / edge.target_node_id, while the consumer normalizes them through generate_node_id() (lowercases, strips spaces, hashes to UUID5) and generate_edge_name(). Even with the separator fix, keys still can't match because the ID formats differ.

Impact

  • Multi-chunk documents lose dedup for all but the last chunk's content edges
  • All content-edge dedup checks fail → duplicate edges accumulate in the graph database on every cognify() run

Expected Behavior

All chunks' graph edges should participate in the existence query, and the resulting map keys should use the same format and normalization as the consumer.

Environment

  • Branch: dev (commit 452333a)
  • Python 3.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions