Skip to content

GML-2135 Release 1.4.2#45

Merged
chengbiao-jin merged 8 commits into
mainfrom
release_1.4.2
Jun 23, 2026
Merged

GML-2135 Release 1.4.2#45
chengbiao-jin merged 8 commits into
mainfrom
release_1.4.2

Conversation

@chengbiao-jin

@chengbiao-jin chengbiao-jin commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

PR Type

Enhancement, Bug fix, Documentation


Description

  • Adds graph compatibility repair assistant

  • Reinstalls drifted GSQL queries safely

  • Hardens ingestion ID and chunk handling

  • Documents and bumps v1.4.2 release


Diagram Walkthrough

flowchart LR
  admin["KG Admin UI"]
  status["Migration status API"]
  repair["Migration apply API"]
  migrate["GSQL migration helpers"]
  tg["TigerGraph queries"]
  ingest["Ingestion pipeline"]
  ids["Normalized IDs and atomic chunks"]
  admin -- "checks" --> status
  status -- "hashes" --> migrate
  admin -- "repairs" --> repair
  repair -- "recreates" --> tg
  ingest -- "uses" --> ids
Loading

File Walkthrough

Relevant files
Enhancement
4 files
ui.py
Add migration status and repair endpoints                               
+387/-12
util.py
Detect query drift and atomic upserts                                       
+115/-9 
migrate.py
Add GSQL query migration utilities                                             
+224/-0 
KGAdmin.tsx
Add compatibility check repair dialog                                       
+343/-6 
Bug fix
8 files
graph_rag.py
Harden chunk streaming and batched loading                             
+168/-98
supportai_ingest.py
Normalize IDs and log ingest failures                                       
+34/-13 
workers.py
Atomic chunk writes with normalized IDs                                   
+28/-34 
eventual_consistency_checker.py
Normalize entity relationship and chunk IDs                           
+9/-7     
workers.py
Normalize supportai worker chunk links                                     
+5/-4     
util.py
Align ID normalization whitespace behavior                             
+1/-4     
IngestGraph.tsx
Report per-file upload failures safely                                     
+43/-34 
StreamIds.gsql
Atomically claim unprocessed vertex IDs                                   
+15/-6   
Error handling
2 files
supportai.py
Log server folder processing failures                                       
+1/-0     
supportai.py
Return clearer ingestion endpoint errors                                 
+14/-4   
Documentation
2 files
README.md
Add v1.4.2 release announcement                                                   
+2/-0     
CHANGELOG.md
Document v1.4.2 release changes                                                   
+11/-0   
Configuration changes
1 files
VERSION
Bump version to 1.4.2                                                                       
+1/-1     

chengbiao-jin and others added 7 commits June 23, 2026 13:32
- Scan an existing graph for installed queries that have drifted from
  the shipped version or are missing
- Repair them in place without rebuilding the knowledge graph
- Refuse repair while a rebuild is in progress and run it under the
  per-graph lock
- Reconcile chunks left unfinished by an interrupted run before
  processing new documents
- Write each chunk together with its content so cancellation can't
  leave chunks without content
- Normalize vertex IDs the same way across every ingest path so
  documents with spaces or mixed case in filenames stay consistent
- Re-create installed queries that have drifted from the shipped
  version on initialization
- Surface ingestion failures as clear errors instead of failing silently
- Bump version to 1.4.2
- Update CHANGELOG and README releases
- Add .gitignore
- GML-2131 Honor configured GS/RESTPP ports at login
- Omit gsPort/restppPort unless configured, matching the auth() path,
  so absent config falls back to pyTigerGraph defaults
@tg-pr-agent

tg-pr-agent Bot commented Jun 23, 2026

Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 Security concerns

Information disclosure:
graphrag/app/routers/supportai.py now returns raw exception text to clients in HTTP 500 details for ingest preparation and ingestion failures. Those exception strings can include filesystem paths, TigerGraph responses, configuration details, or other internal diagnostics. Consider returning a generic client-facing message while logging the detailed exception server-side.

⚡ Recommended focus areas for review

Drift Detection

The GSQL hash compares the local file body directly against SHOW QUERY output after only comment/whitespace normalization. If TigerGraph canonicalizes CREATE OR REPLACE to CREATE, or otherwise changes harmless query boilerplate, every query may be reported as drifted and repeatedly reinstalled. Consider normalizing expected TG canonical forms before hashing.

def _normalize_gsql(body: str) -> str:
    body = _BLOCK_COMMENT_RE.sub("", body)
    body = _LINE_COMMENT_RE.sub("", body)
    body = _WHITESPACE_RE.sub(" ", body).strip()
    return body


def _gsql_hash(body: str) -> str:
    return hashlib.sha256(_normalize_gsql(body).encode()).hexdigest()[:16]
ID Mismatch

v_id is now only lowercased instead of passed through util.process_id, while chunk IDs and previous chunk links are normalized with process_id. Documents containing spaces, slashes, or parentheses can therefore create edges from an unnormalized document ID to normalized chunk IDs, which may not match the ingested document vertex.

v_id = doc["v_id"].lower()

# Use get_chunker for all types (including images)
# For images, get_chunker returns SingleChunker which preserves markdown image references
chunker = ecc_util.get_chunker(chunker_type, graphname=conn.graphname)
# decode the text return from tigergraph as it was encoded when written into jsonl file for uploading
chunks = chunker.chunk(doc["attributes"]["text"].encode('raw_unicode_escape').decode('unicode_escape'))

# v_id / chunk_id derive from user document content.
logger.debug(f"Chunking {v_id} into {len(chunks)} chunk(s)")
for i, chunk in enumerate(chunks):
    chunk_id = util.process_id(f"{v_id}_chunk_{i}")
Scalability Risk

stream_docs and stream_chunks now ignore ttl_batches and fetch all unprocessed IDs in a single StreamIds call. On large graphs or after a long outage, this can produce very large query responses, timeouts, or high memory usage compared with the previous bounded batching behavior.

logger.info("streaming docs (single-probe scan)")
probe = await stream_ids(conn, "Document", 0, 1)
n_docs = 0
if probe.get("error"):
    logger.warning("stream_docs: StreamIds probe failed; nothing to stream")
else:
    doc_ids_all = probe.get("ids") or []
    if not doc_ids_all:
        logger.info("stream_docs: no unprocessed Documents (epoch_processed == 0)")
    else:
        logger.info(
            f"stream_docs: {len(doc_ids_all)} unprocessed Document(s) to stream"
        )
        for d in doc_ids_all:

- Restore partitioned batch streaming for documents and chunks
- Check vertex existence through the TigerGraph client instead of
  hand-built requests
- Use lowercase document ids consistently across the batch ingest path
@chengbiao-jin chengbiao-jin merged commit 8dfcadb into main Jun 23, 2026
1 check failed
@chengbiao-jin chengbiao-jin deleted the release_1.4.2 branch June 23, 2026 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants