Skip to content

Commit 9761fb4

Browse files
committed
feat(rag): add searchable metadata context to vector chunks for filename/path queries
Problem: When codebases and documents are ingested into VectorRAG, content loses its source context. Agents cannot answer queries like "What's in ChatWidget.swift?" or "Show me PDFs about X" because filenames and paths are stored in metadata but not searchable via semantic search. Root Cause: Vector RAG stores filename/path in chunk metadata dictionary, but semantic search only searches the chunk CONTENT field. The context field exists but wasn't being used for source identification - it just had generic text like "Text from document.title" or "Code from document.title". Solution: Enhanced the context field (which IS searchable) for all chunk types with structured metadata that enables filename, path, and source-based queries: Code Files: - Context: "File: {filename} | Path: {relativePath} | Type: code" - Example: "File: ChatWidget.swift | Path: Sources/UserInterface/Chat/ | Type: code" - Enables: "Show me Swift files in UserInterface", "What's in ChatWidget.swift?" Documents: - Context: "Document: {filename} | Type: {format}" - Example: "Document: report.pdf | Type: PDF" - Enables: "Find PDF documents about X", "Show me Word docs" Web Content: - Context: "Web: {title} | Source: {domain}" - Example: "Web: Swift Documentation | Source: swift.org" - Enables: "What did we research from swift.org?", "Show web content about X" Conversations: - Context: "Conversation: {title} | Turn: {turnNumber}" - Example: "Conversation: Mermaid Fixes | Turn: 5" - Enables: "What did we discuss in conversation X?", "Find turn 5" Changes: - VectorRAGService.chunkCodeDocument(): Enhanced code chunk context - VectorRAGService.chunkTextDocument(): Enhanced document/web chunk context - VectorRAGService.chunkConversationDocument(): Enhanced conversation context Testing: ✅ Build: PASS ✅ All chunk types create context with source metadata ✅ Metadata remains backward compatible (stored in metadata dict) Impact: Agents can now answer filename and path-based queries without violating vector DB semantic search patterns. The context field becomes a rich source of structured metadata while maintaining its searchable nature. Training Use Case: When exporting training data, filenames are now included in context, making it easier to correlate training examples with source files for better model understanding of codebase structure.
1 parent 8e122a1 commit 9761fb4

3 files changed

Lines changed: 98 additions & 13 deletions

File tree

Info.plist

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,9 @@
1919
<key>CFBundlePackageType</key>
2020
<string>APPL</string>
2121
<key>CFBundleShortVersionString</key>
22-
<string>20260113.1</string>
22+
<string>20260116.1</string>
2323
<key>CFBundleVersion</key>
24-
<string>20260113.1</string>
24+
<string>20260116.1</string>
2525
<key>LSApplicationCategoryType</key>
2626
<string>public.app-category.productivity</string>
2727
<key>LSMinimumSystemVersion</key>

Resources/whats-new.json

Lines changed: 55 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,69 @@
11
{
22
"releases": [
33
{
4-
"version": "20260120.1",
5-
"release_date": "January 20, 2026",
6-
"introduction": "This release adds batch codebase import capabilities, making it easier to ingest entire repositories into SAM's memory system for enhanced code understanding and assistance.",
4+
"version": "20260116.1",
5+
"release_date": "January 16, 2026",
6+
"introduction": "This release brings comprehensive Mermaid diagram rendering improvements, batch codebase import capabilities, web UI integration support, and critical bug fixes for editor stability and authentication.",
77
"highlights": [
8+
{
9+
"id": "mermaid-improvements",
10+
"icon": "chart.bar.doc.horizontal",
11+
"title": "Comprehensive Mermaid Diagram Improvements",
12+
"description": "Complete overhaul of Mermaid diagram rendering with intelligent edge routing, obstacle avoidance, and adaptive node spacing for flowcharts. Sequence diagrams now parse all arrow types correctly. Bar and XY charts support the 'series' keyword with space-separated data. Mind maps display full hierarchies with proper indentation normalization. All diagram types from Mermaid syntax now render reliably with professional layout and routing."
13+
},
814
{
915
"id": "codebase-ingest",
1016
"icon": "folder.badge.gearshape",
1117
"title": "Batch Codebase Import",
1218
"description": "Import entire codebases into SAM's memory with a single operation. The new ingest_codebase feature recursively scans directories, filters by file patterns, and batch-imports code files into VectorRAG memory. Supports optional JSONL training export, excludes build artifacts and dependencies, and safely limits imports to 500 files by default. Perfect for giving SAM comprehensive understanding of your project structure."
19+
},
20+
{
21+
"id": "web-ui-endpoints",
22+
"icon": "globe",
23+
"title": "Web UI Integration Support",
24+
"description": "New REST API endpoints enable external web UI access to SAM's features. Share conversation topics and mini-prompts with web interfaces for seamless integration with browser-based tools and custom dashboards."
1325
}
1426
],
15-
"bugfixes": [],
16-
"improvements": [],
17-
"developer_notes": []
27+
"bugfixes": [
28+
{
29+
"id": "editor-race-condition",
30+
"icon": "exclamationmark.triangle.fill",
31+
"title": "Fixed Editor Race Condition on Personality Changes",
32+
"description": "Resolved critical race condition where changing the default personality could cause editor initialization failures. System now properly inherits personality settings without timing conflicts."
33+
},
34+
{
35+
"id": "auth-storage-fix",
36+
"icon": "key.fill",
37+
"title": "Replaced Keychain with UserDefaults for API Tokens",
38+
"description": "Moved API token storage from macOS Keychain to UserDefaults to resolve permission and access issues. Tokens are now reliably persisted and accessible across app launches."
39+
},
40+
{
41+
"id": "sd-model-dynamic",
42+
"icon": "arrow.triangle.2.circlepath",
43+
"title": "Stable Diffusion Model List Updates Dynamically",
44+
"description": "Fixed issue where Stable Diffusion model list wouldn't update after downloading new models. List now refreshes automatically when models are added or removed."
45+
},
46+
{
47+
"id": "collab-persistence",
48+
"icon": "pin.fill",
49+
"title": "User Collaboration Tool Response Persistence",
50+
"description": "User responses in collaboration checkpoints are now properly persisted and pinned to prevent loss during long-running sessions."
51+
}
52+
],
53+
"improvements": [
54+
{
55+
"id": "prompt-refactor",
56+
"icon": "text.alignleft",
57+
"title": "Cleaner Prompt Generation Architecture",
58+
"description": "Refactored tool guidance generation into dedicated buildToolUsage() method for better maintainability and separation of concerns in system prompt construction."
59+
}
60+
],
61+
"developer_notes": [
62+
{
63+
"id": "commit-workflow",
64+
"content": "Updated commit message workflow to write messages to files before committing, preventing UTF-8 encoding corruption in multi-line messages."
65+
}
66+
]
1867
},
1968
{
2069
"version": "20260113.1",

Sources/ConversationEngine/VectorRAGService.swift

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -810,9 +810,21 @@ public class DocumentChunker: @unchecked Sendable {
810810
/// Create chunk from accumulated sentences.
811811
let chunkContent = currentChunk.trimmingCharacters(in: .whitespacesAndNewlines)
812812
if chunkContent.count >= minChunkSize {
813+
// Build enhanced searchable context based on document type
814+
var context: String
815+
if document.type == .web {
816+
let sourceURL = (document.metadata["sourceURL"] as? String) ?? ""
817+
let domain = URL(string: sourceURL)?.host ?? "unknown"
818+
context = "Web: \(document.title) | Source: \(domain)"
819+
} else {
820+
let filename = (document.metadata["filename"] as? String) ?? document.title
821+
let format = (document.metadata["format"] as? String) ?? document.type.rawValue
822+
context = "Document: \(filename) | Type: \(format)"
823+
}
824+
813825
let chunk = DocumentChunk(
814826
content: chunkContent,
815-
context: "Text from \(document.title)",
827+
context: context,
816828
importance: calculateChunkImportance(chunkContent),
817829
metadata: ["source": document.title, "type": "text", "chunk_size": chunkContent.count]
818830
)
@@ -835,9 +847,21 @@ public class DocumentChunker: @unchecked Sendable {
835847
if !currentChunk.isEmpty {
836848
let chunkContent = currentChunk.trimmingCharacters(in: .whitespacesAndNewlines)
837849
if chunkContent.count >= minChunkSize {
850+
// Build enhanced searchable context based on document type
851+
var context: String
852+
if document.type == .web {
853+
let sourceURL = (document.metadata["sourceURL"] as? String) ?? ""
854+
let domain = URL(string: sourceURL)?.host ?? "unknown"
855+
context = "Web: \(document.title) | Source: \(domain)"
856+
} else {
857+
let filename = (document.metadata["filename"] as? String) ?? document.title
858+
let format = (document.metadata["format"] as? String) ?? document.type.rawValue
859+
context = "Document: \(filename) | Type: \(format)"
860+
}
861+
838862
let chunk = DocumentChunk(
839863
content: chunkContent,
840-
context: "Text from \(document.title)",
864+
context: context,
841865
importance: calculateChunkImportance(chunkContent),
842866
metadata: ["source": document.title, "type": "text", "chunk_size": chunkContent.count]
843867
)
@@ -933,9 +957,15 @@ public class DocumentChunker: @unchecked Sendable {
933957

934958
/// Simple end detection.
935959
if trimmedLine == "}" || (trimmedLine.isEmpty && currentFunction.count > 500) {
960+
// Extract filename and path from metadata
961+
let filename = (document.metadata["filename"] as? String) ?? document.title
962+
let filePath = (document.metadata["filePath"] as? String) ?? ""
963+
let pathComponents = filePath.components(separatedBy: "/")
964+
let relativePath = pathComponents.count > 2 ? pathComponents.suffix(3).joined(separator: "/") : filePath
965+
936966
let chunk = DocumentChunk(
937967
content: currentFunction,
938-
context: "Code function from \(document.title)",
968+
context: "File: \(filename) | Path: \(relativePath) | Type: code",
939969
importance: 0.8,
940970
metadata: ["source": document.title, "type": "code", "language": detectLanguage(document.content)]
941971
)
@@ -948,9 +978,15 @@ public class DocumentChunker: @unchecked Sendable {
948978

949979
/// Handle remaining content.
950980
if !currentFunction.isEmpty {
981+
// Extract filename and path from metadata
982+
let filename = (document.metadata["filename"] as? String) ?? document.title
983+
let filePath = (document.metadata["filePath"] as? String) ?? ""
984+
let pathComponents = filePath.components(separatedBy: "/")
985+
let relativePath = pathComponents.count > 2 ? pathComponents.suffix(3).joined(separator: "/") : filePath
986+
951987
let chunk = DocumentChunk(
952988
content: currentFunction,
953-
context: "Code from \(document.title)",
989+
context: "File: \(filename) | Path: \(relativePath) | Type: code",
954990
importance: 0.7,
955991
metadata: ["source": document.title, "type": "code"]
956992
)
@@ -969,7 +1005,7 @@ public class DocumentChunker: @unchecked Sendable {
9691005

9701006
return DocumentChunk(
9711007
content: trimmed,
972-
context: "Conversation turn \(index + 1) from \(document.title)",
1008+
context: "Conversation: \(document.title) | Turn: \(index + 1)",
9731009
importance: 0.6,
9741010
metadata: ["source": document.title, "type": "conversation", "turn": index]
9751011
)

0 commit comments

Comments
 (0)