This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
# Install all Python dependencies (run from repo root)
uv sync --all-packages
# Quality checks (run from repo root)
uv run ruff check . # Linting
uvx ty check # Type checking
uv run pytest -v # Run all tests
# Run a single test
uv run pytest apps/api/tests/test_health.py::test_health_check -v
# Start API server (run from repo root)
uv run uvicorn apps.api.app.main:app --reload --port 8000
# Start frontend dev server (from apps/web)
cd apps/web && npm install && npm run dev
# Docker (full stack)
docker compose up --build
# Database migrations (run from packages/core)
cd packages/core && DATABASE_URL=postgresql+asyncpg://contextmine:contextmine@localhost:5432/contextmine uv run alembic upgrade head
# Run migrations in Docker
docker compose exec api sh -c "cd /app/packages/core && alembic upgrade head"ContextMine is a documentation/code indexing system exposing context via MCP (Model Context Protocol).
- apps/api: FastAPI backend with MCP server mounted at
/mcp - apps/web: React frontend (Vite) admin console
- apps/worker: Prefect worker for background sync jobs
- packages/core: Shared Python library (settings, DB models, services)
- rust/spider_md: Rust-based web crawler binary (HTML→Markdown)
Uses uv workspaces. The root pyproject.toml defines:
- Workspace members:
apps/api,apps/worker,packages/* - Dev dependencies: ruff, ty, pytest, httpx
- Shared source:
contextmine-corepackage
FastAPI app (apps/api/app/main.py) mounts:
- REST routes under
/api/*(health, auth, collections, sources, etc.) - MCP server at
/mcpusing Streamable HTTP transport
MCP exposes tools: get_markdown (semantic search), list_collections, list_documents, outline, find_symbol, definition, references, expand, deep_research.
- Backend routes:
/api/* - MCP endpoint:
/mcp(Streamable HTTP) - Environment config:
.env.exampledocuments all env vars - Incremental builds: each step should pass
uv sync,ruff check,ty check,pytest
- Backend: Python 3.12, FastAPI, SQLAlchemy 2.x async, Alembic, Postgres+pgvector
- Frontend: React, Vite, TypeScript
- Orchestration: Prefect (scheduled syncs)
- Crawling: spider-rs (Rust binary for HTML→Markdown)
- Retrieval: Hybrid FTS + vector search with RRF ranking
- LLM providers: OpenAI, Anthropic, Gemini (embeddings: OpenAI/Gemini only)
Tracking implementation of the Knowledge Graph / Derived Knowledge subsystem (see Knowledge Graph docs).
| Step | Description | Status | Notes |
|---|---|---|---|
| 1 | Knowledge Graph storage layer | DONE | Models + migration 013 |
| 2 | Graph builder from indexing output | DONE | Builder + tests (skipped for SQLite) |
| 3 | ERM extraction + Mermaid ERD | DONE | AST parser + 10 tests |
| 4 | System Surface Catalog | DONE | OpenAPI, GraphQL, Proto, Jobs |
| 5 | Business Rule candidate mining | DONE | Tree-sitter AST + 15 tests |
| 6 | LLM labeling (RuleCandidate→BusinessRule) | DONE | Pydantic schemas + 12 tests |
| 7 | GraphRAG retrieval | DONE | Bundle + neighborhood + path + 12 tests |
| 8 | MCP tools for Claude Code | DONE | 7 tools + 14 tests |
| 9 | arc42 Architecture Twin | DONE | Generator + drift report + 17 tests |
| 10 | Final hardening + e2e tests | PENDING |
Schema decisions:
- Generic node/edge tables with
kindenum instead of one-table-per-concept - Natural key constraint:
(collection_id, kind, natural_key)for idempotent upserts - Evidence stored separately with links to nodes/edges/artifacts via join tables
- Column name:
meta(notmetadata- reserved by SQLAlchemy)
Files created/modified:
packages/core/contextmine_core/models.py- Added enums + 7 new tablespackages/core/alembic/versions/013_add_knowledge_graph.py- Migrationpackages/core/contextmine_core/knowledge/__init__.py- Module initpackages/core/contextmine_core/knowledge/schemas.py- Pydantic schemas
What was built:
build_knowledge_graph_for_source()- Creates FILE/SYMBOL nodes and edgescleanup_orphan_nodes()- Removes nodes for deleted documents- Edges: FILE_DEFINES_SYMBOL, SYMBOL_CONTAINS_SYMBOL, plus symbol edges from SymbolEdge table
- Evidence creation linking nodes to source locations
Integration point:
- Call
build_knowledge_graph_for_source(session, source_id)after symbol indexing in sync pipeline
Files created:
packages/core/contextmine_core/knowledge/builder.py- Builder functionspackages/core/tests/test_knowledge_graph.py- Tests (require PostgreSQL)
What was built:
alembic.py- AST-based parser for Alembic migration files (no regex)- Extracts
op.create_table(),op.add_column(),op.create_foreign_key()calls - Parses column definitions including types, nullability, primary keys, foreign keys
- Extracts
erm.py- ERM schema builder and Mermaid ERD generatorERMExtractor- Consolidates schema from multiple migration filesgenerate_mermaid_erd()- Creates Mermaid ER diagram syntaxbuild_erm_graph()- Creates DB_TABLE/DB_COLUMN nodes and edgessave_erd_artifact()- Stores ERD as a KnowledgeArtifact
Integration point:
- Call
ERMExtractor.extract_from_directory(alembic_dir)to parse migrations - Call
build_erm_graph(session, collection_id, schema)to populate knowledge graph - Call
save_erd_artifact(session, collection_id, schema)to store Mermaid ERD
Files created:
packages/core/contextmine_core/analyzer/__init__.py- Module initpackages/core/contextmine_core/analyzer/extractors/__init__.py- Extractors initpackages/core/contextmine_core/analyzer/extractors/alembic.py- Alembic parserpackages/core/contextmine_core/analyzer/extractors/erm.py- ERM builderpackages/core/tests/test_erm_extractor.py- 10 tests (all passing)
What was built:
openapi.py- OpenAPI 3.x specification parser (YAML/JSON)- Extracts endpoints, operations, request/response schemas
graphql.py- GraphQL schema parser- Extracts types, operations (Query/Mutation/Subscription), fields
protobuf.py- Protobuf (.proto) parser- Extracts messages, services, RPCs, enums
jobs.py- Job definition parser for:- GitHub Actions workflows (handles YAML 1.1
on→ True issue) - Kubernetes CronJobs
- Prefect deployments
- GitHub Actions workflows (handles YAML 1.1
surface.py- Unified surface catalog builderSurfaceCatalogExtractor- Auto-detects and processes spec filesbuild_surface_graph()- Creates knowledge graph nodes/edges
Node kinds added:
- GRAPHQL_TYPE, SERVICE_RPC (added to models.py)
Edge kinds added:
- RPC_USES_MESSAGE (added to models.py)
Integration point:
- Call
SurfaceCatalogExtractor().add_file(path, content)for each spec file - Call
build_surface_graph(session, collection_id, catalog)to populate knowledge graph
Files created:
packages/core/contextmine_core/analyzer/extractors/openapi.pypackages/core/contextmine_core/analyzer/extractors/graphql.pypackages/core/contextmine_core/analyzer/extractors/protobuf.pypackages/core/contextmine_core/analyzer/extractors/jobs.pypackages/core/contextmine_core/analyzer/extractors/surface.pypackages/core/tests/test_surface_extractors.py- 14 tests (all passing)
What was built:
rules.py- Rule candidate extractor using Tree-sitter AST- Detects conditional branches leading to failure actions
- Python:
if condition: raise Exceptionpatterns - Python:
assertstatements - TypeScript/JavaScript:
if (condition) throw Errorpatterns - Captures predicate text, failure text, container function, evidence
- Heuristic confidence scoring based on validation keywords
Failure kinds detected:
- RAISE_EXCEPTION (Python raise)
- THROW_ERROR (JS/TS throw)
- RETURN_ERROR (return null/None/error)
- ASSERT_FAIL (assert statements)
Integration point:
- Call
extract_rule_candidates(file_path, content)to get candidates - Call
build_rule_candidates_graph(session, collection_id, extractions)to populate knowledge graph - Natural key:
rule:{file_path}:{start_line}:{content_hash}for idempotent upserts
Files created:
packages/core/contextmine_core/analyzer/extractors/rules.py- Rule extractorpackages/core/tests/test_rule_extractor.py- 15 tests (all passing)
What was built:
labeling.py- LLM-based rule candidate labeling serviceBusinessRuleOutput- Pydantic schema for structured LLM outputlabel_rule_candidates()- Main labeling function- Content hash for idempotency (skips unchanged candidates)
- Creates BUSINESS_RULE nodes with edges to source candidates
- Links evidence and citations
Key features:
- Temperature 0 for deterministic output
- Strict JSON schema validation via Pydantic
- LLM only labels, never discovers new rules
- Idempotent: content hash prevents relabeling unchanged candidates
- Categories: validation, authorization, invariant, constraint, other
- Severity levels: error, warning, info
Integration point:
- Call
label_rule_candidates(session, collection_id, provider)after rule candidate mining - Pass a configured LLMProvider (from
contextmine_core.research.llm)
Files created:
packages/core/contextmine_core/analyzer/labeling.py- Labeling servicepackages/core/tests/test_rule_labeling.py- 12 tests (all passing)
What was built:
graphrag.py- Graph-augmented retrieval service with Microsoft GraphRAG approach:graph_rag_context()- Context retrieval combining:- Hybrid search to find relevant documents/chunks
- Mapping search hits to Knowledge Graph nodes
- Expanding graph neighborhood (configurable depth)
- Gathering evidence citations
- Building ContextPack with communities + entities + edges
graph_rag_query()- Full map-reduce answering using LLMgraph_neighborhood()- Local exploration from a single nodetrace_path()- BFS shortest path between two nodes
Key features:
- Maps search results to FILE nodes via document_id or URI
- BFS-based neighborhood expansion with depth limit
- Community-aware retrieval (global + local context)
- Evidence gathering from KnowledgeEvidence table
- Markdown rendering with node categorization (FILE, SYMBOL, DB_TABLE, etc.)
- Returns ContextPack with communities, entities, edges, paths, citations
Output formats:
ContextPack.to_markdown()- Human-readable summary with citationsContextPack.to_dict()- JSON-serializable structure- Evidence citations with file_path:start_line-end_line format
Integration point:
- Call
graph_rag_context(session, query, collection_id, user_id)for context retrieval - Call
graph_rag_query(session, query, collection_id, user_id, provider)for answered queries - Call
graph_neighborhood(session, node_id)for local exploration - Call
trace_path(session, from_node_id, to_node_id)for dependency analysis
Files created:
packages/core/contextmine_core/graphrag.py- GraphRAG servicepackages/core/tests/test_graphrag.py- 12 tests (all passing)
What was built: 7 new MCP tools exposed via FastMCP server for Claude Code / Cursor:
list_business_rules(collection_id?, query?)- List extracted business rulesget_business_rule(rule_id)- Full rule details with evidenceget_erd(collection_id?, format?)- ERD as Mermaid or JSONlist_system_surfaces(collection_id?, kind?, limit?)- API endpoints, jobs, schemasgraph_neighborhood(node_id, depth?, edge_kinds?, limit?)- Local graph explorationtrace_path(from_node_id, to_node_id, max_hops?)- Shortest path between nodesgraph_rag(query, collection_id?, max_depth?, max_results?)- Graph-augmented retrieval
Key features:
- All tools return Markdown for assistant consumption
- Structured data available via parameters (format="json")
- Input validation with helpful error messages
- Access control respects collection visibility
- Depth/hop limits capped for safety
- Updated MCP server instructions to guide tool selection
Integration:
- Tools are auto-registered via @mcp.tool() decorator
- Access the MCP server at
/mcpendpoint - Tools call into graphrag.py, models.py, and search.py
Files modified:
apps/api/app/mcp_server.py- Added 7 new tools + updated instructions
Files created:
apps/api/tests/test_mcp_knowledge.py- 14 tests (all passing)
What was built:
arc42.py- Architecture documentation generatorgenerate_arc42()- Generates full arc42 document from extracted factssave_arc42_artifact()- Stores document as KnowledgeArtifactcompute_drift_report()- Compares stored vs current state
arc42 Sections generated:
- Context - System boundary, external interfaces (API counts)
- Building Blocks - Components, database schema, symbol counts
- Runtime View - Entry points, execution flows
- Deployment View - Jobs, workflows from manifests
- Crosscutting Concepts - Validation patterns, security hints
- Risks & Technical Debt - Unreviewed candidates, TODOs
- Glossary - Domain terms from database schema
Key features:
- Every statement is evidence-backed or explicitly marked "inferred"
- Drift detection compares stored artifact with current graph state
- Supports caching with
regenerateparameter - Section-specific retrieval via
sectionparameter
MCP tools added:
get_arc42(collection_id?, section?, regenerate?)- Get architecture docarc42_drift_report(collection_id?)- Show what changed
Files created:
packages/core/contextmine_core/analyzer/arc42.py- Generatorpackages/core/tests/test_arc42.py- 10 tests
Files modified:
apps/api/app/mcp_server.py- Added 2 arc42 MCP toolsapps/api/tests/test_mcp_knowledge.py- Added 7 arc42 schema testspackages/core/contextmine_core/models.py- Fixed MERMAID_ERD enum name