ir

An information-retrieval substrate for agentic systems — one uniform "find the relevant things in this corpus" contract that scales from an ad-hoc search over an ephemeral list to a maintained capability-discovery engine.

Give an agent one search tool, not fifty tool schemas. ir retrieves candidates, commits to a small high-precision subset (the distractor problem is the central selection risk — fewer, better candidates beat more), and discloses each committed item's payload only when asked.

import ir

# Define a corpus, build the index (incremental), then discover:
source = ir.CorpusSource.from_skills()       # or from_packages(), from_md_reports(),
                                              # from_claude_sessions(), from_files(...)
corpus = ir.build(source)                     # embed + persist under XDG dirs
result = ir.discover(corpus, "how do I deploy the app to the server")

for item in result.results:
    print(item.score, item.name)              # the committed few (or result.abstained)
print(result.to_dict())                       # JSON-serializable (qh / HTTP ready)

Install

pip install ir

ir is light by default — numpy + dol for storage, plus ef / vd for embedding and lexical/hybrid retrieval. Python ≥ 3.10.

Notes for the default (semantic) path:

The default embedder is all-MiniLM-L6-v2 (384-dim, via sentence-transformers), downloaded on first build (needs network) and cached under ~/.cache/ir. For a fast, offline, dependency-light run — tests, CI, quick experiments — pass embedder="light" (a numpy-only hashing embedder): ir.build(source, embedder="light").
ir sets USE_TF=0 on import so transformers does not pull in TensorFlow (which crashes on some numpy ABIs); import ir before anything that imports transformers.
Case generation (ir.eval_gen) and the optional LLM selector need an LLM via oa — install the extra, pip install "ir[llm]". Scoring and evaluation themselves stay offline.

The pipeline

ir is a six-stage pipeline, each stage a small, swappable seam:

Stage	Entry point	What it does
source	`CorpusSource`	what is in the corpus + what counts as stale
index	`ir.build`	decompose artifacts into embeddable surfaces, embed, persist (incremental, idempotent)
retrieve	`ir.search`	hard metadata filter + `dense` / `lexical` / `hybrid` ranking
expand	`ir.expand`	stitch a hit's stored neighborhood (sentence window / whole artifact) into the passage downstream reads — opt-in
select	`ir.select`	commit to a distractor-robust subset, or abstain
disclose	`ir.disclose`	load the heavy payload (SKILL.md body, package pointer, file text) for committed items — append-only

ir.discover chains retrieve → select → disclose into the single agent-callable (and qh-exposable) tool. Pass a list of corpus names for single-shot federated discovery across several corpora:

ir.discover(["skills", "packages"], "deploy the app")    # fan-out → fuse → select
ir.discover(["skills", "packages"], q, min_score="auto") # gate each source on its own floor

Each source is searched and gated on its own calibrated abstention floor before any merging; the survivors then rank-fuse (weighted RRF via ir.fuse_hits) — raw scores never cross a source boundary, because scores from different corpora / embedders / modes live on incommensurable scales. Every hit carries its corpus name as hit.source, so same-id artifacts from different corpora stay distinct, attributable results. The caller names the sources; ir never chooses the set (source planning belongs to the agent layer — see raglab).

Retrieve

hits = ir.search(corpus, "deploy app", mode="hybrid")   # dense | lexical | hybrid (RRF)

Dense is exact brute-force cosine; lexical is Okapi BM25; hybrid fuses both by Reciprocal Rank Fusion (the strongest default for short, identifier-heavy capability text). Lexical/hybrid reuse vd; dense needs only numpy.

Hybrid has a second fusion, fusion="blend" — a magnitude-preserving score blend instead of rank-RRF. RRF discards score magnitude, which is exactly what abstention calibration needs, so blend separates in-scope from out-of-scope queries far better (and even beats dense); the tradeoff is lower lexical recall on terse corpora, so RRF stays the default. Use blend when abstention matters — see ir_08.

Expand

passage = ir.expand(hit, corpus)                             # ±1 chunk window (default)
passage = ir.expand(hit, corpus, policy=ir.parent_policy())  # the whole artifact

The matched chunk is rarely the unit you want to read, and the whole document is usually too much. ir.expand stitches a hit's stored sibling segments into a mid-granularity Passage (retrieve → expand → rerank), with chunker overlap deduped and the hit's identity and score untouched. Policies are injectable (NeighborhoodPolicy); sentence_window_policy(k) and parent_policy() ship. It also works through the disclosure seam:

ir.disclose(sel, expand=ir.sentence_window_policy(2), corpus=corpus)
ir.discover("skills", q, expand=ir.sentence_window_policy())  # passages on committed items

Each Disclosure then carries the stitched text as .passage (additive; summary stays the matched surface, body stays the pointer's payload).

Select

sel = ir.select(hits)                      # conservative default: stay within rel of top, cap at max_k
sel = ir.select(hits, min_score=0.4)       # opt in to abstention ("nothing applies")
sel = ir.select(hits, strategy="score_gap")  # elbow cut, or "top_k" / "rel_threshold" / a callable

The abstention floor is mode-specific (dense cosine, BM25, and RRF live on different scales), so rather than guess min_score, calibrate it from a case file and let discover load it:

ev.calibrate_min_score(corpus, cases, mode="dense", persist=True)  # learn + store the floor
ir.discover(corpus, query, mode="dense", min_score="auto")         # abstain by the calibrated floor

Calibration separates in-scope from out-of-scope query top-scores and picks the floor that best splits them — see ir_07; it works best on dense / lexical (hybrid's RRF scores barely separate). min_score defaults to None (never abstain), so abstention stays fully opt-in.

The conservative defaults (max_k=3, rel=0.9) are tuned, not guessed — see ir_06; re-tune for your own corpus with ev.sweep_selector / ir sweep-select.

Selection is relative (ratios to the top score), so one selector works across dense / hybrid / lexical whose absolute scales differ by orders of magnitude. The result carries auditable signals and a reason — no opaque "confidence" float. An optional LLM selector (make_llm_selector, lazy on oa, injectable for tests) falls back to the heuristic on any failure.

Disclose

payloads = ir.disclose(sel, level="body")  # "metadata" (no I/O) | "body" | "bundled"

Disclosure is a pure read that follows the pointer already stored on each hit (skill_path / path); it never mutates the ranked hits and tolerates a stale pointer. Keeping the agent's context append-only (to protect the prompt cache) is then the caller's discipline — ir hands back additive payloads.

Graph & traverse (opt-in)

Artifacts refer to each other — a package depends on packages, a skill has a parent. ir models those as a semantic link graph: a typed-edge links view on the store, populated at build time by an EdgeExtractor.

corpus = ir.build(source, edge_extractor=ir.default_edge_extractor)  # deps→REF, parent→PARENT
graph  = ir.CorpusGraph(corpus)
graph.neighbors("contaix", edge_type="REF")     # the package's dependencies

ir.traverse walks that structure at query time under a pluggable WalkPolicy (score frontier → select → expand → stop). Safety is the operator's: a visited-set, depth cap, and node budget live in traverse itself, so even a cyclic graph and a never-stopping policy terminate. The shipped collapsed_tree_policy is pure-vector — it routes a query that matches an artifact's summary down to that artifact's best chunk:

hits = ir.traverse(query, corpus, policy=ir.collapsed_tree_policy())

The summary a query routes on can be an artifact's own short field — or an LLM-authored synopsis. ir.with_synopsis wraps any indexing strategy to add one synopsis surface per artifact at build time (the document-summary-index pattern: build-time cost, ≈free at query time), and that synopsis becomes the collapsed-tree router:

strat  = ir.with_synopsis(ir.Chunked(), synthesize=my_summarizer)  # or default (lazy oa)
corpus = ir.build(ir.CorpusSource.from_mapping(docs, name="d", strategy=strat))
hits   = ir.traverse(q, corpus, policy=ir.collapsed_tree_policy())  # routes via the synopsis

synthesize is injectable (a test double or your own summarizer); omitted, it is built lazily on oa so import ir stays offline. Synopses are derived state with a stamped synthesizer identity, so a prompt/model change re-synthesizes only the affected artifacts on the next incremental build — no silent staleness.

Flat top-k stays the default — traverse is opt-in, and a policy earns its keep only by beating flat+rerank on your eval set (a strong flat retriever wins simple lookup; graph methods cost far more). Results are ordinary SearchHits with additive metadata["walk_depth"] / ["seed"] provenance, so select / disclose compose unchanged. This is the semantic link graph (cyclic, query-time) — distinct from ef.artifact_graph (the acyclic build-time derivation DAG).

Evaluation

ir.eval scores discovery quality offline (reusing ef's retrieval metrics):

from ir import eval as ev

cases = ev.load_cases("skills_eval.jsonl")               # query + gold artifact_id(s)
ev.evaluate_discovery(corpus, cases, mode="hybrid")      # recall@k / NDCG@k / MRR / MAP + failure taxonomy
ev.evaluate_selection(corpus, cases, strategy="conservative")  # conditional commit rate + selection P/R/F1
ev.sweep_selector(corpus, cases)                         # tune max_k × rel; .best() / .frontier() / .table()
ev.distractor_robustness_curve(source.scope, probes)     # accuracy vs catalog size

evaluate_selection's headline is the conditional commit rate — the selection decision isolated from retrieval (did the selector keep the gold, given retrieval surfaced it?). sweep_selector scores a whole max_k × rel grid against the cases off one retrieval pass, so the selector defaults can be read off the data (.best()) rather than guessed. Generate cases by back-translation with ir.eval_gen (needs an LLM; scoring stays offline).

CLI

ir build skills                          # build/update a preset corpus
ir search skills "deploy the app"        # rank candidates (retrieval only)
ir discover skills "deploy the app"      # retrieve -> select
ir discover skills "deploy the app" --disclose       # + load bodies
ir discover skills "deploy the app" --min-score auto # + calibrated abstention
ir build sessions                        # index recent Claude Code sessions (turn pairs)
ir search sessions "numpy abi error" --mode lexical   # find past sessions
ir ls                                    # list corpora + record counts
ir info skills                           # config, stats, policy, calibrated floors
ir maintain --all                        # run due background work (idempotent; cron-friendly)
ir register notes files --root ~/notes --pattern '.*\.md$'  # register a custom corpus
ir rm notes                              # unregister (keeps built data)
ir eval-gen skills skills_eval.jsonl     # generate eval cases (needs oa/LLM)
ir eval skills skills_eval.jsonl         # score retrieval on a case file
ir eval-select skills skills_eval.jsonl  # score the selection stage
ir sweep-select skills skills_eval.jsonl # tune the selector (max_k × rel) on your corpus
ir calibrate-min-score skills skills_eval.jsonl --persist  # calibrate the abstention floor

Design

The design is grounded in a set of capability-discovery research reports and eval-run findings under misc/docs/ (ir_01–ir_08): the single-search-tool pattern, indexing & embedding strategy, evaluation, the ef + vd reuse analysis, the dense-vs-lexical-vs-hybrid eval, selector tuning, abstention-floor calibration, and magnitude-preserving fusion. ir is light by default (numpy / dol) and reuses the ecosystem (ef, vd, oa) only where it composes cleanly.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.claude		.claude
.github/workflows		.github/workflows
ir		ir
misc/docs		misc/docs
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ir

Install

The pipeline

Retrieve

Expand

Select

Disclose

Graph & traverse (opt-in)

Evaluation

CLI

Design

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ir

Install

The pipeline

Retrieve

Expand

Select

Disclose

Graph & traverse (opt-in)

Evaluation

CLI

Design

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages