RAG Probe

Smallest-scope RAG test harness + eval benchmark for WordPress. WP-CLI only — no widget, no admin UI, no sync, no multi-turn. Everything routes through one swappable seam: RAGProbe_Answer::answer(). The provider lives behind another seam: RAGProbe_Client.

Generation (answerer + judge) → WordPress 7.0 AI Client (wp_ai_client_prompt()). No key here: credentials are managed by WP's Connectors API.
Embeddings → pluggable backend (OpenAI or Voyage), switched by the ragprobe_embed_provider filter. Claude has no embeddings API, so embeddings always come from a separate provider — Claude-for-generation + Voyage-for- embeddings is the standard pairing.

Requirements

WordPress 7.0+ (or the wp-ai-client plugin on older versions) with a provider configured under Settings > Connectors (Anthropic / OpenAI / Google).
WP-CLI.

Install

Drop the rag-probe/ folder in wp-content/plugins/ and activate it (creates 4 tables).
Configure a generation provider under Settings > Connectors.
Set the embeddings key (used ONLY for embeddings; generation uses Connectors):
- OpenAI (default): wp ragprobe key sk-... --provider=openai
- Voyage: wp ragprobe key pa-... --provider=voyage, then switch the active embedder with one line in a mu-plugin or theme functions: add_filter('ragprobe_embed_provider', fn() => 'voyage');
- Models are filterable: ragprobe_openai_embed_model (default text-embedding-3-small), ragprobe_voyage_embed_model (default voyage-3.5).
- Switching embedder later means re-running wp ragprobe index (different model = different vectors; you can't mix them in one index).

Run sequence

wp ragprobe index --post_type=post,page        # chunk + embed your docs
wp ragprobe ask "how do I reset my password?"   # sanity check one answer
wp ragprobe import_gold sample-gold.csv          # load the gold set (your mined tickets)
wp ragprobe benchmark --label=baseline           # score the whole gold set
wp ragprobe compare                              # table of every run

The experiment matrix

Same gold set, same harness, one flag changes per run:

Question it answers	Command
Does RAG beat no retrieval?	`benchmark --mode=norag --label=control`
Does vector beat keyword?	`benchmark --retrieval=keyword --label=bm25`
Cheap vs frontier generation?	`benchmark --model=gpt-4o --label=frontier`
Does k matter?	`benchmark --k=10 --label=k10`

Then wp ragprobe compare puts them side by side. Replaces "I think" with a table.

What it measures

Retrieval: recall@k, MRR (deterministic; needs expected_doc_id in the gold CSV)
Answer: faithfulness /2, correctness /2 (LLM-as-judge — hand-label ~20 first to trust it)
Behaviour: refusal rate (refuse-by-default is built into the RAG prompt)
Ops: avg latency, run cost. Note: via the WP AI Client, token counts are char/4 ESTIMATES (the simple generate path doesn't surface provider usage), so cost is directional, not exact. Edit the placeholder price map in class-eval.php.

Model selection (cheap vs frontier)

Generation routes through the provider you configured in Connectors. --model= maps to ->using_model(); whether a plain string works depends on your provider plugin — omit it to use the configured default (wp-default). Add a matching key to the price map so the cost column populates for that model.

Gold CSV format

question, expected_answer, expected_doc_id (last column optional but needed for retrieval metrics). The expected_doc_id is the WP post ID of the doc that should answer it.

Scale assumption

Flat brute-force cosine in PHP. Fine to a few thousand chunks (your "few hundred docs" case). It is deliberately NOT an ANN index — running the external-vector-store variant through this same harness is how you prove you don't need one yet.

Scope — what's covered, what's deliberately out

This is a single-turn RAG technique selector: it covers the standard taxonomy as swappable axes, not "all of RAG" (no harness is — the field moves, and some items below are different paradigms, not toggles).

Covered axes: query transform (none/rewrite/HyDE/multi-query), chunking (fixed/structural/contextual), embedders (OpenAI/Voyage/Cohere/Jina/Google), retrieval (vector/keyword/hybrid/rerank) + MMR diversity + parent-context window, k, prompt variant, generation model, no-RAG control. Metrics: recall@k, precision@k, MRR, nDCG@k, faithfulness, correctness, refusal, latency, estimated cost.

Deliberately OUT (with reasons):

ANN / external vector DB — unnecessary at a few-hundred-doc scale (flat cosine is fine).
Metadata filtering, incremental/freshness indexing — production concerns, not technique evaluation.
Multi-turn / conversational retrieval — different harness; this is your SaaS milestone.
GraphRAG, agentic multi-hop retrieval — different paradigms / a tier up, wrong fit for deflection.
Fine-tuning — not retrieval; the alternative to it.

These are intentional boundaries, not gaps to "fix". Add multi-turn when you move from probe to product.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
includes		includes
.gitignore		.gitignore
README.md		README.md
RUNBOOK.md		RUNBOOK.md
dump-content.php		dump-content.php
import-prospect.php		import-prospect.php
rag-probe.php		rag-probe.php
sample-gold.csv		sample-gold.csv
seed-benchmark.php		seed-benchmark.php
seed-multihop-gold.php		seed-multihop-gold.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Probe

Requirements

Install

Run sequence

The experiment matrix

What it measures

Model selection (cheap vs frontier)

Gold CSV format

Scale assumption

Scope — what's covered, what's deliberately out

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Probe

Requirements

Install

Run sequence

The experiment matrix

What it measures

Model selection (cheap vs frontier)

Gold CSV format

Scale assumption

Scope — what's covered, what's deliberately out

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages