Smallest-scope RAG test harness + eval benchmark for WordPress. WP-CLI only — no
widget, no admin UI, no sync, no multi-turn. Everything routes through one
swappable seam: RAGProbe_Answer::answer(). The provider lives behind another
seam: RAGProbe_Client.
- Generation (answerer + judge) → WordPress 7.0 AI Client (
wp_ai_client_prompt()). No key here: credentials are managed by WP's Connectors API. - Embeddings → pluggable backend (OpenAI or Voyage), switched by the
ragprobe_embed_providerfilter. Claude has no embeddings API, so embeddings always come from a separate provider — Claude-for-generation + Voyage-for- embeddings is the standard pairing.
- WordPress 7.0+ (or the
wp-ai-clientplugin on older versions) with a provider configured under Settings > Connectors (Anthropic / OpenAI / Google). - WP-CLI.
- Drop the
rag-probe/folder inwp-content/plugins/and activate it (creates 4 tables). - Configure a generation provider under Settings > Connectors.
- Set the embeddings key (used ONLY for embeddings; generation uses Connectors):
- OpenAI (default):
wp ragprobe key sk-... --provider=openai - Voyage:
wp ragprobe key pa-... --provider=voyage, then switch the active embedder with one line in a mu-plugin or theme functions:add_filter('ragprobe_embed_provider', fn() => 'voyage'); - Models are filterable:
ragprobe_openai_embed_model(defaulttext-embedding-3-small),ragprobe_voyage_embed_model(defaultvoyage-3.5). - Switching embedder later means re-running
wp ragprobe index(different model = different vectors; you can't mix them in one index).
- OpenAI (default):
wp ragprobe index --post_type=post,page # chunk + embed your docs
wp ragprobe ask "how do I reset my password?" # sanity check one answer
wp ragprobe import_gold sample-gold.csv # load the gold set (your mined tickets)
wp ragprobe benchmark --label=baseline # score the whole gold set
wp ragprobe compare # table of every run
Same gold set, same harness, one flag changes per run:
| Question it answers | Command |
|---|---|
| Does RAG beat no retrieval? | benchmark --mode=norag --label=control |
| Does vector beat keyword? | benchmark --retrieval=keyword --label=bm25 |
| Cheap vs frontier generation? | benchmark --model=gpt-4o --label=frontier |
| Does k matter? | benchmark --k=10 --label=k10 |
Then wp ragprobe compare puts them side by side. Replaces "I think" with a table.
- Retrieval: recall@k, MRR (deterministic; needs
expected_doc_idin the gold CSV) - Answer: faithfulness /2, correctness /2 (LLM-as-judge — hand-label ~20 first to trust it)
- Behaviour: refusal rate (refuse-by-default is built into the RAG prompt)
- Ops: avg latency, run cost. Note: via the WP AI Client, token counts are char/4 ESTIMATES (the simple generate path doesn't surface provider usage), so cost is directional, not exact. Edit the placeholder price map in class-eval.php.
Generation routes through the provider you configured in Connectors. --model=
maps to ->using_model(); whether a plain string works depends on your provider
plugin — omit it to use the configured default (wp-default). Add a matching key
to the price map so the cost column populates for that model.
question, expected_answer, expected_doc_id (last column optional but needed for retrieval metrics).
The expected_doc_id is the WP post ID of the doc that should answer it.
Flat brute-force cosine in PHP. Fine to a few thousand chunks (your "few hundred docs" case). It is deliberately NOT an ANN index — running the external-vector-store variant through this same harness is how you prove you don't need one yet.
This is a single-turn RAG technique selector: it covers the standard taxonomy as swappable axes, not "all of RAG" (no harness is — the field moves, and some items below are different paradigms, not toggles).
Covered axes: query transform (none/rewrite/HyDE/multi-query), chunking (fixed/structural/contextual), embedders (OpenAI/Voyage/Cohere/Jina/Google), retrieval (vector/keyword/hybrid/rerank) + MMR diversity + parent-context window, k, prompt variant, generation model, no-RAG control. Metrics: recall@k, precision@k, MRR, nDCG@k, faithfulness, correctness, refusal, latency, estimated cost.
Deliberately OUT (with reasons):
- ANN / external vector DB — unnecessary at a few-hundred-doc scale (flat cosine is fine).
- Metadata filtering, incremental/freshness indexing — production concerns, not technique evaluation.
- Multi-turn / conversational retrieval — different harness; this is your SaaS milestone.
- GraphRAG, agentic multi-hop retrieval — different paradigms / a tier up, wrong fit for deflection.
- Fine-tuning — not retrieval; the alternative to it.
These are intentional boundaries, not gaps to "fix". Add multi-turn when you move from probe to product.