Skip to content

Ref34t/rag-probe

Repository files navigation

RAG Probe

Smallest-scope RAG test harness + eval benchmark for WordPress. WP-CLI only — no widget, no admin UI, no sync, no multi-turn. Everything routes through one swappable seam: RAGProbe_Answer::answer(). The provider lives behind another seam: RAGProbe_Client.

  • Generation (answerer + judge) → WordPress 7.0 AI Client (wp_ai_client_prompt()). No key here: credentials are managed by WP's Connectors API.
  • Embeddings → pluggable backend (OpenAI or Voyage), switched by the ragprobe_embed_provider filter. Claude has no embeddings API, so embeddings always come from a separate provider — Claude-for-generation + Voyage-for- embeddings is the standard pairing.

Requirements

  • WordPress 7.0+ (or the wp-ai-client plugin on older versions) with a provider configured under Settings > Connectors (Anthropic / OpenAI / Google).
  • WP-CLI.

Install

  1. Drop the rag-probe/ folder in wp-content/plugins/ and activate it (creates 4 tables).
  2. Configure a generation provider under Settings > Connectors.
  3. Set the embeddings key (used ONLY for embeddings; generation uses Connectors):
    • OpenAI (default): wp ragprobe key sk-... --provider=openai
    • Voyage: wp ragprobe key pa-... --provider=voyage, then switch the active embedder with one line in a mu-plugin or theme functions: add_filter('ragprobe_embed_provider', fn() => 'voyage');
    • Models are filterable: ragprobe_openai_embed_model (default text-embedding-3-small), ragprobe_voyage_embed_model (default voyage-3.5).
    • Switching embedder later means re-running wp ragprobe index (different model = different vectors; you can't mix them in one index).

Run sequence

wp ragprobe index --post_type=post,page        # chunk + embed your docs
wp ragprobe ask "how do I reset my password?"   # sanity check one answer
wp ragprobe import_gold sample-gold.csv          # load the gold set (your mined tickets)
wp ragprobe benchmark --label=baseline           # score the whole gold set
wp ragprobe compare                              # table of every run

The experiment matrix

Same gold set, same harness, one flag changes per run:

Question it answers Command
Does RAG beat no retrieval? benchmark --mode=norag --label=control
Does vector beat keyword? benchmark --retrieval=keyword --label=bm25
Cheap vs frontier generation? benchmark --model=gpt-4o --label=frontier
Does k matter? benchmark --k=10 --label=k10

Then wp ragprobe compare puts them side by side. Replaces "I think" with a table.

What it measures

  • Retrieval: recall@k, MRR (deterministic; needs expected_doc_id in the gold CSV)
  • Answer: faithfulness /2, correctness /2 (LLM-as-judge — hand-label ~20 first to trust it)
  • Behaviour: refusal rate (refuse-by-default is built into the RAG prompt)
  • Ops: avg latency, run cost. Note: via the WP AI Client, token counts are char/4 ESTIMATES (the simple generate path doesn't surface provider usage), so cost is directional, not exact. Edit the placeholder price map in class-eval.php.

Model selection (cheap vs frontier)

Generation routes through the provider you configured in Connectors. --model= maps to ->using_model(); whether a plain string works depends on your provider plugin — omit it to use the configured default (wp-default). Add a matching key to the price map so the cost column populates for that model.

Gold CSV format

question, expected_answer, expected_doc_id (last column optional but needed for retrieval metrics). The expected_doc_id is the WP post ID of the doc that should answer it.

Scale assumption

Flat brute-force cosine in PHP. Fine to a few thousand chunks (your "few hundred docs" case). It is deliberately NOT an ANN index — running the external-vector-store variant through this same harness is how you prove you don't need one yet.

Scope — what's covered, what's deliberately out

This is a single-turn RAG technique selector: it covers the standard taxonomy as swappable axes, not "all of RAG" (no harness is — the field moves, and some items below are different paradigms, not toggles).

Covered axes: query transform (none/rewrite/HyDE/multi-query), chunking (fixed/structural/contextual), embedders (OpenAI/Voyage/Cohere/Jina/Google), retrieval (vector/keyword/hybrid/rerank) + MMR diversity + parent-context window, k, prompt variant, generation model, no-RAG control. Metrics: recall@k, precision@k, MRR, nDCG@k, faithfulness, correctness, refusal, latency, estimated cost.

Deliberately OUT (with reasons):

  • ANN / external vector DB — unnecessary at a few-hundred-doc scale (flat cosine is fine).
  • Metadata filtering, incremental/freshness indexing — production concerns, not technique evaluation.
  • Multi-turn / conversational retrieval — different harness; this is your SaaS milestone.
  • GraphRAG, agentic multi-hop retrieval — different paradigms / a tier up, wrong fit for deflection.
  • Fine-tuning — not retrieval; the alternative to it.

These are intentional boundaries, not gaps to "fix". Add multi-turn when you move from probe to product.

About

Single-turn RAG evaluation harness + benchmark for WordPress (WP-CLI). Gold-set metrics, LLM-as-judge, swappable embedders & retrieval.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages