A lightweight research framework for systematically comparing agentic coding setups — models, agent architectures, and prompt strategies — using reproducible benchmarks.
When working with agentic coding, many practical questions arise: Does a setup with dedicated sub-agents for testing produce cleaner code than a single agent doing everything? Does Sonnet deliver better code quality than Opus on a refactoring task? How do different prompt styles affect the outcome? This framework was built to answer such questions with data instead of gut feeling.
The framework runs controlled experiments across four dimensions:
- Workflow variants — from simple one-shot generation ("vibe coding") to structured TDD with specialized agents per phase
- Model configurations — Opus 4.7, Sonnet 4.6, Haiku 4.5, with and without extended thinking
- Coding katas — standardized tasks, known (game-of-life, mars-rover) and novel (claim-office, claim-office-lite); see Katas
- Prompt styles — prose, example-mapping, user-story
Each experiment produces measurable metrics: token usage, code complexity, code smells, test coverage, TDD cycle discipline, and more.
Aggregation is research-question-driven: A research question (RQ) defines a selector query (factors × controls × outcomes) over experiments/runs/, and tooling generates runs.csv + summary.md on demand. Batch plans are pure data-collection helpers that fill missing replicates.
Earlier iterations of this lab were batch-driven: each batch produced its own results folder, and a finding was tied to "the batch it came from". That coupling made it hard to evolve hypotheses (every new analysis required a new batch) and hard to combine evidence across batches.
The current model decouples three concerns:
- A research question (RQ) declaratively describes what is being studied — its factors, controls, outcomes, and required replicates.
- Runs are produced once and live in a single flat pool (
experiments/runs/), independent of any RQ or batch. - Aggregations are derived on demand by selecting all runs matching an RQ's frontmatter — across all batches that ever produced matching cells.
+-----------------------------+
| research/{questions, |
| workflow-dev}/<chapter>-*/ |
| README.md (frontmatter) |
+---------+-------------------+
|
| factors × controls
v
+----------------+ +----------------+ +----------------+
| aggregate-by- | | batch-plan- | | findings.md |
| query.py | | from-rq.py | | (curated) |
+-------+--------+ +--------+-------+ +----------------+
| |
| matches | missing cells
v v
+-----------------+ +------------------------+
| experiments/ | | experiments/ |
| runs/ (pool) |<---| batch-plans/<rq>-fill |
+-----------------+ +------------------------+
executed by docker/batch.sh
RQ directories live in four subtrees: research/questions-claude/ (Claude Code RQs),
research/questions-opencode/ (OpenCode RQs), research/questions-cross/ (harness-übergreifende
RQs), and research/workflow-dev/ (workflow-evolution chain). Each dir carries a <chapter>-slug
name (e.g. 2.6-lean-validation) where the chapter number is an ordering label, not an id —
like a document section heading, freely renumber-able by git mv. The stable identity is the
frontmatter id:; tooling resolves an RQ by that id across all subtrees, never by directory name.
Each README.md starts with YAML frontmatter that acts as a selector query over the run pool:
---
id: RQ-prompt-correctness
question: "Does workflow choice affect code quality, correctness, TDD discipline?"
factors: # what varies
workflow_x_prompt:
- {workflow: v1-oneshot, prompt: prose}
- {workflow: v4-exact-subagents, prompt: example-mapping}
controls: # what is held constant
kata_base: game-of-life
model: opus-4-7-no-thinking
outcomes: [tests_passing, code_mass, smell_total, cc_longest_function, ...]
min_replicates: 3
status: active
---The cartesian product of factors × controls produces cells — each cell is one (kata, workflow, model) combination that needs min_replicates runs.
Terminology:
- Factor — a variable that is deliberately varied across runs. The effect of a factor is what the RQ measures.
- Control — a variable that is held constant so it cannot confound the factor effect. A finding under
model: opus-4-7-no-thinkingis a finding for that model; transfer to other models is an open question, not an established result. - Outcome — a metric that is observed per run.
Mixing values into a control (e.g. running additional opus-4-6 replicates into an RQ controlled on opus-4-7) collapses the control into an uncontrolled factor and invalidates the comparison. To study a different model variant, open a new RQ — either with model as a factor (model-comparison RQ) or with the new value pinned as the control (separate finding, scoped to that model).
aggregate-by-query.pyreads the frontmatter, selects every matching run fromexperiments/runs/, and writesruns.csv(raw data) +summary.md(pivots per outcome) into the RQ directory. Re-runs the moment new replicates land — no plan editing required.batch-plan-from-rq.pyreads the same frontmatter, counts existing matches per cell, and emitsexperiments/batch-plans/<rq>-fill.jsoncontaining exactly the missing(kata, workflow, model)triples needed to reachmin_replicates. Idempotent: if everything is already covered, the plan is empty.
Together: declare the question once → fill the gaps → re-aggregate.
---
id: RQ-<slug> # stable identity, e.g. RQ-prompt-correctness
question: "Full text of the research question"
factors: # what varies
<factor-name>: [<value>, ...]
# OR for paired factors:
workflow_x_prompt:
- {workflow: v1-oneshot, prompt: prose}
- ...
controls: # what is held constant
kata_base: game-of-life # kata base name without prompt suffix
workflow: v4-exact-subagents # only if no workflow_x_prompt factor
prompt: example-mapping # only if no prompt factor / pairing
model: <lab-variant-id> # e.g. opus-4-7-no-thinking (see model alias table)
outcomes: [<metric>, ...] # which metrics are measured
min_replicates: N # per cell
status: active | partial | closed
---Selector resolution: The selector query forms the effective kata ID as <kata_base>-<prompt>. prompt comes from controls.prompt, the workflow_x_prompt pairing, or factors.prompt.
OR-match on controls.model: A control normally pins one exact value. controls.model is the one exception that accepts an explicit OR-match:
controls:
model:
any: # OR-match across providers/routings
- opus-4-7-portkey-no-thinking
- opus-4-7-no-thinkingAll listed values count toward the same cell during aggregation, and the first entry is the canonical value used for new fill-runs (batch-plan-from-rq.py) and cell labelling in summary.md. The real per-run model stays in runs.csv under the model column for debugging; the cell-grouping value is the new cell_model column.
Intended use: combine routing variants of the same underlying model (e.g. Portkey-routed and Direct-API runs of opus-4-7-no-thinking) when routing is assumed not to affect the outcome under study. Not for combining different models — use factors.model instead, otherwise you collapse a real factor into a hidden uncontrolled variable.
outcomes in the frontmatter are CSV column names from runs.csv (see CSV_COLUMNS in experiments/aggregate-by-query.py). aggregate-by-query.py picks the pivot type automatically:
| Value type / naming | Pivot form |
|---|---|
| Boolean | rate_% (share of true) |
| Numeric | mean / min / max / std over the cell |
Suffix <X>_correct_rate |
pooled rate from <X>_correct and <X>_total: Σ/Σ × 100 |
Pooled rate: Used for success rates with numerator/denominator per run, e.g. predictions_correct_rate → Σpredictions_correct / Σpredictions_total. Preferred over the mean of per-run rates because runs with small denominators would otherwise be over-weighted.
These rules apply lab-wide and are respected by every RQ.
For methodological symmetry:
| Workflow | Permitted prompt styles | Rationale |
|---|---|---|
| v1-oneshot, v2-iterative | only prose | Concrete examples in the prompt nudge the agent toward starting with tests, which contaminates the non-TDD condition — the whole point of v1/v2 is to observe what happens when the agent is not steered into test-first. |
| v3-basic-tdd, v4-exact-subagents, v5-exact-single-context | prose, example-mapping, user-story | Examples serve as natural test cases — for TDD workflows this is the ideal task shape. |
Consequences for RQ design:
- Workflow as factor: Factor is named
workflow_x_promptand is a paired list of{workflow, prompt}tuples. Default pairing: v1/v2 → prose, v3/v4/v5 → example-mapping (the so-called "fair" comparison). - Workflow as control:
controls.workflowandcontrols.promptare set together, respecting the constraint.
From the re-evaluation of an earlier 235-run study, three constraints are stable:
- Classic katas live in training data (string-calculator, pixel-art-scaler, etc.). Models solve them too trivially —
smell_total = 0consistently. - Pixel-art-scaler is not usable as a novel-kata sanity check (no workflow or model differentiation).
- Code-quality signal is only visible on game-of-life and mars-rover. Statements about
smell_total,cc_longest_function, etc. must be based on these katas — cross-kata averages over trivial katas dilute the signal.
Consequence for RQs: All current RQs use kata_base: game-of-life as the default. mars-rover stays available for cross-kata validation once enough replicates exist. Generalizability claims about arbitrary katas are 🚫 not testable with the current design.
New novel kata: claim-office was added as a fresh, non-classic kata with enough complexity to differentiate workflows and models. It is not in training data and ships with an external acceptance suite (see CLI katas), so correctness is measured from the outside via verification_pct. Once enough replicates land, it becomes the second carrier of the code-quality signal alongside game-of-life.
Reduced variant claim-office-lite: a derived kata with the same quote rules but claim stripped of cap-tracking and multi-claim chains (10 scenarios instead of 15, all 6 ambiguities preserved). Discriminability test 2026-05-21 (7 workflows × 2 replicates × example-mapping × opus-4-7-portkey-no-thinking) showed it differentiates workflows on code-quality metrics (cognitive_max 3–21, mccabe_max 4–15) and wallclock (3 min to 88 min) — usable as a faster code-quality kata alongside game-of-life. Not usable for correctness research: with example-mapping verification_pct saturates at 9–10/10 for all workflows; with prose it collapses to 2–4/10 without workflow separation. For correctness use the full claim-office.
"When a measure becomes a target, it ceases to be a good measure." — Goodhart's Law
The moment a workflow or agent prompt explicitly names a metric (e.g. "reduce cognitive complexity", "keep functions short"), runs of that workflow no longer produce independent observations of that metric — they produce compliance signals. A finding like "v6.6 improves cognitive_max over v6.5" then partly measures how diligently the agent follows its own instructions, not how good the resulting code is.
This is not an absolute disqualifier — naming code_mass ideas (APP) in refactor prompts is already standard practice in the v5/v6 line, and code_mass is still measured. But the contamination must be tracked explicitly:
- Metrics named in the prompt of a workflow under test are compliance metrics for that workflow. Cross-workflow comparisons on those metrics are valid only if both workflows name them (or neither does).
- Hidden metrics — never named in any prompt — stay independent.
mutation_scoreis the strongest example: it measures test-suite quality from the outside and is hard to game without also producing better tests. - No numeric thresholds in workflow/agent prompts (e.g.
cognitive_max < 15,LoC < 50). Katas and scenarios are too volatile for any single threshold to be meaningful — qualitative language only ("reduce", "extract when …"). Mirrored inCLAUDE.mdunder "Editing workflows". - When introducing a new metric into an agent's vocabulary, open a new workflow variant (e.g. v6.6) rather than mutating an existing one, so prior findings on that metric remain interpretable.
When in doubt, list the named metrics in the workflow's README.md / header so future RQs know which outcomes are compliance signals for it.
Each run has a hard wallclock budget (default 90 min, set via CLAUDE_TIMEOUT_SECONDS=5400 in run-batch.sh). When a (workflow, model, kata) cell systematically hits this limit, that's not a data error — it is itself the finding: the variant is impractical within the chosen cost frame.
Consequences for analysis and data collection:
- Timeout runs are not deleted. Their
metrics.jsonis preserved withrun_status.exit_reason = "timeout".tests_passing,verification_pct,code_mass, etc. arenull. - Exhausted retry budgets (
exit_reason = "rate-limited"or"transient-api-error") are also folded intocompleted_within_budget = false— not because the wallclock ran out, but because a (workflow, model, kata) cell that repeatedly trips rate limits or transient API errors is practically unusable inside the lab's cost/availability envelope. Configurable viaBATCH_RATELIMIT_RETRIES(default 5). - They count toward
min_replicates.batch-plan-from-rq.pytreats a timeout as a legitimate data point — no refill is generated for timeout cells. completed_within_budget(Boolean, derived fromexit_reason) is available as an outcome and reports the share of "finished within budget" per cell. Sensible as an outcome in any RQ whose factors vary workflow or model.n_okcolumn in the cell coverage table ofsummary.mdonly counts successful runs; a "3 timeouts, 0 OK" cell is flagged⚠️ even whenmin_replicatesis formally met.
✅ stabil— data robustly supports the finding (n≥3, clear signal)⚠️ bedingt— only holds under a qualifying condition (named in the finding)❌ widerlegt— data clearly contradicts the finding🚫 offen— data basis missing; status open
| File | Owner | Lifespan | Purpose |
|---|---|---|---|
research/<tree>/<chapter>-*/README.md |
human-curated | persistent | The question, its frontmatter selector, hypotheses, design rationale |
research/<tree>/<chapter>-*/findings.md |
human-curated | persistent, growing | Numbered findings with status flags. Survives data refreshes |
research/<tree>/<chapter>-*/runs.csv |
generated by aggregate-by-query.py |
regenerated on demand | Flat table of all runs matching the selector |
research/<tree>/<chapter>-*/summary.md |
generated by aggregate-by-query.py |
regenerated on demand | Pivot tables per outcome × cell |
experiments/runs/<id>/metrics.json |
produced per run | immutable artefact | The atomic data point — every aggregation traces back here |
experiments/batch-plans/<rq>-fill.json |
generated by batch-plan-from-rq.py |
regenerated, throwaway | Missing-cell list for the next batch run |
research/_archive/ |
human-curated snapshot | frozen | Past analyses preserved with mapping tables to current RQ findings |
.
├── .claude/
│ └── skills/ # Claude Code skills for repo workflows
│ ├── run-rq/ # /run-rq RQ-N — drive an RQ end-to-end
│ └── build-overview/ # /build-overview — generate cross-RQ snapshot
├── experiments/
│ ├── katas/ # Coding exercises (problem definitions)
│ ├── workflows/ # Workflow variants (v1–v5 + pi variants)
│ ├── runs/ # Recorded experiment results (flat pool)
│ ├── batch-plans/ # JSON batch specs (auto-generated per RQ)
│ ├── docker/ # Containerized batch execution
│ ├── record-run.sh # Run a single experiment interactively
│ ├── analyze-run.sh # Generate analysis-report.md + metrics.json for a run
│ ├── reanalyze-all-runs.sh # Backfill metrics across every existing run
│ ├── aggregate-by-query.py # RQ frontmatter → runs.csv + summary.md
│ ├── batch-plan-from-rq.py # RQ frontmatter → batch plan filling missing cells
│ ├── analyze_transcript.py # Parse CC/OC transcript.jsonl for TDD-cycle metrics
│ ├── parse_pi_transcript.py # Parse pi transcript-pi.jsonl (text-marker-based cycle counting)
│ ├── parse_opencode_transcript.py # Parse OpenCode session exports
│ └── generate-snapshot-skeleton.py # Cross-RQ snapshot skeleton (used by /build-overview)
├── research/
│ ├── RQ-prompt-correctness-workflow-effect/ # Per-RQ:
│ │ ├── README.md # frontmatter selector + question + hypotheses
│ │ ├── findings.md # curated, growing list of numbered findings
│ │ ├── runs.csv # generated: raw data of all matching runs
│ │ └── summary.md # generated: pivot tables per outcome × cell
│ ├── RQ-prompt-known-kata-prompt-style/
│ ├── ... # one directory per active RQ
│ ├── kata-design/ # kata construction guidelines
│ └── _archive/ # frozen snapshots of past analyses + experiment-overview snapshots
├── HUMAN-IN-THE-LOOP.md # Optional HITL checkpoint guide
├── WORKTREE-WORKFLOW.md # Persistent agent-worktree convention
└── todos_and_ideas.txt # Future research directions
| Variant | Approach | Description |
|---|---|---|
| v1-oneshot | No TDD | Direct implementation in one shot ("vibe coding" baseline) |
| v2-iterative | No TDD | Iterative prompting with plan/checklist |
| v3-basic-tdd | Minimal TDD | Just "use TDD" — no detailed rules |
| v4-exact-subagents | Structured TDD | Each TDD phase (red/green/refactor) runs in a separate, isolated agent |
| v5-exact-single-context | Structured TDD | All TDD phases run in one continuous context using inline skills |
Single agent reads requirements, writes code, then adds tests after the fact. Baseline that measures the value of TDD itself. Tests are added based on the Example Mapping for fair comparison.
v1-oneshot/.claude/
└── rules/
└── experiment-mode.md # Non-TDD approach + output format
Single agent builds an explicit checklist, implements step by step, then adds tests after. Measures whether structured iteration alone (without TDD) improves over one-shot.
v2-iterative/.claude/
└── rules/
└── experiment-mode.md # Iterative approach + output format
Single agent with minimal TDD rules — no phase-specific guidance, no agent spawning. Claude decides how to structure its TDD process. Lowest TDD overhead, maximum flexibility.
v3-basic-tdd/.claude/
└── rules/
└── experiment-mode.md # Minimal TDD guidance + output format
Each TDD phase runs as a specialized subagent with isolated context, invoked via the Task tool with subagent_type parameter.
Main Agent
├── Task(test-list) → Creates test list [isolated context]
├── Task(red) → Activates test [isolated context]
├── Task(green) → Minimal implementation [isolated context]
└── Task(refactor) → Improves code [isolated context]
Hypothesis: isolated contexts enforce discipline but may lose state between phases. Fresh context per phase avoids accumulated noise; comes with agent-spawning overhead.
v4-exact-subagents/.claude/
├── agents/ # Subagent definitions
│ ├── test-list.md
│ ├── red.md
│ ├── green.md
│ └── refactor.md
└── rules/
├── tdd.md # Main TDD rules (uses Task tool)
├── tdd_with_ts_and_vitest.md
└── tdd-experiment-mode.md # Autonomous mode for experiments
All TDD phases run in one continuous context using inline skills via the Skill tool.
Single Agent
├── Skill(/test-list) → Creates test list [same context]
├── Skill(/red) → Activates test [same context]
├── Skill(/green) → Minimal implementation [same context]
└── Skill(/refactor) → Improves code [same context]
Hypothesis: shared context maintains state but may lead to less discipline. No agent-spawning overhead; risk of context pollution / over-implementation.
v5-exact-single-context/.claude/
├── commands/ # Skill definitions (inline execution)
│ ├── test-list.md
│ ├── red.md
│ ├── green.md
│ └── refactor.md
└── rules/
├── tdd.md # Main TDD rules (uses Skill tool)
├── tdd_with_ts_and_vitest.md
└── tdd-experiment-mode.md # Autonomous mode for experiments
| Aspect | v1-oneshot | v2-iterative | v3-basic-tdd | v4-exact-subagents | v5-exact-single-context |
|---|---|---|---|---|---|
| TDD | ❌ No | ❌ No | ✅ Yes (minimal) | ✅ Yes (strict) | ✅ Yes (strict) |
| Mechanism | Direct code | Checklist | None | Task(subagent_type: "red") |
Skill(skill: "red") |
| Context | Single | Single | Single | Isolated per phase | Shared across phases |
| Guidance | None | Plan/checklist | Minimal TDD | Specialized agents | Inline skills |
| Definitions | None | None | None | agents/*.md |
commands/*.md |
| Overhead | None | None | None | Agent spawning | None |
The same workflow design can be deployed on different coding-agent harnesses (Claude Code, OpenCode, pi). The TDD-cycle measurement pipeline adapts per harness:
| Harness | Cycle-counting mechanism | Transcript parser |
|---|---|---|
| Claude Code | Skill tool calls with skill: "red" |
analyze_transcript.py |
| OpenCode | Same Skill-based logic | parse_opencode_transcript.py |
| pi | Text markers (## Red headers) in assistant output |
parse_pi_transcript.py |
Why pi needs text markers: pi skills are auto-loaded documents, not tool calls. The model reads each SKILL.md once and follows its instructions directly ("freihand"), producing ## Red headings per cycle instead of Skill({skill: "red"}) tool invocations. The refactor subagent still uses the subagent tool (counted the same way across all harnesses). See experiments/workflows/MARKERS.md for the full marker specification.
Workflow directories follow a naming convention: <variant-name> (Claude Code), <variant-name>-oc (OpenCode), <variant-name>-pi (pi). Each stores harness-specific configuration (.claude/ for CC, .opencode/ for OC, .pi/ for pi).
The runner pins the full Claude API model ID per config. The short aliases opus / sonnet are intentionally avoided because they currently resolve to legacy versions (e.g. opus → claude-opus-4-6, not Opus 4.7). Bump these entries when new model versions ship.
In RQ frontmatter, lab-variant IDs are pinned — not the Claude API IDs (claude-opus-4-7), not the short aliases (opus). A lab-variant ID uniquely combines model and thinking mode:
| Lab-variant ID | API model ID | Thinking | Mechanism |
|---|---|---|---|
opus-4-7 |
claude-opus-4-7 |
Adaptive | Default behavior |
opus-4-7-no-thinking |
claude-opus-4-7 |
Off | MAX_THINKING_TOKENS=0 |
sonnet-4-6 |
claude-sonnet-4-6 |
Extended | Default behavior |
sonnet-4-6-no-thinking |
claude-sonnet-4-6 |
Off | MAX_THINKING_TOKENS=0 |
haiku-4-5 |
claude-haiku-4-5-20251001 |
Extended | Default behavior |
haiku-4-5-no-thinking |
claude-haiku-4-5-20251001 |
Off | MAX_THINKING_TOKENS=0 |
The ID exactly matches the model field in metrics.json and the suffix in the run directory name. Source: MODEL_CONFIGS in experiments/record-run.sh and experiments/docker/run-batch.sh.
Katas live under experiments/katas/<basename>-<prompt-style>/prompt.md. The directory name typically ends with one of -prose, -user-story, or -example-mapping (the prompt style); the part before is the basename.
Currently maintained kata families, grouped by training-data exposure:
Known katas (classic exercises that the model has likely seen during training, but complex enough to still produce code-quality signal):
- game-of-life — primary code-quality kata, all three prompt styles
- mars-rover — secondary code-quality kata, all three prompt styles (mostly prose runs so far)
Novel katas (custom-built for this lab, not in training data; ship with external acceptance suite, see CLI katas):
- claim-office — full version with
quote+claim(incl. cap-tracking and multi-claim chains), 15 verification scenarios. Primary correctness kata viaverification_pct. - claim-office-lite — reduced derivative:
quoterules unchanged,claimwithout cap/multi-claim, 10 scenarios. All 6 ambiguities preserved. Use for code-quality research (~3× faster than full claim-office); not for correctness (saturates on example-mapping, collapses on prose). See Methodology constraints for details.
Older classic katas (string-calculator, pixel-art-scaler, diamond) were dropped because training-data contamination collapses the code-quality signal — see Methodology constraints.
For deeper guidance on building good katas (ambiguity construction, ruling strategies, test-suite distribution, anti-patterns), see research/kata-design/kata-construction.md.
For katas where correctness should be measured against a fixed acceptance suite that the implementer does not see (e.g. claim-office), add a sibling directory katas/<basename>-verification/ with:
runner.json—{"command": "...", "timeout_seconds": 30}. For TypeScript CLIs:pnpm exec tsx src/cli.ts(assumestsxis in the run's devDependencies; the standard pnpm template includes it).scenarios/NN-name.input.json— JSON document piped as stdin.scenarios/NN-name.expected.json— expected JSON on stdout (compared canonically viajq -S .).- Optional
scenarios/NN-name.story.mdfor narrative scenarios used in workshop reuse.
Conventions:
- The implementation's CLI entry point must be at
src/cli.ts. - The prompt must specify "CLI executable that reads JSON from stdin and writes JSON to stdout".
- The kata's prompt does not include the verification scenarios — they are private acceptance tests measured by
analyze-run.shafter the run completes.
After each run, analyze-run.sh automatically detects the <basename>-verification/ directory, pipes each scenario input into the CLI (in the run directory), and compares canonical JSON output against the expected JSON. Per-scenario results land in verification.log; counts go into metrics.json as final_metrics.verification_total, verification_passed, and verification_pct (a fraction 0.0–1.0). For non-CLI katas without a verification directory, these fields are 0/null.
- Claude Code CLI installed
- Anthropic API key (or compatible provider)
cd experiments
./record-run.shThis interactively prompts you to select a kata, workflow, and model, then launches Claude Code to execute the task. Results are saved to experiments/runs/<timestamp>_<kata>_<workflow>_<model>/.
cd experiments/docker
cp .env.example .env # add your API key
docker compose buildThen either run an existing batch plan or generate one for a research question:
# Generate a fill plan for an RQ (only missing replicates)
./experiments/batch-plan-from-rq.py research/questions-claude/2.1-model-effect-code-quality/
# Execute the batch
./experiments/docker/batch.sh experiments/batch-plans/rq-3-fill.json./experiments/aggregate-by-query.py research/questions-claude/2.1-model-effect-code-quality/This reads the RQ frontmatter, selects all matching runs from experiments/runs/, and writes runs.csv + summary.md into the RQ directory. Findings are then curated by hand into findings.md with status flags (✅ stabil, ⚠️ bedingt, ❌ widerlegt, 🚫 offen).
# 1. Plan: generate the missing cells for an RQ
./experiments/batch-plan-from-rq.py research/questions-claude/2.1-model-effect-code-quality/
# 2. Execute: run the batch in Docker
./experiments/docker/batch.sh experiments/batch-plans/rq-3-fill.json
# 2b. Optional: compute Stryker mutation_score for green runs
# (idempotent; only does anything if outcomes: [..., mutation_score])
./experiments/compute-mutation-score.py research/questions-claude/2.1-model-effect-code-quality/
# 3. Aggregate: re-derive runs.csv + summary.md
./experiments/aggregate-by-query.py research/questions-claude/2.1-model-effect-code-quality/
# 4. Curate: update findings.md with new evidence and status flagsTwo Claude Code skills automate the repetitive parts of the loop above. They live in .claude/skills/ and are invoked as slash commands inside Claude Code.
| Skill | Invocation | Purpose |
|---|---|---|
run-rq |
/run-rq RQ-N |
Drives a single RQ end-to-end: validates the README, generates a fill batch-plan, starts the Docker batch in the background, monitors progress, runs aggregation, proposes findings updates. Pure orchestration — every step calls existing repo scripts. |
build-overview |
/build-overview |
Generates a cross-RQ snapshot under research/_archive/experiment-overview-YYYY-MM-DD.md. Step 1 runs experiments/generate-snapshot-skeleton.py (data sections, finding lists, caveats). Step 2 fills synthesis sections (RQ paragraphs, cross-RQ synthesis, limitations) from the current findings.md. Reproducible because numbers come from the skeleton, not from model memory. |
The skills replace the manual end-to-end loop in routine usage:
manual ⇒ skill
------------------------ ----------------
batch-plan-from-rq.py
batch.sh ⇒ /run-rq RQ-N
watch-batch.sh
aggregate-by-query.py
read all findings.md
write overview.md ⇒ /build-overview
findings.md is alive — findings grow, status tags get updated, individual findings can be revised or discarded. For publishable point-in-time reports (table-heavy, cross-RQ synthesis) there are snapshots under research/_archive/experiment-overview-YYYY-MM-DD.md.
experiments/generate-snapshot-skeleton.pybuilds a skeleton with all data sections (data-base figures, coverage, raw finding lists per RQ, caveats section over all⚠️ /❌/🚫 findings).- The
/build-overviewskill fills the synthesis sections (RQ paragraphs, cross-RQ synthesis, conclusions) fromfindings.mdand writes toresearch/_archive/.
findings.md stays the single source of truth, and the snapshot is reproducible rather than written from model memory.
All scripts are designed to be run from the repo root unless noted otherwise. .py scripts use Python 3 with PyYAML + pandas; .sh scripts are bash.
| Script | Purpose |
|---|---|
experiments/record-run.sh |
Interactive: pick kata + workflow + model, launch Claude Code locally, record everything into experiments/runs/<id>/. Used for ad-hoc single runs outside Docker. |
experiments/analyze-run.sh |
Post-process one run directory: install pnpm deps, run vitest + ESLint+SonarJS, call analyze_transcript.py, emit analysis-report.md and metrics.json. Idempotent — safe to re-run after pipeline fixes. |
experiments/reanalyze-all-runs.sh |
Backfill metrics for every existing run after analyze-run.sh is extended (e.g. new fields like mccabe_* / cognitive_*). Iterates over every runs/<run>/, runs pnpm install against the shared runs/.pnpm-store for any run missing node_modules/, and re-invokes analyze-run.sh. Output to experiments/reanalyze.log (gitignored). |
experiments/analyze_transcript.py |
Parse transcript.jsonl (+ transcript-subagents/) for TDD-cycle metrics: phase inference, prediction accuracy, refactorings applied, token totals, context-window utilization. Writes transcript-metrics.json. Used for Claude Code runs. |
experiments/parse_pi_transcript.py |
Parse transcript-pi.jsonl for the same TDD-cycle metrics. Used for pi runs, where skills are auto-loaded documents (not tool calls) and cycle counting relies on text markers (## Red headers) in assistant output instead of Skill tool invocations. Writes transcript-metrics.json. |
experiments/parse_opencode_transcript.py |
Parse OpenCode session exports into transcript-metrics.json. Used for OpenCode runs. |
| Script | Purpose |
|---|---|
experiments/aggregate-by-query.py |
Reads an RQ frontmatter, selects matching runs from the pool, writes runs.csv + summary.md into the RQ directory. The canonical aggregator. |
experiments/batch-plan-from-rq.py |
Reads an RQ frontmatter, computes missing cells against min_replicates, writes experiments/batch-plans/<rq-id>-fill.json. Idempotent — empty plan if everything is covered. |
experiments/compute-mutation-score.py |
RQ-driven mutation testing. If the RQ lists mutation_score in outcomes, runs Stryker against every matching green run (tests_passing = true) and writes final_metrics.mutation_score back into metrics.json. Idempotent (skip when already set) and bounded by --timeout-seconds. No-op when the RQ does not request the outcome. Run between batch execution and aggregation. |
experiments/generate-snapshot-skeleton.py |
Reads all README.md + findings.md under research/questions-{claude,opencode,cross}/ and research/workflow-dev/, emits a Markdown skeleton to /tmp/snapshot-skeleton-YYYY-MM-DD.md with data sections (run counts, coverage per RQ, finding lists sorted by status, cross-RQ caveats) pre-filled and synthesis sections marked with <!-- TODO Claude -->. Consumed by the /build-overview skill. |
| Script | Purpose |
|---|---|
docker/batch.sh |
Convenience wrapper around docker compose --profile batch run --rm batch. Accepts a plan name or path: ./batch.sh rq-3-fill or ./batch.sh /abs/path/plan.json. Tees output to batch.<plan>.log. Supports --shards N for parallel runs (round-robin split, default 2, do not exceed 3 due to memory + API rate-limit pressure) and --detach for background execution. |
docker/run-batch.sh |
The inside-container entrypoint invoked by batch.sh. Reads the plan, executes each (kata, workflow, model) triple via Claude Code, calls analyze-run.sh, copies transcripts. Not normally invoked directly. |
docker/list-plans.sh |
Print every experiments/batch-plans/*.json with name, description, and run count. Useful before kicking off a batch. |
docker/resume-plan.sh |
Compute remaining work for a plan: subtract triples already present in experiments/runs/ from the original plan, write /tmp/<plan>-resume.json. Useful after a crashed/cancelled batch. |
docker/watch-batch.sh |
Status snapshot of running or recently finished batches. Auto-detects all docker-batch-run-* containers; falls back to the last entry in batch.log if none running. Reports per-cell progress, ETAs, and tail of run.log. Pass a plan name to scope to one batch. |
Fill a research question:
./experiments/batch-plan-from-rq.py research/questions-claude/2.1-model-effect-code-quality/
./experiments/docker/list-plans.sh # verify
./experiments/docker/batch.sh rq-3-fill # run
./experiments/docker/watch-batch.sh # monitor in another shell
./experiments/aggregate-by-query.py research/questions-claude/2.1-model-effect-code-quality/Recover from an interrupted batch:
./experiments/docker/resume-plan.sh rq-3-fill # writes /tmp/rq-3-fill-resume.json
./experiments/docker/batch.sh /tmp/rq-3-fill-resume.jsonRe-derive metrics for a single run after a pipeline fix:
./experiments/analyze-run.sh experiments/runs/2026-05-04_*_v4-exact-subagents_opus-4-7Run experiments in isolated, reproducible Docker containers.
cd experiments/docker
# Copy and edit environment file
cp .env.example .env
# Set your user/group IDs (fixes volume permissions)
echo "USER_ID=$(id -u)" >> .env
echo "GROUP_ID=$(id -g)" >> .env
# Edit .env with your API credentials (direct or Portkey)
# Build
docker compose build# List all available plans with description and run count
./list-plans.sh
# Run a plan (extension and full path optional)
./batch.sh smoke-test
./batch.sh /abs/path/to/plan.jsonPlan files live in experiments/batch-plans/*.json and list explicit {kata, workflow, model} triples. They are validated fail-fast against available katas, workflows, and the hard-coded MODEL_CONFIGS list — typos abort before any Claude call.
Plan file schema:
{
"name": "Optional plan name",
"description": "Optional description",
"runs": [
{ "kata": "game-of-life-prose", "workflow": "v3-basic-tdd", "model": "sonnet-4-6" }
]
}model is the lab-variant ID from MODEL_CONFIGS in run-batch.sh (e.g. sonnet-4-6, opus-4-7-no-thinking, haiku-4-5), not the full API ID.
Each run is wrapped with a 90-minute timeout (override via CLAUDE_TIMEOUT_SECONDS=...). Stdout/stderr is captured to runs/<run>/run.log. Transient API errors (rate limits, 429, overloaded) are retried with backoff up to 5 times; if they persist, the run is recorded with exit_reason = "rate-limited" or "transient-api-error" and the batch continues. Each metrics.json gets a run_status block with exit_code, exit_reason, and rate_limited.
- Base image:
node:22-slim - Tools: Node.js 22, pnpm, TypeScript, Vitest, Claude Code CLI, jq, git
- Resource limits: 2 CPUs, 4 GB memory (override in
docker-compose.ymlif needed) - Security: non-root user (
experimenter), read-only mounts for katas/workflows, API key mounted securely
| Host path | Container path | Mode |
|---|---|---|
../katas |
/home/experimenter/experiments/katas |
ro |
../workflows |
/home/experimenter/experiments/workflows |
ro |
../runs |
/home/experimenter/experiments/runs |
rw |
~/.anthropic/api_key |
/home/experimenter/.anthropic/api_key |
ro |
~/.claude |
/home/experimenter/.claude |
rw |
Direct Anthropic API
| Variable | Description | Default |
|---|---|---|
ANTHROPIC_API_KEY |
API key for Claude | (required) |
ANTHROPIC_API_KEY_FILE |
Path to file with API key | ~/.anthropic/api_key |
CLAUDE_CONFIG_DIR |
Claude config directory | ~/.claude |
CLAUDE_TIMEOUT_SECONDS |
Per-run wallclock budget | 5400 (90 min) |
Portkey gateway (or other proxy)
| Variable | Description | Example |
|---|---|---|
ANTHROPIC_BASE_URL |
Proxy URL | https://api.portkey.ai |
ANTHROPIC_AUTH_TOKEN |
Auth token (can be dummy for Portkey) | dummy |
ANTHROPIC_CUSTOM_HEADERS |
Custom headers for proxy | x-portkey-api-key: your-key |
Container user (volume permissions)
| Variable | Description | Default |
|---|---|---|
USER_ID |
Container user UID (run id -u) |
1000 |
GROUP_ID |
Container user GID (run id -g) |
1000 |
API key issues
docker compose run --rm experiment cat /home/experimenter/.anthropic/api_keyPermission errors
The container user must match your host user. Ensure USER_ID and GROUP_ID in .env match your system:
id -u # USER_ID
id -g # GROUP_ID
docker compose build --no-cache
# Or fix existing runs directory:
sudo chown -R $(id -u):$(id -g) ../runsEach run is evaluated on metrics extracted from two sources: direct code analysis of generated files, and the AI-generated experiment-summary.md. All metrics live in metrics.json per run; the analysis pipeline (ESLint with sonarjs/cognitive-complexity, max-depth, etc.) runs inside the Docker batch container.
The Term (binding) column gives the canonical name to use in findings.md, summary.md, and snapshots. These terms are binding — alternatives like "Code-Volumen", "Code-Gesamtvolumen", or "LoC-Größe" for code_mass are forbidden because they are ambiguous or collide with established definitions from the software-craftsmanship literature. When in doubt, cite the metric ID in backticks.
| Metric | Term (binding) | Source | Description |
|---|---|---|---|
duration_seconds |
— | metrics.json | Wall-clock time for the complete task |
tests_passing |
Korrektheit (innen) | test runner | Whether the implementer's own Vitest tests pass — the "inside view" of correctness |
verification_pct |
Korrektheit (außen) | external acceptance suite | For CLI katas with a sibling <basename>-verification/ directory: fraction of acceptance scenarios passed (0.0–1.0). The "outside view" of correctness — measured against scenarios the implementer did not see during the run. null for katas without a verification suite. |
| Metric | Term (binding) | Source | Description |
|---|---|---|---|
code_mass |
Code-Mass (APP) | analyze-run.sh |
Weighted sum of code constructs (constants, invocations, conditionals, loops, assignments — heavier weights for higher-complexity constructs) following the Absolute Priority Premise by Micah Martin. Aims to compare implementations objectively beyond raw LoC. Lower = simpler. See Code Cop blog; original talk: Micah Martin — Absolute Priority Premise (8LU, Vimeo). |
mutation_score |
Mutation-Score | Stryker + @stryker-mutator/vitest-runner |
Fraction of mutants killed by the implementer's own Vitest tests (0.0–1.0, formula (Killed + Timeout) / (Killed + Survived + Timeout + NoCoverage)). Higher = the test suite genuinely exercises behavior, not just coverage. Computed only when an RQ lists mutation_score in outcomes and tests_passing = true; otherwise null. Driven by experiments/compute-mutation-score.py (separate from analyze-run.sh). |
cc_loc |
Produktiv-LoC | analyze-run.sh |
Production LoC only, from the clean-code reporter (no tests) |
test_lines |
Test-LoC | analyze-run.sh |
Vitest test code |
smell_total |
Smell-Summe | ESLint + eslint-plugin-sonarjs |
Aggregated code-smell count. Sub-counters smell_complexity, smell_duplication, smell_magic_numbers, smell_code_quality group SonarJS rules (e.g. no-duplicate-string, no-collapsible-if) plus a few ESLint built-ins (max-depth, max-lines-per-function, max-params, no-magic-numbers, no-unreachable). |
cc_longest_function |
Spitzen-Komplexität | analyze-run.sh |
Longest function in lines (complexity peak per run) |
cc_avg_loc_per_function |
— | analyze-run.sh |
Mean function length in lines |
cc_median_loc_per_function |
— | analyze-run.sh |
Median function length in lines (robust against single long outliers) |
mccabe_max, mccabe_avg, mccabe_high_count |
— | ESLint complexity rule |
McCabe cyclomatic complexity per function — max, mean, and count of functions above the threshold. Concept: Cyclomatic complexity (Wikipedia); originally McCabe 1976. |
cognitive_max, cognitive_avg, cognitive_high_count |
— | ESLint sonarjs/cognitive-complexity |
Cognitive complexity per function (SonarSource metric, weights nesting and control-flow breaks heavier than McCabe) — max, mean, and count above threshold. Original definition: G. Ann Campbell — Cognitive Complexity whitepaper (PDF). |
| Metric | Description |
|---|---|
tokens_total |
Total tokens consumed (input + output + cache) |
context_utilization |
Final context-window utilization percentage |
v4-exact-subagents keeps the main context low because each agent has fresh context. v5-exact-single-context accumulates tokens, so utilization is higher.
Extracted from the harness-specific transcript by the corresponding parser:
| Harness | Transcript file | Parser | Cycle-counting mechanism |
|---|---|---|---|
| Claude Code | transcript.jsonl |
analyze_transcript.py |
Skill tool calls (skill: "red") |
| OpenCode | transcript.jsonl |
parse_opencode_transcript.py |
Same Skill-based logic |
| pi | transcript-pi.jsonl |
parse_pi_transcript.py |
Text markers (## Red headers) in assistant output |
The phase markers that drive these counts are documented in experiments/workflows/MARKERS.md — removing or renaming a marker silently zeroes the corresponding metric.
Why pi uses text markers instead of tool calls: pi skills are auto-loaded documents, not tool calls. The model reads each SKILL.md once and then follows its instructions directly ("freihand"). There is no Skill({skill: "red"}) tool invocation per cycle. Instead, the parser counts ## Red headings in the assistant output (one per cycle) and Red Phase Complete: blocks with prediction lines. The derive_cycle_count() function in analyze_transcript.py uses text markers as a tertiary fallback (after Skill and Task tool calls), so both parsers agree on the same priority chain.
| Metric | Description |
|---|---|
tdd_cycles |
Number of red-green-refactor cycles (TDD workflows only); should match test count for proper discipline |
prediction_accuracy |
Correctness of red-phase failure predictions (v4/v5). Higher accuracy shows deeper understanding of code state |
refactorings |
Number of refactoring improvements applied. More refactorings indicate better discipline and cleaner final code |
tests_immediately_passing |
Tests passing immediately in red phase, indicating over-implementation. Lower is better |
| Field | Description |
|---|---|
run_status.exit_code |
Process exit code |
run_status.exit_reason |
ok, timeout, rate-limited, transient-api-error, etc. |
completed_within_budget |
Boolean derived from exit_reason; available as an outcome in any RQ |
Create experiments/katas/<kata-name>/prompt.md with:
- Feature description
- Example Mapping (rules + examples)
- Expected file paths
- Constraints
The directory name typically ends with one of -prose, -user-story, or -example-mapping (the prompt style); the part before is the basename. For CLI katas with an external acceptance suite, see CLI katas with external acceptance suite. For deeper kata-design guidance, see research/kata-design/kata-construction.md.
Create experiments/workflows/<variant-name>/.claude/ with:
rules/*.md— TDD rules for this variantagents/*.md— agent definitions (if using subagents)commands/*.md— skill definitions (if using inline skills)
For pi workflows, create experiments/workflows/<variant-name>-pi/.pi/ with:
AGENTS.md— TDD rules and mandatory output-format markers (pi skills are documents, not tool calls)skills/<phase>/SKILL.md— Skill documents for test-list, red, greenagents/<name>.md— Subagent definitions (refactor)
See experiments/workflows/v6.2-with-why-cleaned-pi/ for a working example. The pi measurement pipeline (parse_pi_transcript.py) counts text markers (## Red, ## Green) in assistant output rather than Skill tool calls — see MARKERS.md for the full marker specification.
| Document | Description |
|---|---|
| HUMAN-IN-THE-LOOP.md | How to re-enable human approval checkpoints between TDD phases |
| WORKTREE-WORKFLOW.md | Persistent agent-worktree convention for parallel work |
This project is provided as-is for research and educational purposes.