Agentic Coding Lab

A lightweight research framework for systematically comparing agentic coding setups — models, agent architectures, and prompt strategies — using reproducible benchmarks.

Motivation

When working with agentic coding, many practical questions arise: Does a setup with dedicated sub-agents for testing produce cleaner code than a single agent doing everything? Does Sonnet deliver better code quality than Opus on a refactoring task? How do different prompt styles affect the outcome? This framework was built to answer such questions with data instead of gut feeling.

What This Framework Does

The framework runs controlled experiments across four dimensions:

Workflow variants — from simple one-shot generation ("vibe coding") to structured TDD with specialized agents per phase
Model configurations — Opus 4.7, Sonnet 4.6, Haiku 4.5, with and without extended thinking
Coding katas — standardized tasks, known (game-of-life, mars-rover) and novel (claim-office, claim-office-lite); see Katas
Prompt styles — prose, example-mapping, user-story

Each experiment produces measurable metrics: token usage, code complexity, code smells, test coverage, TDD cycle discipline, and more.

Aggregation is research-question-driven: A research question (RQ) defines a selector query (factors × controls × outcomes) over experiments/runs/, and tooling generates runs.csv + summary.md on demand. Batch plans are pure data-collection helpers that fill missing replicates.

Core Concept: Research-Question-Driven Aggregation

Earlier iterations of this lab were batch-driven: each batch produced its own results folder, and a finding was tied to "the batch it came from". That coupling made it hard to evolve hypotheses (every new analysis required a new batch) and hard to combine evidence across batches.

The current model decouples three concerns:

A research question (RQ) declaratively describes what is being studied — its factors, controls, outcomes, and required replicates.
Runs are produced once and live in a single flat pool (experiments/runs/), independent of any RQ or batch.
Aggregations are derived on demand by selecting all runs matching an RQ's frontmatter — across all batches that ever produced matching cells.

                                +-----------------------------+
                                |  research/{questions,        |
                                |    workflow-dev}/<chapter>-*/ |
                                |  README.md (frontmatter)      |
                                +---------+-------------------+
                                          |
                                          | factors × controls
                                          v
            +----------------+    +----------------+    +----------------+
            | aggregate-by-  |    | batch-plan-    |    |   findings.md  |
            | query.py       |    | from-rq.py     |    |  (curated)     |
            +-------+--------+    +--------+-------+    +----------------+
                    |                      |
                    | matches              | missing cells
                    v                      v
            +-----------------+    +------------------------+
            | experiments/    |    | experiments/           |
            | runs/  (pool)   |<---| batch-plans/<rq>-fill  |
            +-----------------+    +------------------------+
                       executed by docker/batch.sh

What an RQ contains

RQ directories live in four subtrees: research/questions-claude/ (Claude Code RQs), research/questions-opencode/ (OpenCode RQs), research/questions-cross/ (harness-übergreifende RQs), and research/workflow-dev/ (workflow-evolution chain). Each dir carries a <chapter>-slug name (e.g. 2.6-lean-validation) where the chapter number is an ordering label, not an id — like a document section heading, freely renumber-able by git mv. The stable identity is the frontmatter id:; tooling resolves an RQ by that id across all subtrees, never by directory name.

Each README.md starts with YAML frontmatter that acts as a selector query over the run pool:

---
id: RQ-prompt-correctness
question: "Does workflow choice affect code quality, correctness, TDD discipline?"
factors:                          # what varies
  workflow_x_prompt:
    - {workflow: v1-oneshot,         prompt: prose}
    - {workflow: v4-exact-subagents, prompt: example-mapping}
controls:                         # what is held constant
  kata_base: game-of-life
  model: opus-4-7-no-thinking
outcomes: [tests_passing, code_mass, smell_total, cc_longest_function, ...]
min_replicates: 3
status: active
---

The cartesian product of factors × controls produces cells — each cell is one (kata, workflow, model) combination that needs min_replicates runs.

Terminology:

Factor — a variable that is deliberately varied across runs. The effect of a factor is what the RQ measures.
Control — a variable that is held constant so it cannot confound the factor effect. A finding under model: opus-4-7-no-thinking is a finding for that model; transfer to other models is an open question, not an established result.
Outcome — a metric that is observed per run.

Mixing values into a control (e.g. running additional opus-4-6 replicates into an RQ controlled on opus-4-7) collapses the control into an uncontrolled factor and invalidates the comparison. To study a different model variant, open a new RQ — either with model as a factor (model-comparison RQ) or with the new value pinned as the control (separate finding, scoped to that model).

Two tools, one frontmatter

aggregate-by-query.py reads the frontmatter, selects every matching run from experiments/runs/, and writes runs.csv (raw data) + summary.md (pivots per outcome) into the RQ directory. Re-runs the moment new replicates land — no plan editing required.
batch-plan-from-rq.py reads the same frontmatter, counts existing matches per cell, and emits experiments/batch-plans/<rq>-fill.json containing exactly the missing (kata, workflow, model) triples needed to reach min_replicates. Idempotent: if everything is already covered, the plan is empty.

Together: declare the question once → fill the gaps → re-aggregate.

Frontmatter schema

---
id: RQ-<slug>                       # stable identity, e.g. RQ-prompt-correctness
question: "Full text of the research question"
factors:                          # what varies
  <factor-name>: [<value>, ...]
  # OR for paired factors:
  workflow_x_prompt:
    - {workflow: v1-oneshot, prompt: prose}
    - ...
controls:                         # what is held constant
  kata_base: game-of-life         # kata base name without prompt suffix
  workflow: v4-exact-subagents    # only if no workflow_x_prompt factor
  prompt: example-mapping         # only if no prompt factor / pairing
  model: <lab-variant-id>         # e.g. opus-4-7-no-thinking (see model alias table)
outcomes: [<metric>, ...]         # which metrics are measured
min_replicates: N                 # per cell
status: active | partial | closed
---

Selector resolution: The selector query forms the effective kata ID as <kata_base>-<prompt>. prompt comes from controls.prompt, the workflow_x_prompt pairing, or factors.prompt.

OR-match on controls.model: A control normally pins one exact value. controls.model is the one exception that accepts an explicit OR-match:

controls:
  model:
    any:                            # OR-match across providers/routings
      - opus-4-7-portkey-no-thinking
      - opus-4-7-no-thinking

All listed values count toward the same cell during aggregation, and the first entry is the canonical value used for new fill-runs (batch-plan-from-rq.py) and cell labelling in summary.md. The real per-run model stays in runs.csv under the model column for debugging; the cell-grouping value is the new cell_model column.

Intended use: combine routing variants of the same underlying model (e.g. Portkey-routed and Direct-API runs of opus-4-7-no-thinking) when routing is assumed not to affect the outcome under study. Not for combining different models — use factors.model instead, otherwise you collapse a real factor into a hidden uncontrolled variable.

Outcome conventions

outcomes in the frontmatter are CSV column names from runs.csv (see CSV_COLUMNS in experiments/aggregate-by-query.py). aggregate-by-query.py picks the pivot type automatically:

Value type / naming	Pivot form
Boolean	rate_% (share of `true`)
Numeric	mean / min / max / std over the cell
Suffix `<X>_correct_rate`	pooled rate from `<X>_correct` and `<X>_total`: Σ/Σ × 100

Pooled rate: Used for success rates with numerator/denominator per run, e.g. predictions_correct_rate → Σpredictions_correct / Σpredictions_total. Preferred over the mean of per-run rates because runs with small denominators would otherwise be over-weighted.

Methodology constraints

These rules apply lab-wide and are respected by every RQ.

Workflow → permitted prompt styles

For methodological symmetry:

Workflow	Permitted prompt styles	Rationale
v1-oneshot, v2-iterative	only prose	Concrete examples in the prompt nudge the agent toward starting with tests, which contaminates the non-TDD condition — the whole point of v1/v2 is to observe what happens when the agent is not steered into test-first.
v3-basic-tdd, v4-exact-subagents, v5-exact-single-context	prose, example-mapping, user-story	Examples serve as natural test cases — for TDD workflows this is the ideal task shape.

Consequences for RQ design:

Workflow as factor: Factor is named workflow_x_prompt and is a paired list of {workflow, prompt} tuples. Default pairing: v1/v2 → prose, v3/v4/v5 → example-mapping (the so-called "fair" comparison).
Workflow as control: controls.workflow and controls.prompt are set together, respecting the constraint.

Code-quality signal limited to game-of-life and mars-rover

From the re-evaluation of an earlier 235-run study, three constraints are stable:

Classic katas live in training data (string-calculator, pixel-art-scaler, etc.). Models solve them too trivially — smell_total = 0 consistently.
Pixel-art-scaler is not usable as a novel-kata sanity check (no workflow or model differentiation).
Code-quality signal is only visible on game-of-life and mars-rover. Statements about smell_total, cc_longest_function, etc. must be based on these katas — cross-kata averages over trivial katas dilute the signal.

Consequence for RQs: All current RQs use kata_base: game-of-life as the default. mars-rover stays available for cross-kata validation once enough replicates exist. Generalizability claims about arbitrary katas are 🚫 not testable with the current design.

New novel kata: claim-office was added as a fresh, non-classic kata with enough complexity to differentiate workflows and models. It is not in training data and ships with an external acceptance suite (see CLI katas), so correctness is measured from the outside via verification_pct. Once enough replicates land, it becomes the second carrier of the code-quality signal alongside game-of-life.

Reduced variant claim-office-lite: a derived kata with the same quote rules but claim stripped of cap-tracking and multi-claim chains (10 scenarios instead of 15, all 6 ambiguities preserved). Discriminability test 2026-05-21 (7 workflows × 2 replicates × example-mapping × opus-4-7-portkey-no-thinking) showed it differentiates workflows on code-quality metrics (cognitive_max 3–21, mccabe_max 4–15) and wallclock (3 min to 88 min) — usable as a faster code-quality kata alongside game-of-life. Not usable for correctness research: with example-mapping verification_pct saturates at 9–10/10 for all workflows; with prose it collapses to 2–4/10 without workflow separation. For correctness use the full claim-office.

Goodhart's Law: metrics named in workflow prompts stop being independent outcomes

"When a measure becomes a target, it ceases to be a good measure." — Goodhart's Law

The moment a workflow or agent prompt explicitly names a metric (e.g. "reduce cognitive complexity", "keep functions short"), runs of that workflow no longer produce independent observations of that metric — they produce compliance signals. A finding like "v6.6 improves cognitive_max over v6.5" then partly measures how diligently the agent follows its own instructions, not how good the resulting code is.

This is not an absolute disqualifier — naming code_mass ideas (APP) in refactor prompts is already standard practice in the v5/v6 line, and code_mass is still measured. But the contamination must be tracked explicitly:

Metrics named in the prompt of a workflow under test are compliance metrics for that workflow. Cross-workflow comparisons on those metrics are valid only if both workflows name them (or neither does).
Hidden metrics — never named in any prompt — stay independent. mutation_score is the strongest example: it measures test-suite quality from the outside and is hard to game without also producing better tests.
No numeric thresholds in workflow/agent prompts (e.g. cognitive_max < 15, LoC < 50). Katas and scenarios are too volatile for any single threshold to be meaningful — qualitative language only ("reduce", "extract when …"). Mirrored in CLAUDE.md under "Editing workflows".
When introducing a new metric into an agent's vocabulary, open a new workflow variant (e.g. v6.6) rather than mutating an existing one, so prior findings on that metric remain interpretable.

When in doubt, list the named metrics in the workflow's README.md / header so future RQs know which outcomes are compliance signals for it.

Timeouts as a research finding

Each run has a hard wallclock budget (default 90 min, set via CLAUDE_TIMEOUT_SECONDS=5400 in run-batch.sh). When a (workflow, model, kata) cell systematically hits this limit, that's not a data error — it is itself the finding: the variant is impractical within the chosen cost frame.

Consequences for analysis and data collection:

Timeout runs are not deleted. Their metrics.json is preserved with run_status.exit_reason = "timeout". tests_passing, verification_pct, code_mass, etc. are null.
Exhausted retry budgets (exit_reason = "rate-limited" or "transient-api-error") are also folded into completed_within_budget = false — not because the wallclock ran out, but because a (workflow, model, kata) cell that repeatedly trips rate limits or transient API errors is practically unusable inside the lab's cost/availability envelope. Configurable via BATCH_RATELIMIT_RETRIES (default 5).
They count toward min_replicates. batch-plan-from-rq.py treats a timeout as a legitimate data point — no refill is generated for timeout cells.
completed_within_budget (Boolean, derived from exit_reason) is available as an outcome and reports the share of "finished within budget" per cell. Sensible as an outcome in any RQ whose factors vary workflow or model.
n_ok column in the cell coverage table of summary.md only counts successful runs; a "3 timeouts, 0 OK" cell is flagged ⚠️ even when min_replicates is formally met.

Findings status legend

✅ stabil — data robustly supports the finding (n≥3, clear signal)
⚠️ bedingt — only holds under a qualifying condition (named in the finding)
❌ widerlegt — data clearly contradicts the finding
🚫 offen — data basis missing; status open

Key files and their meaning

File	Owner	Lifespan	Purpose
`research/<tree>/<chapter>-*/README.md`	human-curated	persistent	The question, its frontmatter selector, hypotheses, design rationale
`research/<tree>/<chapter>-*/findings.md`	human-curated	persistent, growing	Numbered findings with status flags. Survives data refreshes
`research/<tree>/<chapter>-*/runs.csv`	generated by `aggregate-by-query.py`	regenerated on demand	Flat table of all runs matching the selector
`research/<tree>/<chapter>-*/summary.md`	generated by `aggregate-by-query.py`	regenerated on demand	Pivot tables per outcome × cell
`experiments/runs/<id>/metrics.json`	produced per run	immutable artefact	The atomic data point — every aggregation traces back here
`experiments/batch-plans/<rq>-fill.json`	generated by `batch-plan-from-rq.py`	regenerated, throwaway	Missing-cell list for the next batch run
`research/_archive/`	human-curated snapshot	frozen	Past analyses preserved with mapping tables to current RQ findings

Repository Structure

.
├── .claude/
│   └── skills/                   # Claude Code skills for repo workflows
│       ├── run-rq/               #   /run-rq RQ-N — drive an RQ end-to-end
│       └── build-overview/       #   /build-overview — generate cross-RQ snapshot
├── experiments/
│   ├── katas/                            # Coding exercises (problem definitions)
│   ├── workflows/                        # Workflow variants (v1–v5 + pi variants)
│   ├── runs/                             # Recorded experiment results (flat pool)
│   ├── batch-plans/                      # JSON batch specs (auto-generated per RQ)
│   ├── docker/                           # Containerized batch execution
│   ├── record-run.sh                     # Run a single experiment interactively
│   ├── analyze-run.sh                    # Generate analysis-report.md + metrics.json for a run
│   ├── reanalyze-all-runs.sh             # Backfill metrics across every existing run
│   ├── aggregate-by-query.py             # RQ frontmatter → runs.csv + summary.md
│   ├── batch-plan-from-rq.py             # RQ frontmatter → batch plan filling missing cells
│   ├── analyze_transcript.py             # Parse CC/OC transcript.jsonl for TDD-cycle metrics
│   ├── parse_pi_transcript.py            # Parse pi transcript-pi.jsonl (text-marker-based cycle counting)
│   ├── parse_opencode_transcript.py      # Parse OpenCode session exports
│   └── generate-snapshot-skeleton.py     # Cross-RQ snapshot skeleton (used by /build-overview)
├── research/
│   ├── RQ-prompt-correctness-workflow-effect/     # Per-RQ:
│   │   ├── README.md             #   frontmatter selector + question + hypotheses
│   │   ├── findings.md           #   curated, growing list of numbered findings
│   │   ├── runs.csv              #   generated: raw data of all matching runs
│   │   └── summary.md            #   generated: pivot tables per outcome × cell
│   ├── RQ-prompt-known-kata-prompt-style/
│   ├── ...                       # one directory per active RQ
│   ├── kata-design/              # kata construction guidelines
│   └── _archive/                 # frozen snapshots of past analyses + experiment-overview snapshots
├── HUMAN-IN-THE-LOOP.md          # Optional HITL checkpoint guide
├── WORKTREE-WORKFLOW.md          # Persistent agent-worktree convention
└── todos_and_ideas.txt           # Future research directions

Workflow Variants

Variant	Approach	Description
v1-oneshot	No TDD	Direct implementation in one shot ("vibe coding" baseline)
v2-iterative	No TDD	Iterative prompting with plan/checklist
v3-basic-tdd	Minimal TDD	Just "use TDD" — no detailed rules
v4-exact-subagents	Structured TDD	Each TDD phase (red/green/refactor) runs in a separate, isolated agent
v5-exact-single-context	Structured TDD	All TDD phases run in one continuous context using inline skills

v1-oneshot (No TDD baseline)

Single agent reads requirements, writes code, then adds tests after the fact. Baseline that measures the value of TDD itself. Tests are added based on the Example Mapping for fair comparison.

v1-oneshot/.claude/
└── rules/
    └── experiment-mode.md     # Non-TDD approach + output format

v2-iterative (No TDD, iterative)

Single agent builds an explicit checklist, implements step by step, then adds tests after. Measures whether structured iteration alone (without TDD) improves over one-shot.

v2-iterative/.claude/
└── rules/
    └── experiment-mode.md     # Iterative approach + output format

v3-basic-tdd (TDD control)

Single agent with minimal TDD rules — no phase-specific guidance, no agent spawning. Claude decides how to structure its TDD process. Lowest TDD overhead, maximum flexibility.

v3-basic-tdd/.claude/
└── rules/
    └── experiment-mode.md     # Minimal TDD guidance + output format

v4-exact-subagents

Each TDD phase runs as a specialized subagent with isolated context, invoked via the Task tool with subagent_type parameter.

Main Agent
    ├── Task(test-list) → Creates test list      [isolated context]
    ├── Task(red)       → Activates test         [isolated context]
    ├── Task(green)     → Minimal implementation [isolated context]
    └── Task(refactor)  → Improves code          [isolated context]

Hypothesis: isolated contexts enforce discipline but may lose state between phases. Fresh context per phase avoids accumulated noise; comes with agent-spawning overhead.

v4-exact-subagents/.claude/
├── agents/                    # Subagent definitions
│   ├── test-list.md
│   ├── red.md
│   ├── green.md
│   └── refactor.md
└── rules/
    ├── tdd.md                 # Main TDD rules (uses Task tool)
    ├── tdd_with_ts_and_vitest.md
    └── tdd-experiment-mode.md # Autonomous mode for experiments

v5-exact-single-context

All TDD phases run in one continuous context using inline skills via the Skill tool.

Single Agent
    ├── Skill(/test-list) → Creates test list      [same context]
    ├── Skill(/red)       → Activates test         [same context]
    ├── Skill(/green)     → Minimal implementation [same context]
    └── Skill(/refactor)  → Improves code          [same context]

Hypothesis: shared context maintains state but may lead to less discipline. No agent-spawning overhead; risk of context pollution / over-implementation.

v5-exact-single-context/.claude/
├── commands/                  # Skill definitions (inline execution)
│   ├── test-list.md
│   ├── red.md
│   ├── green.md
│   └── refactor.md
└── rules/
    ├── tdd.md                 # Main TDD rules (uses Skill tool)
    ├── tdd_with_ts_and_vitest.md
    └── tdd-experiment-mode.md # Autonomous mode for experiments

Key differences

Aspect	v1-oneshot	v2-iterative	v3-basic-tdd	v4-exact-subagents	v5-exact-single-context
TDD	❌ No	❌ No	✅ Yes (minimal)	✅ Yes (strict)	✅ Yes (strict)
Mechanism	Direct code	Checklist	None	`Task(subagent_type: "red")`	`Skill(skill: "red")`
Context	Single	Single	Single	Isolated per phase	Shared across phases
Guidance	None	Plan/checklist	Minimal TDD	Specialized agents	Inline skills
Definitions	None	None	None	`agents/*.md`	`commands/*.md`
Overhead	None	None	None	Agent spawning	None

Multi-harness support

The same workflow design can be deployed on different coding-agent harnesses (Claude Code, OpenCode, pi). The TDD-cycle measurement pipeline adapts per harness:

Harness	Cycle-counting mechanism	Transcript parser
Claude Code	`Skill` tool calls with `skill: "red"`	`analyze_transcript.py`
OpenCode	Same Skill-based logic	`parse_opencode_transcript.py`
pi	Text markers (`## Red` headers) in assistant output	`parse_pi_transcript.py`

Why pi needs text markers: pi skills are auto-loaded documents, not tool calls. The model reads each SKILL.md once and follows its instructions directly ("freihand"), producing ## Red headings per cycle instead of Skill({skill: "red"}) tool invocations. The refactor subagent still uses the subagent tool (counted the same way across all harnesses). See experiments/workflows/MARKERS.md for the full marker specification.

Workflow directories follow a naming convention: <variant-name> (Claude Code), <variant-name>-oc (OpenCode), <variant-name>-pi (pi). Each stores harness-specific configuration (.claude/ for CC, .opencode/ for OC, .pi/ for pi).

Model Configurations

The runner pins the full Claude API model ID per config. The short aliases opus / sonnet are intentionally avoided because they currently resolve to legacy versions (e.g. opus → claude-opus-4-6, not Opus 4.7). Bump these entries when new model versions ship.

In RQ frontmatter, lab-variant IDs are pinned — not the Claude API IDs (claude-opus-4-7), not the short aliases (opus). A lab-variant ID uniquely combines model and thinking mode:

Lab-variant ID	API model ID	Thinking	Mechanism
`opus-4-7`	`claude-opus-4-7`	Adaptive	Default behavior
`opus-4-7-no-thinking`	`claude-opus-4-7`	Off	`MAX_THINKING_TOKENS=0`
`sonnet-4-6`	`claude-sonnet-4-6`	Extended	Default behavior
`sonnet-4-6-no-thinking`	`claude-sonnet-4-6`	Off	`MAX_THINKING_TOKENS=0`
`haiku-4-5`	`claude-haiku-4-5-20251001`	Extended	Default behavior
`haiku-4-5-no-thinking`	`claude-haiku-4-5-20251001`	Off	`MAX_THINKING_TOKENS=0`

The ID exactly matches the model field in metrics.json and the suffix in the run directory name. Source: MODEL_CONFIGS in experiments/record-run.sh and experiments/docker/run-batch.sh.

Katas

Katas live under experiments/katas/<basename>-<prompt-style>/prompt.md. The directory name typically ends with one of -prose, -user-story, or -example-mapping (the prompt style); the part before is the basename.

Currently maintained kata families, grouped by training-data exposure:

Known katas (classic exercises that the model has likely seen during training, but complex enough to still produce code-quality signal):

game-of-life — primary code-quality kata, all three prompt styles
mars-rover — secondary code-quality kata, all three prompt styles (mostly prose runs so far)

Novel katas (custom-built for this lab, not in training data; ship with external acceptance suite, see CLI katas):

claim-office — full version with quote + claim (incl. cap-tracking and multi-claim chains), 15 verification scenarios. Primary correctness kata via verification_pct.
claim-office-lite — reduced derivative: quote rules unchanged, claim without cap/multi-claim, 10 scenarios. All 6 ambiguities preserved. Use for code-quality research (~3× faster than full claim-office); not for correctness (saturates on example-mapping, collapses on prose). See Methodology constraints for details.

Older classic katas (string-calculator, pixel-art-scaler, diamond) were dropped because training-data contamination collapses the code-quality signal — see Methodology constraints.

For deeper guidance on building good katas (ambiguity construction, ruling strategies, test-suite distribution, anti-patterns), see research/kata-design/kata-construction.md.

CLI katas with external acceptance suite

For katas where correctness should be measured against a fixed acceptance suite that the implementer does not see (e.g. claim-office), add a sibling directory katas/<basename>-verification/ with:

runner.json — {"command": "...", "timeout_seconds": 30}. For TypeScript CLIs: pnpm exec tsx src/cli.ts (assumes tsx is in the run's devDependencies; the standard pnpm template includes it).
scenarios/NN-name.input.json — JSON document piped as stdin.
scenarios/NN-name.expected.json — expected JSON on stdout (compared canonically via jq -S .).
Optional scenarios/NN-name.story.md for narrative scenarios used in workshop reuse.

Conventions:

The implementation's CLI entry point must be at src/cli.ts.
The prompt must specify "CLI executable that reads JSON from stdin and writes JSON to stdout".
The kata's prompt does not include the verification scenarios — they are private acceptance tests measured by analyze-run.sh after the run completes.

After each run, analyze-run.sh automatically detects the <basename>-verification/ directory, pipes each scenario input into the CLI (in the run directory), and compares canonical JSON output against the expected JSON. Per-scenario results land in verification.log; counts go into metrics.json as final_metrics.verification_total, verification_passed, and verification_pct (a fraction 0.0–1.0). For non-CLI katas without a verification directory, these fields are 0/null.

Quick Start

Prerequisites

Claude Code CLI installed
Anthropic API key (or compatible provider)

Run a single experiment locally

cd experiments
./record-run.sh

This interactively prompts you to select a kata, workflow, and model, then launches Claude Code to execute the task. Results are saved to experiments/runs/<timestamp>_<kata>_<workflow>_<model>/.

Run a batch via Docker

cd experiments/docker
cp .env.example .env          # add your API key
docker compose build

Then either run an existing batch plan or generate one for a research question:

# Generate a fill plan for an RQ (only missing replicates)
./experiments/batch-plan-from-rq.py research/questions-claude/2.1-model-effect-code-quality/

# Execute the batch
./experiments/docker/batch.sh experiments/batch-plans/rq-3-fill.json

Aggregate results for a research question

./experiments/aggregate-by-query.py research/questions-claude/2.1-model-effect-code-quality/

This reads the RQ frontmatter, selects all matching runs from experiments/runs/, and writes runs.csv + summary.md into the RQ directory. Findings are then curated by hand into findings.md with status flags (✅ stabil, ⚠️ bedingt, ❌ widerlegt, 🚫 offen).

End-to-end loop

# 1. Plan: generate the missing cells for an RQ
./experiments/batch-plan-from-rq.py research/questions-claude/2.1-model-effect-code-quality/

# 2. Execute: run the batch in Docker
./experiments/docker/batch.sh experiments/batch-plans/rq-3-fill.json

# 2b. Optional: compute Stryker mutation_score for green runs
#     (idempotent; only does anything if outcomes: [..., mutation_score])
./experiments/compute-mutation-score.py research/questions-claude/2.1-model-effect-code-quality/

# 3. Aggregate: re-derive runs.csv + summary.md
./experiments/aggregate-by-query.py research/questions-claude/2.1-model-effect-code-quality/

# 4. Curate: update findings.md with new evidence and status flags

Skills

Two Claude Code skills automate the repetitive parts of the loop above. They live in .claude/skills/ and are invoked as slash commands inside Claude Code.

Skill	Invocation	Purpose
`run-rq`	`/run-rq RQ-N`	Drives a single RQ end-to-end: validates the README, generates a fill batch-plan, starts the Docker batch in the background, monitors progress, runs aggregation, proposes findings updates. Pure orchestration — every step calls existing repo scripts.
`build-overview`	`/build-overview`	Generates a cross-RQ snapshot under `research/_archive/experiment-overview-YYYY-MM-DD.md`. Step 1 runs `experiments/generate-snapshot-skeleton.py` (data sections, finding lists, caveats). Step 2 fills synthesis sections (RQ paragraphs, cross-RQ synthesis, limitations) from the current `findings.md`. Reproducible because numbers come from the skeleton, not from model memory.

The skills replace the manual end-to-end loop in routine usage:

   manual                   ⇒    skill
   ------------------------       ----------------
   batch-plan-from-rq.py
   batch.sh                  ⇒    /run-rq RQ-N
   watch-batch.sh
   aggregate-by-query.py

   read all findings.md
   write overview.md         ⇒    /build-overview

Snapshot lifecycle

findings.md is alive — findings grow, status tags get updated, individual findings can be revised or discarded. For publishable point-in-time reports (table-heavy, cross-RQ synthesis) there are snapshots under research/_archive/experiment-overview-YYYY-MM-DD.md.

experiments/generate-snapshot-skeleton.py builds a skeleton with all data sections (data-base figures, coverage, raw finding lists per RQ, caveats section over all ⚠️/❌/🚫 findings).
The /build-overview skill fills the synthesis sections (RQ paragraphs, cross-RQ synthesis, conclusions) from findings.md and writes to research/_archive/.

findings.md stays the single source of truth, and the snapshot is reproducible rather than written from model memory.

Script Reference

All scripts are designed to be run from the repo root unless noted otherwise. .py scripts use Python 3 with PyYAML + pandas; .sh scripts are bash.

Run lifecycle (top-level `experiments/`)

Script	Purpose
`experiments/record-run.sh`	Interactive: pick kata + workflow + model, launch Claude Code locally, record everything into `experiments/runs/<id>/`. Used for ad-hoc single runs outside Docker.
`experiments/analyze-run.sh`	Post-process one run directory: install pnpm deps, run vitest + ESLint+SonarJS, call `analyze_transcript.py`, emit `analysis-report.md` and `metrics.json`. Idempotent — safe to re-run after pipeline fixes.
`experiments/reanalyze-all-runs.sh`	Backfill metrics for every existing run after `analyze-run.sh` is extended (e.g. new fields like `mccabe_` / `cognitive_`). Iterates over every `runs/<run>/`, runs `pnpm install` against the shared `runs/.pnpm-store` for any run missing `node_modules/`, and re-invokes `analyze-run.sh`. Output to `experiments/reanalyze.log` (gitignored).
`experiments/analyze_transcript.py`	Parse `transcript.jsonl` (+ `transcript-subagents/`) for TDD-cycle metrics: phase inference, prediction accuracy, refactorings applied, token totals, context-window utilization. Writes `transcript-metrics.json`. Used for Claude Code runs.
`experiments/parse_pi_transcript.py`	Parse `transcript-pi.jsonl` for the same TDD-cycle metrics. Used for pi runs, where skills are auto-loaded documents (not tool calls) and cycle counting relies on text markers (`## Red` headers) in assistant output instead of `Skill` tool invocations. Writes `transcript-metrics.json`.
`experiments/parse_opencode_transcript.py`	Parse OpenCode session exports into `transcript-metrics.json`. Used for OpenCode runs.

Aggregation

Script	Purpose
`experiments/aggregate-by-query.py`	Reads an RQ frontmatter, selects matching runs from the pool, writes `runs.csv` + `summary.md` into the RQ directory. The canonical aggregator.
`experiments/batch-plan-from-rq.py`	Reads an RQ frontmatter, computes missing cells against `min_replicates`, writes `experiments/batch-plans/<rq-id>-fill.json`. Idempotent — empty plan if everything is covered.
`experiments/compute-mutation-score.py`	RQ-driven mutation testing. If the RQ lists `mutation_score` in `outcomes`, runs Stryker against every matching green run (`tests_passing = true`) and writes `final_metrics.mutation_score` back into `metrics.json`. Idempotent (skip when already set) and bounded by `--timeout-seconds`. No-op when the RQ does not request the outcome. Run between batch execution and aggregation.
`experiments/generate-snapshot-skeleton.py`	Reads all `README.md` + `findings.md` under `research/questions-{claude,opencode,cross}/` and `research/workflow-dev/`, emits a Markdown skeleton to `/tmp/snapshot-skeleton-YYYY-MM-DD.md` with data sections (run counts, coverage per RQ, finding lists sorted by status, cross-RQ caveats) pre-filled and synthesis sections marked with `<!-- TODO Claude -->`. Consumed by the `/build-overview` skill.

Docker batch execution (`experiments/docker/`)

Script	Purpose
`docker/batch.sh`	Convenience wrapper around `docker compose --profile batch run --rm batch`. Accepts a plan name or path: `./batch.sh rq-3-fill` or `./batch.sh /abs/path/plan.json`. Tees output to `batch.<plan>.log`. Supports `--shards N` for parallel runs (round-robin split, default 2, do not exceed 3 due to memory + API rate-limit pressure) and `--detach` for background execution.
`docker/run-batch.sh`	The inside-container entrypoint invoked by `batch.sh`. Reads the plan, executes each `(kata, workflow, model)` triple via Claude Code, calls `analyze-run.sh`, copies transcripts. Not normally invoked directly.
`docker/list-plans.sh`	Print every `experiments/batch-plans/*.json` with name, description, and run count. Useful before kicking off a batch.
`docker/resume-plan.sh`	Compute remaining work for a plan: subtract triples already present in `experiments/runs/` from the original plan, write `/tmp/<plan>-resume.json`. Useful after a crashed/cancelled batch.
`docker/watch-batch.sh`	Status snapshot of running or recently finished batches. Auto-detects all `docker-batch-run-*` containers; falls back to the last entry in `batch.log` if none running. Reports per-cell progress, ETAs, and tail of `run.log`. Pass a plan name to scope to one batch.

Typical sequences

Fill a research question:

./experiments/batch-plan-from-rq.py research/questions-claude/2.1-model-effect-code-quality/
./experiments/docker/list-plans.sh                          # verify
./experiments/docker/batch.sh rq-3-fill                     # run
./experiments/docker/watch-batch.sh                         # monitor in another shell
./experiments/aggregate-by-query.py research/questions-claude/2.1-model-effect-code-quality/

Recover from an interrupted batch:

./experiments/docker/resume-plan.sh rq-3-fill               # writes /tmp/rq-3-fill-resume.json
./experiments/docker/batch.sh /tmp/rq-3-fill-resume.json

Re-derive metrics for a single run after a pipeline fix:

./experiments/analyze-run.sh experiments/runs/2026-05-04_*_v4-exact-subagents_opus-4-7

Docker Setup

Run experiments in isolated, reproducible Docker containers.

Setup

cd experiments/docker

# Copy and edit environment file
cp .env.example .env

# Set your user/group IDs (fixes volume permissions)
echo "USER_ID=$(id -u)" >> .env
echo "GROUP_ID=$(id -g)" >> .env

# Edit .env with your API credentials (direct or Portkey)

# Build
docker compose build

Running

# List all available plans with description and run count
./list-plans.sh

# Run a plan (extension and full path optional)
./batch.sh smoke-test
./batch.sh /abs/path/to/plan.json

Plan files live in experiments/batch-plans/*.json and list explicit {kata, workflow, model} triples. They are validated fail-fast against available katas, workflows, and the hard-coded MODEL_CONFIGS list — typos abort before any Claude call.

Plan file schema:

{
  "name": "Optional plan name",
  "description": "Optional description",
  "runs": [
    { "kata": "game-of-life-prose", "workflow": "v3-basic-tdd", "model": "sonnet-4-6" }
  ]
}

model is the lab-variant ID from MODEL_CONFIGS in run-batch.sh (e.g. sonnet-4-6, opus-4-7-no-thinking, haiku-4-5), not the full API ID.

Per-run hardening

Each run is wrapped with a 90-minute timeout (override via CLAUDE_TIMEOUT_SECONDS=...). Stdout/stderr is captured to runs/<run>/run.log. Transient API errors (rate limits, 429, overloaded) are retried with backoff up to 5 times; if they persist, the run is recorded with exit_reason = "rate-limited" or "transient-api-error" and the batch continues. Each metrics.json gets a run_status block with exit_code, exit_reason, and rate_limited.

Container details

Base image: node:22-slim
Tools: Node.js 22, pnpm, TypeScript, Vitest, Claude Code CLI, jq, git
Resource limits: 2 CPUs, 4 GB memory (override in docker-compose.yml if needed)
Security: non-root user (experimenter), read-only mounts for katas/workflows, API key mounted securely

Volume mounts

Host path	Container path	Mode
`../katas`	`/home/experimenter/experiments/katas`	ro
`../workflows`	`/home/experimenter/experiments/workflows`	ro
`../runs`	`/home/experimenter/experiments/runs`	rw
`~/.anthropic/api_key`	`/home/experimenter/.anthropic/api_key`	ro
`~/.claude`	`/home/experimenter/.claude`	rw

Environment variables

Direct Anthropic API

Variable	Description	Default
`ANTHROPIC_API_KEY`	API key for Claude	(required)
`ANTHROPIC_API_KEY_FILE`	Path to file with API key	`~/.anthropic/api_key`
`CLAUDE_CONFIG_DIR`	Claude config directory	`~/.claude`
`CLAUDE_TIMEOUT_SECONDS`	Per-run wallclock budget	`5400` (90 min)

Portkey gateway (or other proxy)

Variable	Description	Example
`ANTHROPIC_BASE_URL`	Proxy URL	`https://api.portkey.ai`
`ANTHROPIC_AUTH_TOKEN`	Auth token (can be dummy for Portkey)	`dummy`
`ANTHROPIC_CUSTOM_HEADERS`	Custom headers for proxy	`x-portkey-api-key: your-key`

Container user (volume permissions)

Variable	Description	Default
`USER_ID`	Container user UID (run `id -u`)	`1000`
`GROUP_ID`	Container user GID (run `id -g`)	`1000`

Troubleshooting

API key issues

docker compose run --rm experiment cat /home/experimenter/.anthropic/api_key

Permission errors

The container user must match your host user. Ensure USER_ID and GROUP_ID in .env match your system:

id -u  # USER_ID
id -g  # GROUP_ID
docker compose build --no-cache
# Or fix existing runs directory:
sudo chown -R $(id -u):$(id -g) ../runs

Metrics

Each run is evaluated on metrics extracted from two sources: direct code analysis of generated files, and the AI-generated experiment-summary.md. All metrics live in metrics.json per run; the analysis pipeline (ESLint with sonarjs/cognitive-complexity, max-depth, etc.) runs inside the Docker batch container.

The Term (binding) column gives the canonical name to use in findings.md, summary.md, and snapshots. These terms are binding — alternatives like "Code-Volumen", "Code-Gesamtvolumen", or "LoC-Größe" for code_mass are forbidden because they are ambiguous or collide with established definitions from the software-craftsmanship literature. When in doubt, cite the metric ID in backticks.

Run outcomes

Metric	Term (binding)	Source	Description
`duration_seconds`	—	metrics.json	Wall-clock time for the complete task
`tests_passing`	Korrektheit (innen)	test runner	Whether the implementer's own Vitest tests pass — the "inside view" of correctness
`verification_pct`	Korrektheit (außen)	external acceptance suite	For CLI katas with a sibling `<basename>-verification/` directory: fraction of acceptance scenarios passed (0.0–1.0). The "outside view" of correctness — measured against scenarios the implementer did not see during the run. `null` for katas without a verification suite.

Code-quality metrics

Metric	Term (binding)	Source	Description
`code_mass`	Code-Mass (APP)	`analyze-run.sh`	Weighted sum of code constructs (constants, invocations, conditionals, loops, assignments — heavier weights for higher-complexity constructs) following the Absolute Priority Premise by Micah Martin. Aims to compare implementations objectively beyond raw LoC. Lower = simpler. See Code Cop blog; original talk: Micah Martin — Absolute Priority Premise (8LU, Vimeo).
`mutation_score`	Mutation-Score	Stryker + `@stryker-mutator/vitest-runner`	Fraction of mutants killed by the implementer's own Vitest tests (0.0–1.0, formula `(Killed + Timeout) / (Killed + Survived + Timeout + NoCoverage)`). Higher = the test suite genuinely exercises behavior, not just coverage. Computed only when an RQ lists `mutation_score` in `outcomes` and `tests_passing = true`; otherwise `null`. Driven by `experiments/compute-mutation-score.py` (separate from `analyze-run.sh`).
`cc_loc`	Produktiv-LoC	`analyze-run.sh`	Production LoC only, from the clean-code reporter (no tests)
`test_lines`	Test-LoC	`analyze-run.sh`	Vitest test code
`smell_total`	Smell-Summe	ESLint + `eslint-plugin-sonarjs`	Aggregated code-smell count. Sub-counters `smell_complexity`, `smell_duplication`, `smell_magic_numbers`, `smell_code_quality` group SonarJS rules (e.g. `no-duplicate-string`, `no-collapsible-if`) plus a few ESLint built-ins (`max-depth`, `max-lines-per-function`, `max-params`, `no-magic-numbers`, `no-unreachable`).
`cc_longest_function`	Spitzen-Komplexität	`analyze-run.sh`	Longest function in lines (complexity peak per run)
`cc_avg_loc_per_function`	—	`analyze-run.sh`	Mean function length in lines
`cc_median_loc_per_function`	—	`analyze-run.sh`	Median function length in lines (robust against single long outliers)
`mccabe_max`, `mccabe_avg`, `mccabe_high_count`	—	ESLint `complexity` rule	McCabe cyclomatic complexity per function — max, mean, and count of functions above the threshold. Concept: Cyclomatic complexity (Wikipedia); originally McCabe 1976.
`cognitive_max`, `cognitive_avg`, `cognitive_high_count`	—	ESLint `sonarjs/cognitive-complexity`	Cognitive complexity per function (SonarSource metric, weights nesting and control-flow breaks heavier than McCabe) — max, mean, and count above threshold. Original definition: G. Ann Campbell — Cognitive Complexity whitepaper (PDF).

Token & context metrics

Metric	Description
`tokens_total`	Total tokens consumed (input + output + cache)
`context_utilization`	Final context-window utilization percentage

v4-exact-subagents keeps the main context low because each agent has fresh context. v5-exact-single-context accumulates tokens, so utilization is higher.

TDD discipline metrics

Extracted from the harness-specific transcript by the corresponding parser:

Harness	Transcript file	Parser	Cycle-counting mechanism
Claude Code	`transcript.jsonl`	`analyze_transcript.py`	`Skill` tool calls (`skill: "red"`)
OpenCode	`transcript.jsonl`	`parse_opencode_transcript.py`	Same Skill-based logic
pi	`transcript-pi.jsonl`	`parse_pi_transcript.py`	Text markers (`## Red` headers) in assistant output

The phase markers that drive these counts are documented in experiments/workflows/MARKERS.md — removing or renaming a marker silently zeroes the corresponding metric.

Why pi uses text markers instead of tool calls: pi skills are auto-loaded documents, not tool calls. The model reads each SKILL.md once and then follows its instructions directly ("freihand"). There is no Skill({skill: "red"}) tool invocation per cycle. Instead, the parser counts ## Red headings in the assistant output (one per cycle) and Red Phase Complete: blocks with prediction lines. The derive_cycle_count() function in analyze_transcript.py uses text markers as a tertiary fallback (after Skill and Task tool calls), so both parsers agree on the same priority chain.

Metric	Description
`tdd_cycles`	Number of red-green-refactor cycles (TDD workflows only); should match test count for proper discipline
`prediction_accuracy`	Correctness of red-phase failure predictions (v4/v5). Higher accuracy shows deeper understanding of code state
`refactorings`	Number of refactoring improvements applied. More refactorings indicate better discipline and cleaner final code
`tests_immediately_passing`	Tests passing immediately in red phase, indicating over-implementation. Lower is better

Run status

Field	Description
`run_status.exit_code`	Process exit code
`run_status.exit_reason`	`ok`, `timeout`, `rate-limited`, `transient-api-error`, etc.
`completed_within_budget`	Boolean derived from `exit_reason`; available as an outcome in any RQ

Adding New Experiments

New kata

Create experiments/katas/<kata-name>/prompt.md with:

Feature description
Example Mapping (rules + examples)
Expected file paths
Constraints

The directory name typically ends with one of -prose, -user-story, or -example-mapping (the prompt style); the part before is the basename. For CLI katas with an external acceptance suite, see CLI katas with external acceptance suite. For deeper kata-design guidance, see research/kata-design/kata-construction.md.

New workflow variant

Create experiments/workflows/<variant-name>/.claude/ with:

rules/*.md — TDD rules for this variant
agents/*.md — agent definitions (if using subagents)
commands/*.md — skill definitions (if using inline skills)

For pi workflows, create experiments/workflows/<variant-name>-pi/.pi/ with:

AGENTS.md — TDD rules and mandatory output-format markers (pi skills are documents, not tool calls)
skills/<phase>/SKILL.md — Skill documents for test-list, red, green
agents/<name>.md — Subagent definitions (refactor)

See experiments/workflows/v6.2-with-why-cleaned-pi/ for a working example. The pi measurement pipeline (parse_pi_transcript.py) counts text markers (## Red, ## Green) in assistant output rather than Skill tool calls — see MARKERS.md for the full marker specification.

Further Documentation

Document	Description
HUMAN-IN-THE-LOOP.md	How to re-enable human approval checkpoints between TDD phases
WORKTREE-WORKFLOW.md	Persistent agent-worktree convention for parallel work

License

This project is provided as-is for research and educational purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 316 Commits
.claude		.claude
.opencode		.opencode
experiments		experiments
reference		reference
research		research
todos_and_ideas		todos_and_ideas
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
SECURITY-AUDIT.md		SECURITY-AUDIT.md
notes.md		notes.md
opencode.json		opencode.json

Folders and files

Latest commit

History

Repository files navigation

Agentic Coding Lab

Motivation

What This Framework Does

Core Concept: Research-Question-Driven Aggregation

What an RQ contains

Two tools, one frontmatter

Frontmatter schema

Outcome conventions

Methodology constraints

Workflow → permitted prompt styles

Code-quality signal limited to game-of-life and mars-rover

Goodhart's Law: metrics named in workflow prompts stop being independent outcomes

Timeouts as a research finding

Findings status legend

Key files and their meaning

Repository Structure

Workflow Variants

v1-oneshot (No TDD baseline)

v2-iterative (No TDD, iterative)

v3-basic-tdd (TDD control)

v4-exact-subagents

v5-exact-single-context

Key differences

Multi-harness support

Model Configurations

Katas

CLI katas with external acceptance suite

Quick Start

Prerequisites

Run a single experiment locally

Run a batch via Docker

Aggregate results for a research question

End-to-end loop

Skills

Snapshot lifecycle

Script Reference

Run lifecycle (top-level experiments/)

Aggregation

Docker batch execution (experiments/docker/)

Typical sequences

Docker Setup

Setup

Running

Per-run hardening

Container details

Volume mounts

Environment variables

Troubleshooting

Metrics

Run outcomes

Code-quality metrics

Token & context metrics

TDD discipline metrics

Run status

Adding New Experiments

New kata

New workflow variant

Further Documentation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Run lifecycle (top-level `experiments/`)

Docker batch execution (`experiments/docker/`)

Packages