velos · zac · Jun 12, 2026 · May 17, 2026 · May 17, 2026 · May 17, 2026
diff --git a/.agents/skills/codemode-evaluations/SKILL.md b/.agents/skills/codemode-evaluations/SKILL.md
@@ -0,0 +1,117 @@
+---
+name: codemode-evaluations
+description: Use when adding, updating, or reviewing CodeMode.swift evaluation scenarios, deterministic eval tests, or Wavelike-backed LLM eval coverage in this repository.
+metadata:
+  short-description: Add CodeMode eval scenarios
+---
+
+# CodeMode Evaluations
+
+Use this skill when the user asks to add evaluation coverage for CodeMode.swift tools, bridge behavior, catalog search, tool-call order, capability minimization, validation errors, permission behavior, or LLM agent performance.
+
+## Source of Truth
+
+Start with the local harness:
+
+- `Sources/CodeModeEvaluation/EvalModels.swift`: scenario, seed file, permission, expectation, result models.
+- `Sources/CodeModeEvaluation/EvalRunner.swift`: deterministic sandbox runner and transcript validator.
+- `Sources/CodeModeEvaluation/EvalScenarios.swift`: built-in scenarios and `CodeModeEvalScenarios.all`.
+- `Tests/CodeModeEvalTests/CodeModeEvalRunnerTests.swift`: regression tests for built-in scenarios and validator behavior.
+- `Tools/CodeModeEval/Sources/CodeModeEvalCLI`: CLI for deterministic runs, LLM runs, planning, summaries, comparisons, and Markdown reports.
+
+Apple Evaluations framework guidance maps to this repo as:
+
+- Evaluation = one `CodeModeEvalScenario` plus its expectations.
+- Dataset = categorized scenario set in `CodeModeEvalScenarios.all` and optional LLM suites.
+- Subject = `CodeModeAgentTools` for deterministic runs or the Wavelike-backed agent loop for LLM runs.
+- Evaluators = `CodeModeEvalExpectation` fields plus `CodeModeEvalRunner.validateTranscript`.
+- Aggregation = test pass rate, CLI summaries, scenario summaries, capability pass rate, retry and turn counts.
+
+Apple's Evaluations framework is useful as design guidance, but do not add `import Evaluations` to the package unless the repository is intentionally moving to a compatible Apple beta toolchain and platform. The current package is SwiftPM-based and already has a portable evaluation harness.
+
+## Eval Design Checklist
+
+Treat evaluations as the living specification for the feature under test:
+
+1. Define the behavior in measurable terms before changing prompts or implementation.
+2. Add golden, edge, adversarial, and known-failure cases where relevant.
+3. Prefer code-based checks when the result is computable: exact JSON output, error code, diagnostic fragment, tool order, allowed capabilities, forbidden capabilities, and required code fragments.
+4. Use LLM repeat runs only for model behavior that cannot be proven deterministically.
+5. Keep each scenario narrow enough that a failure points to a specific behavior.
+6. Include at least one negative assertion when there is a real risk, such as forbidden capabilities or an expected permission denial.
+
+## Adding a Deterministic Scenario
+
+1. Find nearby examples in `EvalScenarios.swift`.
+2. Add a `public static let` scenario with a stable dotted id.
+3. Add it to `CodeModeEvalScenarios.all` in the appropriate section.
+4. Set `task` as the model-facing instruction for LLM runs. Make it precise enough to evaluate.
+5. Use `searchCode` when the scenario expects catalog discovery before execution.
+6. Use `executeCode` for one-step workflows or `executeSteps` when order matters across multiple tool calls.
+7. Set `allowedCapabilities` to the minimum needed by the execute step.
+8. Use `seedFiles`, `permissions`, and `catalogPlatform` instead of ad hoc setup in test code.
+9. Fill `CodeModeEvalExpectation` with measurable checks:
+   - `toolOrder`
+   - `exactAllowedCapabilities`
+   - `forbiddenCapabilities`
+   - `requiredSearchResultFragments`
+   - `requiredExecuteCodeFragments` or `requiredExecuteCodeAlternativeFragments`
+   - `expectedOutput`
+   - `expectedErrorCode`
+   - `requiredErrorSuggestionFragments`
+   - diagnostic or log fragments when they are part of the contract
+
+Keep expected JSON stable. If order is not semantically meaningful, assert fragments instead of exact arrays.
+
+## Adding LLM Eval Coverage
+
+Use existing deterministic scenarios as the dataset for LLM runs. When adding a scenario that should be part of standard LLM suites:
+
+1. Inspect `LLMEvalSuite` in `Tools/CodeModeEval/Sources/CodeModeEvalCLI/LLM.swift`.
+2. Add the scenario id to `smoke`, `core`, `failures`, or `all` only when it matches that suite's purpose.
+3. Run `swift run codemode-eval plan --suite <suite>` from `Tools/CodeModeEval` before live model calls.
+4. For live calls, prefer a small repeat count first, then increase only after the scenario is stable.
+5. Compare reports with the CLI when evaluating a prompt or model change.
+
+Use these metrics as the main quality signals:
+
+- `passRate`: scenario success.
+- `exactCapabilityPassRate`: whether the agent used the exact minimal capability set.
+- `averageTurns`: whether the task requires too much back-and-forth.
+- `averageRetries`: whether transport or model stability regressed.
+- failure categories and captured tool calls for root cause.
+
+## Validation Commands
+
+From the repo root:
+
+```sh
+swift test --filter CodeModeEvalTests
+```
+
+From `Tools/CodeModeEval`:
+
+```sh
+swift run codemode-eval list
+swift run codemode-eval run <scenario-id> --show-code
+swift run codemode-eval plan --suite smoke
+```
+
+Use LLM runs only when credentials and budget are intentionally available:
+
+```sh
+swift run codemode-eval llm --suite smoke --repeat 1 --output /tmp/codemode-llm.json
+swift run codemode-eval summarize /tmp/codemode-llm.json --include-failures
+swift run codemode-eval report /tmp/codemode-llm.json --output /tmp/codemode-llm.md
+```
+
+## Review Heuristics
+
+Reject or revise evals that:
+
+- Only assert that a command runs without checking behavior.
+- Request broader capabilities than the task requires.
+- Depend on wall-clock time, network data, or host state when a seeded fixture can express the behavior.
+- Combine unrelated bridge behavior in one scenario.
+- Add LLM-only coverage for behavior that the deterministic runner can verify.
+- Use polished natural-language expectations where a code-based check would be cheaper and more reproducible.
diff --git a/.agents/skills/codemode-synthetic-eval-datasets/SKILL.md b/.agents/skills/codemode-synthetic-eval-datasets/SKILL.md
@@ -0,0 +1,136 @@
+---
+name: codemode-synthetic-eval-datasets
+description: Use when designing, generating, validating, or importing synthetic evaluation datasets for CodeMode.swift eval scenarios.
+metadata:
+  short-description: Build CodeMode synthetic eval data
+---
+
+# CodeMode Synthetic Eval Datasets
+
+Use this skill when the user asks to generate, expand, validate, rebalance, or import synthetic evaluation data for CodeMode.swift.
+
+The output should usually become typed Swift scenarios in `Sources/CodeModeEvaluation/EvalScenarios.swift`, plus tests in `Tests/CodeModeEvalTests`, not an unreviewed blob of generated data.
+
+## Apple-Informed Principles
+
+Follow these dataset design rules:
+
+- Start with high-quality human-written seeds. Synthetic data amplifies seed quality and seed gaps.
+- Categorize samples by purpose: golden, edge, adversarial, known failures, permission failures, platform-gated catalog discovery, and capability-minimization checks.
+- Generate within focused categories instead of asking for one broad "diverse" set.
+- Include hard and adversarial seeds. Aim for at least 20 to 30 percent difficult human-authored anchors when building a synthetic set.
+- Keep at least 20 to 30 percent human-written samples in any expanded dataset as calibration anchors.
+- Validate generated samples programmatically, then manually review a random slice before relying on them.
+- Prefer 50 to 200 strong samples per feature or category over thousands of noisy, duplicate, or ambiguous cases.
+
+## CodeMode Dataset Shape
+
+Represent synthetic output in CodeMode terms:
+
+- `id`: stable dotted id, grouped by domain, for example `fs.synthetic.long-path-read`.
+- `title`: short human-readable title.
+- `task`: model-facing instruction that can be evaluated.
+- `searchCode`: catalog lookup the agent should perform, if discovery is part of the behavior.
+- `executeCode` or `executeSteps`: expected tool-side behavior for deterministic replay.
+- `allowedCapabilities`: exact minimum capability set.
+- `seedFiles`: deterministic fixtures under `tmp:`, `documents:`, or `caches:`.
+- `permissions`: explicit permission status and request behavior.
+- `catalogPlatform`: platform override when testing platform-specific availability.
+- `expectation`: measurable checks on tool order, capabilities, output, errors, diagnostics, and code fragments.
+
+Do not import generated examples directly if expected outputs are ambiguous. Turn them into deterministic scenarios with explicit expected JSON or explicit failure expectations.
+
+## Generation Workflow
+
+1. Identify the feature and quality dimensions:
+   - correctness
+   - capability minimization
+   - tool-call order
+   - argument validation
+   - permission behavior
+   - platform pruning
+   - diagnostic quality
+   - recovery after invalid arguments
+2. Write 5 to 15 seed scenarios by hand.
+3. Label each seed with category, difficulty, feature area, required capabilities, and the expected failure mode if any.
+4. Generate more cases one category at a time:
+   - golden filesystem reads
+   - adversarial path policy escapes
+   - catalog alias lookups
+   - permission denied flows
+   - invalid argument repair cases
+   - platform-pruned iOS-only or macOS-only helpers
+5. Validate generated candidates:
+   - unique id
+   - unique or intentionally varied task
+   - exact minimal capability list
+   - no impossible host state
+   - all seeded paths stay inside allowed roots
+   - expected output is derivable from seeded data
+   - no leaking the expected answer in a way that invalidates the task
+6. Promote accepted samples into `CodeModeEvalScenario` declarations.
+7. Add them to `CodeModeEvalScenarios.all` or to a new grouped collection if the set is large.
+8. Run deterministic tests before any LLM run.
+
+## Candidate Review Checklist
+
+For each generated candidate, answer:
+
+- What exact behavior does this sample measure?
+- Which category does it belong to?
+- Is the expected result deterministic?
+- Could a model pass by using a broader capability than necessary?
+- Does `requiredExecuteCodeAlternativeFragments` allow legitimate API aliases without allowing unrelated implementations?
+- Is this a near-duplicate of an existing scenario?
+- Would failure produce a useful, localized signal?
+
+Discard or rewrite candidates that fail these checks.
+
+## Prompt Pattern for Synthetic Candidates
+
+When asking a model to draft candidates, keep the prompt constrained:
+
+```text
+Generate CodeMode evaluation scenario candidates for <feature area>.
+Category: <golden|edge|adversarial|known-failure|permission|platform>.
+Each candidate must include: id, title, task, seedFiles, allowedCapabilities,
+expectedOutput or expectedErrorCode, and rationale.
+Do not include cases that require live network, clock time, user contacts,
+real calendars, Photos library contents, or host filesystem state.
+Vary phrasing, input length, and difficulty. Keep expected outputs deterministic.
+```
+
+Then convert accepted candidates to Swift manually or with a small script, preserving the local formatting style.
+
+## Validation Commands
+
+From the repo root:
+
+```sh
+swift test --filter CodeModeEvalTests
+```
+
+From `Tools/CodeModeEval`:
+
+```sh
+swift run codemode-eval run <scenario-id> --show-code
+swift run codemode-eval plan --suite core
+```
+
+For LLM-backed validation after deterministic checks pass:
+
+```sh
+swift run codemode-eval llm <scenario-id> --repeat 3 --output /tmp/codemode-synthetic-smoke.json
+swift run codemode-eval report /tmp/codemode-synthetic-smoke.json --all-runs --include-code
+```
+
+## When to Use Apple's Evaluations Framework Directly
+
+Only add a direct `Evaluations` framework target when the user explicitly wants Apple's framework adoption and the local toolchain supports it. In that case:
+
+- Model each CodeMode scenario as a `ModelSample` with explicit expected values.
+- Put the CodeMode tool or agent under test in the `subject(from:)` implementation.
+- Use code-based `Evaluator` checks for deterministic behavior.
+- Use tool-call trajectory expectations for tool order and arguments.
+- Aggregate pass/fail metrics with means and scored metrics with medians or maxima, depending on what the metric represents.
+- Keep the existing CodeMode harness running until the new framework produces equivalent or better regression signal.
diff --git a/.github/workflows/codemode-evals.yml b/.github/workflows/codemode-evals.yml
@@ -36,23 +36,18 @@ jobs:
       - name: Test package
         run: swift test
 
-      - name: Build deterministic eval CLI
-        run: swift build --package-path Tools/CodeModeDeterministicEval
+      - name: Build eval CLI
+        run: swift build --package-path Tools/CodeModeEval
 
       - name: Run deterministic evals
-        run: swift run --package-path Tools/CodeModeDeterministicEval codemode-deterministic-eval run
+        run: swift run --package-path Tools/CodeModeEval codemode-eval run
 
-  live-llm:
-    name: Live Wavelike LLM evals
+  llm-plan:
+    name: LLM eval planning
     runs-on: macos-latest
     needs: deterministic
     if: github.event_name == 'workflow_dispatch' || github.event_name == 'schedule'
-    timeout-minutes: 90
     env:
-      WAVELIKE_MODEL_ID: ${{ secrets.WAVELIKE_MODEL_ID }}
-      WAVELIKE_APP_ID: ${{ secrets.WAVELIKE_APP_ID }}
-      WAVELIKE_API_KEY: ${{ secrets.WAVELIKE_API_KEY }}
-      WAVELIKE_ENV: ${{ secrets.WAVELIKE_ENV }}
       INPUT_REPEAT_COUNT: ${{ github.event.inputs.repeat_count }}
       INPUT_REQUEST_DELAY_MS: ${{ github.event.inputs.request_delay_ms }}
     steps:
@@ -62,104 +57,27 @@ jobs:
       - name: Toolchain
         run: swift --version
 
-      - name: Check Wavelike secrets
-        id: wavelike
-        run: |
-          if [ -n "$WAVELIKE_MODEL_ID" ] && [ -n "$WAVELIKE_APP_ID" ] && [ -n "$WAVELIKE_API_KEY" ]; then
-            echo "available=true" >> "$GITHUB_OUTPUT"
-          else
-            echo "available=false" >> "$GITHUB_OUTPUT"
-            echo "Wavelike secrets are not configured; skipping live LLM evals."
-          fi
-
       - name: Build eval CLI
-        if: steps.wavelike.outputs.available == 'true'
         run: swift build --package-path Tools/CodeModeEval
 
-      - name: Run live LLM eval suites
-        if: steps.wavelike.outputs.available == 'true'
+      - name: Preview LLM eval suites
         run: |
-          set +e
-
           repeat_count="${INPUT_REPEAT_COUNT:-5}"
           request_delay_ms="${INPUT_REQUEST_DELAY_MS:-1000}"
-          reports_dir="Tools/CodeModeEval/.build/reports"
-          mkdir -p "$reports_dir"
-
-          status=0
 
-          swift run --package-path Tools/CodeModeEval codemode-eval llm \
+          swift run --package-path Tools/CodeModeEval codemode-eval plan \
             --suite core \
             --repeat "$repeat_count" \
-            --max-output-tokens 1600 \
-            --request-delay-ms "$request_delay_ms" \
-            --model-retries 5 \
-            --retry-delay-ms 5000 \
-            --output "$reports_dir/core-r${repeat_count}.json" || status=$?
-
-          swift run --package-path Tools/CodeModeEval codemode-eval summarize \
-            "$reports_dir/core-r${repeat_count}.json" \
-            --output "$reports_dir/core-r${repeat_count}-summary.json" || status=$?
+            --request-delay-ms "$request_delay_ms"
 
-          if [ "$repeat_count" = "5" ]; then
-            swift run --package-path Tools/CodeModeEval codemode-eval report \
-              "$reports_dir/core-r${repeat_count}.json" \
-              --baseline Tools/CodeModeEval/Baselines/core-r5-summary.json \
-              --output "$reports_dir/core-r${repeat_count}.md" || status=$?
-          else
-            swift run --package-path Tools/CodeModeEval codemode-eval report \
-              "$reports_dir/core-r${repeat_count}.json" \
-              --output "$reports_dir/core-r${repeat_count}.md" || status=$?
-          fi
-
-          swift run --package-path Tools/CodeModeEval codemode-eval llm \
+          swift run --package-path Tools/CodeModeEval codemode-eval plan \
             --suite failures \
             --repeat "$repeat_count" \
-            --max-output-tokens 1200 \
-            --request-delay-ms "$request_delay_ms" \
-            --model-retries 5 \
-            --retry-delay-ms 5000 \
-            --output "$reports_dir/failures-r${repeat_count}.json" || status=$?
-
-          swift run --package-path Tools/CodeModeEval codemode-eval summarize \
-            "$reports_dir/failures-r${repeat_count}.json" \
-            --output "$reports_dir/failures-r${repeat_count}-summary.json" || status=$?
-
-          if [ "$repeat_count" = "5" ]; then
-            swift run --package-path Tools/CodeModeEval codemode-eval report \
-              "$reports_dir/failures-r${repeat_count}.json" \
-              --baseline Tools/CodeModeEval/Baselines/failures-r5-summary.json \
-              --output "$reports_dir/failures-r${repeat_count}.md" || status=$?
-          else
-            swift run --package-path Tools/CodeModeEval codemode-eval report \
-              "$reports_dir/failures-r${repeat_count}.json" \
-              --output "$reports_dir/failures-r${repeat_count}.md" || status=$?
-          fi
+            --request-delay-ms "$request_delay_ms"
 
-          if [ "$repeat_count" = "5" ]; then
-            swift run --package-path Tools/CodeModeEval codemode-eval compare \
-              Tools/CodeModeEval/Baselines/core-r5-summary.json \
-              "$reports_dir/core-r5.json" \
-              --retry-tolerance 0.5 \
-              --turn-tolerance 0.5 || status=$?
-
-            swift run --package-path Tools/CodeModeEval codemode-eval compare \
-              Tools/CodeModeEval/Baselines/failures-r5-summary.json \
-              "$reports_dir/failures-r5.json" \
-              --retry-tolerance 0.25 \
-              --turn-tolerance 0.25 || status=$?
-          else
-            echo "Skipping baseline compare because repeat_count=$repeat_count does not match committed r5 baselines."
-          fi
-
-          exit "$status"
+          swift run --package-path Tools/CodeModeEval codemode-eval plan \
+            --suite catalog \
+            --repeat "$repeat_count" \
+            --request-delay-ms "$request_delay_ms"
 
-      - name: Upload LLM eval reports
-        if: always() && steps.wavelike.outputs.available == 'true'
-        uses: actions/upload-artifact@v4
-        with:
-          name: codemode-llm-eval-reports
-          path: |
-            Tools/CodeModeEval/.build/reports/*.json
-            Tools/CodeModeEval/.build/reports/*.md
-          if-no-files-found: warn
+          echo "Live Wavelike-backed LLM execution is disabled in this workflow while the default eval package avoids private dependencies."