Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions .agents/skills/codemode-evaluations/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
name: codemode-evaluations
description: Use when adding, updating, or reviewing CodeMode.swift evaluation scenarios, deterministic eval tests, or Wavelike-backed LLM eval coverage in this repository.
metadata:
short-description: Add CodeMode eval scenarios
---

# CodeMode Evaluations

Use this skill when the user asks to add evaluation coverage for CodeMode.swift tools, bridge behavior, catalog search, tool-call order, capability minimization, validation errors, permission behavior, or LLM agent performance.

## Source of Truth

Start with the local harness:

- `Sources/CodeModeEvaluation/EvalModels.swift`: scenario, seed file, permission, expectation, result models.
- `Sources/CodeModeEvaluation/EvalRunner.swift`: deterministic sandbox runner and transcript validator.
- `Sources/CodeModeEvaluation/EvalScenarios.swift`: built-in scenarios and `CodeModeEvalScenarios.all`.
- `Tests/CodeModeEvalTests/CodeModeEvalRunnerTests.swift`: regression tests for built-in scenarios and validator behavior.
- `Tools/CodeModeEval/Sources/CodeModeEvalCLI`: CLI for deterministic runs, LLM runs, planning, summaries, comparisons, and Markdown reports.

Apple Evaluations framework guidance maps to this repo as:

- Evaluation = one `CodeModeEvalScenario` plus its expectations.
- Dataset = categorized scenario set in `CodeModeEvalScenarios.all` and optional LLM suites.
- Subject = `CodeModeAgentTools` for deterministic runs or the Wavelike-backed agent loop for LLM runs.
- Evaluators = `CodeModeEvalExpectation` fields plus `CodeModeEvalRunner.validateTranscript`.
- Aggregation = test pass rate, CLI summaries, scenario summaries, capability pass rate, retry and turn counts.

Apple's Evaluations framework is useful as design guidance, but do not add `import Evaluations` to the package unless the repository is intentionally moving to a compatible Apple beta toolchain and platform. The current package is SwiftPM-based and already has a portable evaluation harness.

## Eval Design Checklist

Treat evaluations as the living specification for the feature under test:

1. Define the behavior in measurable terms before changing prompts or implementation.
2. Add golden, edge, adversarial, and known-failure cases where relevant.
3. Prefer code-based checks when the result is computable: exact JSON output, error code, diagnostic fragment, tool order, allowed capabilities, forbidden capabilities, and required code fragments.
4. Use LLM repeat runs only for model behavior that cannot be proven deterministically.
5. Keep each scenario narrow enough that a failure points to a specific behavior.
6. Include at least one negative assertion when there is a real risk, such as forbidden capabilities or an expected permission denial.

## Adding a Deterministic Scenario

1. Find nearby examples in `EvalScenarios.swift`.
2. Add a `public static let` scenario with a stable dotted id.
3. Add it to `CodeModeEvalScenarios.all` in the appropriate section.
4. Set `task` as the model-facing instruction for LLM runs. Make it precise enough to evaluate.
5. Use `searchCode` when the scenario expects catalog discovery before execution.
6. Use `executeCode` for one-step workflows or `executeSteps` when order matters across multiple tool calls.
7. Set `allowedCapabilities` to the minimum needed by the execute step.
8. Use `seedFiles`, `permissions`, and `catalogPlatform` instead of ad hoc setup in test code.
9. Fill `CodeModeEvalExpectation` with measurable checks:
- `toolOrder`
- `exactAllowedCapabilities`
- `forbiddenCapabilities`
- `requiredSearchResultFragments`
- `requiredExecuteCodeFragments` or `requiredExecuteCodeAlternativeFragments`
- `expectedOutput`
- `expectedErrorCode`
- `requiredErrorSuggestionFragments`
- diagnostic or log fragments when they are part of the contract

Keep expected JSON stable. If order is not semantically meaningful, assert fragments instead of exact arrays.

## Adding LLM Eval Coverage

Use existing deterministic scenarios as the dataset for LLM runs. When adding a scenario that should be part of standard LLM suites:

1. Inspect `LLMEvalSuite` in `Tools/CodeModeEval/Sources/CodeModeEvalCLI/LLM.swift`.
2. Add the scenario id to `smoke`, `core`, `failures`, or `all` only when it matches that suite's purpose.
3. Run `swift run codemode-eval plan --suite <suite>` from `Tools/CodeModeEval` before live model calls.
4. For live calls, prefer a small repeat count first, then increase only after the scenario is stable.
5. Compare reports with the CLI when evaluating a prompt or model change.

Use these metrics as the main quality signals:

- `passRate`: scenario success.
- `exactCapabilityPassRate`: whether the agent used the exact minimal capability set.
- `averageTurns`: whether the task requires too much back-and-forth.
- `averageRetries`: whether transport or model stability regressed.
- failure categories and captured tool calls for root cause.

## Validation Commands

From the repo root:

```sh
swift test --filter CodeModeEvalTests
```

From `Tools/CodeModeEval`:

```sh
swift run codemode-eval list
swift run codemode-eval run <scenario-id> --show-code
swift run codemode-eval plan --suite smoke
```

Use LLM runs only when credentials and budget are intentionally available:

```sh
swift run codemode-eval llm --suite smoke --repeat 1 --output /tmp/codemode-llm.json
swift run codemode-eval summarize /tmp/codemode-llm.json --include-failures
swift run codemode-eval report /tmp/codemode-llm.json --output /tmp/codemode-llm.md
```

## Review Heuristics

Reject or revise evals that:

- Only assert that a command runs without checking behavior.
- Request broader capabilities than the task requires.
- Depend on wall-clock time, network data, or host state when a seeded fixture can express the behavior.
- Combine unrelated bridge behavior in one scenario.
- Add LLM-only coverage for behavior that the deterministic runner can verify.
- Use polished natural-language expectations where a code-based check would be cheaper and more reproducible.
136 changes: 136 additions & 0 deletions .agents/skills/codemode-synthetic-eval-datasets/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
name: codemode-synthetic-eval-datasets
description: Use when designing, generating, validating, or importing synthetic evaluation datasets for CodeMode.swift eval scenarios.
metadata:
short-description: Build CodeMode synthetic eval data
---

# CodeMode Synthetic Eval Datasets

Use this skill when the user asks to generate, expand, validate, rebalance, or import synthetic evaluation data for CodeMode.swift.

The output should usually become typed Swift scenarios in `Sources/CodeModeEvaluation/EvalScenarios.swift`, plus tests in `Tests/CodeModeEvalTests`, not an unreviewed blob of generated data.

## Apple-Informed Principles

Follow these dataset design rules:

- Start with high-quality human-written seeds. Synthetic data amplifies seed quality and seed gaps.
- Categorize samples by purpose: golden, edge, adversarial, known failures, permission failures, platform-gated catalog discovery, and capability-minimization checks.
- Generate within focused categories instead of asking for one broad "diverse" set.
- Include hard and adversarial seeds. Aim for at least 20 to 30 percent difficult human-authored anchors when building a synthetic set.
- Keep at least 20 to 30 percent human-written samples in any expanded dataset as calibration anchors.
- Validate generated samples programmatically, then manually review a random slice before relying on them.
- Prefer 50 to 200 strong samples per feature or category over thousands of noisy, duplicate, or ambiguous cases.

## CodeMode Dataset Shape

Represent synthetic output in CodeMode terms:

- `id`: stable dotted id, grouped by domain, for example `fs.synthetic.long-path-read`.
- `title`: short human-readable title.
- `task`: model-facing instruction that can be evaluated.
- `searchCode`: catalog lookup the agent should perform, if discovery is part of the behavior.
- `executeCode` or `executeSteps`: expected tool-side behavior for deterministic replay.
- `allowedCapabilities`: exact minimum capability set.
- `seedFiles`: deterministic fixtures under `tmp:`, `documents:`, or `caches:`.
- `permissions`: explicit permission status and request behavior.
- `catalogPlatform`: platform override when testing platform-specific availability.
- `expectation`: measurable checks on tool order, capabilities, output, errors, diagnostics, and code fragments.

Do not import generated examples directly if expected outputs are ambiguous. Turn them into deterministic scenarios with explicit expected JSON or explicit failure expectations.

## Generation Workflow

1. Identify the feature and quality dimensions:
- correctness
- capability minimization
- tool-call order
- argument validation
- permission behavior
- platform pruning
- diagnostic quality
- recovery after invalid arguments
2. Write 5 to 15 seed scenarios by hand.
3. Label each seed with category, difficulty, feature area, required capabilities, and the expected failure mode if any.
4. Generate more cases one category at a time:
- golden filesystem reads
- adversarial path policy escapes
- catalog alias lookups
- permission denied flows
- invalid argument repair cases
- platform-pruned iOS-only or macOS-only helpers
5. Validate generated candidates:
- unique id
- unique or intentionally varied task
- exact minimal capability list
- no impossible host state
- all seeded paths stay inside allowed roots
- expected output is derivable from seeded data
- no leaking the expected answer in a way that invalidates the task
6. Promote accepted samples into `CodeModeEvalScenario` declarations.
7. Add them to `CodeModeEvalScenarios.all` or to a new grouped collection if the set is large.
8. Run deterministic tests before any LLM run.

## Candidate Review Checklist

For each generated candidate, answer:

- What exact behavior does this sample measure?
- Which category does it belong to?
- Is the expected result deterministic?
- Could a model pass by using a broader capability than necessary?
- Does `requiredExecuteCodeAlternativeFragments` allow legitimate API aliases without allowing unrelated implementations?
- Is this a near-duplicate of an existing scenario?
- Would failure produce a useful, localized signal?

Discard or rewrite candidates that fail these checks.

## Prompt Pattern for Synthetic Candidates

When asking a model to draft candidates, keep the prompt constrained:

```text
Generate CodeMode evaluation scenario candidates for <feature area>.
Category: <golden|edge|adversarial|known-failure|permission|platform>.
Each candidate must include: id, title, task, seedFiles, allowedCapabilities,
expectedOutput or expectedErrorCode, and rationale.
Do not include cases that require live network, clock time, user contacts,
real calendars, Photos library contents, or host filesystem state.
Vary phrasing, input length, and difficulty. Keep expected outputs deterministic.
```

Then convert accepted candidates to Swift manually or with a small script, preserving the local formatting style.

## Validation Commands

From the repo root:

```sh
swift test --filter CodeModeEvalTests
```

From `Tools/CodeModeEval`:

```sh
swift run codemode-eval run <scenario-id> --show-code
swift run codemode-eval plan --suite core
```

For LLM-backed validation after deterministic checks pass:

```sh
swift run codemode-eval llm <scenario-id> --repeat 3 --output /tmp/codemode-synthetic-smoke.json
swift run codemode-eval report /tmp/codemode-synthetic-smoke.json --all-runs --include-code
```

## When to Use Apple's Evaluations Framework Directly

Only add a direct `Evaluations` framework target when the user explicitly wants Apple's framework adoption and the local toolchain supports it. In that case:

- Model each CodeMode scenario as a `ModelSample` with explicit expected values.
- Put the CodeMode tool or agent under test in the `subject(from:)` implementation.
- Use code-based `Evaluator` checks for deterministic behavior.
- Use tool-call trajectory expectations for tool order and arguments.
- Aggregate pass/fail metrics with means and scored metrics with medians or maxima, depending on what the metric represents.
- Keep the existing CodeMode harness running until the new framework produces equivalent or better regression signal.
112 changes: 15 additions & 97 deletions .github/workflows/codemode-evals.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,23 +36,18 @@ jobs:
- name: Test package
run: swift test

- name: Build deterministic eval CLI
run: swift build --package-path Tools/CodeModeDeterministicEval
- name: Build eval CLI
run: swift build --package-path Tools/CodeModeEval

- name: Run deterministic evals
run: swift run --package-path Tools/CodeModeDeterministicEval codemode-deterministic-eval run
run: swift run --package-path Tools/CodeModeEval codemode-eval run

live-llm:
name: Live Wavelike LLM evals
llm-plan:
name: LLM eval planning
runs-on: macos-latest
needs: deterministic
if: github.event_name == 'workflow_dispatch' || github.event_name == 'schedule'
timeout-minutes: 90
env:
WAVELIKE_MODEL_ID: ${{ secrets.WAVELIKE_MODEL_ID }}
WAVELIKE_APP_ID: ${{ secrets.WAVELIKE_APP_ID }}
WAVELIKE_API_KEY: ${{ secrets.WAVELIKE_API_KEY }}
WAVELIKE_ENV: ${{ secrets.WAVELIKE_ENV }}
INPUT_REPEAT_COUNT: ${{ github.event.inputs.repeat_count }}
INPUT_REQUEST_DELAY_MS: ${{ github.event.inputs.request_delay_ms }}
steps:
Expand All @@ -62,104 +57,27 @@ jobs:
- name: Toolchain
run: swift --version

- name: Check Wavelike secrets
id: wavelike
run: |
if [ -n "$WAVELIKE_MODEL_ID" ] && [ -n "$WAVELIKE_APP_ID" ] && [ -n "$WAVELIKE_API_KEY" ]; then
echo "available=true" >> "$GITHUB_OUTPUT"
else
echo "available=false" >> "$GITHUB_OUTPUT"
echo "Wavelike secrets are not configured; skipping live LLM evals."
fi

- name: Build eval CLI
if: steps.wavelike.outputs.available == 'true'
run: swift build --package-path Tools/CodeModeEval

- name: Run live LLM eval suites
if: steps.wavelike.outputs.available == 'true'
- name: Preview LLM eval suites
run: |
set +e

repeat_count="${INPUT_REPEAT_COUNT:-5}"
request_delay_ms="${INPUT_REQUEST_DELAY_MS:-1000}"
reports_dir="Tools/CodeModeEval/.build/reports"
mkdir -p "$reports_dir"

status=0

swift run --package-path Tools/CodeModeEval codemode-eval llm \
swift run --package-path Tools/CodeModeEval codemode-eval plan \
--suite core \
--repeat "$repeat_count" \
--max-output-tokens 1600 \
--request-delay-ms "$request_delay_ms" \
--model-retries 5 \
--retry-delay-ms 5000 \
--output "$reports_dir/core-r${repeat_count}.json" || status=$?

swift run --package-path Tools/CodeModeEval codemode-eval summarize \
"$reports_dir/core-r${repeat_count}.json" \
--output "$reports_dir/core-r${repeat_count}-summary.json" || status=$?
--request-delay-ms "$request_delay_ms"

if [ "$repeat_count" = "5" ]; then
swift run --package-path Tools/CodeModeEval codemode-eval report \
"$reports_dir/core-r${repeat_count}.json" \
--baseline Tools/CodeModeEval/Baselines/core-r5-summary.json \
--output "$reports_dir/core-r${repeat_count}.md" || status=$?
else
swift run --package-path Tools/CodeModeEval codemode-eval report \
"$reports_dir/core-r${repeat_count}.json" \
--output "$reports_dir/core-r${repeat_count}.md" || status=$?
fi

swift run --package-path Tools/CodeModeEval codemode-eval llm \
swift run --package-path Tools/CodeModeEval codemode-eval plan \
--suite failures \
--repeat "$repeat_count" \
--max-output-tokens 1200 \
--request-delay-ms "$request_delay_ms" \
--model-retries 5 \
--retry-delay-ms 5000 \
--output "$reports_dir/failures-r${repeat_count}.json" || status=$?

swift run --package-path Tools/CodeModeEval codemode-eval summarize \
"$reports_dir/failures-r${repeat_count}.json" \
--output "$reports_dir/failures-r${repeat_count}-summary.json" || status=$?

if [ "$repeat_count" = "5" ]; then
swift run --package-path Tools/CodeModeEval codemode-eval report \
"$reports_dir/failures-r${repeat_count}.json" \
--baseline Tools/CodeModeEval/Baselines/failures-r5-summary.json \
--output "$reports_dir/failures-r${repeat_count}.md" || status=$?
else
swift run --package-path Tools/CodeModeEval codemode-eval report \
"$reports_dir/failures-r${repeat_count}.json" \
--output "$reports_dir/failures-r${repeat_count}.md" || status=$?
fi
--request-delay-ms "$request_delay_ms"

if [ "$repeat_count" = "5" ]; then
swift run --package-path Tools/CodeModeEval codemode-eval compare \
Tools/CodeModeEval/Baselines/core-r5-summary.json \
"$reports_dir/core-r5.json" \
--retry-tolerance 0.5 \
--turn-tolerance 0.5 || status=$?

swift run --package-path Tools/CodeModeEval codemode-eval compare \
Tools/CodeModeEval/Baselines/failures-r5-summary.json \
"$reports_dir/failures-r5.json" \
--retry-tolerance 0.25 \
--turn-tolerance 0.25 || status=$?
else
echo "Skipping baseline compare because repeat_count=$repeat_count does not match committed r5 baselines."
fi

exit "$status"
swift run --package-path Tools/CodeModeEval codemode-eval plan \
--suite catalog \
--repeat "$repeat_count" \
--request-delay-ms "$request_delay_ms"

- name: Upload LLM eval reports
if: always() && steps.wavelike.outputs.available == 'true'
uses: actions/upload-artifact@v4
with:
name: codemode-llm-eval-reports
path: |
Tools/CodeModeEval/.build/reports/*.json
Tools/CodeModeEval/.build/reports/*.md
if-no-files-found: warn
echo "Live Wavelike-backed LLM execution is disabled in this workflow while the default eval package avoids private dependencies."
Loading
Loading