Skip to content

Commit d3d6040

Browse files
planetf1ajbozarth
andauthored
test: agent skills infrastructure and marker taxonomy audit (#727, #728) (#742)
* test: add granularity marker taxonomy infrastructure (#727) Register unit/integration/e2e markers in conftest and pyproject.toml. Add unit auto-apply hook in pytest_collection_modifyitems. Deprecate llm marker (synonym for e2e). Remove dead plugins marker. Rewrite MARKERS_GUIDE.md as authoritative marker reference. Sync AGENTS.md Section 3 with new taxonomy. * test: add audit-markers skill for test classification (#728) Skill classifies tests as unit/integration/e2e/qualitative using general heuristics (Part 1) and project-specific rules (Part 2). Includes fixture chain tracing guidance, backend detection heuristics, and example file handling. References MARKERS_GUIDE.md for tables. * chore: add CLAUDE.md and agent skills infrastructure Add CLAUDE.md referencing AGENTS.md for project directives. Add skill-author meta-skill for cross-compatible skill creation. The audit-markers skill was added in the previous commit. * test: improve audit-markers skill quality and add resource predicates Resolve 8 quality issues from dry-run review of the audit-markers skill: - Add behavioural signal detection tables and Step 0 triage procedure for scaling to full-repo audits (grep for backend behaviour, not just existing markers) - Clarify unit/integration boundary with scope-of-mocks rule - Allow module-level qualitative when every function qualifies - Replace resource marker inference with predicate factory pattern - Make llm→e2e rule explicit for # pytest: comments in examples - Redesign report format: 3-tier output (summary table, issues-only detail, batch groups) instead of per-function listing - Remove stale infrastructure note (conftest hook already exists) Add test/predicates.py with reusable skipif decorators: require_gpu, require_ram, require_gpu_isolation, require_api_key, require_package, require_ollama, require_python. Update skill-author with dry-run review step and 4 new authoring guidelines (variable scope, category boundaries, temporal assertions, qualifying absolutes). Refs: #727, #728 * chore: remove issue references from audit-markers skill Epic/issue numbers are task context, not permanent skill knowledge. * docs: align MARKERS_GUIDE.md with predicate factory pattern MARKERS_GUIDE.md documented legacy resource markers (requires_gpu, etc.) as the active convention while SKILL.md instructed migration to predicates — a direct conflict that would cause the audit agent to stall or produce incorrect edits. - Replace resource markers section with predicate-first documentation - Move legacy markers to deprecated subsection (conftest still handles them) - Update common patterns example to use predicate imports - Add test/predicates.py to related files - Add explicit dry-run enforcement to SKILL.md Step 4 Refs: #727, #728 * fix: validate_skill.py schema mismatch and brittle YAML parsing Two bugs: - Required `version` at root level but skill-author guide nests it under `metadata` — guaranteed failure on valid skills - Naive `content.split('---')` breaks on markdown horizontal rules Fix: use yaml.safe_load_all for robust frontmatter extraction, check `name`/`description` at root and `version` under `metadata.version`. * fix: migrate deprecated llm markers to e2e, add backend registry, update audit-markers skill - Replace all `pytest.mark.llm` with `pytest.mark.e2e` across 34 test files and 87 example files (comment-based markers) - Add `BACKEND_MARKERS` data-driven registry in test/conftest.py as single source of truth for backend marker registration - Register `bedrock` backend marker in conftest.py, pyproject.toml, MARKERS_GUIDE.md, and add missing marker to test_bedrock.py - Reclassify test_alora_train.py as integration (was unit); add importorskip for peft dependency - Add missing `e2e` tier markers to test_tracing.py and test_tracing_backend.py - Update audit-markers skill: report-first default, predicate migration as fix (not recommendation), backend registry gap detection * feat: add estimate-vram skill and fix MPS VRAM detection - New /estimate-vram agent skill that analyses test files to determine correct require_gpu(min_vram_gb=N) and require_ram(min_gb=N) values by tracing model IDs and looking up parameter counts dynamically - Fix _gpu_vram_gb() in test/predicates.py to use torch.mps.recommended_max_memory() on macOS MPS instead of returning 0 - Fix get_system_capabilities() in test/conftest.py with same MPS path - Update test/README.md with predicates table and legacy marker deprecation - Add /estimate-vram cross-reference in audit-markers skill * refactor: fold estimate-vram into audit-markers skill VRAM estimation is only useful during marker audits, not standalone. Move the model-tracing and VRAM computation procedure into the audit-markers resource gating section and delete the separate skill. * docs: drop isolation refs and fix RAM guidance in markers docs requires_heavy_ram and requires_gpu_isolation are deprecated with no replacement — models load into VRAM not system RAM, and GPU isolation is now automatic. require_ram() stays available for genuinely RAM-bound tests but has no current use case. * docs: add legacy marker guidance for example files in audit-markers skill * refactor: remove require_ollama() predicate — redundant with backend marker The ollama backend marker + conftest auto-skip already handles Ollama availability. No other backend has a dedicated predicate — consistent to let the marker system handle it. * refactor: replace requires_heavy_ram gate with huggingface backend marker in examples conftest The legacy requires_heavy_ram marker (blanket 48 GB RAM threshold) conflated VRAM with system RAM. Replace both the collection-time and runtime skip logic to gate on the huggingface backend marker instead, which accurately checks GPU availability. * refactor: replace ad-hoc bedrock skipif with require_api_key predicate * refactor: migrate legacy resource markers to predicates Replace deprecated pytest markers with typed predicate functions from test/predicates.py across all test files and example files: - requires_gpu → require_gpu(min_vram_gb=N) with per-model VRAM estimates - requires_heavy_ram → removed (conflated VRAM with RAM; no replacement needed) - requires_gpu_isolation → removed (GPU isolation is now automatic) - requires_api_key → require_api_key("VAR1", "VAR2", ...) with explicit env vars Also removes spurious requires_gpu from ollama-backed tests (test_genslot, test_think_budget_forcing, test_component_typing) and adds missing integration marker to test_hook_call_sites. VRAM estimates computed from model parameter counts using bf16 formula (params_B × 2 × 1.2, rounded up to next even GB): - granite-3.3-8b: 20 GB, Mistral-7B: 18 GB, granite-4.0-micro (3B): 8 GB - Qwen3-0.6B: 4 GB (conservative for vLLM KV cache headroom) - granite-4.0-h-micro (3B): 8 GB, alora training (3B): 12 GB * test: skip collection gracefully when optional backend deps are missing Add pytest.importorskip() / pytest.importorskip() guards to 14 test files that previously aborted the entire test run with a ModuleNotFoundError when optional extras were not installed: - torch / llguidance (mellea[hf]): test_huggingface, test_huggingface_tools, test_alora_train_integration, test_intrinsics_formatters, test_core, test_guardian, test_rag, test_spans - litellm (mellea[litellm]): test_litellm_ollama, test_litellm_watsonx - ibm_watsonx_ai (mellea[watsonx]): test_watsonx - docling / docling_core (mellea[mify]): test_tool_calls, test_richdocument, test_transform With these guards, `uv run pytest` runs all collectable tests and reports skipped files with a clear reason instead of aborting at first ImportError. * test: refine integration marker definition and apply audit fixes Expand integration to cover SDK-boundary tests (OTel InMemoryMetricReader, InMemorySpanExporter, LoggingHandler) — tests that assert against a real third-party SDK contract, not just multi-component wiring. Updates SKILL.md and MARKERS_GUIDE.md with new definition, indicators, tie-breaker, and SDK-boundary signal tables. Applied fixes: - test/telemetry/test_{metrics,metrics_token,logging}.py: add integration marker - test/telemetry/test_metrics_backend.py: add openai marker to OTel+OpenAI test, remove redundant inline skip already covered by require_api_key predicate - test/cli/test_alora_train.py: add integration to test_imports_work (real LoraConfig) - test/formatters/granite/test_intrinsics_formatters.py: remove unregistered block_network marker - test/stdlib/components/docs/test_richdocument.py: add integration pytestmark + e2e/huggingface/qualitative on skipped generation test - test/backends/test_openai_ollama.py: note inherited module marker limitation - docs/examples/plugins/testing_plugins.py: add # pytest: unit * test: add importorskip guards and optional-dep skip logic for examples - test/plugins/test_payloads.py: importorskip("cpex") — skip module when mellea[hooks] not installed instead of failing mid-test with ImportError - test/telemetry/test_metrics_plugins.py: same cpex guard - docs/examples/conftest.py: extend _check_optional_imports to cover docling, pandas, cpex (mellea.plugins imports), and litellm; also call the check from pytest_pycollect_makemodule so directly-specified files are guarded too - docs/examples/image_text_models/README.md: add Prerequisites section listing models to pull (granite3.2-vision, qwen2.5vl:7b) * fix: convert example import errors to skips; add cpex importorskip guards Replace per-dep import checks in examples conftest with a runtime approach: ExampleModule (a pytest.Module subclass) is now returned by pytest_pycollect_makemodule for all runnable example files, preventing pytest's default collector from importing them directly. Import errors in the subprocess are caught in ExampleItem.runtest() and converted to skips, so no optional dependency needs to be encoded in conftest. Remove _check_optional_imports entirely — it was hand-maintained and would need updating for every new optional dep. Also: - test/plugins/test_payloads.py: importorskip("cpex") - test/telemetry/test_metrics_plugins.py: importorskip("cpex") - docs/examples/image_text_models/README.md: add Prerequisites section listing models to pull (granite3.2-vision, qwen2.5vl:7b) * test: skip OTel-dependent tests when opentelemetry not installed Locally running without mellea[telemetry] caused three tests to fail with assertion errors rather than skip cleanly. Add importorskip at module level for test_tracing.py and a skipif decorator for the single OTel-gated test in test_astream_exception_propagation.py. * fix: use conservative heuristic for Apple Silicon GPU memory detection Metal's recommendedMaxWorkingSetSize is a static device property (~75% of total RAM) that ignores current system load. Replace it with min(total * 0.75, total - 16) so that desktop/IDE memory usage is accounted for. Also removes the torch dependency for GPU detection on Apple Silicon — sysctl hw.memsize is used directly. CUDA path on Linux is unchanged. * test: add training memory signals to audit-markers skill; bump alora VRAM gate Training tests need ~2x the base model inference memory (activations, optimizer states, gradient temporaries). The skill now detects training signals (train_model, Trainer, epochs=) and checks that require_gpu min_vram_gb uses the 2x rule. Bump test_alora_train_integration from min_vram_gb=12 to 20 (3B bfloat16: ~6 GB inference, ~12 GB training peak + headroom) so it skips correctly on 32 GB Apple Silicon under typical load. * fix: cache system capabilities result in examples conftest get_system_capabilities() was caching the function reference, not the result — causing the Ollama socket check (1s timeout) and full capability detection to re-run for every example file during collection (~102 times). Cache the result dict instead so detection runs exactly once. * fix: cache get_system_capabilities() result in test/conftest.py The function was called once per test in pytest_runtest_setup (325+ calls) and once at collection in pytest_collection_modifyitems, each time re-running the Ollama socket check (1s timeout when down), sysctl subprocess, and psutil query. Cache the result after the first call. * fix: flush MPS memory pool in intrinsic test fixture teardown torch.cuda.empty_cache() is a no-op on Apple Silicon MPS, leaving the MPS allocator pool occupied after each module fixture tears down. The next module then loads a fresh model into an already-pressured pool, causing the process RSS to grow unboundedly across modules. Both calls are now guarded so CUDA and MPS runs each get the correct flush. * fix: load LocalHFBackend model in config dtype to prevent float32 upcasting AutoModelForCausalLM.from_pretrained without torch_dtype may load weights in float32 on CPU before moving to MPS/CUDA, doubling peak memory briefly and leaving float32 remnants in the allocator pool. torch_dtype="auto" respects the model config (bfloat16 for Granite) for both the CPU load and the device transfer. * test: remove --isolate-heavy process isolation and bump intrinsic VRAM gates - Remove --isolate-heavy flag, _run_heavy_modules_isolated(), pytest_collection_finish(), and require_gpu_isolation() predicate — superseded by cleanup_gpu_backend() from PR #721 - Remove dead requires_gpu/requires_api_key branches from docs/examples/conftest.py - Bump min_vram_gb from 8 → 12 on test_guardian, test_core, test_rag, test_spans — correct gate for 3B base model (6 GB) + adapters + inference overhead; 8 GB was wrong and masked by the now-fixed MPS pool leak - Add adapter accumulation signals to audit-markers skill - Update AGENTS.md, test/README.md, MARKERS_GUIDE.md to remove --isolate-heavy references * test: migrate legacy markers in test_intrinsics_formatters.py Replace deprecated @pytest.mark.llm, @pytest.mark.requires_gpu, @pytest.mark.requires_heavy_ram, @pytest.mark.requires_gpu_isolation with @pytest.mark.e2e and @require_gpu(min_vram_gb=12) to align with the new marker taxonomy (#727/#728). VRAM gate set to 12 GB matching the 3B-parameter model loaded across the parametrized test cases. * test: add integration marker to test_dependency_isolation.py * docs: document OLLAMA_KEEP_ALIVE=1m as memory optimisation for unordered test runs * fix: suppress mypy name-defined for torch.Tensor after importorskip change * fix: ruff format huggingface.py from_pretrained args * fix: ruff format test_watsonx.py and test_huggingface_tools.py * refactor: remove requires_gpu, requires_heavy_ram, requires_gpu_isolation markers and handlers * refactor: remove --ignore-*-check override flags from conftest * refactor: remove requires_api_key marker; fix api backend group to match watsonx+bedrock markers * fix: address review Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> * test: mark test_image_block_in_instruction as qualitative * chore: commit .claude/settings.json with skillLocations for skill discovery * docs: broaden audit-markers skill description to cover diagnostic use cases * docs: add diagnostic mode to audit-markers skill for troubleshooting skip/resource issues --------- Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> Co-authored-by: Alex Bozarth <ajbozart@us.ibm.com>
1 parent 243a161 commit d3d6040

152 files changed

Lines changed: 2044 additions & 1127 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/skills/audit-markers/SKILL.md

Lines changed: 902 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
---
2+
name: skill-author
3+
description: >
4+
Draft, validate, and install new agent skills. Use when asked to create a new
5+
skill, automate a workflow, or add a capability. Produces cross-compatible
6+
SKILL.md files that work in both Claude Code and IBM Bob.
7+
argument-hint: "[skill-name]"
8+
compatibility: "Claude Code, IBM Bob"
9+
metadata:
10+
version: "2026-03-25"
11+
capabilities: [bash, read_file, write_file]
12+
---
13+
14+
# Skill Authoring Meta-Skill
15+
16+
Create new agent skills that work across Claude Code (CLI/IDE) and IBM Bob.
17+
18+
## Skill Location
19+
20+
Skills live under `.agents/skills/<name>/SKILL.md`.
21+
22+
Discovery configuration varies by tool:
23+
- **Claude Code:** Add `"skillLocations": [".agents/skills"]` to `.claude/settings.json`.
24+
Without this, Claude Code looks in `.claude/skills/` by default.
25+
- **IBM Bob:** Discovers `.agents/skills/` natively per agentskills.io convention.
26+
27+
Both tools read the same `SKILL.md` format. Use the frontmatter schema below
28+
to maximise compatibility.
29+
30+
## Workflow
31+
32+
1. **Name the skill** — kebab-case, max 64 chars (e.g. `api-tester`, `audit-markers`).
33+
34+
2. **Scaffold the directory:**
35+
```
36+
.agents/skills/<name>/
37+
├── SKILL.md # Required — frontmatter + instructions
38+
├── scripts/ # Optional — helper scripts
39+
└── templates/ # Optional — output templates
40+
```
41+
42+
3. **Write SKILL.md** — YAML frontmatter + markdown body (see schema below).
43+
44+
4. **Dry-run review** — mentally execute the skill against a realistic scenario
45+
before finalising. Walk through the procedure on a concrete example (a real
46+
file in the repo, not a hypothetical) and check for:
47+
- **Scaling gaps:** Does the procedure work for 1 file AND 100 files? If the
48+
skill accepts a directory or glob, it needs a triage strategy (e.g., "grep
49+
first to find candidates, then deep-read only files with issues") — not
50+
just "read every file fully."
51+
- **Boundary ambiguity:** If the skill defines categories or classifications,
52+
test the boundaries between adjacent categories with a real example. The
53+
edges are where agents will disagree or ask the user. Sharpen definitions
54+
until two agents reading the same test would classify it the same way.
55+
- **Stale references:** If the skill describes project state ("this hook needs
56+
to be added", "this marker is not yet registered"), verify those statements
57+
are still true. Embed checks ("read conftest.py to confirm") rather than
58+
assertions that rot.
59+
- **Output format at scale:** Run the report template mentally against the
60+
largest expected input. A per-function report for 5 files is fine; for 165
61+
files it's unusable. Design output for the largest scope — summary table
62+
first, per-item detail only where issues exist.
63+
- **Format coverage:** If the skill operates on multiple input formats (e.g.,
64+
`pytestmark` lists AND `# pytest:` comments), verify each format is
65+
explicitly addressed in the procedure. Implicit coverage causes agents to
66+
skip or guess.
67+
- **Rigid rules:** If you wrote "always X" or "never Y", find the edge case
68+
where the rule is wrong. Add the escape hatch. E.g., "per-function only"
69+
should say "module-level is acceptable when every function qualifies."
70+
71+
5. **Validate:**
72+
- Check the skill is discoverable: list files in `.agents/skills/`.
73+
- Confirm no frontmatter warnings from the IDE.
74+
- Verify the skill does not conflict with existing skills or `AGENTS.md`.
75+
76+
## SKILL.md Frontmatter Schema
77+
78+
Use only fields from the **cross-compatible** set to avoid IDE warnings.
79+
80+
### Cross-compatible fields (use these)
81+
82+
| Field | Type | Purpose |
83+
|-------|------|---------|
84+
| `name` | string | Kebab-case identifier. Becomes the `/slash-command`. Max 64 chars. |
85+
| `description` | string | What the skill does and when to trigger it. Be specific — agents use this to decide whether to invoke the skill automatically. |
86+
| `argument-hint` | string | Autocomplete hint. E.g. `"[file] [--dry-run]"`, `"[issue-number]"`. |
87+
| `compatibility` | string | Which tools support this skill. E.g. `"Claude Code, IBM Bob"`. |
88+
| `disable-model-invocation` | boolean | `true` = manual `/name` only, no auto-invocation. |
89+
| `user-invocable` | boolean | `false` = hidden from `/` menu. Use for background knowledge skills. |
90+
| `license` | string | SPDX identifier if publishing. E.g. `"Apache-2.0"`. |
91+
| `metadata` | object | Free-form key-value pairs for tool-specific or custom fields. |
92+
93+
### Tool-specific fields (put under `metadata`)
94+
95+
These are useful but not universally supported — nest them under `metadata`:
96+
97+
```yaml
98+
metadata:
99+
version: "2026-03-25"
100+
capabilities: [bash, read_file, write_file] # Bob/agentskills.io
101+
```
102+
103+
Claude Code's `allowed-tools` and `context`/`agent` fields are recognised by
104+
Claude Code but may trigger warnings in Bob's validator. If needed, add them
105+
to `metadata` or accept the warnings.
106+
107+
### Example frontmatter
108+
109+
```yaml
110+
---
111+
name: my-skill
112+
description: >
113+
Does X when Y. Use when asked to Z.
114+
argument-hint: "[target] [--flag]"
115+
compatibility: "Claude Code, IBM Bob"
116+
metadata:
117+
version: "2026-03-25"
118+
capabilities: [bash, read_file, write_file]
119+
---
120+
```
121+
122+
## SKILL.md Body Structure
123+
124+
After frontmatter, write clear markdown instructions the agent follows:
125+
126+
1. **Context section** — what the skill operates on, key reference files.
127+
2. **Procedure** — numbered steps the agent follows. Be explicit about decisions and edge cases.
128+
3. **Rules / constraints** — hard rules the agent must not break.
129+
4. **Output format** — what the agent should produce (report, edits, summary).
130+
131+
### Guidelines
132+
133+
- **Be specific.** Vague instructions produce inconsistent results across models.
134+
"Check if markers are correct" is worse than "Compare the test's assertions
135+
to the qualitative decision rule in section 3."
136+
- **Reference project files.** Point to docs, configs, and examples by relative
137+
path so the agent can read them. E.g. "See `test/MARKERS_GUIDE.md` for the
138+
full marker taxonomy."
139+
- **Declare scope boundaries.** State what the skill does NOT do. E.g. "This
140+
skill does not modify conftest.py — flag infrastructure issues as notes."
141+
- **Use `$ARGUMENTS`** for user input. `$ARGUMENTS` is the full argument string;
142+
`$1`, `$2` etc. are positional.
143+
- **Keep SKILL.md under 500 lines.** Use supporting files for large reference
144+
material (link to them from the body).
145+
- **Portability:** use relative paths from the repo root, never absolute paths.
146+
- **Formatting:** use YYYY-MM-DD for dates, 24-hour clock for times, metric units.
147+
- **Design for variable scope.** If the skill can operate on a single file or an
148+
entire directory, provide a triage strategy for the large case. Agents given
149+
"audit everything" with no prioritisation will either read every file (slow)
150+
or skip files (incomplete).
151+
- **Sharpen category boundaries.** When defining classifications, the boundary
152+
between adjacent categories causes the most disagreement. Add a "key
153+
distinction from X" sentence for each pair of adjacent tiers.
154+
- **Avoid temporal assertions.** Don't write "this conftest hook needs to be
155+
added" — write "check whether conftest.py already has the hook." State that
156+
goes stale silently is worse than no guidance at all.
157+
- **Qualify absolutes.** "Always X" and "never Y" rules need escape hatches for
158+
the common exception. E.g., "per-function only — unless every function in the
159+
file qualifies, in which case module-level is acceptable."
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
"""Validate SKILL.md frontmatter for agent skills."""
2+
3+
import json
4+
import os
5+
import sys
6+
7+
import yaml
8+
9+
10+
def validate_skill(skill_path: str) -> dict:
11+
"""Check that a skill directory has valid SKILL.md with required frontmatter keys."""
12+
skill_file = os.path.join(skill_path, "SKILL.md")
13+
14+
if not os.path.exists(skill_file):
15+
return {"status": "error", "message": "Missing SKILL.md"}
16+
17+
try:
18+
with open(skill_file) as f:
19+
# safe_load_all handles the --- delimiters correctly and won't
20+
# break on markdown horizontal rules later in the file.
21+
frontmatter = next(yaml.safe_load_all(f))
22+
23+
if not isinstance(frontmatter, dict):
24+
return {"status": "error", "message": "Frontmatter is not a YAML mapping"}
25+
26+
# Root-level required keys
27+
for key in ("name", "description"):
28+
if key not in frontmatter:
29+
return {"status": "error", "message": f"Missing root key: {key}"}
30+
31+
# version lives under metadata (per skill-author guide)
32+
meta = frontmatter.get("metadata")
33+
if not isinstance(meta, dict) or "version" not in meta:
34+
return {
35+
"status": "error",
36+
"message": "Missing nested key: metadata.version",
37+
}
38+
39+
return {"status": "success", "data": frontmatter}
40+
41+
except yaml.YAMLError as e:
42+
return {"status": "error", "message": f"Invalid YAML: {e}"}
43+
except StopIteration:
44+
return {"status": "error", "message": "No YAML frontmatter found"}
45+
46+
47+
if __name__ == "__main__":
48+
if len(sys.argv) < 2:
49+
print("Usage: python3 validate_skill.py <skill-directory>", file=sys.stderr)
50+
sys.exit(1)
51+
result = validate_skill(sys.argv[1])
52+
print(json.dumps(result))

.claude/settings.json

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"skillLocations": [".agents/skills"]
3+
}

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -451,7 +451,8 @@ pyrightconfig.json
451451

452452
# AI agent configs
453453
.bob/
454-
.claude/
454+
.claude/*
455+
!.claude/settings.json
455456

456457
# Generated API documentation (built by tooling/docs-autogen/)
457458
docs/docs/api/

AGENTS.md

Lines changed: 36 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ uv run pytest # Default: qualitative tests, skip slow te
2525
uv run pytest -m "not qualitative" # Fast tests only (~2 min)
2626
uv run pytest -m slow # Run only slow tests (>5 min)
2727
uv run pytest --co -q # Run ALL tests including slow (bypass config)
28-
uv run pytest --isolate-heavy # Enable GPU process isolation (opt-in)
2928
uv run ruff format . # Format code
3029
uv run ruff check . # Lint code
3130
uv run mypy . # Type check
@@ -44,49 +43,44 @@ uv run mypy . # Type check
4443
| `cli/` | CLI commands (`m serve`, `m alora`, `m decompose`, `m eval`) |
4544
| `test/` | All tests (run from repo root) |
4645
| `docs/examples/` | Example code (run as tests via pytest) |
46+
| `.agents/skills/` | Agent skills ([agentskills.io](https://agentskills.io) standard) |
4747
| `scratchpad/` | Experiments (git-ignored) |
4848

4949
## 3. Test Markers
50-
All tests and examples use markers to indicate requirements. The test infrastructure automatically skips tests based on system capabilities.
51-
52-
**Backend Markers:**
53-
- `@pytest.mark.ollama` — Requires Ollama running (local, lightweight)
54-
- `@pytest.mark.huggingface` — Requires HuggingFace backend (local, heavy)
55-
- `@pytest.mark.vllm` — Requires vLLM backend (local, GPU required)
56-
- `@pytest.mark.openai` — Requires OpenAI API (requires API key)
57-
- `@pytest.mark.watsonx` — Requires Watsonx API (requires API key)
58-
- `@pytest.mark.litellm` — Requires LiteLLM backend
59-
60-
**Capability Markers:**
61-
- `@pytest.mark.requires_gpu` — Requires GPU
62-
- `@pytest.mark.requires_heavy_ram` — Requires 48GB+ RAM
63-
- `@pytest.mark.requires_api_key` — Requires external API keys
64-
- `@pytest.mark.qualitative` — LLM output quality tests (skipped in CI via `CICD=1`)
65-
- `@pytest.mark.llm` — Makes LLM calls (needs at least Ollama)
66-
- `@pytest.mark.slow` — Tests taking >5 minutes (skipped via `SKIP_SLOW=1`)
67-
68-
**Execution Strategy Markers:**
69-
- `@pytest.mark.requires_gpu_isolation` — Requires OS-level process isolation to clear CUDA memory (use with `--isolate-heavy` or `CICD=1`)
70-
71-
**Examples in `docs/examples/`** use comment-based markers for clean code:
50+
Tests use a four-tier granularity system (`unit`, `integration`, `e2e`, `qualitative`) plus backend and resource markers. The `unit` marker is auto-applied by conftest — never write it explicitly. The `llm` marker is deprecated; use `e2e` instead.
51+
52+
See **[test/MARKERS_GUIDE.md](test/MARKERS_GUIDE.md)** for the full marker reference (tier definitions, backend markers, resource gates, auto-skip logic, common patterns).
53+
54+
**Examples in `docs/examples/`** use comment-based markers:
7255
```python
73-
# pytest: ollama, llm, requires_heavy_ram
56+
# pytest: e2e, ollama, qualitative
7457
"""Example description..."""
75-
76-
# Your clean example code here
7758
```
7859

79-
Tests/examples automatically skip if system lacks required resources. Heavy examples (e.g., HuggingFace) are skipped during collection to prevent memory issues.
60+
⚠️ Don't add `qualitative` to trivial tests — keep the fast loop fast.
61+
⚠️ Mark tests taking >1 minute with `slow`.
62+
63+
## 4. Agent Skills
64+
65+
Skills live in `.agents/skills/` following the [agentskills.io](https://agentskills.io) open standard. Each skill is a directory with a `SKILL.md` file (YAML frontmatter + markdown instructions).
66+
67+
**Tool discovery:**
8068

81-
**Default behavior:**
82-
- `uv run pytest` skips slow tests (>5 min) but runs qualitative tests
83-
- Use `pytest -m "not qualitative"` for fast tests only (~2 min)
84-
- Use `pytest -m slow` or `pytest` (without config) to include slow tests
69+
| Tool | Project skills | Global skills | Config needed |
70+
| ----------------- | ----------------- | ------------------- | ------------------------------------------------------------------ |
71+
| Claude Code | `.agents/skills/` | `~/.claude/skills/` | `"skillLocations": [".agents/skills"]` in `.claude/settings.json` |
72+
| IBM Bob | `.bob/skills/` | `~/.bob/skills/` | Symlink: `.bob/skills``.agents/skills` |
73+
| VS Code / Copilot | `.agents/skills/` || None (auto-discovered) |
8574

86-
⚠️ Don't add `qualitative` to trivial tests—keep the fast loop fast.
87-
⚠️ Mark tests taking >5 minutes with `slow` (e.g., dataset loading, extensive evaluations).
75+
**Bob users:** create the symlink once per clone:
8876

89-
## 4. Coding Standards
77+
```bash
78+
mkdir -p .bob && ln -s ../.agents/skills .bob/skills
79+
```
80+
81+
**Available skills:** `/audit-markers`, `/skill-author`
82+
83+
## 5. Coding Standards
9084
- **Types required** on all core functions
9185
- **Docstrings are prompts** — be specific, the LLM reads them
9286
- **Google-style docstrings**`Args:` on the **class docstring only**; `__init__` gets a single summary sentence. Add `Attributes:` only when a stored value differs in type/behaviour from its constructor input (type transforms, computed values, class constants). See CONTRIBUTING.md for a full example.
@@ -96,37 +90,38 @@ Tests/examples automatically skip if system lacks required resources. Heavy exam
9690
- **Friendly Dependency Errors**: Wraps optional backend imports in `try/except ImportError` with a helpful message (e.g., "Please pip install mellea[hf]"). See `mellea/stdlib/session.py` for examples.
9791
- **Backend telemetry fields**: All backends must populate `mot.usage` (dict with `prompt_tokens`, `completion_tokens`, `total_tokens`), `mot.model` (str), and `mot.provider` (str) in their `post_processing()` method. Metrics are automatically recorded by `TokenMetricsPlugin` — don't add manual `record_token_usage_metrics()` calls.
9892

99-
## 5. Commits & Hooks
93+
## 6. Commits & Hooks
10094
[Angular format](https://github.com/angular/angular/blob/main/CONTRIBUTING.md#commit): `feat:`, `fix:`, `docs:`, `test:`, `refactor:`, `release:`
10195

10296
Pre-commit runs: ruff, mypy, uv-lock, codespell
10397

104-
## 6. Timing
98+
## 7. Timing
10599
> **Don't cancel**: `pytest` (full) and `pre-commit --all-files` may take minutes. Canceling mid-run can corrupt state.
106100
107-
## 7. Common Issues
101+
## 8. Common Issues
108102
| Problem | Fix |
109103
|---------|-----|
110104
| `ComponentParseError` | Add examples to docstring |
111105
| `uv.lock` out of sync | Run `uv sync` |
112106
| Ollama refused | Run `ollama serve` |
113107
| Telemetry import errors | Run `uv sync` to install OpenTelemetry deps |
114108

115-
## 8. Self-Review (before notifying user)
109+
## 9. Self-Review (before notifying user)
116110
1. `uv run pytest test/ -m "not qualitative"` passes?
117111
2. `ruff format` and `ruff check` clean?
118112
3. New functions typed with concise docstrings?
119113
4. Unit tests added for new functionality?
120114
5. Avoided over-engineering?
121115

122-
## 9. Writing Tests
116+
## 10. Writing Tests
117+
123118
- Place tests in `test/` mirroring source structure
124119
- Name files `test_*.py` (required for pydocstyle)
125120
- Use `gh_run` fixture for CI-aware tests (see `test/conftest.py`)
126121
- Mark tests checking LLM output quality with `@pytest.mark.qualitative`
127122
- If a test fails, fix the **code**, not the test (unless the test was wrong)
128123

129-
## 10. Writing Docs
124+
## 11. Writing Docs
130125

131126
If you are modifying or creating pages under `docs/docs/`, follow the writing
132127
conventions in [`docs/docs/guide/CONTRIBUTING.md`](docs/docs/guide/CONTRIBUTING.md).
@@ -144,7 +139,7 @@ Key rules that differ from typical Markdown habits:
144139
mellea source; mark forward-looking content with `> **Coming soon:**`
145140
- **No visible TODOs** — if content is missing, open a GitHub issue instead
146141

147-
## 11. Feedback Loop
142+
## 12. Feedback Loop
148143

149144
Found a bug, workaround, or pattern? Update the docs:
150145

CLAUDE.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Claude Code Directives
2+
@AGENTS.md
3+
4+
## Execution
5+
- If instructed to create a new capability, strictly trigger the `skill-author` meta-skill to ensure cross-compatibility.

0 commit comments

Comments
 (0)