Skip to content

Commit d971776

Browse files
csbobbycsbobbyplanetf1ajbozarth
authored
feat: (m decomp) M Decompose Readme and Docstring Updates (#767)
* decompse doc string * pipline doc string * logging doc string * decomp README * merge docstrings * clean: pre-commit * decomp guide * fix: subtask tag * clean: pre-commit * clean: Readme * merge docstrings * clean: pre-commit * decomp guide * fix: subtask tag * clean: pre-commit * test: agent skills infrastructure and marker taxonomy audit (#727, #728) (#742) * test: add granularity marker taxonomy infrastructure (#727) Register unit/integration/e2e markers in conftest and pyproject.toml. Add unit auto-apply hook in pytest_collection_modifyitems. Deprecate llm marker (synonym for e2e). Remove dead plugins marker. Rewrite MARKERS_GUIDE.md as authoritative marker reference. Sync AGENTS.md Section 3 with new taxonomy. * test: add audit-markers skill for test classification (#728) Skill classifies tests as unit/integration/e2e/qualitative using general heuristics (Part 1) and project-specific rules (Part 2). Includes fixture chain tracing guidance, backend detection heuristics, and example file handling. References MARKERS_GUIDE.md for tables. * chore: add CLAUDE.md and agent skills infrastructure Add CLAUDE.md referencing AGENTS.md for project directives. Add skill-author meta-skill for cross-compatible skill creation. The audit-markers skill was added in the previous commit. * test: improve audit-markers skill quality and add resource predicates Resolve 8 quality issues from dry-run review of the audit-markers skill: - Add behavioural signal detection tables and Step 0 triage procedure for scaling to full-repo audits (grep for backend behaviour, not just existing markers) - Clarify unit/integration boundary with scope-of-mocks rule - Allow module-level qualitative when every function qualifies - Replace resource marker inference with predicate factory pattern - Make llm→e2e rule explicit for # pytest: comments in examples - Redesign report format: 3-tier output (summary table, issues-only detail, batch groups) instead of per-function listing - Remove stale infrastructure note (conftest hook already exists) Add test/predicates.py with reusable skipif decorators: require_gpu, require_ram, require_gpu_isolation, require_api_key, require_package, require_ollama, require_python. Update skill-author with dry-run review step and 4 new authoring guidelines (variable scope, category boundaries, temporal assertions, qualifying absolutes). Refs: #727, #728 * chore: remove issue references from audit-markers skill Epic/issue numbers are task context, not permanent skill knowledge. * docs: align MARKERS_GUIDE.md with predicate factory pattern MARKERS_GUIDE.md documented legacy resource markers (requires_gpu, etc.) as the active convention while SKILL.md instructed migration to predicates — a direct conflict that would cause the audit agent to stall or produce incorrect edits. - Replace resource markers section with predicate-first documentation - Move legacy markers to deprecated subsection (conftest still handles them) - Update common patterns example to use predicate imports - Add test/predicates.py to related files - Add explicit dry-run enforcement to SKILL.md Step 4 Refs: #727, #728 * fix: validate_skill.py schema mismatch and brittle YAML parsing Two bugs: - Required `version` at root level but skill-author guide nests it under `metadata` — guaranteed failure on valid skills - Naive `content.split('---')` breaks on markdown horizontal rules Fix: use yaml.safe_load_all for robust frontmatter extraction, check `name`/`description` at root and `version` under `metadata.version`. * fix: migrate deprecated llm markers to e2e, add backend registry, update audit-markers skill - Replace all `pytest.mark.llm` with `pytest.mark.e2e` across 34 test files and 87 example files (comment-based markers) - Add `BACKEND_MARKERS` data-driven registry in test/conftest.py as single source of truth for backend marker registration - Register `bedrock` backend marker in conftest.py, pyproject.toml, MARKERS_GUIDE.md, and add missing marker to test_bedrock.py - Reclassify test_alora_train.py as integration (was unit); add importorskip for peft dependency - Add missing `e2e` tier markers to test_tracing.py and test_tracing_backend.py - Update audit-markers skill: report-first default, predicate migration as fix (not recommendation), backend registry gap detection * feat: add estimate-vram skill and fix MPS VRAM detection - New /estimate-vram agent skill that analyses test files to determine correct require_gpu(min_vram_gb=N) and require_ram(min_gb=N) values by tracing model IDs and looking up parameter counts dynamically - Fix _gpu_vram_gb() in test/predicates.py to use torch.mps.recommended_max_memory() on macOS MPS instead of returning 0 - Fix get_system_capabilities() in test/conftest.py with same MPS path - Update test/README.md with predicates table and legacy marker deprecation - Add /estimate-vram cross-reference in audit-markers skill * refactor: fold estimate-vram into audit-markers skill VRAM estimation is only useful during marker audits, not standalone. Move the model-tracing and VRAM computation procedure into the audit-markers resource gating section and delete the separate skill. * docs: drop isolation refs and fix RAM guidance in markers docs requires_heavy_ram and requires_gpu_isolation are deprecated with no replacement — models load into VRAM not system RAM, and GPU isolation is now automatic. require_ram() stays available for genuinely RAM-bound tests but has no current use case. * docs: add legacy marker guidance for example files in audit-markers skill * refactor: remove require_ollama() predicate — redundant with backend marker The ollama backend marker + conftest auto-skip already handles Ollama availability. No other backend has a dedicated predicate — consistent to let the marker system handle it. * refactor: replace requires_heavy_ram gate with huggingface backend marker in examples conftest The legacy requires_heavy_ram marker (blanket 48 GB RAM threshold) conflated VRAM with system RAM. Replace both the collection-time and runtime skip logic to gate on the huggingface backend marker instead, which accurately checks GPU availability. * refactor: replace ad-hoc bedrock skipif with require_api_key predicate * refactor: migrate legacy resource markers to predicates Replace deprecated pytest markers with typed predicate functions from test/predicates.py across all test files and example files: - requires_gpu → require_gpu(min_vram_gb=N) with per-model VRAM estimates - requires_heavy_ram → removed (conflated VRAM with RAM; no replacement needed) - requires_gpu_isolation → removed (GPU isolation is now automatic) - requires_api_key → require_api_key("VAR1", "VAR2", ...) with explicit env vars Also removes spurious requires_gpu from ollama-backed tests (test_genslot, test_think_budget_forcing, test_component_typing) and adds missing integration marker to test_hook_call_sites. VRAM estimates computed from model parameter counts using bf16 formula (params_B × 2 × 1.2, rounded up to next even GB): - granite-3.3-8b: 20 GB, Mistral-7B: 18 GB, granite-4.0-micro (3B): 8 GB - Qwen3-0.6B: 4 GB (conservative for vLLM KV cache headroom) - granite-4.0-h-micro (3B): 8 GB, alora training (3B): 12 GB * test: skip collection gracefully when optional backend deps are missing Add pytest.importorskip() / pytest.importorskip() guards to 14 test files that previously aborted the entire test run with a ModuleNotFoundError when optional extras were not installed: - torch / llguidance (mellea[hf]): test_huggingface, test_huggingface_tools, test_alora_train_integration, test_intrinsics_formatters, test_core, test_guardian, test_rag, test_spans - litellm (mellea[litellm]): test_litellm_ollama, test_litellm_watsonx - ibm_watsonx_ai (mellea[watsonx]): test_watsonx - docling / docling_core (mellea[mify]): test_tool_calls, test_richdocument, test_transform With these guards, `uv run pytest` runs all collectable tests and reports skipped files with a clear reason instead of aborting at first ImportError. * test: refine integration marker definition and apply audit fixes Expand integration to cover SDK-boundary tests (OTel InMemoryMetricReader, InMemorySpanExporter, LoggingHandler) — tests that assert against a real third-party SDK contract, not just multi-component wiring. Updates SKILL.md and MARKERS_GUIDE.md with new definition, indicators, tie-breaker, and SDK-boundary signal tables. Applied fixes: - test/telemetry/test_{metrics,metrics_token,logging}.py: add integration marker - test/telemetry/test_metrics_backend.py: add openai marker to OTel+OpenAI test, remove redundant inline skip already covered by require_api_key predicate - test/cli/test_alora_train.py: add integration to test_imports_work (real LoraConfig) - test/formatters/granite/test_intrinsics_formatters.py: remove unregistered block_network marker - test/stdlib/components/docs/test_richdocument.py: add integration pytestmark + e2e/huggingface/qualitative on skipped generation test - test/backends/test_openai_ollama.py: note inherited module marker limitation - docs/examples/plugins/testing_plugins.py: add # pytest: unit * test: add importorskip guards and optional-dep skip logic for examples - test/plugins/test_payloads.py: importorskip("cpex") — skip module when mellea[hooks] not installed instead of failing mid-test with ImportError - test/telemetry/test_metrics_plugins.py: same cpex guard - docs/examples/conftest.py: extend _check_optional_imports to cover docling, pandas, cpex (mellea.plugins imports), and litellm; also call the check from pytest_pycollect_makemodule so directly-specified files are guarded too - docs/examples/image_text_models/README.md: add Prerequisites section listing models to pull (granite3.2-vision, qwen2.5vl:7b) * fix: convert example import errors to skips; add cpex importorskip guards Replace per-dep import checks in examples conftest with a runtime approach: ExampleModule (a pytest.Module subclass) is now returned by pytest_pycollect_makemodule for all runnable example files, preventing pytest's default collector from importing them directly. Import errors in the subprocess are caught in ExampleItem.runtest() and converted to skips, so no optional dependency needs to be encoded in conftest. Remove _check_optional_imports entirely — it was hand-maintained and would need updating for every new optional dep. Also: - test/plugins/test_payloads.py: importorskip("cpex") - test/telemetry/test_metrics_plugins.py: importorskip("cpex") - docs/examples/image_text_models/README.md: add Prerequisites section listing models to pull (granite3.2-vision, qwen2.5vl:7b) * test: skip OTel-dependent tests when opentelemetry not installed Locally running without mellea[telemetry] caused three tests to fail with assertion errors rather than skip cleanly. Add importorskip at module level for test_tracing.py and a skipif decorator for the single OTel-gated test in test_astream_exception_propagation.py. * fix: use conservative heuristic for Apple Silicon GPU memory detection Metal's recommendedMaxWorkingSetSize is a static device property (~75% of total RAM) that ignores current system load. Replace it with min(total * 0.75, total - 16) so that desktop/IDE memory usage is accounted for. Also removes the torch dependency for GPU detection on Apple Silicon — sysctl hw.memsize is used directly. CUDA path on Linux is unchanged. * test: add training memory signals to audit-markers skill; bump alora VRAM gate Training tests need ~2x the base model inference memory (activations, optimizer states, gradient temporaries). The skill now detects training signals (train_model, Trainer, epochs=) and checks that require_gpu min_vram_gb uses the 2x rule. Bump test_alora_train_integration from min_vram_gb=12 to 20 (3B bfloat16: ~6 GB inference, ~12 GB training peak + headroom) so it skips correctly on 32 GB Apple Silicon under typical load. * fix: cache system capabilities result in examples conftest get_system_capabilities() was caching the function reference, not the result — causing the Ollama socket check (1s timeout) and full capability detection to re-run for every example file during collection (~102 times). Cache the result dict instead so detection runs exactly once. * fix: cache get_system_capabilities() result in test/conftest.py The function was called once per test in pytest_runtest_setup (325+ calls) and once at collection in pytest_collection_modifyitems, each time re-running the Ollama socket check (1s timeout when down), sysctl subprocess, and psutil query. Cache the result after the first call. * fix: flush MPS memory pool in intrinsic test fixture teardown torch.cuda.empty_cache() is a no-op on Apple Silicon MPS, leaving the MPS allocator pool occupied after each module fixture tears down. The next module then loads a fresh model into an already-pressured pool, causing the process RSS to grow unboundedly across modules. Both calls are now guarded so CUDA and MPS runs each get the correct flush. * fix: load LocalHFBackend model in config dtype to prevent float32 upcasting AutoModelForCausalLM.from_pretrained without torch_dtype may load weights in float32 on CPU before moving to MPS/CUDA, doubling peak memory briefly and leaving float32 remnants in the allocator pool. torch_dtype="auto" respects the model config (bfloat16 for Granite) for both the CPU load and the device transfer. * test: remove --isolate-heavy process isolation and bump intrinsic VRAM gates - Remove --isolate-heavy flag, _run_heavy_modules_isolated(), pytest_collection_finish(), and require_gpu_isolation() predicate — superseded by cleanup_gpu_backend() from PR #721 - Remove dead requires_gpu/requires_api_key branches from docs/examples/conftest.py - Bump min_vram_gb from 8 → 12 on test_guardian, test_core, test_rag, test_spans — correct gate for 3B base model (6 GB) + adapters + inference overhead; 8 GB was wrong and masked by the now-fixed MPS pool leak - Add adapter accumulation signals to audit-markers skill - Update AGENTS.md, test/README.md, MARKERS_GUIDE.md to remove --isolate-heavy references * test: migrate legacy markers in test_intrinsics_formatters.py Replace deprecated @pytest.mark.llm, @pytest.mark.requires_gpu, @pytest.mark.requires_heavy_ram, @pytest.mark.requires_gpu_isolation with @pytest.mark.e2e and @require_gpu(min_vram_gb=12) to align with the new marker taxonomy (#727/#728). VRAM gate set to 12 GB matching the 3B-parameter model loaded across the parametrized test cases. * test: add integration marker to test_dependency_isolation.py * docs: document OLLAMA_KEEP_ALIVE=1m as memory optimisation for unordered test runs * fix: suppress mypy name-defined for torch.Tensor after importorskip change * fix: ruff format huggingface.py from_pretrained args * fix: ruff format test_watsonx.py and test_huggingface_tools.py * refactor: remove requires_gpu, requires_heavy_ram, requires_gpu_isolation markers and handlers * refactor: remove --ignore-*-check override flags from conftest * refactor: remove requires_api_key marker; fix api backend group to match watsonx+bedrock markers * fix: address review Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> * test: mark test_image_block_in_instruction as qualitative * chore: commit .claude/settings.json with skillLocations for skill discovery * docs: broaden audit-markers skill description to cover diagnostic use cases * docs: add diagnostic mode to audit-markers skill for troubleshooting skip/resource issues --------- Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> Co-authored-by: Alex Bozarth <ajbozart@us.ibm.com> * clean: Readme --------- Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> Co-authored-by: csbobby <phdbobbywu.cs@gmail.com> Co-authored-by: Nigel Jones <jonesn@uk.ibm.com> Co-authored-by: Alex Bozarth <ajbozart@us.ibm.com>
1 parent 951145d commit d971776

8 files changed

Lines changed: 476 additions & 166 deletions

File tree

cli/decompose/README.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,100 @@
11
# Mellea Decomp
2+
3+
This automatic pipeline demonstrates **task decomposition and execution** built with *Mellea generative programs*.
4+
5+
Instead of solving a complex task with a single prompt, the system first **decomposes the task into subtasks**, then executes them sequentially through a assembled pipeline.
6+
7+
This pattern improves reasoning quality, interpretability, and modularity in LLM-powered systems.
8+
9+
---
10+
11+
# Overview
12+
13+
Many complex tasks contain multiple reasoning steps.
14+
The `m_decompose` pipeline handles this by splitting the task into smaller units.
15+
16+
```
17+
User Request
18+
19+
Task Decomposition
20+
21+
Subtasks
22+
23+
Task Execution
24+
25+
Final Result
26+
```
27+
28+
Rather than writing a large prompt, the workflow uses **generative modules and reusable prompts**.
29+
30+
---
31+
32+
# Directory
33+
34+
```
35+
m_decompose/
36+
├── decompose.py
37+
├── pipeline.py
38+
├── prompt_modules
39+
└── README.md
40+
```
41+
42+
**decompose.py**
43+
44+
Generates the refined subtasks from the user request.
45+
46+
**pipeline.py**
47+
48+
Runs the full workflow:
49+
50+
1. decompose the task
51+
2. execute subtasks
52+
3. aggregate results
53+
54+
**prompt_modules**
55+
56+
Reusable prompt components used by the pipeline.
57+
58+
**m_decomp_result_v1.py.jinja2**
59+
60+
Template used to format the final output.
61+
62+
---
63+
64+
# Quick Start
65+
66+
Example usage:
67+
68+
```python
69+
from mellea.cli.decompose.pipeline import decompose, DecompBackend
70+
import json
71+
72+
query = """I will visit Grand Canyon National Park for 3 days in early May. Please create a travel itinerary that includes major scenic viewpoints and short hiking trails. The daily walking distance should stay under 6 miles, and each day should include at least one sunset or sunrise viewpoint."""
73+
74+
result = decompose(
75+
task_prompt=query,
76+
model_id="mistralai/Mistral-Small-3.2-24B-Instruct-2506",
77+
backend=DecompBackend.openai,
78+
backend_endpoint="http://localhost:8000/v1",
79+
backend_api_key="EMPTY",
80+
)
81+
82+
print(json.dumps(result, indent=2, ensure_ascii=False))
83+
```
84+
85+
86+
The pipeline then executes each step and produces the final answer.
87+
88+
---
89+
90+
# Highlights
91+
92+
- **Task Decomposition and Execution** — analyze complex problems into smaller planning and execution steps.
93+
- **Generative Mellea Program** — conduct LLM workflows as an programmatic pipeline instead of single call.
94+
- **Instruction Modular** — separate instruction design from execution logic using reusable modules.
95+
96+
---
97+
98+
# Summary
99+
100+
`m_decompose` shows how to build **LLM pipelines** using task decomposition.

cli/decompose/decompose.py

Lines changed: 39 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -27,14 +27,16 @@
2727

2828

2929
class DecompVersion(StrEnum):
30-
"""Available versions of the decomposition pipeline template.
30+
"""Available template versions for generated decomposition programs.
3131
32-
Newer versions must be declared last to ensure ``latest`` always resolves to
33-
the most recent template.
32+
Newer concrete versions must be declared after older ones so that
33+
``latest`` can resolve to the most recently declared template version.
3434
35-
Args:
36-
latest (str): Sentinel value that resolves to the last declared version.
37-
v1 (str): Version 1 of the decomposition pipeline template.
35+
Attributes:
36+
latest: Sentinel value that resolves to the last declared concrete
37+
template version.
38+
v1: Version 1 of the decomposition program template.
39+
v2: Version 2 of the decomposition program template.
3840
"""
3941

4042
latest = "latest"
@@ -225,44 +227,42 @@ def run(
225227
),
226228
] = LogMode.demo,
227229
) -> None:
228-
"""Decompose one or more user queries into subtasks with constraints and dependency metadata.
229-
230-
Reads user queries either from a file or interactively, runs the LLM
231-
decomposition pipeline to produce subtask descriptions, Jinja2 prompt templates,
232-
constraint lists, and dependency metadata, and writes one ``.json`` result file
233-
plus one rendered ``.py`` script per task job to the output directory.
230+
"""Runs the ``m decompose`` CLI workflow and writes generated outputs.
234231
235-
If ``input_file`` contains multiple non-empty lines, each line is treated as a
236-
separate task job.
232+
Reads user queries from a file or interactive input, runs the decomposition
233+
pipeline for each task job, and writes one JSON file, one rendered Python
234+
program, and any generated validation modules under a per-job output
235+
directory.
237236
238237
Args:
239-
out_dir: Path to an existing directory where output files are saved.
240-
out_name: Base name (no extension) for the output files. Defaults to
241-
``"m_decomp_result"``.
242-
input_file: Optional path to a text file containing one or more user
243-
queries. If the file contains multiple non-empty lines, each line is
244-
treated as a separate task job. If omitted, the query is collected
245-
interactively.
246-
model_id: Model name or ID used for all decomposition pipeline steps.
247-
backend: Inference backend -- ``"ollama"``, ``"openai"``, or ``"rits"``.
248-
backend_req_timeout: Request timeout in seconds for model inference calls.
249-
backend_endpoint: Base URL of the configured endpoint. Required when
250-
``backend="openai"`` or ``backend="rits"``.
251-
backend_api_key: API key for the configured endpoint. Required when
252-
``backend="openai"`` or ``backend="rits"``.
253-
version: Version of the decomposition pipeline template to use.
254-
input_var: Optional list of user-input variable names (e.g. ``"DOC"``).
255-
Each name must be a valid Python identifier. Pass this option
256-
multiple times to define multiple variables.
257-
log_mode: Logging detail mode for CLI and pipeline output.
238+
out_dir: Existing directory under which per-job output directories are
239+
created.
240+
out_name: Base name used for the per-job output directory and generated
241+
files.
242+
input_file: Optional path to a text file containing one or more task
243+
prompts. Each non-empty line is processed as a separate task job.
244+
When omitted, the command prompts interactively for one task.
245+
model_id: Model identifier used for all decomposition pipeline stages.
246+
backend: Inference backend used to execute model calls.
247+
backend_req_timeout: Request timeout in seconds for backend inference calls.
248+
backend_endpoint: Endpoint URL or base URL required by remote backends.
249+
backend_api_key: API key required by remote backends.
250+
version: Template version used to render the generated Python program.
251+
``latest`` resolves to the most recently declared concrete version.
252+
input_var: Optional user input variable names to expose in generated
253+
prompts and programs. Each name must be a valid non-keyword Python
254+
identifier.
255+
log_mode: Logging verbosity for CLI and pipeline execution.
258256
259257
Raises:
260-
AssertionError: If ``out_name`` contains invalid characters, if
261-
``out_dir`` does not exist or is not a directory, or if any
262-
``input_var`` name is not a valid Python identifier.
263-
ValueError: If the input file contains no non-empty task lines.
264-
Exception: Re-raised from the decomposition pipeline after cleaning up
265-
any partially written output directories.
258+
AssertionError: If ``out_name`` is invalid, ``out_dir`` does not name an
259+
existing directory, ``input_file`` does not name an existing file,
260+
or any declared ``input_var`` is not a valid Python identifier.
261+
ValueError: If ``input_file`` exists but contains no non-empty task
262+
lines.
263+
Exception: Propagates pipeline, rendering, parsing, or file-writing
264+
failures. Any output directories created earlier in the run are
265+
removed before the exception is re-raised.
266266
"""
267267
created_dirs: list[Path] = []
268268

cli/decompose/logging.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,13 @@
44

55

66
class LogMode(StrEnum):
7+
"""Logging verbosity modes used across the CLI and pipeline.
8+
9+
Attributes:
10+
demo: Standard readable logs with moderate detail.
11+
debug: Verbose logs including internal state and intermediate outputs.
12+
"""
13+
714
demo = "demo"
815
debug = "debug"
916

@@ -12,6 +19,17 @@ class LogMode(StrEnum):
1219

1320

1421
def configure_logging(log_mode: LogMode = LogMode.demo) -> None:
22+
"""Configures root logging handlers and log levels for the application.
23+
24+
Initializes a single stdout stream handler on first invocation and updates
25+
log levels on subsequent calls. This function is safe to call multiple times.
26+
27+
Args:
28+
log_mode: Logging verbosity mode controlling the global log level.
29+
30+
Returns:
31+
None.
32+
"""
1533
global _CONFIGURED
1634

1735
level = logging.DEBUG if log_mode == LogMode.debug else logging.INFO
@@ -34,10 +52,27 @@ def configure_logging(log_mode: LogMode = LogMode.demo) -> None:
3452

3553

3654
def get_logger(name: str) -> logging.Logger:
55+
"""Returns a logger instance with the given name.
56+
57+
Args:
58+
name: Logger name, typically a module or component identifier.
59+
60+
Returns:
61+
A configured ``logging.Logger`` instance.
62+
"""
3763
return logging.getLogger(name)
3864

3965

4066
def log_section(logger: logging.Logger, title: str) -> None:
67+
"""Emits a formatted section header to visually separate log output.
68+
69+
Args:
70+
logger: Logger used to emit the section lines.
71+
title: Section title displayed between separator lines.
72+
73+
Returns:
74+
None.
75+
"""
4176
logger.info("")
4277
logger.info("=" * 72)
4378
logger.info(title)

0 commit comments

Comments
 (0)