Skip to content

Commit e1fa209

Browse files
committed
Add tests for appendix preferences, timeout handling, and refine model catalog checks
- Implement tests for deriving layout policy from LaTeX templates in `latexTemplateLoader.test.ts`. - Add `parseAppendixPreferencesFromBrief` function tests in `manuscriptFormat.test.ts`. - Update model catalog tests to ensure correct recommendations for Codex models in `modelCatalog.test.ts`. - Introduce handling for preflight-only metrics in `objectiveMetricPropagation.test.ts`. - Add tests for handling hanging responses in `ollama.test.ts` and `paperSelection.test.ts`. - Refactor and clean up venue style references in `paperCritique.test.ts`, `writePaperPdfBuild.test.ts`, and related tests. - Implement timeout fallback tests for constraint profiles and literature queries in `collectPlanningTimeouts.test.ts`.
1 parent 3996de1 commit e1fa209

65 files changed

Lines changed: 1784 additions & 750 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

ISSUES.md

Lines changed: 83 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# ISSUES.md
22

3-
Last updated: 2026-04-03
3+
Last updated: 2026-04-04
44

55
This file was compacted on 2026-03-22 to remove duplicated template fragments, malformed partial entries, and conflicting reused LV identifiers. Detailed pre-cleanup prose remains in git history.
66

@@ -14,7 +14,8 @@ Usage rules:
1414
## Current active status
1515

1616
- Active live-validation defects:
17-
- None currently open.
17+
- `LV-085` implement-stage materialization boundary
18+
- `LV-086` preflight metrics over-promoted as executed evidence
1819
- Active research/paper-readiness watchlist: see `Research and paper-readiness watchlist` below.
1920
- Current watchlist snapshot:
2021
- `R-001` Result-table discipline and claim→evidence linkage — `MITIGATED`
@@ -29,49 +30,99 @@ Usage rules:
2930

3031
## Active live validation issues
3132

32-
- None currently open.
33+
## Issue: LV-085
3334

34-
---
35+
- Status: active
36+
- Validation target: real `test/`-workspace governed run for the LoRA rank × dropout factorial brief
37+
- Environment/session context: fresh live TUI run in `test/`, run id `1f46de0f-5beb-4de6-a219-abf483b74101`, current node `implement_experiments`
3538

36-
## Research and paper-readiness watchlist
39+
- Reproduction steps:
40+
1. Start the real run from `test/` with the governed brief for the Mistral-7B LoRA rank/dropout sweep.
41+
2. Let the run progress through `collect_papers`, `analyze_papers`, `generate_hypotheses`, and `design_experiments`.
42+
3. Allow `implement_experiments` to repair after the earlier stale `peft` runner feedback.
43+
4. Observe the third implementation attempt fail before `run_experiments` is allowed to rerun.
3744

38-
These are not active interactive defects. They stay here as mitigated or watchlist-style research/paper-readiness risks so they do not get lost in the fixed live-validation timeline.
45+
- Expected behavior:
46+
- `implement_experiments` should validate runnable public artifacts such as the experiment script, config, and docs.
47+
- Run-owned execution outputs like `.autolabos/runs/<run-id>/metrics.json` should not be required to already exist before `run_experiments` executes.
3948

40-
### R-001 — Result-table discipline and claim→evidence linkage
41-
- Status: MITIGATED
42-
- What was done: `design_experiments` writes `baseline_summary.json`; `analyze_results` writes `result_table.json`; `review` gate checks both and blocks when missing.
43-
- Remaining risk: quality of content inside these artifacts still depends on the generated analysis.
49+
- Actual behavior:
50+
- `implement_experiments` fails with:
51+
- `Implementer referenced artifact(s) that were not materialized: .autolabos/runs/1f46de0f-5beb-4de6-a219-abf483b74101/metrics.json`
52+
- The node then retries instead of handing the current experiment harness back to `run_experiments`.
4453

45-
### R-002 — Scientific gate warnings surfacing
46-
- Status: MITIGATED
47-
- What was done: gate warnings are grouped by category with severity labels and surfaced as limitation sentences in the manuscript.
48-
- Remaining risk: categories are still coarse and can require operator review.
54+
- Fresh vs existing session comparison:
55+
- Fresh session: reproduced in the active live run
56+
- Existing session: not yet compared after this exact failure mode
57+
- Divergence: unknown
4958

50-
### R-003 — System-validation paper shape over-promotion
51-
- Status: MITIGATED
52-
- What was done: manuscript classification now downgrades missing-baseline / missing-results / missing-richness cases.
53-
- Remaining risk: a structurally complete fake-mode run can still look stronger than the underlying evidence.
59+
- Root cause hypothesis:
60+
- Type: `persisted_state_bug`
61+
- Hypothesis: the implement-stage artifact materialization check treats run-owned execution outputs such as `metrics.json` as if they must already be present during `implement_experiments`, even though those files are supposed to be produced by `run_experiments`.
5462

55-
### P-001 — Baseline/comparator packaging
56-
- Status: MITIGATED
57-
- What was done: `baseline_summary.json` is written by `design_experiments`; review downgrades when missing.
63+
- Code/test changes:
64+
- Code: pending
65+
- Tests: pending
5866

59-
### P-002 — Compact quantitative result packaging
60-
- Status: MITIGATED
61-
- What was done: `result_table.json` is written by `analyze_results`; review downgrades when missing.
67+
- Regression status:
68+
- Automated regression test linked: no
69+
- Re-validation result: pending
6270

63-
### P-003 — Related-work depth signaling
64-
- Status: MITIGATED
65-
- What was done: `analyze_papers_richness_summary.json` tracks full-text coverage and feeds readiness classification.
66-
- Remaining risk: full-text grounding still depends on PDF availability.
71+
- Follow-up risks:
72+
- The same validator boundary may also incorrectly require other run-owned execution artifacts before second-stage verification.
73+
- Evidence/artifacts:
74+
- `test/.autolabos/runs/1f46de0f-5beb-4de6-a219-abf483b74101/events.jsonl`
75+
- `test/.autolabos/runs/1f46de0f-5beb-4de6-a219-abf483b74101/run_record.json`
76+
- `test/outputs/lora-rank-dropout-interaction-for-mistral-7b-ins-1f46de0f/experiment/experiment.py`
6777

68-
---
78+
## Issue: LV-086
6979

70-
## Historical archive
80+
- Status: active
81+
- Validation target: real `test/`-workspace governed run for the LoRA rank × dropout factorial brief after `run_experiments`
82+
- Environment/session context: same fresh live run `1f46de0f-5beb-4de6-a219-abf483b74101`, artifacts inspected after `run_experiments` completed and `analyze_results` paused
7183

72-
Older fixed live-validation entries, compact archived summaries, and legacy draft items have been moved out of this main operator-facing file.
84+
- Reproduction steps:
85+
1. Start the real run from `test/` with the governed LoRA rank/dropout brief.
86+
2. Let `implement_experiments` and `run_experiments` complete.
87+
3. Inspect `.autolabos/runs/<run-id>/metrics.json` and `analysis/result_table.json`.
88+
4. Observe that the recorded metrics come from `mode: "preflight"` with no training or evaluation executed.
89+
90+
- Expected behavior:
91+
- `run_experiments` should not treat preflight-only environment checks as successful executed experiment evidence for this paper-scale brief.
92+
- Objective evaluation should not infer research success from hardware/resource fields such as `device.gpu_count` when the stated objective is benchmark accuracy on ARC-Challenge and HellaSwag.
93+
94+
- Actual behavior:
95+
- `metrics.json` contains:
96+
- `mode: "preflight"`
97+
- `notes: "No training/evaluation executed..."`
98+
- `primary_metric: null`
99+
- `run_experiments` still completes and summarizes:
100+
- `Objective metric met: device.gpu_count=2 >= 0.015`
101+
- `analyze_results` then builds a results table from hardware/resource fields and pauses only later with `incomplete_results_table`.
73102

74-
If we need to resurrect one of those older cases, use git history rather than treating them as current active work.
103+
- Fresh vs existing session comparison:
104+
- Fresh session: reproduced in the active live run
105+
- Existing session: not yet compared after this exact failure mode
106+
- Divergence: unknown
107+
108+
- Root cause hypothesis:
109+
- Type: `persisted_state_bug`
110+
- Hypothesis: `run_experiments` currently accepts preflight-only metrics as a successful execution artifact, and the best-effort objective matcher is willing to promote resource metrics (for example `device.gpu_count`) into the objective summary even when no task metric exists.
111+
112+
- Code/test changes:
113+
- Code: pending
114+
- Tests: pending
115+
116+
- Regression status:
117+
- Automated regression test linked: no
118+
- Re-validation result: pending
119+
120+
- Follow-up risks:
121+
- Even when later gates pause the workflow, the misleading “objective met” summary can contaminate operator interpretation, review context, and any quality-improvement loop that reads `paper_readiness`-adjacent artifacts.
122+
- Evidence/artifacts:
123+
- `test/.autolabos/runs/1f46de0f-5beb-4de6-a219-abf483b74101/metrics.json`
124+
- `test/outputs/lora-rank-dropout-interaction-for-mistral-7b-ins-1f46de0f/analysis/result_table.json`
125+
- `test/.autolabos/runs/1f46de0f-5beb-4de6-a219-abf483b74101/run_record.json`
75126

76127
## Issue: LV-ARCHIVE-ANCHOR
77128

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,8 @@ The brief is not just a startup note. It is the governed contract for a run.
196196

197197
That makes the brief part of the audit trail, not just part of the prompt.
198198

199+
In the current contract, `.autolabos/config.yaml` is primarily for provider/runtime defaults and workspace policy. Run-specific research intent, evidence bars, baseline expectations, manuscript-format targets, and manuscript template path belong in the brief. Persisted config may therefore omit brief-owned sections such as research defaults and some manuscript-profile or paper-template fields.
200+
199201
```bash
200202
/new
201203
/brief start --latest

docs/README.de.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,8 @@ Der Brief ist nicht nur ein Startdokument. Er ist der governed contract für ein
196196

197197
Mit anderen Worten: Der Brief ist nicht nur Teil des Prompts. Er ist Teil des Audit Trails.
198198

199+
Im aktuellen Vertrag speichert `.autolabos/config.yaml` vor allem Provider-/Runtime-Defaults und Workspace-Policy. Die run-spezifische Forschungsabsicht, Evidence-Schwellen, Baseline-Erwartungen, Manuscript-Format-Ziele und der Pfad zum Manuscript-Template gehören dagegen in den Brief. Deshalb kann der persistierte Config `research`-Defaults sowie einige manuscript-profile- bzw. paper-template-Felder auslassen.
200+
199201
```bash
200202
/new
201203
/brief start --latest

docs/README.es.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,8 @@ El brief no es solo un documento de arranque. Es el governed contract de la corr
196196

197197
Es decir, el brief no es solo parte del prompt. Es parte del audit trail.
198198

199+
En el contrato actual, `.autolabos/config.yaml` guarda sobre todo valores por defecto de provider/runtime y workspace policy. La intención de investigación de cada run, los evidence bars, las expectativas de baseline, los objetivos de manuscript format y la ruta del manuscript template deben vivir en el Brief. Por eso, el config persistido puede omitir valores por defecto de `research` y algunos campos de manuscript-profile / paper-template.
200+
199201
```bash
200202
/new
201203
/brief start --latest

docs/README.fr.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,8 @@ Le brief n’est pas seulement un document de départ. C’est le governed contr
196196

197197
En d’autres termes, le brief n’est pas seulement une partie du prompt. Il fait partie de l’audit trail.
198198

199+
Dans le contrat actuel, `.autolabos/config.yaml` sert surtout à stocker les valeurs par défaut du provider/runtime et la workspace policy. L’intention de recherche propre à chaque run, les evidence bars, les attentes de baseline, les objectifs de manuscript format et le chemin du manuscript template doivent vivre dans le Brief. Le config persisté peut donc omettre les valeurs par défaut de `research` ainsi que certains champs de manuscript-profile / paper-template.
200+
199201
```bash
200202
/new
201203
/brief start --latest

docs/README.ja.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,8 @@ Brief は単なる開始文書ではありません。run の governed contract
196196

197197
つまり brief は prompt の一部ではなく、audit trail の一部です。
198198

199+
現在の契約では、`.autolabos/config.yaml` は主に provider/runtime の既定値と workspace policy を保持します。run ごとの research intent、evidence bar、baseline expectation、manuscript format target、manuscript template path は Brief 側で定義するのが原則です。そのため、persisted config では `research` の既定値や一部の manuscript-profile / paper-template フィールドが省略されることがあります。
200+
199201
```bash
200202
/new
201203
/brief start --latest

docs/README.ko.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,8 @@ Brief는 단순한 시작 문서가 아닙니다. 한 run의 governed contract
196196

197197
즉, brief는 prompt 일부가 아니라 audit trail의 일부입니다.
198198

199+
현재 계약에서 `.autolabos/config.yaml`은 주로 provider/runtime 기본값과 workspace 정책을 담습니다. run별 연구 의도, evidence 기준, baseline 기대치, manuscript format 목표는 Brief에 두는 것이 원칙입니다. 그래서 persisted config에서는 `research` 기본값이나 일부 manuscript-profile 필드가 생략될 수 있습니다.
200+
199201
```bash
200202
/new
201203
/brief start --latest

docs/README.pt.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,8 @@ O brief não é apenas um documento inicial. Ele é o governed contract do run.
196196

197197
Em outras palavras, o brief não é apenas parte do prompt. Ele é parte do audit trail.
198198

199+
No contrato atual, `.autolabos/config.yaml` guarda principalmente defaults de provider/runtime e workspace policy. A intenção de pesquisa específica de cada run, os evidence bars, as expectativas de baseline, os objetivos de manuscript format e o caminho do manuscript template devem ficar no Brief. Por isso, o config persistido pode omitir defaults de `research` e alguns campos de manuscript-profile / paper-template.
200+
199201
```bash
200202
/new
201203
/brief start --latest

docs/README.ru.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,8 @@ Brief — это не просто стартовый документ. Это g
196196

197197
Иными словами, brief — это не просто часть prompt. Это часть audit trail.
198198

199+
В текущем контракте `.autolabos/config.yaml` в основном хранит provider/runtime defaults и workspace policy. Исследовательское намерение для конкретного run, evidence bar, ожидания по baseline, цели manuscript format и путь к manuscript template должны задаваться в Brief. Поэтому сохранённый config может не содержать `research` defaults и часть полей manuscript-profile / paper-template.
200+
199201
```bash
200202
/new
201203
/brief start --latest

docs/README.zh-CN.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,8 @@ Brief 不只是启动文档。它是 run 的 governed contract。
196196

197197
也就是说,brief 不是 prompt 的一部分,而是 audit trail 的一部分。
198198

199+
在当前契约里,`.autolabos/config.yaml` 主要保存 provider/runtime 默认值和 workspace policy。每个 run 的研究意图、evidence 门槛、baseline 预期、manuscript format 目标以及 manuscript template 路径,原则上应放在 Brief 中。因此,持久化后的 config 可能会省略 `research` 默认值以及部分 manuscript-profile / paper-template 字段。
200+
199201
```bash
200202
/new
201203
/brief start --latest

0 commit comments

Comments
 (0)