You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add tests for appendix preferences, timeout handling, and refine model catalog checks
- Implement tests for deriving layout policy from LaTeX templates in `latexTemplateLoader.test.ts`.
- Add `parseAppendixPreferencesFromBrief` function tests in `manuscriptFormat.test.ts`.
- Update model catalog tests to ensure correct recommendations for Codex models in `modelCatalog.test.ts`.
- Introduce handling for preflight-only metrics in `objectiveMetricPropagation.test.ts`.
- Add tests for handling hanging responses in `ollama.test.ts` and `paperSelection.test.ts`.
- Refactor and clean up venue style references in `paperCritique.test.ts`, `writePaperPdfBuild.test.ts`, and related tests.
- Implement timeout fallback tests for constraint profiles and literature queries in `collectPlanningTimeouts.test.ts`.
Copy file name to clipboardExpand all lines: ISSUES.md
+83-32Lines changed: 83 additions & 32 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# ISSUES.md
2
2
3
-
Last updated: 2026-04-03
3
+
Last updated: 2026-04-04
4
4
5
5
This file was compacted on 2026-03-22 to remove duplicated template fragments, malformed partial entries, and conflicting reused LV identifiers. Detailed pre-cleanup prose remains in git history.
-`LV-086` preflight metrics over-promoted as executed evidence
18
19
- Active research/paper-readiness watchlist: see `Research and paper-readiness watchlist` below.
19
20
- Current watchlist snapshot:
20
21
-`R-001` Result-table discipline and claim→evidence linkage — `MITIGATED`
@@ -29,49 +30,99 @@ Usage rules:
29
30
30
31
## Active live validation issues
31
32
32
-
- None currently open.
33
+
## Issue: LV-085
33
34
34
-
---
35
+
- Status: active
36
+
- Validation target: real `test/`-workspace governed run for the LoRA rank × dropout factorial brief
37
+
- Environment/session context: fresh live TUI run in `test/`, run id `1f46de0f-5beb-4de6-a219-abf483b74101`, current node `implement_experiments`
35
38
36
-
## Research and paper-readiness watchlist
39
+
- Reproduction steps:
40
+
1. Start the real run from `test/` with the governed brief for the Mistral-7B LoRA rank/dropout sweep.
41
+
2. Let the run progress through `collect_papers`, `analyze_papers`, `generate_hypotheses`, and `design_experiments`.
42
+
3. Allow `implement_experiments` to repair after the earlier stale `peft` runner feedback.
43
+
4. Observe the third implementation attempt fail before `run_experiments` is allowed to rerun.
37
44
38
-
These are not active interactive defects. They stay here as mitigated or watchlist-style research/paper-readiness risks so they do not get lost in the fixed live-validation timeline.
45
+
- Expected behavior:
46
+
-`implement_experiments` should validate runnable public artifacts such as the experiment script, config, and docs.
47
+
- Run-owned execution outputs like `.autolabos/runs/<run-id>/metrics.json` should not be required to already exist before `run_experiments` executes.
39
48
40
-
### R-001 — Result-table discipline and claim→evidence linkage
41
-
- Status: MITIGATED
42
-
- What was done: `design_experiments` writes `baseline_summary.json`; `analyze_results` writes `result_table.json`; `review` gate checks both and blocks when missing.
43
-
- Remaining risk: quality of content inside these artifacts still depends on the generated analysis.
49
+
- Actual behavior:
50
+
-`implement_experiments` fails with:
51
+
-`Implementer referenced artifact(s) that were not materialized: .autolabos/runs/1f46de0f-5beb-4de6-a219-abf483b74101/metrics.json`
52
+
- The node then retries instead of handing the current experiment harness back to `run_experiments`.
44
53
45
-
### R-002 — Scientific gate warnings surfacing
46
-
- Status: MITIGATED
47
-
- What was done: gate warnings are grouped by category with severity labels and surfaced as limitation sentences in the manuscript.
48
-
- Remaining risk: categories are still coarse and can require operator review.
54
+
- Fresh vs existing session comparison:
55
+
- Fresh session: reproduced in the active live run
56
+
- Existing session: not yet compared after this exact failure mode
57
+
- Divergence: unknown
49
58
50
-
### R-003 — System-validation paper shape over-promotion
51
-
- Status: MITIGATED
52
-
- What was done: manuscript classification now downgrades missing-baseline / missing-results / missing-richness cases.
53
-
- Remaining risk: a structurally complete fake-mode run can still look stronger than the underlying evidence.
59
+
- Root cause hypothesis:
60
+
- Type: `persisted_state_bug`
61
+
- Hypothesis: the implement-stage artifact materialization check treats run-owned execution outputs such as `metrics.json` as if they must already be present during `implement_experiments`, even though those files are supposed to be produced by `run_experiments`.
54
62
55
-
### P-001 — Baseline/comparator packaging
56
-
- Status: MITIGATED
57
-
- What was done: `baseline_summary.json` is written by `design_experiments`; review downgrades when missing.
63
+
- Code/test changes:
64
+
- Code: pending
65
+
- Tests: pending
58
66
59
-
### P-002 — Compact quantitative result packaging
60
-
- Status: MITIGATED
61
-
- What was done: `result_table.json` is written by `analyze_results`; review downgrades when missing.
67
+
- Regression status:
68
+
- Automated regression test linked: no
69
+
- Re-validation result: pending
62
70
63
-
### P-003 — Related-work depth signaling
64
-
- Status: MITIGATED
65
-
- What was done: `analyze_papers_richness_summary.json` tracks full-text coverage and feeds readiness classification.
66
-
- Remaining risk: full-text grounding still depends on PDF availability.
71
+
- Follow-up risks:
72
+
- The same validator boundary may also incorrectly require other run-owned execution artifacts before second-stage verification.
- Validation target: real `test/`-workspace governed run for the LoRA rank × dropout factorial brief after `run_experiments`
82
+
- Environment/session context: same fresh live run `1f46de0f-5beb-4de6-a219-abf483b74101`, artifacts inspected after `run_experiments` completed and `analyze_results` paused
71
83
72
-
Older fixed live-validation entries, compact archived summaries, and legacy draft items have been moved out of this main operator-facing file.
84
+
- Reproduction steps:
85
+
1. Start the real run from `test/` with the governed LoRA rank/dropout brief.
86
+
2. Let `implement_experiments` and `run_experiments` complete.
87
+
3. Inspect `.autolabos/runs/<run-id>/metrics.json` and `analysis/result_table.json`.
88
+
4. Observe that the recorded metrics come from `mode: "preflight"` with no training or evaluation executed.
89
+
90
+
- Expected behavior:
91
+
-`run_experiments` should not treat preflight-only environment checks as successful executed experiment evidence for this paper-scale brief.
92
+
- Objective evaluation should not infer research success from hardware/resource fields such as `device.gpu_count` when the stated objective is benchmark accuracy on ARC-Challenge and HellaSwag.
93
+
94
+
- Actual behavior:
95
+
-`metrics.json` contains:
96
+
-`mode: "preflight"`
97
+
-`notes: "No training/evaluation executed..."`
98
+
-`primary_metric: null`
99
+
-`run_experiments` still completes and summarizes:
-`analyze_results` then builds a results table from hardware/resource fields and pauses only later with `incomplete_results_table`.
73
102
74
-
If we need to resurrect one of those older cases, use git history rather than treating them as current active work.
103
+
- Fresh vs existing session comparison:
104
+
- Fresh session: reproduced in the active live run
105
+
- Existing session: not yet compared after this exact failure mode
106
+
- Divergence: unknown
107
+
108
+
- Root cause hypothesis:
109
+
- Type: `persisted_state_bug`
110
+
- Hypothesis: `run_experiments` currently accepts preflight-only metrics as a successful execution artifact, and the best-effort objective matcher is willing to promote resource metrics (for example `device.gpu_count`) into the objective summary even when no task metric exists.
111
+
112
+
- Code/test changes:
113
+
- Code: pending
114
+
- Tests: pending
115
+
116
+
- Regression status:
117
+
- Automated regression test linked: no
118
+
- Re-validation result: pending
119
+
120
+
- Follow-up risks:
121
+
- Even when later gates pause the workflow, the misleading “objective met” summary can contaminate operator interpretation, review context, and any quality-improvement loop that reads `paper_readiness`-adjacent artifacts.
Copy file name to clipboardExpand all lines: README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -196,6 +196,8 @@ The brief is not just a startup note. It is the governed contract for a run.
196
196
197
197
That makes the brief part of the audit trail, not just part of the prompt.
198
198
199
+
In the current contract, `.autolabos/config.yaml` is primarily for provider/runtime defaults and workspace policy. Run-specific research intent, evidence bars, baseline expectations, manuscript-format targets, and manuscript template path belong in the brief. Persisted config may therefore omit brief-owned sections such as research defaults and some manuscript-profile or paper-template fields.
Copy file name to clipboardExpand all lines: docs/README.de.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -196,6 +196,8 @@ Der Brief ist nicht nur ein Startdokument. Er ist der governed contract für ein
196
196
197
197
Mit anderen Worten: Der Brief ist nicht nur Teil des Prompts. Er ist Teil des Audit Trails.
198
198
199
+
Im aktuellen Vertrag speichert `.autolabos/config.yaml` vor allem Provider-/Runtime-Defaults und Workspace-Policy. Die run-spezifische Forschungsabsicht, Evidence-Schwellen, Baseline-Erwartungen, Manuscript-Format-Ziele und der Pfad zum Manuscript-Template gehören dagegen in den Brief. Deshalb kann der persistierte Config `research`-Defaults sowie einige manuscript-profile- bzw. paper-template-Felder auslassen.
Copy file name to clipboardExpand all lines: docs/README.es.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -196,6 +196,8 @@ El brief no es solo un documento de arranque. Es el governed contract de la corr
196
196
197
197
Es decir, el brief no es solo parte del prompt. Es parte del audit trail.
198
198
199
+
En el contrato actual, `.autolabos/config.yaml` guarda sobre todo valores por defecto de provider/runtime y workspace policy. La intención de investigación de cada run, los evidence bars, las expectativas de baseline, los objetivos de manuscript format y la ruta del manuscript template deben vivir en el Brief. Por eso, el config persistido puede omitir valores por defecto de `research` y algunos campos de manuscript-profile / paper-template.
Copy file name to clipboardExpand all lines: docs/README.fr.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -196,6 +196,8 @@ Le brief n’est pas seulement un document de départ. C’est le governed contr
196
196
197
197
En d’autres termes, le brief n’est pas seulement une partie du prompt. Il fait partie de l’audit trail.
198
198
199
+
Dans le contrat actuel, `.autolabos/config.yaml` sert surtout à stocker les valeurs par défaut du provider/runtime et la workspace policy. L’intention de recherche propre à chaque run, les evidence bars, les attentes de baseline, les objectifs de manuscript format et le chemin du manuscript template doivent vivre dans le Brief. Le config persisté peut donc omettre les valeurs par défaut de `research` ainsi que certains champs de manuscript-profile / paper-template.
Copy file name to clipboardExpand all lines: docs/README.ko.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -196,6 +196,8 @@ Brief는 단순한 시작 문서가 아닙니다. 한 run의 governed contract
196
196
197
197
즉, brief는 prompt 일부가 아니라 audit trail의 일부입니다.
198
198
199
+
현재 계약에서 `.autolabos/config.yaml`은 주로 provider/runtime 기본값과 workspace 정책을 담습니다. run별 연구 의도, evidence 기준, baseline 기대치, manuscript format 목표는 Brief에 두는 것이 원칙입니다. 그래서 persisted config에서는 `research` 기본값이나 일부 manuscript-profile 필드가 생략될 수 있습니다.
Copy file name to clipboardExpand all lines: docs/README.pt.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -196,6 +196,8 @@ O brief não é apenas um documento inicial. Ele é o governed contract do run.
196
196
197
197
Em outras palavras, o brief não é apenas parte do prompt. Ele é parte do audit trail.
198
198
199
+
No contrato atual, `.autolabos/config.yaml` guarda principalmente defaults de provider/runtime e workspace policy. A intenção de pesquisa específica de cada run, os evidence bars, as expectativas de baseline, os objetivos de manuscript format e o caminho do manuscript template devem ficar no Brief. Por isso, o config persistido pode omitir defaults de `research` e alguns campos de manuscript-profile / paper-template.
Copy file name to clipboardExpand all lines: docs/README.ru.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -196,6 +196,8 @@ Brief — это не просто стартовый документ. Это g
196
196
197
197
Иными словами, brief — это не просто часть prompt. Это часть audit trail.
198
198
199
+
В текущем контракте `.autolabos/config.yaml` в основном хранит provider/runtime defaults и workspace policy. Исследовательское намерение для конкретного run, evidence bar, ожидания по baseline, цели manuscript format и путь к manuscript template должны задаваться в Brief. Поэтому сохранённый config может не содержать `research` defaults и часть полей manuscript-profile / paper-template.
0 commit comments