lhy0718
diff --git a/‎ISSUES.md‎
Lines changed: 83 additions & 32 deletions b/‎ISSUES.md‎
Lines changed: 83 additions & 32 deletions
diff --git a/‎README.md‎
Lines changed: 2 additions & 0 deletions b/‎README.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/README.de.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/README.de.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/README.es.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/README.es.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/README.fr.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/README.fr.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/README.ja.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/README.ja.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/README.ko.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/README.ko.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/README.pt.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/README.pt.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/README.ru.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/README.ru.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/README.zh-CN.md‎
Lines changed: 2 additions & 0 deletions b/‎docs/README.zh-CN.md‎
Lines changed: 2 additions & 0 deletions
@@ -1,6 +1,6 @@
 # ISSUES.md
 
-Last updated: 2026-04-03
+Last updated: 2026-04-04
 
 This file was compacted on 2026-03-22 to remove duplicated template fragments, malformed partial entries, and conflicting reused LV identifiers. Detailed pre-cleanup prose remains in git history.
 
@@ -14,7 +14,8 @@ Usage rules:
 ## Current active status
 
 - Active live-validation defects:
-  - None currently open.
+  - `LV-085` implement-stage materialization boundary
+  - `LV-086` preflight metrics over-promoted as executed evidence
 - Active research/paper-readiness watchlist: see `Research and paper-readiness watchlist` below.
 - Current watchlist snapshot:
   - `R-001` Result-table discipline and claim→evidence linkage — `MITIGATED`
@@ -29,49 +30,99 @@ Usage rules:
 
 ## Active live validation issues
 
-- None currently open.
+## Issue: LV-085
 
----
+- Status: active
+- Validation target: real `test/`-workspace governed run for the LoRA rank × dropout factorial brief
+- Environment/session context: fresh live TUI run in `test/`, run id `1f46de0f-5beb-4de6-a219-abf483b74101`, current node `implement_experiments`
 
-## Research and paper-readiness watchlist
+- Reproduction steps:
+  1. Start the real run from `test/` with the governed brief for the Mistral-7B LoRA rank/dropout sweep.
+  2. Let the run progress through `collect_papers`, `analyze_papers`, `generate_hypotheses`, and `design_experiments`.
+  3. Allow `implement_experiments` to repair after the earlier stale `peft` runner feedback.
+  4. Observe the third implementation attempt fail before `run_experiments` is allowed to rerun.
 
-These are not active interactive defects. They stay here as mitigated or watchlist-style research/paper-readiness risks so they do not get lost in the fixed live-validation timeline.
+- Expected behavior:
+  - `implement_experiments` should validate runnable public artifacts such as the experiment script, config, and docs.
+  - Run-owned execution outputs like `.autolabos/runs/<run-id>/metrics.json` should not be required to already exist before `run_experiments` executes.
 
-### R-001 — Result-table discipline and claim→evidence linkage
-- Status: MITIGATED
-- What was done: `design_experiments` writes `baseline_summary.json`; `analyze_results` writes `result_table.json`; `review` gate checks both and blocks when missing.
-- Remaining risk: quality of content inside these artifacts still depends on the generated analysis.
+- Actual behavior:
+  - `implement_experiments` fails with:
+    - `Implementer referenced artifact(s) that were not materialized: .autolabos/runs/1f46de0f-5beb-4de6-a219-abf483b74101/metrics.json`
+  - The node then retries instead of handing the current experiment harness back to `run_experiments`.
 
-### R-002 — Scientific gate warnings surfacing
-- Status: MITIGATED
-- What was done: gate warnings are grouped by category with severity labels and surfaced as limitation sentences in the manuscript.
-- Remaining risk: categories are still coarse and can require operator review.
+- Fresh vs existing session comparison:
+  - Fresh session: reproduced in the active live run
+  - Existing session: not yet compared after this exact failure mode
+  - Divergence: unknown
 
-### R-003 — System-validation paper shape over-promotion
-- Status: MITIGATED
-- What was done: manuscript classification now downgrades missing-baseline / missing-results / missing-richness cases.
-- Remaining risk: a structurally complete fake-mode run can still look stronger than the underlying evidence.
+- Root cause hypothesis:
+  - Type: `persisted_state_bug`
+  - Hypothesis: the implement-stage artifact materialization check treats run-owned execution outputs such as `metrics.json` as if they must already be present during `implement_experiments`, even though those files are supposed to be produced by `run_experiments`.
 
-### P-001 — Baseline/comparator packaging
-- Status: MITIGATED
-- What was done: `baseline_summary.json` is written by `design_experiments`; review downgrades when missing.
+- Code/test changes:
+  - Code: pending
+  - Tests: pending
 
-### P-002 — Compact quantitative result packaging
-- Status: MITIGATED
-- What was done: `result_table.json` is written by `analyze_results`; review downgrades when missing.
+- Regression status:
+  - Automated regression test linked: no
+  - Re-validation result: pending
 
-### P-003 — Related-work depth signaling
-- Status: MITIGATED
-- What was done: `analyze_papers_richness_summary.json` tracks full-text coverage and feeds readiness classification.
-- Remaining risk: full-text grounding still depends on PDF availability.
+- Follow-up risks:
+  - The same validator boundary may also incorrectly require other run-owned execution artifacts before second-stage verification.
+- Evidence/artifacts:
+  - `test/.autolabos/runs/1f46de0f-5beb-4de6-a219-abf483b74101/events.jsonl`
+  - `test/.autolabos/runs/1f46de0f-5beb-4de6-a219-abf483b74101/run_record.json`
+  - `test/outputs/lora-rank-dropout-interaction-for-mistral-7b-ins-1f46de0f/experiment/experiment.py`
 
----
+## Issue: LV-086
 
-## Historical archive
+- Status: active
+- Validation target: real `test/`-workspace governed run for the LoRA rank × dropout factorial brief after `run_experiments`
+- Environment/session context: same fresh live run `1f46de0f-5beb-4de6-a219-abf483b74101`, artifacts inspected after `run_experiments` completed and `analyze_results` paused
 
-Older fixed live-validation entries, compact archived summaries, and legacy draft items have been moved out of this main operator-facing file.
+- Reproduction steps:
+  1. Start the real run from `test/` with the governed LoRA rank/dropout brief.
+  2. Let `implement_experiments` and `run_experiments` complete.
+  3. Inspect `.autolabos/runs/<run-id>/metrics.json` and `analysis/result_table.json`.
+  4. Observe that the recorded metrics come from `mode: "preflight"` with no training or evaluation executed.
+
+- Expected behavior:
+  - `run_experiments` should not treat preflight-only environment checks as successful executed experiment evidence for this paper-scale brief.
+  - Objective evaluation should not infer research success from hardware/resource fields such as `device.gpu_count` when the stated objective is benchmark accuracy on ARC-Challenge and HellaSwag.
+
+- Actual behavior:
+  - `metrics.json` contains:
+    - `mode: "preflight"`
+    - `notes: "No training/evaluation executed..."`
+    - `primary_metric: null`
+  - `run_experiments` still completes and summarizes:
+    - `Objective metric met: device.gpu_count=2 >= 0.015`
+  - `analyze_results` then builds a results table from hardware/resource fields and pauses only later with `incomplete_results_table`.
 
-If we need to resurrect one of those older cases, use git history rather than treating them as current active work.
+- Fresh vs existing session comparison:
+  - Fresh session: reproduced in the active live run
+  - Existing session: not yet compared after this exact failure mode
+  - Divergence: unknown
+
+- Root cause hypothesis:
+  - Type: `persisted_state_bug`
+  - Hypothesis: `run_experiments` currently accepts preflight-only metrics as a successful execution artifact, and the best-effort objective matcher is willing to promote resource metrics (for example `device.gpu_count`) into the objective summary even when no task metric exists.
+
+- Code/test changes:
+  - Code: pending
+  - Tests: pending
+
+- Regression status:
+  - Automated regression test linked: no
+  - Re-validation result: pending
+
+- Follow-up risks:
+  - Even when later gates pause the workflow, the misleading “objective met” summary can contaminate operator interpretation, review context, and any quality-improvement loop that reads `paper_readiness`-adjacent artifacts.
+- Evidence/artifacts:
+  - `test/.autolabos/runs/1f46de0f-5beb-4de6-a219-abf483b74101/metrics.json`
+  - `test/outputs/lora-rank-dropout-interaction-for-mistral-7b-ins-1f46de0f/analysis/result_table.json`
+  - `test/.autolabos/runs/1f46de0f-5beb-4de6-a219-abf483b74101/run_record.json`
 
 ## Issue: LV-ARCHIVE-ANCHOR
 
 
@@ -196,6 +196,8 @@ The brief is not just a startup note. It is the governed contract for a run.
 
 That makes the brief part of the audit trail, not just part of the prompt.
 
+In the current contract, `.autolabos/config.yaml` is primarily for provider/runtime defaults and workspace policy. Run-specific research intent, evidence bars, baseline expectations, manuscript-format targets, and manuscript template path belong in the brief. Persisted config may therefore omit brief-owned sections such as research defaults and some manuscript-profile or paper-template fields.
+
 ```bash
 /new
 /brief start --latest
 
@@ -196,6 +196,8 @@ Der Brief ist nicht nur ein Startdokument. Er ist der governed contract für ein
 
 Mit anderen Worten: Der Brief ist nicht nur Teil des Prompts. Er ist Teil des Audit Trails.
 
+Im aktuellen Vertrag speichert `.autolabos/config.yaml` vor allem Provider-/Runtime-Defaults und Workspace-Policy. Die run-spezifische Forschungsabsicht, Evidence-Schwellen, Baseline-Erwartungen, Manuscript-Format-Ziele und der Pfad zum Manuscript-Template gehören dagegen in den Brief. Deshalb kann der persistierte Config `research`-Defaults sowie einige manuscript-profile- bzw. paper-template-Felder auslassen.
+
 ```bash
 /new
 /brief start --latest
 
@@ -196,6 +196,8 @@ El brief no es solo un documento de arranque. Es el governed contract de la corr
 
 Es decir, el brief no es solo parte del prompt. Es parte del audit trail.
 
+En el contrato actual, `.autolabos/config.yaml` guarda sobre todo valores por defecto de provider/runtime y workspace policy. La intención de investigación de cada run, los evidence bars, las expectativas de baseline, los objetivos de manuscript format y la ruta del manuscript template deben vivir en el Brief. Por eso, el config persistido puede omitir valores por defecto de `research` y algunos campos de manuscript-profile / paper-template.
+
 ```bash
 /new
 /brief start --latest
 
@@ -196,6 +196,8 @@ Le brief n’est pas seulement un document de départ. C’est le governed contr
 
 En d’autres termes, le brief n’est pas seulement une partie du prompt. Il fait partie de l’audit trail.
 
+Dans le contrat actuel, `.autolabos/config.yaml` sert surtout à stocker les valeurs par défaut du provider/runtime et la workspace policy. L’intention de recherche propre à chaque run, les evidence bars, les attentes de baseline, les objectifs de manuscript format et le chemin du manuscript template doivent vivre dans le Brief. Le config persisté peut donc omettre les valeurs par défaut de `research` ainsi que certains champs de manuscript-profile / paper-template.
+
 ```bash
 /new
 /brief start --latest
 
@@ -196,6 +196,8 @@ Brief は単なる開始文書ではありません。run の governed contract
 
 つまり brief は prompt の一部ではなく、audit trail の一部です。
 
+現在の契約では、`.autolabos/config.yaml` は主に provider/runtime の既定値と workspace policy を保持します。run ごとの research intent、evidence bar、baseline expectation、manuscript format target、manuscript template path は Brief 側で定義するのが原則です。そのため、persisted config では `research` の既定値や一部の manuscript-profile / paper-template フィールドが省略されることがあります。
+
 ```bash
 /new
 /brief start --latest
 
@@ -196,6 +196,8 @@ Brief는 단순한 시작 문서가 아닙니다. 한 run의 governed contract
 
 즉, brief는 prompt 일부가 아니라 audit trail의 일부입니다.
 
+현재 계약에서 `.autolabos/config.yaml`은 주로 provider/runtime 기본값과 workspace 정책을 담습니다. run별 연구 의도, evidence 기준, baseline 기대치, manuscript format 목표는 Brief에 두는 것이 원칙입니다. 그래서 persisted config에서는 `research` 기본값이나 일부 manuscript-profile 필드가 생략될 수 있습니다.
+
 ```bash
 /new
 /brief start --latest
 
@@ -196,6 +196,8 @@ O brief não é apenas um documento inicial. Ele é o governed contract do run.
 
 Em outras palavras, o brief não é apenas parte do prompt. Ele é parte do audit trail.
 
+No contrato atual, `.autolabos/config.yaml` guarda principalmente defaults de provider/runtime e workspace policy. A intenção de pesquisa específica de cada run, os evidence bars, as expectativas de baseline, os objetivos de manuscript format e o caminho do manuscript template devem ficar no Brief. Por isso, o config persistido pode omitir defaults de `research` e alguns campos de manuscript-profile / paper-template.
+
 ```bash
 /new
 /brief start --latest
 
@@ -196,6 +196,8 @@ Brief — это не просто стартовый документ. Это g
 
 Иными словами, brief — это не просто часть prompt. Это часть audit trail.
 
+В текущем контракте `.autolabos/config.yaml` в основном хранит provider/runtime defaults и workspace policy. Исследовательское намерение для конкретного run, evidence bar, ожидания по baseline, цели manuscript format и путь к manuscript template должны задаваться в Brief. Поэтому сохранённый config может не содержать `research` defaults и часть полей manuscript-profile / paper-template.
+
 ```bash
 /new
 /brief start --latest
 
@@ -196,6 +196,8 @@ Brief 不只是启动文档。它是 run 的 governed contract。
 
 也就是说，brief 不是 prompt 的一部分，而是 audit trail 的一部分。
 
+在当前契约里，`.autolabos/config.yaml` 主要保存 provider/runtime 默认值和 workspace policy。每个 run 的研究意图、evidence 门槛、baseline 预期、manuscript format 目标以及 manuscript template 路径，原则上应放在 Brief 中。因此，持久化后的 config 可能会省略 `research` 默认值以及部分 manuscript-profile / paper-template 字段。
+
 ```bash
 /new
 /brief start --latest