You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add comprehensive tests for exploration policies, status, failure memory, figure audit, and auditor
- Implement tests for exploration policies including budget checks, promotion evaluations, and scoring.
- Create tests for exploration status snapshots, ensuring correct handling of enabled/disabled states and artifact persistence.
- Add integration tests for failure memory, validating entry handling and blocking logic.
- Develop figure audit node tests to verify summary generation and artifact writing.
- Introduce tests for figure auditor functionality, checking linting, caption consistency, and issue severity.
Copy file name to clipboardExpand all lines: ISSUES.md
+36-1Lines changed: 36 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,8 @@ Usage rules:
13
13
14
14
## Current active status
15
15
16
-
- Active live-validation defects: none currently open in this file.
16
+
- Active live-validation defects:
17
+
-`LV-084` Exploration status surfaces stay globally disabled in live TUI/web even when persisted `experiment_tree/` artifacts exist.
17
18
- Active research/paper-readiness watchlist: see `Research and paper-readiness watchlist` below.
18
19
- Current watchlist snapshot:
19
20
-`R-001` Result-table discipline and claim→evidence linkage — `MITIGATED`
@@ -26,6 +27,40 @@ Usage rules:
26
27
27
28
---
28
29
30
+
## Active live validation issues
31
+
32
+
### LV-084 — `/explore` and `/api/exploration/status` ignore persisted exploration artifacts and always report the global disabled contract
33
+
- Status: OPEN
34
+
- Validation target: real `test/.live` TUI `/explore` output and real web `/api/exploration/status` / bootstrap state for a run that already has `experiment_tree/tree.json`, `manager_state.json`, `baseline_lock.json`, and `figure_audit/figure_audit_summary.json`
35
+
- Environment/session context: repo head on 2026-04-02, live fixture workspace `/home/hanyong/AutoLabOS/test/.live/autolabos-live-explore-uhei2J`, run `run-explore-live`, launched with real `node /home/hanyong/AutoLabOS/dist/cli/main.js` and `node /home/hanyong/AutoLabOS/dist/cli/main.js web --host 127.0.0.1 --port 4318`
36
+
- Reproduction steps:
37
+
1. Create a real `test/.live` workspace containing a paused review run plus persisted exploration artifacts under `.autolabos/runs/run-explore-live/experiment_tree/` and `figure_audit/`.
38
+
2. Launch a fresh TUI rooted at that workspace and run `/explore`.
39
+
3. Launch a fresh web server rooted at the same workspace.
40
+
4. Fetch `/api/exploration/status?run_id=run-explore-live` and `/api/bootstrap`.
41
+
- Expected behavior: because the run already has persisted exploration tree state, baseline lock, and figure audit summary, the live TUI/web operator surfaces should report an enabled exploration status snapshot with the current stage (`main_agenda`), node counts, best defensible branch, lock state, and figure-audit warnings.
42
+
- Actual behavior: both the real TUI and real web API returned the disabled contract instead:
-`GET /api/exploration/status?run_id=run-explore-live` returned `{\"enabled\":false,...,\"baseline_lock_status\":\"not_applicable\"}`.
45
+
-`GET /api/bootstrap` still anchored to the correct active run and showed the run graph paused at `review`, so the disabled exploration result was not caused by selecting the wrong run.
46
+
- Fresh vs existing session comparison:
47
+
- Fresh session: the first real TUI process showed the disabled exploration contract.
48
+
- Existing/reopened session: reopening the same persisted workspace in a second TUI process produced the same disabled exploration contract.
49
+
- Divergence: none observed; the behavior appears stable across fresh and reopened sessions.
50
+
- Root cause hypothesis:
51
+
- Type: `in_memory_projection_bug`
52
+
- Hypothesis: `src/core/exploration/status.ts` short-circuits on `loadExplorationConfig().enabled === false`, and `loadExplorationConfig()` currently reads only the repo-default YAML (`src/config/exploration.default.yaml`) instead of any run/workspace/runtime seam. That prevents the live status surfaces from reading persisted exploration artifacts even when they exist.
53
+
- Code/test changes: none yet; this entry records the first real live-validation reproduction after the exploration status surfaces landed.
54
+
- Regression status:
55
+
- Automated regression coverage exists for the enabled path via mocked config in `tests/explorationStatus.test.ts`, `tests/newSlashCommands.test.ts`, and `web/src/App.test.tsx`.
56
+
- Real live revalidation result: still reproduces.
57
+
- Follow-up risks:
58
+
- Operators can be misled into thinking exploration never ran, even when `experiment_tree/` and `figure_audit/` artifacts are present.
59
+
- The current tests only verify the enabled path through explicit config mocking, so this runtime configuration seam can drift unnoticed.
60
+
- Direct browser rendering of the web card was not rechecked because the Playwright navigation approval was rejected during this validation loop; the live API behavior was verified instead.
61
+
62
+
---
63
+
29
64
## Competitor-Derived Backlog
30
65
31
66
-[ ] P1 · Environment Bootstrapping: 샌드박스 환경 스냅샷을 수집해 implement_experiments 노드 시작 전 system prompt에 주입
This 9-node structure is the default top-level workflow contract and must remain stable unless an explicit contract change is made.
11
+
The historical 9-node contract remains the architectural baseline for the research loop. `figure_audit`is the one approved post-analysis checkpoint added for independent figure-quality and vision-critique resume behavior. Beyond that deliberate checkpoint, the top-level governed workflow must remain stable unless an explicit contract change is made.
12
12
13
13
Do not casually add, remove, reorder, or redefine top-level nodes.
14
14
@@ -22,7 +22,7 @@ A top-level workflow change is allowed only when all of the following are true:
22
22
- safe backtracking behavior is preserved
23
23
- the change is reflected consistently in docs, runtime behavior, and validation expectations
24
24
25
-
Until those conditions are met, treat the 9-node workflow as fixed.
25
+
Until those conditions are met, treat the governed workflow shape as fixed.
26
26
27
27
## 2) Shared runtime surfaces
28
28
@@ -158,3 +158,67 @@ When applicable, validation should confirm:
158
158
- No broad refactor of orchestration/runtime without contract justification.
159
159
- No speculative replacement of existing node logic.
160
160
- No weakening of review gating, evidence discipline, or reproducibility expectations for convenience.
161
+
162
+
## 11) Exploration Engine (P2-9)
163
+
164
+
### 왜 fixed 9-node graph를 유지하는가
165
+
166
+
AutoLabOS의 핵심 가치는 governed, checkpointed, inspectable workflow다.
167
+
Exploration Engine은 이 graph를 대체하지 않는다.
168
+
`figure_audit`를 제외한 exploration 관련 신규 상위 노드는 추가하지 않는다.
169
+
170
+
### Exploration Manager가 내부 coordinator인 이유
171
+
172
+
새로운 상위 노드를 추가하면 기존 checkpoint/resume 계약이 깨진다.
173
+
ExplorationManager는 기존 노드 핸들러 내부에서 초기화되고, 자체 파일시스템(`experiment_tree/`)에 상태를 저장한다.
174
+
즉, `design_experiments ~ analyze_results` 구간의 bounded coordinator이지, `StateGraphRuntime`를 우회하는 별도 오케스트레이터가 아니다.
`evidenceSerializer`는 그 이전 단계에서 미실행 항목이 claim source로 진입하지 못하도록 차단한다.
202
+
두 메커니즘은 독립적이지만 상호 보완적이다.
203
+
204
+
### Figure Auditor 역할
205
+
206
+
`figure_audit`는 `analyze_results` 완료 후, review 입력 전에 동작하는 품질 gate다.
207
+
역할은 미적 개선이 아니라 증거 정합성(`evidence_alignment`), 가독성, 게재 가능성(`publication_readiness`) 판정이다.
208
+
`empirical_validity_impact`와 `publication_readiness`는 별도 필드로 분리 저장된다.
209
+
severe mismatch는 review decision을 `revise` 이상으로 격상시킨다.
210
+
211
+
### AI-Scientist-v2와의 차이
212
+
213
+
유사점:
214
+
- experiment manager
215
+
- tree-based exploration
216
+
- stage-based policy
217
+
- search budget
218
+
219
+
차이점:
220
+
- AutoLabOS는 governed fixed graph를 유지하며 exploration tree가 그 안에 내장된다. `figure_audit`는 Gate 3의 독립 체크포인트 필요성 때문에 추가된 노드이며, exploration engine 자체가 상위 workflow를 늘리는 방식은 아니다.
0 commit comments