Skip to content

Commit 9da7c07

Browse files
committed
Add comprehensive tests for exploration policies, status, failure memory, figure audit, and auditor
- Implement tests for exploration policies including budget checks, promotion evaluations, and scoring. - Create tests for exploration status snapshots, ensuring correct handling of enabled/disabled states and artifact persistence. - Add integration tests for failure memory, validating entry handling and blocking logic. - Develop figure audit node tests to verify summary generation and artifact writing. - Introduce tests for figure auditor functionality, checking linting, caption consistency, and issue severity.
1 parent a7db4a7 commit 9da7c07

54 files changed

Lines changed: 4904 additions & 54 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

ISSUES.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@ Usage rules:
1313

1414
## Current active status
1515

16-
- Active live-validation defects: none currently open in this file.
16+
- Active live-validation defects:
17+
- `LV-084` Exploration status surfaces stay globally disabled in live TUI/web even when persisted `experiment_tree/` artifacts exist.
1718
- Active research/paper-readiness watchlist: see `Research and paper-readiness watchlist` below.
1819
- Current watchlist snapshot:
1920
- `R-001` Result-table discipline and claim→evidence linkage — `MITIGATED`
@@ -26,6 +27,40 @@ Usage rules:
2627

2728
---
2829

30+
## Active live validation issues
31+
32+
### LV-084 — `/explore` and `/api/exploration/status` ignore persisted exploration artifacts and always report the global disabled contract
33+
- Status: OPEN
34+
- Validation target: real `test/.live` TUI `/explore` output and real web `/api/exploration/status` / bootstrap state for a run that already has `experiment_tree/tree.json`, `manager_state.json`, `baseline_lock.json`, and `figure_audit/figure_audit_summary.json`
35+
- Environment/session context: repo head on 2026-04-02, live fixture workspace `/home/hanyong/AutoLabOS/test/.live/autolabos-live-explore-uhei2J`, run `run-explore-live`, launched with real `node /home/hanyong/AutoLabOS/dist/cli/main.js` and `node /home/hanyong/AutoLabOS/dist/cli/main.js web --host 127.0.0.1 --port 4318`
36+
- Reproduction steps:
37+
1. Create a real `test/.live` workspace containing a paused review run plus persisted exploration artifacts under `.autolabos/runs/run-explore-live/experiment_tree/` and `figure_audit/`.
38+
2. Launch a fresh TUI rooted at that workspace and run `/explore`.
39+
3. Launch a fresh web server rooted at the same workspace.
40+
4. Fetch `/api/exploration/status?run_id=run-explore-live` and `/api/bootstrap`.
41+
- Expected behavior: because the run already has persisted exploration tree state, baseline lock, and figure audit summary, the live TUI/web operator surfaces should report an enabled exploration status snapshot with the current stage (`main_agenda`), node counts, best defensible branch, lock state, and figure-audit warnings.
42+
- Actual behavior: both the real TUI and real web API returned the disabled contract instead:
43+
- TUI `/explore` showed `Enabled: false`, `Current Stage: n/a`, `Nodes: n/a`, `Baseline Lock: not_applicable`.
44+
- `GET /api/exploration/status?run_id=run-explore-live` returned `{\"enabled\":false,...,\"baseline_lock_status\":\"not_applicable\"}`.
45+
- `GET /api/bootstrap` still anchored to the correct active run and showed the run graph paused at `review`, so the disabled exploration result was not caused by selecting the wrong run.
46+
- Fresh vs existing session comparison:
47+
- Fresh session: the first real TUI process showed the disabled exploration contract.
48+
- Existing/reopened session: reopening the same persisted workspace in a second TUI process produced the same disabled exploration contract.
49+
- Divergence: none observed; the behavior appears stable across fresh and reopened sessions.
50+
- Root cause hypothesis:
51+
- Type: `in_memory_projection_bug`
52+
- Hypothesis: `src/core/exploration/status.ts` short-circuits on `loadExplorationConfig().enabled === false`, and `loadExplorationConfig()` currently reads only the repo-default YAML (`src/config/exploration.default.yaml`) instead of any run/workspace/runtime seam. That prevents the live status surfaces from reading persisted exploration artifacts even when they exist.
53+
- Code/test changes: none yet; this entry records the first real live-validation reproduction after the exploration status surfaces landed.
54+
- Regression status:
55+
- Automated regression coverage exists for the enabled path via mocked config in `tests/explorationStatus.test.ts`, `tests/newSlashCommands.test.ts`, and `web/src/App.test.tsx`.
56+
- Real live revalidation result: still reproduces.
57+
- Follow-up risks:
58+
- Operators can be misled into thinking exploration never ran, even when `experiment_tree/` and `figure_audit/` artifacts are present.
59+
- The current tests only verify the enabled path through explicit config mocking, so this runtime configuration seam can drift unnoticed.
60+
- Direct browser rendering of the web card was not rechecked because the Playwright navigation approval was rejected during this validation loop; the live API behavior was verified instead.
61+
62+
---
63+
2964
## Competitor-Derived Backlog
3065

3166
- [ ] P1 · Environment Bootstrapping: 샌드박스 환경 스냅샷을 수집해 implement_experiments 노드 시작 전 system prompt에 주입

docs/architecture.md

Lines changed: 68 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,11 @@ This document captures the runtime contracts that must remain stable while impro
44

55
## 1) Governed workflow contract
66

7-
AutoLabOS operates around a governed 9-node research workflow:
7+
AutoLabOS operates around a governed fixed research workflow:
88

9-
`collect_papers -> analyze_papers -> generate_hypotheses -> design_experiments -> implement_experiments -> run_experiments -> analyze_results -> review -> write_paper`
9+
`collect_papers -> analyze_papers -> generate_hypotheses -> design_experiments -> implement_experiments -> run_experiments -> analyze_results -> figure_audit -> review -> write_paper`
1010

11-
This 9-node structure is the default top-level workflow contract and must remain stable unless an explicit contract change is made.
11+
The historical 9-node contract remains the architectural baseline for the research loop. `figure_audit` is the one approved post-analysis checkpoint added for independent figure-quality and vision-critique resume behavior. Beyond that deliberate checkpoint, the top-level governed workflow must remain stable unless an explicit contract change is made.
1212

1313
Do not casually add, remove, reorder, or redefine top-level nodes.
1414

@@ -22,7 +22,7 @@ A top-level workflow change is allowed only when all of the following are true:
2222
- safe backtracking behavior is preserved
2323
- the change is reflected consistently in docs, runtime behavior, and validation expectations
2424

25-
Until those conditions are met, treat the 9-node workflow as fixed.
25+
Until those conditions are met, treat the governed workflow shape as fixed.
2626

2727
## 2) Shared runtime surfaces
2828

@@ -158,3 +158,67 @@ When applicable, validation should confirm:
158158
- No broad refactor of orchestration/runtime without contract justification.
159159
- No speculative replacement of existing node logic.
160160
- No weakening of review gating, evidence discipline, or reproducibility expectations for convenience.
161+
162+
## 11) Exploration Engine (P2-9)
163+
164+
### 왜 fixed 9-node graph를 유지하는가
165+
166+
AutoLabOS의 핵심 가치는 governed, checkpointed, inspectable workflow다.
167+
Exploration Engine은 이 graph를 대체하지 않는다.
168+
`figure_audit`를 제외한 exploration 관련 신규 상위 노드는 추가하지 않는다.
169+
170+
### Exploration Manager가 내부 coordinator인 이유
171+
172+
새로운 상위 노드를 추가하면 기존 checkpoint/resume 계약이 깨진다.
173+
ExplorationManager는 기존 노드 핸들러 내부에서 초기화되고, 자체 파일시스템(`experiment_tree/`)에 상태를 저장한다.
174+
즉, `design_experiments ~ analyze_results` 구간의 bounded coordinator이지, `StateGraphRuntime`를 우회하는 별도 오케스트레이터가 아니다.
175+
176+
### Bounded Exploration Engine 삽입 위치
177+
178+
- `design_experiments` → ExplorationManager 초기화, baseline proposal
179+
- `implement_experiments` → tree node 코드 구현
180+
- `run_experiments` → tree node 실행
181+
- `analyze_results` → evidence 수집, Gate 1+2(결정론적), promotion gate, writeup manifest 생성
182+
- `figure_audit` → Gate 3(vision LLM critique) + 전체 audit 집계 → `figure_audit_summary.json`
183+
- `review` → figure audit 결과 반영, strongest defensible branch 판정
184+
185+
### figure_audit 노드를 별도 추가한 이유
186+
187+
Gate 1+2는 결정론적이고 실행 시간이 1초 미만이므로 `analyze_results` 후처리로 충분하다.
188+
Gate 3(vision LLM)는 실행 시간이 분 단위이고 비동기 LLM 호출이며 타임아웃/실패가 가능하다.
189+
Gate 3 실패 시 `analyze_results` 전체를 재실행해야 하는 책임 혼재를 피하고, `analyze_results 완료 / figure_audit 미완` 상태를 독립 체크포인트로 resume할 수 있어야 한다.
190+
`figure_auditor.enabled=false`이면 `figure_audit` 노드는 pass-through로 동작해 기존 경로와 동일한 결과를 낸다.
191+
192+
### Baseline Lock과 Single-Change Enforcement
193+
194+
`baseline_hardening` stage 완료 시 baseline lock이 생성된다.
195+
이후 모든 branch는 lock의 `allowed_intervention_dimensions` 안에서 단 하나의 dimension만 바꿀 수 있다.
196+
동시에 두 dimension이 바뀌면 `singleChangeEnforcer`가 차단한다.
197+
198+
### Executed-Evidence-Only와 Claim Ceiling의 연결
199+
200+
claim ceiling (`paperMinimumGate.ts`)은 claim-evidence 정합성을 검사한다.
201+
`evidenceSerializer`는 그 이전 단계에서 미실행 항목이 claim source로 진입하지 못하도록 차단한다.
202+
두 메커니즘은 독립적이지만 상호 보완적이다.
203+
204+
### Figure Auditor 역할
205+
206+
`figure_audit``analyze_results` 완료 후, review 입력 전에 동작하는 품질 gate다.
207+
역할은 미적 개선이 아니라 증거 정합성(`evidence_alignment`), 가독성, 게재 가능성(`publication_readiness`) 판정이다.
208+
`empirical_validity_impact``publication_readiness`는 별도 필드로 분리 저장된다.
209+
severe mismatch는 review decision을 `revise` 이상으로 격상시킨다.
210+
211+
### AI-Scientist-v2와의 차이
212+
213+
유사점:
214+
- experiment manager
215+
- tree-based exploration
216+
- stage-based policy
217+
- search budget
218+
219+
차이점:
220+
- AutoLabOS는 governed fixed graph를 유지하며 exploration tree가 그 안에 내장된다. `figure_audit`는 Gate 3의 독립 체크포인트 필요성 때문에 추가된 노드이며, exploration engine 자체가 상위 workflow를 늘리는 방식은 아니다.
221+
- single-change enforcement와 baseline lock이 필수 gate다.
222+
- review gate가 단순 LLM 점수가 아닌 5-specialist panel + 2-layer 구조다.
223+
- checkpointed resume와 audit trail이 핵심 요구사항이다.
224+
- Figure Auditor가 별도 노드로 분리되어 비동기 vision critique를 독립 resume 가능하게 한다.
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
enabled: false
2+
num_parallel_workers: 1
3+
max_nodes_per_stage: 6
4+
max_nodes_per_hypothesis: 3
5+
max_children_per_node: 2
6+
max_tree_depth: 4
7+
max_debug_depth: 2
8+
debug_probability: 0.2
9+
per_node_time_budget: 1800
10+
per_node_token_budget: 50000
11+
per_node_compute_budget: null
12+
stage_budgets:
13+
feasibility:
14+
max_nodes: 3
15+
max_time: 3600
16+
baseline_hardening:
17+
max_nodes: 4
18+
max_time: 7200
19+
main_agenda:
20+
max_nodes: 6
21+
max_time: 14400
22+
ablation:
23+
max_nodes: 4
24+
max_time: 7200
25+
promotion_thresholds:
26+
min_objective_gain: 0.01
27+
max_instability_penalty: 0.2
28+
max_confound_penalty: 0.15
29+
min_evidence_completeness: 0.8
30+
min_reproduction_runs: 2
31+
reproducibility_minimums:
32+
before_ablation: 2
33+
for_promotion: 2
34+
baseline_lock:
35+
required: true
36+
strict_hash_match: true
37+
strongest_defensible_only: true
38+
figure_auditor:
39+
enabled: true
40+
block_on_severe_mismatch: true
41+
require_caption_alignment: true
42+
require_reference_alignment: true

0 commit comments

Comments
 (0)