lhy0718
diff --git a/‎.gitignore‎
Lines changed: 0 additions & 4 deletions b/‎.gitignore‎
Lines changed: 0 additions & 4 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 39 deletions b/‎README.md‎
Lines changed: 1 addition & 39 deletions
diff --git a/‎docs/README.de.md‎
Lines changed: 2 additions & 38 deletions b/‎docs/README.de.md‎
Lines changed: 2 additions & 38 deletions
diff --git a/‎docs/README.es.md‎
Lines changed: 2 additions & 38 deletions b/‎docs/README.es.md‎
Lines changed: 2 additions & 38 deletions
diff --git a/‎docs/README.fr.md‎
Lines changed: 2 additions & 38 deletions b/‎docs/README.fr.md‎
Lines changed: 2 additions & 38 deletions
@@ -23,10 +23,6 @@ web/dist/
 # AutoLabOS runtime artifacts
 .autolabos/
 
-# Manual example / real-validation workspace
-/test/*
-!/test/smoke/
-
 # Logs / coverage
 coverage/
 .nyc_output/
 
@@ -175,7 +175,6 @@ Typical first-use flow:
 Notes:
 
 - Both UIs guide onboarding if `.autolabos/config.yaml` does not exist yet.
-- Do not run AutoLabOS from the repository root itself. Use a workspace such as `test/` or your own research workspace.
 - TUI and Web UI share the same runtime, artifacts, and checkpoints.
 
 ### Prerequisites
@@ -259,7 +258,6 @@ AutoLabOS is designed around governed execution rather than prompt-only orchestr
 | Claims | As strong as the model will generate | Bounded by evidence and a claim ceiling |
 | Review | Optional cleanup pass | Structural gate that can block writing |
 | Failures | Forgotten and retried | Fingerprinted in failure memory |
-| Validation | Secondary | First-class surface: `/doctor`, harnesses, smoke checks, live validation |
 | Interfaces | Separate code paths | TUI and Web share one runtime |
 
 This is why the system reads more like research infrastructure than a paper generator.
@@ -290,7 +288,6 @@ Failure fingerprints are persisted so structural errors and repeated equivalent
 
 ### Reproducibility Through Artifacts
 
-Reproducibility is enforced through artifacts, checkpoints, and inspectable transitions. Public-facing summaries mirror persisted run outputs rather than inventing a second source of truth.
 
 ---
 
@@ -299,9 +296,6 @@ Reproducibility is enforced through artifacts, checkpoints, and inspectable tran
 AutoLabOS treats validation surfaces as first-class.
 
 - `/doctor` checks environment and workspace readiness before a run starts
-- harness validation protects workflow, artifact, and governance contracts
-- smoke checks exist for targeted diagnostic coverage
-- live validation is used when interactive behavior matters
 
 Paper readiness is not a single binary prompt judgment.
 
@@ -311,13 +305,6 @@ Paper readiness is not a single binary prompt judgment.
 
 `paper_readiness.json` can include an `overall_score`. It should be read as a run-quality signal inside the system, not as a universal scientific benchmark. Some advanced evaluation and self-improvement flows use that score to compare runs or candidate prompt mutations.
 
-<details>
-<summary><strong>Why the validation model matters</strong></summary>
-
-Quality assumptions are turned into explicit checks. Real behavior matters more than prompt-level appearance. The intended result is not "the model wrote something convincing," but "the run can be inspected and defended."
-
-</details>
-
 ---
 
 ## Advanced Self-Improvement Capabilities
@@ -336,7 +323,7 @@ It can include:
 - `outputs/eval-harness/history.jsonl`
 - current `node-prompts/` files for the targeted node
 
-The LLM is instructed through `TASK.md` to return only `TARGET_FILE + unified diff`, and the target is constrained to `node-prompts/`. In apply mode, the candidate must pass `validate:harness`; otherwise the change is rolled back and an audit log is written. `--no-apply` builds context only. `--dry-run` shows the diff without modifying files.
+The LLM is instructed through `TASK.md` to return only `TARGET_FILE + unified diff`, and the target is constrained to `node-prompts/`. In apply mode, the candidate must pass validation checks; otherwise the change is rolled back and an audit log is written. `--no-apply` builds context only. `--dry-run` shows the diff without modifying files.
 
 ### `autolabos evolve`
 
@@ -428,29 +415,6 @@ AutoLabOS also has built-in harness presets such as `base`, `compact`, `failure-
 
 ---
 
-## Development
-
-```bash
-npm install
-npm run build
-npm test
-npm run test:web
-npm run validate:harness
-```
-
-Use the smallest honest validation set that covers the change. For interactive defects, tests are not a substitute for re-running the same TUI or Web flow when the environment allows it.
-
-Useful commands:
-
-```bash
-npm run test:watch
-npm run test:smoke:natural-collect
-npm run test:smoke:natural-collect-execute
-npm run test:smoke:all
-```
-
----
-
 ## Advanced Details
 
 <details>
@@ -574,10 +538,8 @@ outputs/<title-slug>-<run_id_prefix>/
 AutoLabOS is an active OSS research-engineering project. The canonical references for behavior and contracts are the repository docs under `docs/`, especially:
 
 - `docs/architecture.md`
-- `docs/tui-live-validation.md`
 - `docs/experiment-quality-bar.md`
 - `docs/paper-quality-bar.md`
 - `docs/reproducibility.md`
 - `docs/research-brief-template.md`
 
-If you are changing runtime behavior, treat those documents, the shipped tests, and the observable artifacts as the source of truth.
@@ -175,7 +175,7 @@ Typischer Einstieg:
 Hinweise:
 
 - wenn `.autolabos/config.yaml` fehlt, führen beide UIs durch das Onboarding
-- AutoLabOS nicht aus dem Repository-Root starten; verwende `test/` oder einen eigenen Workspace
+- AutoLabOS nicht aus dem Repository-Root starten; verwende ein separates Workspace-Verzeichnis für deine Research-Runs
 - TUI und Web UI teilen sich denselben Runtime, dieselben Artefakte und dieselben Checkpoints
 
 ### Voraussetzungen
@@ -259,7 +259,6 @@ AutoLabOS ist auf governed execution ausgelegt, nicht auf prompt-only orchestrat
 | Claims | so stark, wie das Modell sie formuliert | begrenzt durch Evidence und claim ceiling |
 | Review | optionale Cleanup-Passage | structural gate, die Schreiben stoppen kann |
 | Failures | werden vergessen und erneut versucht | mit Fingerprint in failure memory gespeichert |
-| Validation | nachrangig | `/doctor`, Harnesses, Smoke und Live Validation sind first-class |
 | Interfaces | getrennte Codepfade | TUI und Web teilen sich einen Runtime |
 
 Deshalb sollte dieses System eher als Research Infrastructure denn als Paper Generator gelesen werden.
@@ -299,9 +298,6 @@ Reproduzierbarkeit wird durch Artefakte, Checkpoints und inspectable transitions
 AutoLabOS behandelt Validation Surfaces als first-class.
 
 - `/doctor` prüft Environment und Workspace-Readiness vor dem Start eines Runs
-- Harness Validation schützt Workflow-, Artifact- und Governance-Contracts
-- Targeted Smoke Checks liefern diagnostische Regressionsabdeckung
-- wenn interaktives Verhalten wichtig ist, wird Live Validation verwendet
 
 Paper Readiness ist nicht das Ergebnis eines einzelnen Prompt-Eindrucks.
 
@@ -311,13 +307,6 @@ Paper Readiness ist nicht das Ergebnis eines einzelnen Prompt-Eindrucks.
 
 `paper_readiness.json` kann ein `overall_score` enthalten. Dieser Wert sollte als internes Run-Quality-Signal gelesen werden, nicht als allgemeiner wissenschaftlicher Benchmark. Einige fortgeschrittene evaluation / self-improvement paths nutzen ihn, um Runs oder Prompt-Mutation-Kandidaten zu vergleichen.
 
-<details>
-<summary><strong>Warum dieses Validation-Modell wichtig ist</strong></summary>
-
-Qualitätsannahmen werden in explizite Checks übersetzt. Reales Verhalten zählt mehr als Oberfläche auf Prompt-Ebene. Das Ziel ist nicht „das Modell hat etwas Überzeugendes geschrieben“, sondern „dieser Run lässt sich inspizieren und verteidigen“.
-
-</details>
-
 ---
 
 ## Fortgeschrittene Self-Improvement-Fähigkeiten
@@ -336,7 +325,7 @@ Es kann enthalten:
 - `outputs/eval-harness/history.jsonl`
 - aktuelle `node-prompts/`-Dateien für den Zielknoten
 
-Das LLM wird durch `TASK.md` darauf beschränkt, nur `TARGET_FILE + unified diff` zurückzugeben, und das Ziel ist auf `node-prompts/` begrenzt. Im Apply-Modus muss der Kandidat `validate:harness` bestehen; andernfalls erfolgt Rollback und ein Audit Log wird geschrieben. `--no-apply` erstellt nur den Context. `--dry-run` zeigt den Diff, ohne Dateien zu ändern.
+Das LLM wird durch `TASK.md` darauf beschränkt, nur `TARGET_FILE + unified diff` zurückzugeben, und das Ziel ist auf `node-prompts/` begrenzt. Im Apply-Modus muss der Kandidat validation checks bestehen; andernfalls erfolgt Rollback und ein Audit Log wird geschrieben. `--no-apply` erstellt nur den Context. `--dry-run` zeigt den Diff, ohne Dateien zu ändern.
 
 ### `autolabos evolve`
 
@@ -428,29 +417,6 @@ AutoLabOS bietet außerdem built-in harness presets wie `base`, `compact`, `fail
 
 ---
 
-## Entwicklung
-
-```bash
-npm install
-npm run build
-npm test
-npm run test:web
-npm run validate:harness
-```
-
-Wähle das kleinste Validation-Set, das die Änderung ehrlich abdeckt. Bei interaktiven Defects solltest du dich – wenn die Umgebung es erlaubt – nicht nur auf Tests verlassen, sondern denselben TUI- / Web-Flow erneut ausführen.
-
-Nützliche Befehle:
-
-```bash
-npm run test:watch
-npm run test:smoke:natural-collect
-npm run test:smoke:natural-collect-execute
-npm run test:smoke:all
-```
-
----
-
 ## Advanced Details
 
 <details>
@@ -523,12 +489,10 @@ flowchart TB
 AutoLabOS ist ein aktiv entwickeltes OSS-Research-Engineering-Projekt. Die kanonischen Referenzen für Verhalten und Contracts liegen unter `docs/`, insbesondere:
 
 - `docs/architecture.md`
-- `docs/tui-live-validation.md`
 - `docs/experiment-quality-bar.md`
 - `docs/paper-quality-bar.md`
 - `docs/reproducibility.md`
 - `docs/research-brief-template.md`
 
-Wenn du Runtime-Verhalten änderst, behandle diese Dokumente, die veröffentlichten Tests und die beobachtbaren Artefakte als source of truth.
 
 </details>
@@ -175,7 +175,7 @@ Flujo típico de primer uso:
 Notas:
 
 - si `.autolabos/config.yaml` no existe, ambas interfaces te guían en el onboarding
-- no ejecutes AutoLabOS desde la raíz del repositorio; usa `test/` u otro workspace
+- no ejecutes AutoLabOS desde la raíz del repositorio; usa un directorio de workspace separado para tu ejecución de investigación
 - TUI y Web UI comparten el mismo runtime, los mismos artifacts y los mismos checkpoints
 
 ### Requisitos previos
@@ -259,7 +259,6 @@ AutoLabOS está diseñado alrededor de governed execution, no de prompt-only orc
 | Claims | Tan fuertes como el modelo los escriba | Limitados por evidence y claim ceiling |
 | Review | Cleanup pass opcional | Structural gate que puede bloquear la escritura |
 | Failures | Se olvidan y se reintentan | Se registran con fingerprint en failure memory |
-| Validation | Secundaria | `/doctor`, harnesses, smoke y live validation son first-class |
 | Interfaces | Caminos de código separados | TUI y Web comparten un runtime |
 
 Por eso este sistema se entiende mejor como research infrastructure que como paper generator.
@@ -299,9 +298,6 @@ La reproducibilidad se impone mediante artifacts, checkpoints e inspectable tran
 AutoLabOS trata las validation surfaces como first-class.
 
 - `/doctor` comprueba environment y workspace readiness antes de iniciar un run
-- harness validation protege workflow, artifact y governance contracts
-- targeted smoke checks dan cobertura diagnóstica de regresión
-- cuando importa el comportamiento interactivo, se usa live validation
 
 Paper readiness no es una sola impresión producida por un prompt.
 
@@ -311,13 +307,6 @@ Paper readiness no es una sola impresión producida por un prompt.
 
 `paper_readiness.json` puede incluir `overall_score`. Debe leerse como una señal interna de calidad del run, no como un benchmark científico universal. Algunos caminos avanzados de evaluation / self-improvement usan esa señal para comparar runs o candidatos de prompt mutation.
 
-<details>
-<summary><strong>Por qué importa este modelo de validation</strong></summary>
-
-Los supuestos de calidad se convierten en checks explícitos. Importa más el comportamiento real que la apariencia a nivel de prompt. La meta no es “el modelo escribió algo convincente”, sino “el run se puede inspeccionar y defender”.
-
-</details>
-
 ---
 
 ## Capacidades avanzadas de Self-Improvement
@@ -336,7 +325,7 @@ Puede incluir:
 - `outputs/eval-harness/history.jsonl`
 - archivos actuales de `node-prompts/` para el nodo objetivo
 
-El LLM queda instruido por `TASK.md` para responder solo con `TARGET_FILE + unified diff`, y el target queda restringido a `node-prompts/`. En modo apply, el candidato debe pasar `validate:harness`; si falla, se hace rollback y se escribe un audit log. `--no-apply` solo genera el context. `--dry-run` muestra el diff sin cambiar archivos.
+El LLM queda instruido por `TASK.md` para responder solo con `TARGET_FILE + unified diff`, y el target queda restringido a `node-prompts/`. En modo apply, el candidato debe pasar validation checks; si falla, se hace rollback y se escribe un audit log. `--no-apply` solo genera el context. `--dry-run` muestra el diff sin cambiar archivos.
 
 ### `autolabos evolve`
 
@@ -428,29 +417,6 @@ AutoLabOS también tiene built-in harness presets como `base`, `compact`, `failu
 
 ---
 
-## Desarrollo
-
-```bash
-npm install
-npm run build
-npm test
-npm run test:web
-npm run validate:harness
-```
-
-Elige el conjunto mínimo de validation que cubra honestamente el cambio. Para defects interactivos, si el entorno lo permite, no te quedes solo con tests: vuelve a ejecutar el mismo flujo de TUI / Web.
-
-Comandos útiles:
-
-```bash
-npm run test:watch
-npm run test:smoke:natural-collect
-npm run test:smoke:natural-collect-execute
-npm run test:smoke:all
-```
-
----
-
 ## Advanced Details
 
 <details>
@@ -574,10 +540,8 @@ outputs/<title-slug>-<run_id_prefix>/
 AutoLabOS es un proyecto OSS activo de research engineering. Las referencias canónicas de comportamiento y contracts están en `docs/`, especialmente:
 
 - `docs/architecture.md`
-- `docs/tui-live-validation.md`
 - `docs/experiment-quality-bar.md`
 - `docs/paper-quality-bar.md`
 - `docs/reproducibility.md`
 - `docs/research-brief-template.md`
 
-Si cambias comportamiento de runtime, trata esos documentos, los tests publicados y los artifacts observables como source of truth.
@@ -175,7 +175,7 @@ Flux typique au premier usage :
 Notes :
 
 - si `.autolabos/config.yaml` n’existe pas, les deux interfaces guident l’onboarding
-- n’exécutez pas AutoLabOS depuis la racine du dépôt ; utilisez `test/` ou votre propre workspace
+- n’exécutez pas AutoLabOS depuis la racine du dépôt ; utilisez un répertoire de workspace séparé pour vos runs de recherche
 - le TUI et le Web UI partagent le même runtime, les mêmes artefacts et les mêmes checkpoints
 
 ### Prérequis
@@ -259,7 +259,6 @@ AutoLabOS est conçu autour de la governed execution, pas d’une prompt-only or
 | Claims | aussi fortes que le modèle les écrit | limitées par l’evidence et le claim ceiling |
 | Review | cleanup pass optionnel | structural gate capable de bloquer l’écriture |
 | Failures | oubliés puis réessayés | enregistrés avec fingerprint dans la failure memory |
-| Validation | secondaire | `/doctor`, harnesses, smoke et live validation sont first-class |
 | Interfaces | chemins de code séparés | TUI et Web partagent un seul runtime |
 
 Le système se lit donc davantage comme une research infrastructure que comme un paper generator.
@@ -299,9 +298,6 @@ La reproductibilité est imposée par les artefacts, les checkpoints et les tran
 AutoLabOS traite les validation surfaces comme first-class.
 
 - `/doctor` vérifie l’environnement et la readiness du workspace avant le démarrage d’un run
-- la harness validation protège les workflow, artifact et governance contracts
-- les targeted smoke checks fournissent une couverture diagnostique de régression
-- quand le comportement interactif compte, on utilise la live validation
 
 Le paper readiness n’est pas une simple impression produite par un prompt.
 
@@ -311,13 +307,6 @@ Le paper readiness n’est pas une simple impression produite par un prompt.
 
 `paper_readiness.json` peut inclure un `overall_score`. Cette valeur doit être comprise comme un signal interne de qualité du run, pas comme un benchmark scientifique universel. Certains chemins avancés d’evaluation / self-improvement utilisent ce signal pour comparer des runs ou des candidats de prompt mutation.
 
-<details>
-<summary><strong>Pourquoi ce modèle de validation est important</strong></summary>
-
-Les hypothèses de qualité sont transformées en checks explicites. Le comportement réel compte davantage que l’apparence au niveau du prompt. Le but n’est pas « le modèle a écrit quelque chose de convaincant », mais « ce run peut être inspecté et défendu ».
-
-</details>
-
 ---
 
 ## Capacités avancées de Self-Improvement
@@ -336,7 +325,7 @@ Il peut inclure :
 - `outputs/eval-harness/history.jsonl`
 - les fichiers `node-prompts/` actuels pour le nœud ciblé
 
-Le LLM est contraint par `TASK.md` à répondre uniquement avec `TARGET_FILE + unified diff`, et la cible est restreinte à `node-prompts/`. En mode apply, la proposition doit passer `validate:harness`; sinon elle est rollbackée et un audit log est écrit. `--no-apply` ne construit que le context. `--dry-run` affiche le diff sans modifier les fichiers.
+Le LLM est contraint par `TASK.md` à répondre uniquement avec `TARGET_FILE + unified diff`, et la cible est restreinte à `node-prompts/`. En mode apply, la proposition doit passer validation checks; sinon elle est rollbackée et un audit log est écrit. `--no-apply` ne construit que le context. `--dry-run` affiche le diff sans modifier les fichiers.
 
 ### `autolabos evolve`
 
@@ -428,29 +417,6 @@ AutoLabOS fournit aussi des built-in harness presets comme `base`, `compact`, `f
 
 ---
 
-## Développement
-
-```bash
-npm install
-npm run build
-npm test
-npm run test:web
-npm run validate:harness
-```
-
-Choisissez le plus petit ensemble de validation qui couvre honnêtement le changement. Pour un defect interactif, si l’environnement le permet, ne vous contentez pas des tests : relancez le même flux TUI / Web.
-
-Commandes utiles :
-
-```bash
-npm run test:watch
-npm run test:smoke:natural-collect
-npm run test:smoke:natural-collect-execute
-npm run test:smoke:all
-```
-
----
-
 ## Advanced Details
 
 <details>
@@ -574,10 +540,8 @@ outputs/<title-slug>-<run_id_prefix>/
 AutoLabOS est un projet OSS actif de research engineering. Les références canoniques pour le comportement et les contracts se trouvent dans `docs/`, en particulier :
 
 - `docs/architecture.md`
-- `docs/tui-live-validation.md`
 - `docs/experiment-quality-bar.md`
 - `docs/paper-quality-bar.md`
 - `docs/reproducibility.md`
 - `docs/research-brief-template.md`
 
-Si vous modifiez le comportement du runtime, traitez ces documents, les tests publiés et les observable artifacts comme source of truth.