Skip to content

lhy0718/AutoLabOS


AutoLabOS

An Operating System for Autonomous Research

Autonomous research execution, not just research generation.
Governed, checkpointed, inspectable research runs from brief to manuscript.

English  ·  한국어  ·  日本語  ·  简体中文  ·  繁體中文  ·  Español  ·  Français  ·  Deutsch  ·  Português  ·  Русский

Localized README files are maintained translations of this document. The English README is updated first.

CI Smoke Tests

TypeScript Node React

Governed workflow Checkpointed Claim Ceiling Validation surfaces


AutoLabOS is a governed operating system for research execution. It treats a run as checkpointed research state rather than a one-shot generation step.

The core loop is inspectable end to end: literature collection, hypothesis formation, experiment design, execution, analysis, figure audit, review, and manuscript drafting all produce auditable artifacts. Claims stay evidence-bounded through a claim ceiling. Review is a structural gate, not a polish pass.

Quality assumptions are turned into explicit checks. Real behavior matters more than prompt-level appearance. Reproducibility is enforced through artifacts, checkpoints, and inspectable transitions.


Why AutoLabOS Exists

Most research-agent systems are optimized around producing text. AutoLabOS is optimized around running a governed research process.

That difference matters when a project needs more than a plausible-looking draft:

  • a research brief that acts as an execution contract
  • explicit workflow gates instead of open-ended agent drift
  • checkpoints and artifacts that can be inspected after the fact
  • review that can stop weak work before manuscript generation
  • failure memory so the same failed experiment is not repeated blindly
  • evidence-bounded claims rather than prose that outruns the data

AutoLabOS is for teams that want autonomous help without giving up auditability, backtracking, or validation.


What Happens In One Run

One governed run follows the same research arc every time:

Brief.md → literature → hypothesis → experiment design → implementation → execution → analysis → figure audit → review → manuscript

In practice:

  1. /new creates or opens a research brief.
  2. /brief start --latest validates the brief, snapshots it into the run, and launches a governed run.
  3. The system moves through the fixed research workflow, checkpointing state and artifacts at each boundary.
  4. Weak evidence triggers backtracking or downgrade instead of automatic polishing.
  5. If the review gate passes, write_paper drafts a manuscript from bounded evidence.

In the current runtime, figure_audit sits between analyze_results and review so figure-quality critique can checkpoint and resume independently.

stateDiagram-v2
    [*] --> collect_papers
    collect_papers --> analyze_papers: complete
    analyze_papers --> generate_hypotheses: complete
    generate_hypotheses --> design_experiments: complete
    design_experiments --> implement_experiments: complete
    implement_experiments --> run_experiments: auto_handoff or complete
    run_experiments --> analyze_results: complete
    analyze_results --> figure_audit: auto_advance
    analyze_results --> implement_experiments: auto_backtrack_to_implement
    analyze_results --> design_experiments: auto_backtrack_to_design
    analyze_results --> generate_hypotheses: auto_backtrack_to_hypotheses
    figure_audit --> review: auto_advance
    review --> write_paper: auto_advance
    review --> implement_experiments: auto_backtrack_to_implement
    review --> design_experiments: auto_backtrack_to_design
    review --> generate_hypotheses: auto_backtrack_to_hypotheses
    write_paper --> [*]: auto_complete
Loading

All automation inside that flow is bounded inside node-internal loops. The workflow stays governed even in unattended modes.


What You Get After A Run

AutoLabOS does not just emit a PDF. It emits a traceable research state.

Output What it contains
Literature corpus Collected papers, BibTeX, extracted evidence store
Hypotheses Literature-grounded hypotheses with skeptical review
Experiment plan Governed design with contract, baseline lock, and consistency checks
Executed results Metrics, objective evaluation, failure memory log
Result analysis Statistical analysis, attempt decisions, transition reasoning
Figure audit Figure lint, caption/reference consistency, optional vision critique summary
Review packet 5-specialist panel scorecard, claim ceiling, pre-draft critique
Manuscript LaTeX draft with evidence links, scientific validation, optional PDF
Checkpoints Full state snapshots at every node boundary — resume anytime

Everything lives under .autolabos/runs/<run_id>/, with public-facing outputs mirrored to outputs/.

That is the reproducibility model: artifacts, checkpoints, and inspectable transitions rather than hidden state.


Quick Start

# 1. Install and build
npm install
npm run build
npm link

# 2. Move to a research workspace
cd /path/to/your-research-workspace

# 3. Launch one interface
autolabos        # TUI
autolabos web    # Web UI

Typical first-use flow:

/new
/brief start --latest
/doctor

Notes:

  • Both UIs guide onboarding if .autolabos/config.yaml does not exist yet.
  • TUI and Web UI share the same runtime, artifacts, and checkpoints.

Prerequisites

Item When needed Notes
SEMANTIC_SCHOLAR_API_KEY Always Paper discovery and metadata
OPENAI_API_KEY When provider is api OpenAI API model execution
Codex CLI login When provider is codex Uses your local Codex session

Research Brief System

The brief is not just a startup note. It is the governed contract for a run.

/new creates or opens Brief.md. /brief start --latest validates it, snapshots it into the run, and starts execution from that snapshot. The run records the brief source path, the snapshot path, and any parsed manuscript format so the provenance of the run remains inspectable even if the workspace brief changes later. Appendix Preferences can now be structured with Prefer appendix for: and Keep in main body: so appendix-routing intent is explicit in the brief contract.

That makes the brief part of the audit trail, not just part of the prompt.

In practice, .autolabos/config.yaml holds provider and workspace defaults, while the brief carries run-specific research intent, evidence bars, baseline expectations, manuscript-format targets, and manuscript template path.

/new
/brief start --latest

Briefs are expected to define both research intent and governance constraints: topic, objective metric, baseline or comparator, minimum acceptable evidence, disallowed shortcuts, and the paper ceiling if evidence remains weak.

Brief sections and grading
Section Status Purpose
## Topic Required Research question in 1-3 sentences
## Objective Metric Required Primary success metric
## Constraints Recommended Compute budget, dataset limits, reproducibility rules
## Plan Recommended Step-by-step experiment plan
## Target Comparison Governance Proposed method vs. explicit baseline
## Minimum Acceptable Evidence Governance Minimum effect size, fold count, decision boundary
## Disallowed Shortcuts Governance Shortcuts that invalidate results
## Paper Ceiling If Evidence Remains Weak Governance Maximum paper classification if evidence is insufficient
## Manuscript Format Optional Column count, page budget, reference/appendix rules
Grade Meaning Paper-scale ready?
complete Core + 4+ governance sections substantive Yes
partial Core complete + 2+ governance Proceed with warnings
minimal Only core sections No

Two Interfaces, One Runtime

AutoLabOS has two front ends over the same governed runtime.

TUI Web UI
Launch autolabos autolabos web
Interaction Slash commands, natural language Browser dashboard and composer
Workflow view Real-time node progress in terminal Governed workflow graph with actions
Artifacts CLI inspection Inline preview for text, images, PDFs
Operations surfaces /watch, /queue, /explore, /doctor Jobs queue, live watch cards, exploration status, diagnostics
Best for Fast iteration and direct control Visual monitoring and artifact browsing

The important constraint is that both surfaces see the same checkpoints, the same runs, and the same underlying artifacts.


What Makes AutoLabOS Different

AutoLabOS is designed around governed execution rather than prompt-only orchestration.

Typical research tools AutoLabOS
Workflow Open-ended agent drift Governed fixed graph with explicit review boundaries
State Ephemeral Checkpointed, resumable, inspectable
Claims As strong as the model will generate Bounded by evidence and a claim ceiling
Review Optional cleanup pass Structural gate that can block writing
Failures Forgotten and retried Fingerprinted in failure memory
Interfaces Separate code paths TUI and Web share one runtime

This is why the system reads more like research infrastructure than a paper generator.


Core Guarantees

Governed Workflow

The workflow is bounded and auditable. Backtracking is part of the contract. Results that do not justify forward motion are sent back to hypotheses, design, or implementation rather than polished into stronger prose.

Checkpointed Research State

Every node boundary writes state you can inspect and resume. The unit of progress is not only text output. It is a run with artifacts, transitions, and recoverable state.

Claim Ceiling

Claims are kept under the strongest defensible evidence ceiling. The system records blocked stronger claims and the evidence gaps required to unlock them.

Review As A Structural Gate

review is not a cosmetic cleanup stage. It is where readiness, methodology sanity, evidence linkage, writing discipline, and reproducibility handoff are checked before manuscript generation.

Failure Memory

Failure fingerprints are persisted so structural errors and repeated equivalent failures are not retried blindly.

Reproducibility Through Artifacts

Runs stay inspectable because the system persists artifacts, checkpoints, and transitions instead of relying on hidden state.


Quality Model

AutoLabOS makes quality checks visible during a run.

  • /doctor checks environment and workspace readiness before a run starts

Paper readiness is not a single binary prompt judgment.

  • Layer 1 - deterministic minimum gate blocks under-evidenced work with explicit artifact and evidence-integrity checks
  • Layer 2 - LLM paper-quality evaluator adds structured critique over methodology, evidence strength, writing structure, claim support, and limitations honesty
  • Review packet + specialist panel determine whether the manuscript path should advance, revise, or backtrack

paper_readiness.json can include an overall_score. It should be read as a run-quality signal inside the system, not as a universal scientific benchmark. Some advanced evaluation and self-improvement flows use that score to compare runs or candidate prompt mutations.


Advanced Self-Improvement Capabilities

AutoLabOS includes bounded self-improvement paths, but they are governed by validation and rollback rather than blind autonomous rewriting.

autolabos meta-harness

autolabos meta-harness builds a context directory from recent completed runs and evaluation history under outputs/meta-harness/<timestamp>/.

It can include:

  • filtered run events
  • node artifacts such as result_analysis.json or review/decision.json
  • paper_readiness.json
  • outputs/eval-harness/history.jsonl
  • current node-prompts/ files for the targeted node

The LLM is instructed through TASK.md to return only TARGET_FILE + unified diff, and the target is constrained to node-prompts/. In apply mode, the candidate must pass validation checks; otherwise the change is rolled back and an audit log is written. --no-apply builds context only. --dry-run shows the diff without modifying files.

autolabos evolve

autolabos evolve runs a bounded mutation-and-evaluation loop over .codex and node-prompts.

  • supports --max-cycles, --target skills|prompts|all, and --dry-run
  • reads run fitness from paper_readiness.overall_score
  • can mutate prompts and skills, run validation, and compare fitness across cycles
  • rolls back regressions by restoring .codex and node-prompts from the last good git tag

This is a self-improvement path, but not an unconstrained repo-wide rewrite path.

Harness Preset Layer

AutoLabOS also has built-in harness presets such as base, compact, failure-aware, and review-heavy. These adjust artifact/context policy, failure-memory emphasis, prompt policy, and compression strategy for comparative evaluation paths without changing the governed production workflow.


Common Commands

Command Description
/new Create or open Brief.md
/brief start <path|--latest> Start research from a brief
/runs [query] List or search runs
/resume <run> Resume a run
/agent run <node> [run] Execute from a graph node
/agent status [run] Show node statuses
/agent overnight [run] Run unattended with conservative bounds
/agent autonomous [run] Run open-ended bounded research exploration
/watch Live watch view for active runs and background jobs
/explore Show exploration-engine status for the active run
/queue Show running, waiting, and stalled jobs
/doctor Environment and workspace diagnostics
/model Switch model and reasoning effort
Full command list
Command Description
/help Show command list
/new Create or open workspace Brief.md
/brief start <path|--latest> Start research from workspace Brief.md or a brief path
/doctor Environment + workspace diagnostics
/runs [query] List or search runs
/run <run> Select run
/resume <run> Resume run
/agent list List graph nodes
/agent run <node> [run] Execute from node
/agent status [run] Show node statuses
/agent collect [query] [options] Collect papers
/agent recollect <n> [run] Collect additional papers
/agent focus <node> Move focus with safe jump
/agent graph [run] Show graph state
/agent resume [run] [checkpoint] Resume from checkpoint
/agent retry [node] [run] Retry node
/agent jump <node> [run] [--force] Jump node
/agent overnight [run] Overnight autonomy (24h)
/agent autonomous [run] Open-ended autonomous research
/model Model and reasoning selector
/approve Approve paused node
/queue Show running / waiting / stalled jobs
/watch Live watch view for active runs
/explore Show exploration-engine status
/retry Retry current node
/settings Provider and model settings
/quit Exit

Who This Is For / Not For

Good fit

  • teams that want autonomous help with a governed workflow
  • research engineering work where checkpoints and artifacts matter
  • paper-scale or paper-adjacent projects that need evidence discipline
  • environments where review, traceability, and resumability matter as much as generation

Not a good fit

  • users who only want a fast one-shot draft
  • workflows that do not need artifact trails or review gates
  • projects that want free-form agent behavior more than governed execution
  • cases where a simple literature summary tool is enough

Status

AutoLabOS is an active OSS research-engineering project. For deeper details beyond this overview, see the documents under docs.

About

An operating system for autonomous research — from literature to manuscript inside a governed, checkpointed loop.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors