General Research + PubMed Pipeline

A user-facing research workflow for turning scattered papers into structured, auditable evidence.

Who This Is For

Research teams doing literature reviews
Clinical/scientific ops teams filling evidence tables
Analysts who need traceable quotes, not summaries only

What It Does

Finds relevant papers from multiple research sources
Prioritizes fulltext evidence for core fields
Extracts structured values and links each value to evidence
Records missing reasons when a field cannot be filled

Performance Snapshot (Measured)

Expert baseline label (user-provided): Harvard Medical School MD+PhD, pharma experience, manual extraction workload around 2 days.

Metric	Expert manual baseline	Pipeline (measured)	Pipeline (conservative 30m)	Improvement
Time to deliver structured evidence	2880 min (2 days)	3.5 min	30 min	822x faster (measured) / 96x faster (conservative)
Core non-empty values	186	198	198	+12 (+6.5%)
Unique studies captured	23	30	30	+7 (+30.4%)
Traceability coverage proxy (evidence rows / core values)	50.5%	56.6%	56.6%	+6.1 pp
Evidence mapping pass rate	N/A	100%	100%	Higher auditability

Benchmark files:

assets/benchmark_snapshot_2026-02-09.csv
assets/benchmark_snapshot_2026-02-09.json

Typical Use Cases

Endpoint extraction from clinical trial papers
Building evidence tables for internal review
Updating research FAQs with source-backed claims
Rapid gap analysis: what is known vs not reported

Coverage (Current)

Source registry: 29 sources
Source families: literature, guidelines, regulatory, institutions
Evidence policy: fulltext-first for core data

How The Workflow Runs

Define what needs to be filled and how success is measured.
Retrieve candidate papers from multiple sources.
Download and parse fulltexts.
Extract target fields and normalize terms.
Output three aligned layers:
- Results (filled values)
- Evidence (quote + source + location)
- Missing reasons (why unfilled)
Run quality checks for traceability and consistency.

What You Get

Faster evidence extraction with less manual copy/paste
A clear audit trail for every filled value
Explicit handling of uncertainty and missing data

Start in 3 Steps

# 1) Retrieve candidates
python3 src/paper_hub.py search-multi \
  --query "your research question" \
  --sources pubmed,crossref,semantic

# 2) Download fulltext candidates
python3 src/paper_hub.py download-batch \
  --input-jsonl downloads/candidate_papers.jsonl \
  --output-dir downloads

# 3) Parse artifacts
python3 src/paper_hub.py parse \
  --mode bioc \
  --input-dir downloads \
  --output downloads/parsed_bioc.jsonl

Core Docs

SOP: SOP_endpoint_extraction_standard.md
Main entrypoint: src/paper_hub.py

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
src		src
.gitignore		.gitignore
README.md		README.md
SOP_endpoint_extraction_standard.md		SOP_endpoint_extraction_standard.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

General Research + PubMed Pipeline

Who This Is For

What It Does

Performance Snapshot (Measured)

Typical Use Cases

Coverage (Current)

How The Workflow Runs

What You Get

Start in 3 Steps

Core Docs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

General Research + PubMed Pipeline

Who This Is For

What It Does

Performance Snapshot (Measured)

Typical Use Cases

Coverage (Current)

How The Workflow Runs

What You Get

Start in 3 Steps

Core Docs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages