Skip to content

populationgenomics/flowa

Repository files navigation

Flowa

Variant literature assessment pipeline with AI extraction.

Flowa's interactive evidence viewer: paper list on the left, aggregated assessment with inline citations in the centre, and the source PDF with bounding-box highlights on the right.

Each citation in the aggregated assessment links back to the exact highlighted quote in the source paper's PDF.

Quickstart

Run the whole stack locally against a captured example variant. The demo ships pre-seeded with a real assessment, so the evidence viewer is populated the moment it boots — you don't have to run the pipeline yourself to explore it.

git clone https://github.com/populationgenomics/flowa.git
cd flowa
pnpm install                                       # repo root — this is a pnpm workspace
cp examples/demo/.env.example examples/demo/.env   # the demo's only config file
# Edit examples/demo/.env: uncomment exactly one provider block and fill in your key.
pnpm demo

Then open http://localhost:7700. pnpm demo builds the workspace packages and boots three services — Next.js (UI + triage API), chat-service, and a Python pipeline gateway.

Prerequisites: Node 24+ (the demo uses the built-in node:sqlite) and uv on your PATH. For the full walkthrough, demo simplifications, and how to run a live pipeline, see examples/demo/README.md.

Architecture

Flowa is a single async pipeline that processes genetic variant literature:

query → download → convert → extract → aggregate
  • Query: Search Mastermind/LitVar for papers, resolve PMIDs to DOIs via PubMed
  • Download: Fetch PDFs from PMC (main article + supplements)
  • Convert: PDF → Markdown via anchorite (LLM-based conversion)
  • Extract: Per-paper evidence extraction via LLM
  • Aggregate: Cross-paper synthesis via LLM, resolving citation quotes to PDF bounding boxes via anchorite

Papers are processed in parallel. LLM concurrency is controlled via --llm-concurrency.

Installation

The Quickstart above needs no separate install. To use the pipeline as a library or CLI in your own project, install flowapy from PyPI, opting into the provider extras you need (one or more of anthropic, bedrock, google, openai):

pip install 'flowapy[bedrock]'
# or
uv pip install 'flowapy[bedrock,anthropic]'

The PyPI distribution is named flowapy (the flowa name was already taken), but the import, module, and CLI are all flowa. The CLI is exposed as a console script — see Configuration for credentials and storage setup.

Usage

# Full pipeline
flowa run --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar

# Individual steps (for debugging)
flowa query --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar
flowa download --doi '10.1038/s41586-020-2308-7'
flowa convert --doi '10.1038/s41586-020-2308-7'
flowa extract --variant-id VAR123 --doi '10.1038/s41586-020-2308-7'
flowa aggregate --variant-id VAR123

Configuration

Environment Variables

The model settings are nested config objects: set their sub-fields with the __ delimiter — __NAME selects the model. These three are required:

Variable Description Example
FLOWA_STORAGE_BASE Storage path for PDFs, extractions, results s3://bucket, gs://bucket, file:///path
FLOWA_CONVERT_MODEL__NAME LLM for PDF→Markdown conversion (anchorite), in <provider>:<model> form bedrock:au.anthropic.claude-sonnet-4-6
FLOWA_EXTRACTION_MODEL__NAME LLM for extraction and aggregation bedrock:au.anthropic.claude-opus-4-6

Optional:

Variable Description
FLOWA_CONVERT_MODEL__BEDROCK_INFERENCE_PROFILE Bedrock application inference profile ARN for cost attribution. When set, __NAME must point at the underlying foundation model.
FLOWA_EXTRACTION_MODEL__BEDROCK_INFERENCE_PROFILE Same, for the extraction/aggregation model.
MASTERMIND_API_TOKEN Required when querying with --source mastermind; use --source litvar (free, no token) otherwise.
NCBI_API_KEY Optional NCBI key for higher PubMed rate limits.

LLM Providers

Models use pydantic-ai format. Examples:

  • AWS Bedrock: bedrock:au.anthropic.claude-sonnet-4-6 (convert), bedrock:au.anthropic.claude-opus-4-6 (extraction)
  • Google Gemini: google-gla:gemini-3-pro
  • OpenAI: openai:gpt-5.2

Provider credentials:

Provider Required Variables
AWS Bedrock AWS_PROFILE + AWS_REGION, or AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY + AWS_REGION
Google Gemini GOOGLE_API_KEY
OpenAI OPENAI_API_KEY

Storage Backends

Backend FLOWA_STORAGE_BASE Additional Variables
AWS S3 s3://bucket-name AWS credentials (see above)
Google Cloud Storage gs://bucket-name GOOGLE_APPLICATION_CREDENTIALS or workload identity
S3-compatible (MinIO) s3://bucket-name FSSPEC_S3_ENDPOINT_URL, FSSPEC_S3_KEY, FSSPEC_S3_SECRET
Local filesystem file:///path

Prompt Customization

Flowa supports site-specific prompt sets. Each prompt set is a directory under prompts/ containing prompt templates and Pydantic schema modules.

Variable Description Default
FLOWA_PROMPT_SET Name of the prompt set directory to use generic

Prompt Set Structure

prompts/{prompt_set}/
├── extraction_prompt.txt      # Prompt template for individual paper extraction
├── extraction_schema.py       # Pydantic model defining ExtractionResult
├── aggregation_prompt.txt     # Prompt template for cross-paper aggregation
└── aggregation_schema.py      # Pydantic model defining AggregationResult

Interface Requirements

Schema modules must define Pydantic models with specific fields that Flowa's validation logic depends on:

extraction_schema.py must define ExtractionResult with:

  • evidence[].citations[].quote (str) — verbatim quote from the paper

aggregation_schema.py must define AggregationResult with:

  • results[].citations[].paper_id (str) — paper identifier
  • results[].citations[].quote (str) — verbatim quote resolved to PDF bounding boxes

All other fields can be customized freely. See prompts/generic/ for the default implementation.

Citation Format

The pipeline uses a unified citation format:

[display text](#cite:paperId "verbatim quote to highlight")
  • paperId = AuthorYear label (e.g., Smith2024) from paper_id_mapping
  • The title attribute carries a verbatim quote that scopes the PDF highlight
  • Display text is free-form

During aggregation, quotes are resolved against each paper's source PDF (via anchorite.PdfIndex) to produce bounding box coordinates. The aggregate output contains pre-resolved bboxes arrays for each citation. Quotes that cannot be resolved get empty bboxes.

Storage Layout

papers/{encoded_doi}/
  source.pdf              # Downloaded PDF
  markdown.md             # LLM-generated Markdown
  metadata.json           # PubMed metadata (title, authors, date, etc.)

assessments/{variant_id}/
  workflow.json            # Pipeline run metadata
  variant_details.json     # VariantValidator output
  query.json               # Query results (DOI list)
  aggregation.json         # Aggregated assessment with pre-resolved bboxes
  aggregation_raw.json     # Raw LLM conversation
  extractions/
    {encoded_doi}.json     # Per-paper extraction (quotes + commentary)
    {encoded_doi}_raw.json # Raw LLM conversation

Development

This repo is a polyglot monorepo: a Python pipeline under src/flowa/, TypeScript packages under packages/, and worked examples under examples/. Each piece has its own dependency closure, and each Python project (the library and examples/demo-gateway/) is an independent uv project. Running pytest from the repo root would walk into the sibling project and fail on its venv-specific imports — always run pytest from the project that owns the tests, scoping it to the local tests/ directory:

# Library tests
uv run pytest tests/

# Demo-gateway tests
cd examples/demo-gateway && uv run pytest tests/

The TypeScript packages and examples share one pnpm workspace, so the JS/TS test runner is a single recursive invocation:

pnpm -r typecheck
pnpm -r test

Lint and format checks are unified under pre-commit; CI invokes the same hook so local and CI behaviour match:

uv run pre-commit run --all-files

Releasing

Bump [project].version in pyproject.toml, commit, then push a matching tag:

git tag flowapy-v0.1.0
git push origin flowapy-v0.1.0

The tag-driven workflow (.github/workflows/release-flowapy.yaml) builds the package and publishes to PyPI via OIDC trusted publishing. The pypi GitHub environment requires manual approval before the publish step runs.

Deployment

Local Development

Run the pipeline directly from a checkout (the Quickstart demo wraps this with a UI):

export FLOWA_STORAGE_BASE=file:///tmp/flowa
export FLOWA_CONVERT_MODEL__NAME=bedrock:au.anthropic.claude-sonnet-4-6
export FLOWA_EXTRACTION_MODEL__NAME=bedrock:au.anthropic.claude-opus-4-6
uv run flowa run --variant-id test --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar

Docker

docker build --build-arg LLM_EXTRA=bedrock -t flowa .
docker run \
  -e FLOWA_STORAGE_BASE=s3://bucket \
  -e FLOWA_CONVERT_MODEL__NAME=bedrock:au.anthropic.claude-sonnet-4-6 \
  -e FLOWA_EXTRACTION_MODEL__NAME=bedrock:au.anthropic.claude-opus-4-6 \
  -e AWS_REGION=ap-southeast-2 \
  flowa run --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar

AWS Batch

Create a job definition with the flowa container image. A typical run processes up to 50 papers with LLM calls for conversion, extraction, and aggregation — allow sufficient time and retries.

aws batch register-job-definition \
  --job-definition-name flowa-worker \
  --type container \
  --container-properties '{
    "image": "123456789.dkr.ecr.ap-southeast-2.amazonaws.com/flowa:latest",
    "resourceRequirements": [
      {"type": "VCPU", "value": "2"},
      {"type": "MEMORY", "value": "8192"}
    ],
    "environment": [
      {"name": "FLOWA_STORAGE_BASE", "value": "s3://flowa-data"},
      {"name": "FLOWA_CONVERT_MODEL__NAME", "value": "bedrock:au.anthropic.claude-sonnet-4-6"},
      {"name": "FLOWA_EXTRACTION_MODEL__NAME", "value": "bedrock:au.anthropic.claude-opus-4-6"}
    ]
  }' \
  --retry-strategy '{"attempts": 2}' \
  --timeout '{"attemptDurationSeconds": 3600}'

Submit a job:

aws batch submit-job \
  --job-name "flowa-VAR123" \
  --job-definition flowa-worker \
  --job-queue flowa-queue \
  --container-overrides '{
    "command": ["run", "--variant-id", "VAR123", "--gene", "GAA", "--hgvs-c", "NM_000152.5:c.2238G>C", "--source", "litvar"]
  }'

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors