Flowa

Variant literature assessment pipeline with AI extraction.

Each citation in the aggregated assessment links back to the exact highlighted quote in the source paper's PDF.

Quickstart

Run the whole stack locally against a captured example variant. The demo ships pre-seeded with a real assessment, so the evidence viewer is populated the moment it boots — you don't have to run the pipeline yourself to explore it.

git clone https://github.com/populationgenomics/flowa.git
cd flowa
pnpm install                                       # repo root — this is a pnpm workspace
cp examples/demo/.env.example examples/demo/.env   # the demo's only config file
# Edit examples/demo/.env: uncomment exactly one provider block and fill in your key.
pnpm demo

Then open http://localhost:7700. pnpm demo builds the workspace packages and boots three services — Next.js (UI + triage API), chat-service, and a Python pipeline gateway.

Prerequisites: Node 24+ (the demo uses the built-in node:sqlite) and uv on your PATH. For the full walkthrough, demo simplifications, and how to run a live pipeline, see examples/demo/README.md.

Architecture

Flowa is a single async pipeline that processes genetic variant literature:

query → download → convert → extract → aggregate

Query: Search Mastermind/LitVar for papers, resolve PMIDs to DOIs via PubMed
Download: Fetch PDFs from PMC (main article + supplements)
Convert: PDF → Markdown via anchorite (LLM-based conversion)
Extract: Per-paper evidence extraction via LLM
Aggregate: Cross-paper synthesis via LLM, resolving citation quotes to PDF bounding boxes via anchorite

Papers are processed in parallel. LLM concurrency is controlled via --llm-concurrency.

Installation

The Quickstart above needs no separate install. To use the pipeline as a library or CLI in your own project, install flowapy from PyPI, opting into the provider extras you need (one or more of anthropic, bedrock, google, openai):

pip install 'flowapy[bedrock]'
# or
uv pip install 'flowapy[bedrock,anthropic]'

The PyPI distribution is named flowapy (the flowa name was already taken), but the import, module, and CLI are all flowa. The CLI is exposed as a console script — see Configuration for credentials and storage setup.

Usage

# Full pipeline
flowa run --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar

# Individual steps (for debugging)
flowa query --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar
flowa download --doi '10.1038/s41586-020-2308-7'
flowa convert --doi '10.1038/s41586-020-2308-7'
flowa extract --variant-id VAR123 --doi '10.1038/s41586-020-2308-7'
flowa aggregate --variant-id VAR123

Configuration

Environment Variables

The model settings are nested config objects: set their sub-fields with the __ delimiter — __NAME selects the model. These three are required:

Variable	Description	Example
`FLOWA_STORAGE_BASE`	Storage path for PDFs, extractions, results	`s3://bucket`, `gs://bucket`, `file:///path`
`FLOWA_CONVERT_MODEL__NAME`	LLM for PDF→Markdown conversion (anchorite), in `<provider>:<model>` form	`bedrock:au.anthropic.claude-sonnet-4-6`
`FLOWA_EXTRACTION_MODEL__NAME`	LLM for extraction and aggregation	`bedrock:au.anthropic.claude-opus-4-6`

Optional:

Variable	Description
`FLOWA_CONVERT_MODEL__BEDROCK_INFERENCE_PROFILE`	Bedrock application inference profile ARN for cost attribution. When set, `__NAME` must point at the underlying foundation model.
`FLOWA_EXTRACTION_MODEL__BEDROCK_INFERENCE_PROFILE`	Same, for the extraction/aggregation model.
`MASTERMIND_API_TOKEN`	Required when querying with `--source mastermind`; use `--source litvar` (free, no token) otherwise.
`NCBI_API_KEY`	Optional NCBI key for higher PubMed rate limits.

LLM Providers

Models use pydantic-ai format. Examples:

AWS Bedrock: bedrock:au.anthropic.claude-sonnet-4-6 (convert), bedrock:au.anthropic.claude-opus-4-6 (extraction)
Google Gemini: google-gla:gemini-3-pro
OpenAI: openai:gpt-5.2

Provider credentials:

Provider	Required Variables
AWS Bedrock	`AWS_PROFILE` + `AWS_REGION`, or `AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` + `AWS_REGION`
Google Gemini	`GOOGLE_API_KEY`
OpenAI	`OPENAI_API_KEY`

Storage Backends

Backend	`FLOWA_STORAGE_BASE`	Additional Variables
AWS S3	`s3://bucket-name`	AWS credentials (see above)
Google Cloud Storage	`gs://bucket-name`	`GOOGLE_APPLICATION_CREDENTIALS` or workload identity
S3-compatible (MinIO)	`s3://bucket-name`	`FSSPEC_S3_ENDPOINT_URL`, `FSSPEC_S3_KEY`, `FSSPEC_S3_SECRET`
Local filesystem	`file:///path`	—

Prompt Customization

Flowa supports site-specific prompt sets. Each prompt set is a directory under prompts/ containing prompt templates and Pydantic schema modules.

Variable	Description	Default
`FLOWA_PROMPT_SET`	Name of the prompt set directory to use	`generic`

Prompt Set Structure

prompts/{prompt_set}/
├── extraction_prompt.txt      # Prompt template for individual paper extraction
├── extraction_schema.py       # Pydantic model defining ExtractionResult
├── aggregation_prompt.txt     # Prompt template for cross-paper aggregation
└── aggregation_schema.py      # Pydantic model defining AggregationResult

Interface Requirements

Schema modules must define Pydantic models with specific fields that Flowa's validation logic depends on:

extraction_schema.py must define ExtractionResult with:

evidence[].citations[].quote (str) — verbatim quote from the paper

aggregation_schema.py must define AggregationResult with:

results[].citations[].paper_id (str) — paper identifier
results[].citations[].quote (str) — verbatim quote resolved to PDF bounding boxes

All other fields can be customized freely. See prompts/generic/ for the default implementation.

Citation Format

The pipeline uses a unified citation format:

[display text](#cite:paperId "verbatim quote to highlight")

paperId = AuthorYear label (e.g., Smith2024) from paper_id_mapping
The title attribute carries a verbatim quote that scopes the PDF highlight
Display text is free-form

During aggregation, quotes are resolved against each paper's source PDF (via anchorite.PdfIndex) to produce bounding box coordinates. The aggregate output contains pre-resolved bboxes arrays for each citation. Quotes that cannot be resolved get empty bboxes.

Storage Layout

papers/{encoded_doi}/
  source.pdf              # Downloaded PDF
  markdown.md             # LLM-generated Markdown
  metadata.json           # PubMed metadata (title, authors, date, etc.)

assessments/{variant_id}/
  workflow.json            # Pipeline run metadata
  variant_details.json     # VariantValidator output
  query.json               # Query results (DOI list)
  aggregation.json         # Aggregated assessment with pre-resolved bboxes
  aggregation_raw.json     # Raw LLM conversation
  extractions/
    {encoded_doi}.json     # Per-paper extraction (quotes + commentary)
    {encoded_doi}_raw.json # Raw LLM conversation

Development

This repo is a polyglot monorepo: a Python pipeline under src/flowa/, TypeScript packages under packages/, and worked examples under examples/. Each piece has its own dependency closure, and each Python project (the library and examples/demo-gateway/) is an independent uv project. Running pytest from the repo root would walk into the sibling project and fail on its venv-specific imports — always run pytest from the project that owns the tests, scoping it to the local tests/ directory:

# Library tests
uv run pytest tests/

# Demo-gateway tests
cd examples/demo-gateway && uv run pytest tests/

The TypeScript packages and examples share one pnpm workspace, so the JS/TS test runner is a single recursive invocation:

pnpm -r typecheck
pnpm -r test

Lint and format checks are unified under pre-commit; CI invokes the same hook so local and CI behaviour match:

uv run pre-commit run --all-files

Releasing

Bump [project].version in pyproject.toml, commit, then push a matching tag:

git tag flowapy-v0.1.0
git push origin flowapy-v0.1.0

The tag-driven workflow (.github/workflows/release-flowapy.yaml) builds the package and publishes to PyPI via OIDC trusted publishing. The pypi GitHub environment requires manual approval before the publish step runs.

Deployment

Local Development

Run the pipeline directly from a checkout (the Quickstart demo wraps this with a UI):

export FLOWA_STORAGE_BASE=file:///tmp/flowa
export FLOWA_CONVERT_MODEL__NAME=bedrock:au.anthropic.claude-sonnet-4-6
export FLOWA_EXTRACTION_MODEL__NAME=bedrock:au.anthropic.claude-opus-4-6
uv run flowa run --variant-id test --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar

Docker

docker build --build-arg LLM_EXTRA=bedrock -t flowa .
docker run \
  -e FLOWA_STORAGE_BASE=s3://bucket \
  -e FLOWA_CONVERT_MODEL__NAME=bedrock:au.anthropic.claude-sonnet-4-6 \
  -e FLOWA_EXTRACTION_MODEL__NAME=bedrock:au.anthropic.claude-opus-4-6 \
  -e AWS_REGION=ap-southeast-2 \
  flowa run --variant-id VAR123 --gene GAA --hgvs-c "NM_000152.5:c.2238G>C" --source litvar

AWS Batch

Create a job definition with the flowa container image. A typical run processes up to 50 papers with LLM calls for conversion, extraction, and aggregation — allow sufficient time and retries.

aws batch register-job-definition \
  --job-definition-name flowa-worker \
  --type container \
  --container-properties '{
    "image": "123456789.dkr.ecr.ap-southeast-2.amazonaws.com/flowa:latest",
    "resourceRequirements": [
      {"type": "VCPU", "value": "2"},
      {"type": "MEMORY", "value": "8192"}
    ],
    "environment": [
      {"name": "FLOWA_STORAGE_BASE", "value": "s3://flowa-data"},
      {"name": "FLOWA_CONVERT_MODEL__NAME", "value": "bedrock:au.anthropic.claude-sonnet-4-6"},
      {"name": "FLOWA_EXTRACTION_MODEL__NAME", "value": "bedrock:au.anthropic.claude-opus-4-6"}
    ]
  }' \
  --retry-strategy '{"attempts": 2}' \
  --timeout '{"attemptDurationSeconds": 3600}'

Submit a job:

aws batch submit-job \
  --job-name "flowa-VAR123" \
  --job-definition flowa-worker \
  --job-queue flowa-queue \
  --container-overrides '{
    "command": ["run", "--variant-id", "VAR123", "--gene", "GAA", "--hgvs-c", "NM_000152.5:c.2238G>C", "--source", "litvar"]
  }'

Name		Name	Last commit message	Last commit date
Latest commit History 193 Commits
.github		.github
docs/images		docs/images
examples		examples
packages		packages
prompts		prompts
src/flowa		src/flowa
tests		tests
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.nvmrc		.nvmrc
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
pyproject.toml		pyproject.toml
tsconfig.base.json		tsconfig.base.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flowa

Quickstart

Architecture

Installation

Usage

Configuration

Environment Variables

LLM Providers

Storage Backends

Prompt Customization

Prompt Set Structure

Interface Requirements

Citation Format

Storage Layout

Development

Releasing

Deployment

Local Development

Docker

AWS Batch

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Flowa

Quickstart

Architecture

Installation

Usage

Configuration

Environment Variables

LLM Providers

Storage Backends

Prompt Customization

Prompt Set Structure

Interface Requirements

Citation Format

Storage Layout

Development

Releasing

Deployment

Local Development

Docker

AWS Batch

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages