Skip to content

README: modernize structure (logo, Quick start, MetaGraph Online)#635

Open
karasikov wants to merge 23 commits into
masterfrom
mk/readme-modernization
Open

README: modernize structure (logo, Quick start, MetaGraph Online)#635
karasikov wants to merge 23 commits into
masterfrom
mk/readme-modernization

Conversation

@karasikov

@karasikov karasikov commented May 17, 2026

Copy link
Copy Markdown
Member

Summary

Modernize the top-level README.md for the open-source repo: lead with a theme-aware logo and a one-command Quick start, surface what MetaGraph does before how to install it, push the long raw-metagraph recipes into collapsible sections, and advertise the hosted search service.

What changed

  • Header — centered, theme-aware SVG logo (light/dark via <picture> + prefers-color-scheme), an <h1> project title, and a consolidated badge row (platform, release, bioconda version + downloads, conda/docker/source install-nav, DOI, online docs).
  • Intro — short value proposition, a "search engine for sequencing data" framing, the two-component (de Bruijn graph + annotation matrix) explanation, and a Mermaid pipeline diagram.
  • Features — concise bulleted list (search public archives, index your own data, optional count/coordinate payloads, alignment, graph cleaning, differential assembly, Python API & HTTP server), with an "Under the hood" <details> for the data-structure internals.
  • MetaGraph Online — new section pointing to the hosted search engine (https://metagraph.ethz.ch) and to the prebuilt indexes on AWS Open Data.
  • Quick start — three numbered steps: 1. Install (fresh conda env + pip workflows wrapper), 2. Build an index (uses the bundled metagraph/tests/data/*.fa so a fresh checkout runs as-is; explains --primary and the graph.dbg / graph_small.dbg / graph.relax.row_diff_brwt.annodbg outputs), 3. Query. Includes a collapsible real-workload example with a hardware budget.
  • Minimal example — collapsible end-to-end CLI walkthrough (build → annotate → query → stats) on the small bundled dataset, with the actual query output and an explanation of the k-mer-match numbers.
  • More recipes — collapsible direct-CLI recipes (align / labeled align / coordinate-aware align / assemble / differential assembly / stats), a KMC build example, and common query flags.
  • More install options — Docker (with an "advanced usage" <details>) and a pointer to the install-from-source guide.
  • Contributing — dev notes (docker build, release process) kept inline in a <details> (no separate file).
  • Citation / License — Nature citation + BibTeX in a <details>; GPLv3 with links to AUTHORS/COPYRIGHT.
  • AUTHORS / COPYRIGHT — add contributors and bump the copyright year.
  • docs/quick_start.rst — add a note spelling out the rough RAM requirements of the two BRWT-clustering stages.

Net: 6 files, +486 / −225 lines (README ~342 lines).

Test plan

  • Render README on GitHub: light/dark logo, Mermaid diagram, badges, and all internal section anchors resolve.
  • Ran the Minimal example (build → annotate → query → stats) locally — output matches the README byte-for-byte (<kl_sample>:330, <zh_sample>:243, <tk_sample>:207).
  • Verified the Quick start output filenames (graph.dbg, graph_small.dbg, graph.relax.row_diff_brwt.annodbg) against the workflow's default config.
  • Verified every CLI flag/default cited in the recipes (--query-mode, --min-kmers-fraction-label 0.7, --labels-delimiter ":", --mmap, server_query --port 5555, assemble --unitigs/--diff-assembly-rules, …).
  • After merge, re-check the README links from master.

Base automatically changed from mk/workflows-resources to master May 17, 2026 23:36
@karasikov karasikov force-pushed the mk/readme-modernization branch from ab59afb to 3086e4b Compare May 20, 2026 22:51
karasikov and others added 17 commits May 21, 2026 01:07
Re-introduces content trimmed from master as collapsible sections (design
choices, direct CLI, KMC build, manual Multi-BRWT with memory formulas,
common query flags, advanced Docker, contributing pointer, offline docs).

Adds a visible "Minimal example" section with a 4-command end-to-end demo
on the bundled transcripts_1000.fa, plus a sample query output showing
paralog matching. Aligns terminology with the project (de Bruijn graph,
annotation matrix, labels, ColumnCompressed/RowDiff<Multi-BRWT>, BOSS).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ures

- Add a Mermaid diagram (FASTA → build/annotate → graph + annotation →
  query/align → matches) as a visual anchor for the two-component model.
- Insert a "What you can do" use-case block before Features (search SRA,
  index a cohort, lossless source recovery).
- Split k-mer counts/coordinates into distinct Feature bullets; promote
  Python API & HTTP server to a Feature; move Custom alphabets into the
  Design choices collapsible.
- Add memory-mapped loading to Design choices.
- Gloss --primary inline; collapse the 4-stage pipeline list to a one-liner;
  add a "What next" subsection after Step 3 (align / Python / diff assembly
  / docs).
- Add a plain-English definition of "label" at the top of Minimal example.
- Drop the Manual Multi-BRWT recipe (with memory formulas); leave a link
  to the annotation-transform docs.
- Add a sysreqs + alphabet-picker line to Install (Linux/macOS x86-64/arm64,
  RAM scaling, bioconda binary is DNA-only).

Net: 334 -> 315 lines; first paint reframed from engineering inventory to
outcomes plus a working example.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cross-checked every flag, default, output name, and behavioral claim against
the code and docs. Corrections:

- Memory-mapped loading reframed as opt-in --mmap (was implied default);
  noted NVMe recommended, SSD slower.
- graph_small.dbg: "slower at access" not "slower-to-load" (BOSS SMALL state
  is smaller, slower-access).
- --primary rationale: "appropriate when read strand orientation is unknown"
  instead of "recommended for DNA data".
- Bioconda: ships DNA + Protein alphabets (metagraph symlinked to
  metagraph_DNA), not a single DNA binary; Docker image adds DNA5.
- MetaGraph Online datasets: list per resources.rst (RefSeq, UHGG, Tara
  Oceans, UniParc), not the previous SRA/ENA/DRA/UniProt guess.
- Pipeline chain now includes BRWT relaxation step (explains the .relax.
  prefix in the default annotation filename).

Simplifications:

- Merge "What you can do" + "Features" into one outcome-framed list
  (6 bullets, was 3 + 7).
- Trim Design choices ("Under the hood") to 4 sub-bullets; move Custom
  alphabets out (covered in Install).
- Move the real-workload build command (file list + hardware budget) into
  a collapsible.
- Replace "What next" 4-bullet subsection with a one-sentence post-script
  after Step 3.
- Tighten Install intro to a single sentence (Linux/macOS, no arch claims).

Net: -20 lines (315 -> 295).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Reword "search engine over 50+ petabases of ... data" -> "search engine
  over indexes built from 50+ petabases of ... sequencing data" in both
  places. The 50+ PB is the input to indexing, not the search target.
- Drop the lightbulb emoji from the Minimal example header (update the
  one cross-reference anchor accordingly).
- Use a file-cabinet icon (database-like) for "Index your own data".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Bring back the three install-method badges from master (install with conda / install with docker / install from source), with anchors updated to the current section IDs.

- Switch the Index your own data icon from file-cabinet to construction, matching the bullet's Build a k-mer index verb.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
GitHub's slugger left the U+FE0F variation-selector from the wrench emoji as an invisible character in the generated anchor for '### 🛠️ Install from sources'. Drop the emoji from the header and point the badge at the clean #install-from-sources anchor.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Switch from transcripts_1000.fa (1000 Ensembl headers with pipe-separated metadata) to metasub_fake_data_simple.fa (3 short sample-name labels: kl_sample / zh_sample / tk_sample). Query output is now human-readable at a glance.

Also drop the closing BRWT/RowDiff paragraph - operator-tier detail that doesn't belong in a minimal example.

Smoke-tested: output in README matches the actual command output exactly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Default query mode (labels) just lists matching labels without counts. Switching to --query-mode matches gives <label>:k-mer-count per match, which is far more informative and also matches what Quick Start step 3 already uses.

Updated explanation to highlight that zh_sample (243 k-mers) and tk_sample (207 k-mers) are fully contained in kl_sample (243/243, 207/207); kl_sample matches only itself because the shorter samples can't cover its 330 k-mers.

Smoke-tested: README output matches command output exactly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s feature; trim License badge and BRWT pointer

Issue #636 reported confusion about metagraph align vs metagraph query --align. Per Harun's clarification in the comments:

- metagraph align: sequence-to-graph alignment. With a coordinate-aware annotator, acts as a read mapper (walks are constrained to coordinate-consistent paths).

- metagraph query --align: aligns each query to the graph (ignoring annotations), then fetches labels for the highest-scoring walk. Use when exact k-mer matching is too strict.

Expanded the Direct CLI recipes to show all three modes with explanatory comments, and added a brief mention to the Quick Start step 3 closing line.

Other tweaks:

- Removed the License: GPLv3 badge.

- Removed the trailing 'manual Multi-BRWT clustering' pointer in More recipes.

- Added back the 🔤 Custom alphabets bullet to the visible Features list.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Match master's Features-list coverage: give k-mer counts and k-mer coordinates their own bullets (was folded inline into Index your own data) and move Custom alphabets to last position.

Features list is now 9 bullets matching master's 'Main features' breakdown.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Combine the Linux and macOS support badges into a single platform badge.

Move docker build, Makefile, and release notes into a collapsed Developer notes section and remove the standalone CONTRIBUTING.md (restoring how master had it).

Trim the k-mer payload and differential-assembly Feature bullets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt Features punctuation, trim logo margins

Add a MetaGraph Online pointer to the AWS Open Data preconstructed indexes (download + offline query).

Use a globe icon for the public-archive search bullet (reworded to 'Search in public archives') and a cloud icon for the Python API & HTTP server bullet.

Unify the Features bullets to a single 'Title. Sentence.' punctuation style.

Trim transparent margins from the logo PNG (1138x569 -> 631x399).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After trimming the logo's transparent margins it rendered ~266px tall at width=420; reduce to 260 (~164px) so the header logo is proportionate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…struction

Fold the hands-on Minimal example into a collapsed <details> (heading kept so the #minimal-example cross-link still resolves).

Relabel the build-section docs pointer as the 'Quick start guide' describing index construction, and note the git clone is only to fetch the example *.fa files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings back the contributor-facing design-choice from master (generic, extensible interfaces for adding custom representations/algorithms), closing the last completeness gap vs the old README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@adamant-pwn adamant-pwn left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I've added a few comments.

Comment thread README.md Outdated
@@ -1,316 +1,342 @@
# Metagenome Graph Project
<p align="center">
<img src="metagraph/docs/source/images/metagraph_logo.png" alt="MetaGraph" width="260">

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a logo in vector format? I'd suggest to also try making a logo variant for dark backgrounds (so the main color would be white, or some other light color). Note the logo is different from the one we use in Docs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added svg

Comment thread README.md Outdated
[![documentation](https://img.shields.io/badge/📖-online%20docs-blue.svg)](https://metagraph.ethz.ch/static/docs/index.html)

MetaGraph is a tool for scalable construction of annotated genome graphs and sequence-to-graph alignment.
**Scalable indexing and querying of annotated genome graphs — from a handful of genomes to petabase-scale sequence repositories.**

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"—", "…", "·" and extensive use of emojis in modern days is generally a strong sign of it being AI-generated. It's not necessarily bad, but some users may be averted by it being too prominent. Stupid as it is, I'd suggest asking Claude to "de-LLM" the text and keep it more professional (in particular in terms of unusual characters used) for better perception.

I'm also a bit wary that LLM text is more "watery" than it should be, while the original one was reasonably concise and on-point, but that may be a matter of taste.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think emojis and dashes are not a big deal, they actually make it easier to read. But watery text is not good. Where is it watery?

Comment thread README.md
## License
Metagraph is distributed under the GPLv3 License (see LICENSE).
Please find further information in the AUTHORS and COPYRIGHTS files.
MetaGraph is distributed under the GPLv3 License (see [LICENSE](LICENSE)). See also [AUTHORS](AUTHORS) and [COPYRIGHT](COPYRIGHT).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, would be nice to update AUTHORS and COPYRIGHT files while we're at it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment thread README.md Outdated
conda create -n metagraph python
conda activate metagraph
conda install -c bioconda -c conda-forge metagraph
pip install --force-reinstall "git+https://github.com/ratschlab/metagraph.git#subdirectory=metagraph/workflows"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will force pip to checkout all submodules in a heavy repo, which probably isn't really needed for lightweight wrapper for the binaries that are installed from conda. Would be better to put this on PyPI or a separate repo (that wouldn't drag all the dependencies).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maintaining another repo on PyPI might add a larger overhead than installing this small lib from source like this

Comment thread README.md Outdated
```

Then run the full pipeline as:
Other ways to use the index: [`metagraph align`](https://metagraph.ethz.ch/static/docs/sequence_search.html#sequence-to-graph-alignment) for sequence-to-graph alignment (acts as a read mapper when given a coordinate-aware annotator); `metagraph query --align` to find labels via alignment scoring instead of exact *k*-mer matching (useful for divergent or noisy queries); [`metagraph server_query`](https://metagraph.ethz.ch/static/docs/api.html) for Python/HTTP queries. The [Minimal example](#minimal-example) below walks through each step on a smaller dataset.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make it an itemized list.

metagraph align for sequence-to-graph alignment (acts as a read mapper when given a coordinate-aware annotator); metagraph query --align to find labels via alignment scoring instead of exact k-mer matching (useful for divergent or noisy queries);

It's unclear what is the difference between the two.

metagraph server_query for Python/HTTP queries

It reads as if it sends queries, but it actually runs a server?

Comment thread README.md Outdated
**Build a docker container.** Run `docker build .`

Examples:
**Makefile.** The top-level `Makefile` conveniently wraps the common build / test invocations. Supported arguments:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still relevant? I thought we primarily use CMake in metagraph/. Is Makefile even covered by CI?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not needed in the README. But the Makefile is indeed used -- to build the Docker images.

changing the value passed with flag ``--subsample <INT>``. The 1M rows subsampled by default are usually enough
even for very large annotations. Increasing this value usually does not lead to any significantly better compression.

.. note::

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... Do we know the current process to deploy changes to metagraph/docs/? It's not pulled automatically, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it pulls and rebuilds the docs on redeployment

karasikov and others added 4 commits May 31, 2026 00:49
Address PR #635 review: remove decorative emoji and the em-dash/ellipsis/middot characters flagged as AI-style, keeping legitimate symbols (math/arrow/en-dash).

Itemize "Other ways to use the index" and clarify metagraph align vs query --align, and that server_query starts an HTTP server.

Replace the raster logo with a vector SVG (from the source PDF) plus a white dark-mode variant, served via a <picture> prefers-color-scheme swap; remove the now-unused PNG.

Drop the Makefile shortcuts from the developer notes (the Makefile is used only to build the Docker image).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add Oleksandr Kulkov, Marc Zimmermann, and Thomas Zhou to AUTHORS; bump the COPYRIGHT year to 2026 and add Oleksandr Kulkov to the copyright holders.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an H1 title "MetaGraph: The Metagenome Graph Project" under the logo.

Recolor the dark-mode logo variant from pure white to a soft light gray (#c9d1d9); pure white (~21:1 contrast) read as too aggressive on dark backgrounds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@karasikov karasikov requested a review from adamant-pwn May 30, 2026 22:59
- Correct the "Common query flags" comment: --labels-delimiter applies to
  --query-mode labels (default mode is matches), and show the flag with it.
- Drop --force-reinstall from the workflows pip install: the env is freshly
  created just above, so it only risks clobbering conda-managed deps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="README.md">

<violation number="1" location="README.md:182">
P2: Misstates `metagraph align -a` semantics; it requires a shared label across the path, not every k-mer in every reported label.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread README.md
Comment on lines +182 to +183
# Labeled alignment: with -a, the walk is label-trace-consistent, i.e. every
# k-mer of the reported path lies in every reported label.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Misstates metagraph align -a semantics; it requires a shared label across the path, not every k-mer in every reported label.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At README.md, line 182:

<comment>Misstates `metagraph align -a` semantics; it requires a shared label across the path, not every k-mer in every reported label.</comment>

<file context>
@@ -1,316 +1,342 @@
+metagraph align -v -i graph.dbg query.fa
 
-./metagraph query -i transcripts_1000.dbg -a transcripts_1000.column.annodbg $DATA
+# Labeled alignment: with -a, the walk is label-trace-consistent, i.e. every
+# k-mer of the reported path lies in every reported label.
+metagraph align -v -i graph.dbg -a annotation.row_diff_brwt.annodbg reads.fq
</file context>
Suggested change
# Labeled alignment: with -a, the walk is label-trace-consistent, i.e. every
# k-mer of the reported path lies in every reported label.
# Labeled alignment: with -a, the walk is label-consistent, meaning at least one label
# is shared by all nodes on the path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants