README: modernize structure (logo, Quick start, MetaGraph Online)#635
README: modernize structure (logo, Quick start, MetaGraph Online)#635karasikov wants to merge 23 commits into
Conversation
ab59afb to
3086e4b
Compare
Re-introduces content trimmed from master as collapsible sections (design choices, direct CLI, KMC build, manual Multi-BRWT with memory formulas, common query flags, advanced Docker, contributing pointer, offline docs). Adds a visible "Minimal example" section with a 4-command end-to-end demo on the bundled transcripts_1000.fa, plus a sample query output showing paralog matching. Aligns terminology with the project (de Bruijn graph, annotation matrix, labels, ColumnCompressed/RowDiff<Multi-BRWT>, BOSS). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ures - Add a Mermaid diagram (FASTA → build/annotate → graph + annotation → query/align → matches) as a visual anchor for the two-component model. - Insert a "What you can do" use-case block before Features (search SRA, index a cohort, lossless source recovery). - Split k-mer counts/coordinates into distinct Feature bullets; promote Python API & HTTP server to a Feature; move Custom alphabets into the Design choices collapsible. - Add memory-mapped loading to Design choices. - Gloss --primary inline; collapse the 4-stage pipeline list to a one-liner; add a "What next" subsection after Step 3 (align / Python / diff assembly / docs). - Add a plain-English definition of "label" at the top of Minimal example. - Drop the Manual Multi-BRWT recipe (with memory formulas); leave a link to the annotation-transform docs. - Add a sysreqs + alphabet-picker line to Install (Linux/macOS x86-64/arm64, RAM scaling, bioconda binary is DNA-only). Net: 334 -> 315 lines; first paint reframed from engineering inventory to outcomes plus a working example. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cross-checked every flag, default, output name, and behavioral claim against
the code and docs. Corrections:
- Memory-mapped loading reframed as opt-in --mmap (was implied default);
noted NVMe recommended, SSD slower.
- graph_small.dbg: "slower at access" not "slower-to-load" (BOSS SMALL state
is smaller, slower-access).
- --primary rationale: "appropriate when read strand orientation is unknown"
instead of "recommended for DNA data".
- Bioconda: ships DNA + Protein alphabets (metagraph symlinked to
metagraph_DNA), not a single DNA binary; Docker image adds DNA5.
- MetaGraph Online datasets: list per resources.rst (RefSeq, UHGG, Tara
Oceans, UniParc), not the previous SRA/ENA/DRA/UniProt guess.
- Pipeline chain now includes BRWT relaxation step (explains the .relax.
prefix in the default annotation filename).
Simplifications:
- Merge "What you can do" + "Features" into one outcome-framed list
(6 bullets, was 3 + 7).
- Trim Design choices ("Under the hood") to 4 sub-bullets; move Custom
alphabets out (covered in Install).
- Move the real-workload build command (file list + hardware budget) into
a collapsible.
- Replace "What next" 4-bullet subsection with a one-sentence post-script
after Step 3.
- Tighten Install intro to a single sentence (Linux/macOS, no arch claims).
Net: -20 lines (315 -> 295).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Reword "search engine over 50+ petabases of ... data" -> "search engine over indexes built from 50+ petabases of ... sequencing data" in both places. The 50+ PB is the input to indexing, not the search target. - Drop the lightbulb emoji from the Minimal example header (update the one cross-reference anchor accordingly). - Use a file-cabinet icon (database-like) for "Index your own data". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Bring back the three install-method badges from master (install with conda / install with docker / install from source), with anchors updated to the current section IDs. - Switch the Index your own data icon from file-cabinet to construction, matching the bullet's Build a k-mer index verb. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
GitHub's slugger left the U+FE0F variation-selector from the wrench emoji as an invisible character in the generated anchor for '### 🛠️ Install from sources'. Drop the emoji from the header and point the badge at the clean #install-from-sources anchor. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Switch from transcripts_1000.fa (1000 Ensembl headers with pipe-separated metadata) to metasub_fake_data_simple.fa (3 short sample-name labels: kl_sample / zh_sample / tk_sample). Query output is now human-readable at a glance. Also drop the closing BRWT/RowDiff paragraph - operator-tier detail that doesn't belong in a minimal example. Smoke-tested: output in README matches the actual command output exactly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Default query mode (labels) just lists matching labels without counts. Switching to --query-mode matches gives <label>:k-mer-count per match, which is far more informative and also matches what Quick Start step 3 already uses. Updated explanation to highlight that zh_sample (243 k-mers) and tk_sample (207 k-mers) are fully contained in kl_sample (243/243, 207/207); kl_sample matches only itself because the shorter samples can't cover its 330 k-mers. Smoke-tested: README output matches command output exactly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s feature; trim License badge and BRWT pointer Issue #636 reported confusion about metagraph align vs metagraph query --align. Per Harun's clarification in the comments: - metagraph align: sequence-to-graph alignment. With a coordinate-aware annotator, acts as a read mapper (walks are constrained to coordinate-consistent paths). - metagraph query --align: aligns each query to the graph (ignoring annotations), then fetches labels for the highest-scoring walk. Use when exact k-mer matching is too strict. Expanded the Direct CLI recipes to show all three modes with explanatory comments, and added a brief mention to the Quick Start step 3 closing line. Other tweaks: - Removed the License: GPLv3 badge. - Removed the trailing 'manual Multi-BRWT clustering' pointer in More recipes. - Added back the 🔤 Custom alphabets bullet to the visible Features list. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Match master's Features-list coverage: give k-mer counts and k-mer coordinates their own bullets (was folded inline into Index your own data) and move Custom alphabets to last position. Features list is now 9 bullets matching master's 'Main features' breakdown. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Combine the Linux and macOS support badges into a single platform badge. Move docker build, Makefile, and release notes into a collapsed Developer notes section and remove the standalone CONTRIBUTING.md (restoring how master had it). Trim the k-mer payload and differential-assembly Feature bullets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt Features punctuation, trim logo margins Add a MetaGraph Online pointer to the AWS Open Data preconstructed indexes (download + offline query). Use a globe icon for the public-archive search bullet (reworded to 'Search in public archives') and a cloud icon for the Python API & HTTP server bullet. Unify the Features bullets to a single 'Title. Sentence.' punctuation style. Trim transparent margins from the logo PNG (1138x569 -> 631x399). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After trimming the logo's transparent margins it rendered ~266px tall at width=420; reduce to 260 (~164px) so the header logo is proportionate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…struction Fold the hands-on Minimal example into a collapsed <details> (heading kept so the #minimal-example cross-link still resolves). Relabel the build-section docs pointer as the 'Quick start guide' describing index construction, and note the git clone is only to fetch the example *.fa files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings back the contributor-facing design-choice from master (generic, extensible interfaces for adding custom representations/algorithms), closing the last completeness gap vs the old README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
adamant-pwn
left a comment
There was a problem hiding this comment.
Thanks for the PR! I've added a few comments.
| @@ -1,316 +1,342 @@ | |||
| # Metagenome Graph Project | |||
| <p align="center"> | |||
| <img src="metagraph/docs/source/images/metagraph_logo.png" alt="MetaGraph" width="260"> | |||
There was a problem hiding this comment.
Do we have a logo in vector format? I'd suggest to also try making a logo variant for dark backgrounds (so the main color would be white, or some other light color). Note the logo is different from the one we use in Docs.
| [](https://metagraph.ethz.ch/static/docs/index.html) | ||
|
|
||
| MetaGraph is a tool for scalable construction of annotated genome graphs and sequence-to-graph alignment. | ||
| **Scalable indexing and querying of annotated genome graphs — from a handful of genomes to petabase-scale sequence repositories.** |
There was a problem hiding this comment.
"—", "…", "·" and extensive use of emojis in modern days is generally a strong sign of it being AI-generated. It's not necessarily bad, but some users may be averted by it being too prominent. Stupid as it is, I'd suggest asking Claude to "de-LLM" the text and keep it more professional (in particular in terms of unusual characters used) for better perception.
I'm also a bit wary that LLM text is more "watery" than it should be, while the original one was reasonably concise and on-point, but that may be a matter of taste.
There was a problem hiding this comment.
I think emojis and dashes are not a big deal, they actually make it easier to read. But watery text is not good. Where is it watery?
| ## License | ||
| Metagraph is distributed under the GPLv3 License (see LICENSE). | ||
| Please find further information in the AUTHORS and COPYRIGHTS files. | ||
| MetaGraph is distributed under the GPLv3 License (see [LICENSE](LICENSE)). See also [AUTHORS](AUTHORS) and [COPYRIGHT](COPYRIGHT). |
There was a problem hiding this comment.
Hmm, would be nice to update AUTHORS and COPYRIGHT files while we're at it.
| conda create -n metagraph python | ||
| conda activate metagraph | ||
| conda install -c bioconda -c conda-forge metagraph | ||
| pip install --force-reinstall "git+https://github.com/ratschlab/metagraph.git#subdirectory=metagraph/workflows" |
There was a problem hiding this comment.
This will force pip to checkout all submodules in a heavy repo, which probably isn't really needed for lightweight wrapper for the binaries that are installed from conda. Would be better to put this on PyPI or a separate repo (that wouldn't drag all the dependencies).
There was a problem hiding this comment.
I think maintaining another repo on PyPI might add a larger overhead than installing this small lib from source like this
| ``` | ||
|
|
||
| Then run the full pipeline as: | ||
| Other ways to use the index: [`metagraph align`](https://metagraph.ethz.ch/static/docs/sequence_search.html#sequence-to-graph-alignment) for sequence-to-graph alignment (acts as a read mapper when given a coordinate-aware annotator); `metagraph query --align` to find labels via alignment scoring instead of exact *k*-mer matching (useful for divergent or noisy queries); [`metagraph server_query`](https://metagraph.ethz.ch/static/docs/api.html) for Python/HTTP queries. The [Minimal example](#minimal-example) below walks through each step on a smaller dataset. |
There was a problem hiding this comment.
Maybe make it an itemized list.
metagraph alignfor sequence-to-graph alignment (acts as a read mapper when given a coordinate-aware annotator);metagraph query --alignto find labels via alignment scoring instead of exact k-mer matching (useful for divergent or noisy queries);
It's unclear what is the difference between the two.
metagraph server_queryfor Python/HTTP queries
It reads as if it sends queries, but it actually runs a server?
| **Build a docker container.** Run `docker build .` | ||
|
|
||
| Examples: | ||
| **Makefile.** The top-level `Makefile` conveniently wraps the common build / test invocations. Supported arguments: |
There was a problem hiding this comment.
Is this still relevant? I thought we primarily use CMake in metagraph/. Is Makefile even covered by CI?
There was a problem hiding this comment.
Probably not needed in the README. But the Makefile is indeed used -- to build the Docker images.
| changing the value passed with flag ``--subsample <INT>``. The 1M rows subsampled by default are usually enough | ||
| even for very large annotations. Increasing this value usually does not lead to any significantly better compression. | ||
|
|
||
| .. note:: |
There was a problem hiding this comment.
So... Do we know the current process to deploy changes to metagraph/docs/? It's not pulled automatically, right?
There was a problem hiding this comment.
Yes, it pulls and rebuilds the docs on redeployment
Address PR #635 review: remove decorative emoji and the em-dash/ellipsis/middot characters flagged as AI-style, keeping legitimate symbols (math/arrow/en-dash). Itemize "Other ways to use the index" and clarify metagraph align vs query --align, and that server_query starts an HTTP server. Replace the raster logo with a vector SVG (from the source PDF) plus a white dark-mode variant, served via a <picture> prefers-color-scheme swap; remove the now-unused PNG. Drop the Makefile shortcuts from the developer notes (the Makefile is used only to build the Docker image). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add Oleksandr Kulkov, Marc Zimmermann, and Thomas Zhou to AUTHORS; bump the COPYRIGHT year to 2026 and add Oleksandr Kulkov to the copyright holders. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an H1 title "MetaGraph: The Metagenome Graph Project" under the logo. Recolor the dark-mode logo variant from pure white to a soft light gray (#c9d1d9); pure white (~21:1 contrast) read as too aggressive on dark backgrounds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Correct the "Common query flags" comment: --labels-delimiter applies to --query-mode labels (default mode is matches), and show the flag with it. - Drop --force-reinstall from the workflows pip install: the env is freshly created just above, so it only risks clobbering conda-managed deps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 6 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="README.md">
<violation number="1" location="README.md:182">
P2: Misstates `metagraph align -a` semantics; it requires a shared label across the path, not every k-mer in every reported label.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| # Labeled alignment: with -a, the walk is label-trace-consistent, i.e. every | ||
| # k-mer of the reported path lies in every reported label. |
There was a problem hiding this comment.
P2: Misstates metagraph align -a semantics; it requires a shared label across the path, not every k-mer in every reported label.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At README.md, line 182:
<comment>Misstates `metagraph align -a` semantics; it requires a shared label across the path, not every k-mer in every reported label.</comment>
<file context>
@@ -1,316 +1,342 @@
+metagraph align -v -i graph.dbg query.fa
-./metagraph query -i transcripts_1000.dbg -a transcripts_1000.column.annodbg $DATA
+# Labeled alignment: with -a, the walk is label-trace-consistent, i.e. every
+# k-mer of the reported path lies in every reported label.
+metagraph align -v -i graph.dbg -a annotation.row_diff_brwt.annodbg reads.fq
</file context>
| # Labeled alignment: with -a, the walk is label-trace-consistent, i.e. every | |
| # k-mer of the reported path lies in every reported label. | |
| # Labeled alignment: with -a, the walk is label-consistent, meaning at least one label | |
| # is shared by all nodes on the path. |
Summary
Modernize the top-level
README.mdfor the open-source repo: lead with a theme-aware logo and a one-command Quick start, surface what MetaGraph does before how to install it, push the long raw-metagraphrecipes into collapsible sections, and advertise the hosted search service.What changed
<picture>+prefers-color-scheme), an<h1>project title, and a consolidated badge row (platform, release, bioconda version + downloads, conda/docker/source install-nav, DOI, online docs).<details>for the data-structure internals.pipworkflows wrapper), 2. Build an index (uses the bundledmetagraph/tests/data/*.faso a fresh checkout runs as-is; explains--primaryand thegraph.dbg/graph_small.dbg/graph.relax.row_diff_brwt.annodbgoutputs), 3. Query. Includes a collapsible real-workload example with a hardware budget.<details>) and a pointer to the install-from-source guide.<details>(no separate file).<details>; GPLv3 with links to AUTHORS/COPYRIGHT.Net: 6 files, +486 / −225 lines (README ~342 lines).
Test plan
build → annotate → query → stats) locally — output matches the README byte-for-byte (<kl_sample>:330,<zh_sample>:243,<tk_sample>:207).graph.dbg,graph_small.dbg,graph.relax.row_diff_brwt.annodbg) against the workflow's default config.--query-mode,--min-kmers-fraction-label0.7,--labels-delimiter":",--mmap,server_query --port5555,assemble --unitigs/--diff-assembly-rules, …).master.