Feature/qualitative analysis notebook by tanhaow · Pull Request #64 · Princeton-CDH/muse

tanhaow · 2026-04-06T13:34:08Z

Associated Issue(s): resolves #17

Changes in this PR

A marimo notebook (notebooks/mt_browser.py) for browsing Phase 1 machine translations side-by-side with source text, reference translation, and evaluation scores (chrF, COMET, CometKiwi).

Four sections:

Sentence translations
Sentence backtranslations
Paragraph translations
Paragraph backtranslations

Uses the full sentence corpus (mt-sents-*.jsonl, 854 pairs, 4 models including madlad). Displays dataframes using marimo's native viewer, which supports hovering to view full text.

Notes

Set DATA_DIR at the top of the notebook to the phase-1 data directory (project drive or TigerData mount). All paths are resolved relative to that variable.

Reviewer Checklist

Review the notebook

laurejt

I was unable to run the notebook because there is something wrong with the eval data. They do not appear to have matching tr_id with the machine translations on the drive (perhaps the eval data was generated on the wrong files?)

Once the data issue is fixed, please update the notebook so at the top a variable to the phase-1 data directory can be provided. Then all subsequent paths can assume the canonical organization (and names) dictated by TigerData and the project drive (which will likely become the staler copy).

Also, to make this notebook runnable: update the pyproject.toml so that it includes the new package dependencies and tell ruff to ignore files in the notebooks directory (we did something similar for remarx)

- Add DATA_DIR variable pointing to Phase 1 data directory - Update all data paths to canonical Phase 1 structure - Add scipy and pandas to pyproject.toml dependencies - Exclude notebook from ruff linting

laurejt · 2026-04-08T13:27:35Z

I was unable to run the notebook because there is something wrong with the eval data. They do not appear to have matching tr_id with the machine translations on the drive (perhaps the eval data was generated on the wrong files?)

Once the data issue is fixed, please update the notebook so at the top a variable to the phase-1 data directory can be provided. Then all subsequent paths can assume the canonical organization (and names) dictated by TigerData and the project drive (which will likely become the staler copy).

Also, to make this notebook runnable: update the pyproject.toml so that it includes the new package dependencies and tell ruff to ignore files in the notebooks directory (we did something similar for remarx)

The tr_ids do not match for the paragraph data, which suggests that the wrong data was evaluated (i.e., not the translation files available on Google Drive / TigerData).

tanhaow · 2026-04-08T13:41:52Z

I was unable to run the notebook because there is something wrong with the eval data. They do not appear to have matching tr_id with the machine translations on the drive (perhaps the eval data was generated on the wrong files?)
Once the data issue is fixed, please update the notebook so at the top a variable to the phase-1 data directory can be provided. Then all subsequent paths can assume the canonical organization (and names) dictated by TigerData and the project drive (which will likely become the staler copy).
Also, to make this notebook runnable: update the pyproject.toml so that it includes the new package dependencies and tell ruff to ignore files in the notebooks directory (we did something similar for remarx)

The tr_ids do not match for the paragraph data, which suggests that the wrong data was evaluated (i.e., not the translation files available on Google Drive / TigerData).

@laurejt Thanks for pointing this out. I found the mismatch is caused by that I re-ran the full pipeline locally (generating new translations and evaluations together), but only uploaded the new eval data to Google Drive, not the updated translations. The local copies on my computer match each other. I can push the updated translation files to Google Drive to resolve this, unless there's a reason to keep the current versions there.

laurejt · 2026-04-08T13:43:25Z

Why did you rerun all of the machine translations for the paragraph data? We should not be replacing what we originally created, especially if there wasn't any issues with them. As a reminder, running Google TLLM costs money, so we shouldn't be rerunning this when it's unnecessary.

…e-analysis-notebook

laurejt · 2026-04-13T20:10:40Z

@tanhaow The notebook mentions there being machine translations in different languages than the professional reference translation. This should never be the case. Can you share specific examples where this occurs? This sounds like their might be a bug with the evaluation script or some of the models are failing to translate entirely.

laurejt

This notebook does not address #61 which is to qualitatively evaluate the quality of the metrics with respect to our music domain translations. However, this notebook can be revised to meet the initial specifications for #17. This will largely involve trimming down the notebook to section 5, but consider any additional features that would be valuable for completing #61 (the issue has been updated with a more detailed specification).

Before the next review, make sure to update the PR's description to match the shift in the notebook's scope / purpose.

laurejt · 2026-04-13T19:28:16Z

+  "marimo>=0.22.0",
  "polars>=0.20.4",
  "numpy",
  "ipython", # Required by transformers for trainer functionality
  "transformers[torch, sentencepiece, tiktoken]",
  "google-cloud-translate",
  "google-auth",
  "orjsonl",
  "ftfy",
  "evaluate",
  "datasets",
  "sacrebleu",
  "unbabel-comet",
+  "altair>=5",
+  "pandas",
+  "scipy",
+  "nbformat>=5.10.4",
+  "nbconvert>=7.17.0",
+  "playwright>=1.58.0",
 ]


Are these specific versions all the way to patch levels needed? How are these dependencies being added to the pyproject.toml?

If the versioning logic is needed, please document the reasons.

This question still needs to be addressed.

laurejt · 2026-04-13T19:31:07Z

 # Configure src path so ruff import fixes can identify local imports
 src = [ "src" ]
+# Ignore notebooks — marimo notebooks are not linted
+extend-exclude = [ "notebooks/mt_eval_analysis.py" ]


The comment suggests this is a general rule for all notebooks, so make that the case.

Suggested change

extend-exclude = [ "notebooks/mt_eval_analysis.py" ]

extend-exclude = [ "notebooks/*.py" ]

laurejt · 2026-04-13T19:32:28Z

+[tool.ruff.lint.per-file-ignores]
+# Notebooks use typographic Unicode in markdown strings intentionally
+"notebooks/*.py" = ["RUF001"]
+


Given that notebooks are ignored by ruff, why is this needed?

This still stands, if ruff is ignoring these files then there is no need for this rule.

laurejt · 2026-04-13T19:40:24Z

+    import pathlib
+
+    import altair as alt
+    import pandas as pd


Why are you using pandas at all? Altair fully supports polars and does not require pandas.
https://pola.rs/posts/lightweight_plotting/

laurejt · 2026-04-13T19:48:29Z

+def _(DATA_DIR, pl):
+    # Load sentence eval CSVs and join with metadata from notion-concept-tasks.jsonl
+    sents_meta = pl.read_ndjson(
+        DATA_DIR / "prodigy/notion-concept/notion-concept-tasks.jsonl"


The prodigy input is the wrong file to use for exploring our sentence-level machine translations since it is a small subset of the available sentence translations and drops backtranslations entirely.

This still needs to be corrected.

- Replace quantitative analysis (sections 1-4) with a qualitative translation browser, aligning the notebook with issue #17 - Split browser into two sections: Translations and Backtranslations - Add DATA_DIR variable at top for configurable data path - Broaden ruff exclude to cover entire notebooks/ directory

tanhaow · 2026-04-16T18:28:50Z

@laurejt Hi Laure! The data browser is ready for your review. I’m using it to analyze the data and draft the Google Doc. I’ll send you the Doc too as soon as it’s ready.

laurejt

This is looking pretty good, but it will be difficult to use this for analysis until the full texts are displayed. Additionally this notebook does not provide browsing of our full sentence-level machine translation corpus which should cover all of the Notion concepts, not just the subset used in the annotation task, and include the sentence backtranslations.

Requested Changes:

Rename the notebook file
Remove the descriptions of the metrics at the beginning
Use the full sentence translation corpus not the subset used for the annotation task
Update the dataframe browsers so that the entire text of a field may be viewed. The marimo's built-in viewer of dataframes supports this via hovering but this is not the case with the way this notebook is displaying.
Update pyproject.toml to address my comments

laurejt · 2026-04-16T18:34:44Z

+def _(DATA_DIR, pl):
+    # Load sentence eval CSVs and join with metadata from notion-concept-tasks.jsonl
+    sents_meta = pl.read_ndjson(
+        DATA_DIR / "prodigy/notion-concept/notion-concept-tasks.jsonl"


This still needs to be corrected.

laurejt · 2026-04-16T18:35:50Z

+[tool.ruff.lint.per-file-ignores]
+# Notebooks use typographic Unicode in markdown strings intentionally
+"notebooks/*.py" = ["RUF001"]
+


This still stands, if ruff is ignoring these files then there is no need for this rule.

laurejt · 2026-04-16T18:36:08Z

+  "marimo>=0.22.0",
  "polars>=0.20.4",
  "numpy",
  "ipython", # Required by transformers for trainer functionality
  "transformers[torch, sentencepiece, tiktoken]",
  "google-cloud-translate",
  "google-auth",
  "orjsonl",
  "ftfy",
  "evaluate",
  "datasets",
  "sacrebleu",
  "unbabel-comet",
+  "altair>=5",
+  "pandas",
+  "scipy",
+  "nbformat>=5.10.4",
+  "nbconvert>=7.17.0",
+  "playwright>=1.58.0",
 ]


This question still needs to be addressed.

laurejt · 2026-04-16T18:37:30Z

Please rename this file to what it represents. Perhaps mt_browser.py?

laurejt · 2026-04-16T18:42:07Z

-      but for non-literal, domain-specific translations it mostly measures surface similarity
-      to the reference rather than actual quality. The near-zero correlation with CometKiwi
-      on paragraphs (r ≈ −0.05) is a red flag.
+    ## 2. Backtranslations


This is not a good way to display the dataframe since there is no way to read the entire text within a given column.

laurejt · 2026-04-16T18:42:28Z

-    n_sel = mo.ui.slider(5, 50, step=5, value=10, label="Show N")
+def _(mo):
+    mo.md("""
+    ## 1. Translations


This is not a good way to display the dataframe since there is no way to read the entire text within a given column.

laurejt · 2026-04-16T18:43:18Z

+    **Metrics:** chrF measures character n-gram overlap with the reference (unreliable for
+    cross-script directions such as `en→zh`). COMET and CometKiwi are neural metrics trained
+    on human judgements; CometKiwi is reference-free and does not penalise paraphrastic output.


These are do not sufficiently describe the algorithms. Do not revise, simply delete.

laurejt · 2026-04-16T18:58:45Z

 # Configure src path so ruff import fixes can identify local imports
 src = [ "src" ]
+# Ignore notebooks — marimo notebooks are not linted
+extend-exclude = [ "notebooks/" ]


Suggested change

extend-exclude = [ "notebooks/" ]

extend-exclude = [ "notebooks/*.py" ]

laurejt · 2026-04-16T19:34:00Z

@tanhaow When it comes to fixing the displays issue, just simplify it. Don't use marimo ui components; simply display the dataframes directly. There's also no reason to combine the viewing space for sentences and paragraphs; just display a dataframe for each.

While this approach only shows the dataframes in edit mode, that's fine for our current needs.

- Rename to mt_browser.py - Use full sentence corpus (mt-sents-*.jsonl, 854 rows, 4 models including madlad) instead of annotation subset (notion-concept-tasks.jsonl, 300 rows) - Add sentence backtranslations section (en→* directions) - Replace mo.ui widgets with native marimo dataframe viewer for full-text hover - Remove metric descriptions from notebook header - Split into 4 sections: sentence translations, sentence backtranslations, paragraph translations, paragraph backtranslations

altair, pandas, scipy, nbformat, nbconvert, and playwright were added for the previous notebook version and are no longer used.

- Change extend-exclude to notebooks/*.py (more precise) - Remove per-file-ignores section for notebooks (redundant when excluded)

laurejt

This is looking really good. There's just one minor change: skipping the "madlad" sentence corpus, since that model was dropped (i.e., replaced by TranslateGemma) from our evaluation.

laurejt · 2026-04-17T15:53:44Z

+    _sents_meta = pl.concat(
+        [
+            pl.read_ndjson(DATA_DIR / f"notion-sents/mt-sents-{m}.jsonl")
+            for m in ["google_tllm", "hymt", "gemma", "madlad"]


Drop madlad since it was dropped from our working set of models.

madlad was replaced by TranslateGemma in our working set of models.

tanhaow added 3 commits April 1, 2026 13:03

Update metrics.py

317cd1f

ruff format

4d4c6ef

Add qualitative analysis marimo notebook

dfcd219

tanhaow marked this pull request as draft April 6, 2026 13:34

tanhaow requested a review from laurejt April 7, 2026 14:04

laurejt reviewed Apr 7, 2026

View reviewed changes

Update notebook to use Phase 1 data paths and add missing dependencies

f674c78

- Add DATA_DIR variable pointing to Phase 1 data directory - Update all data paths to canonical Phase 1 structure - Add scipy and pandas to pyproject.toml dependencies - Exclude notebook from ruff linting

tanhaow added 2 commits April 13, 2026 12:46

Merge remote-tracking branch 'origin/develop' into feature/qualitativ…

ba63f7d

…e-analysis-notebook

Add more notes and analysis

59895b4

tanhaow marked this pull request as ready for review April 13, 2026 18:37

tanhaow requested a review from laurejt April 13, 2026 18:38

laurejt requested changes Apr 14, 2026

View reviewed changes

tanhaow requested a review from laurejt April 16, 2026 18:27

laurejt requested changes Apr 16, 2026

View reviewed changes

tanhaow added 3 commits April 17, 2026 09:40

Remove unused dependencies from pyproject.toml

2dd44a9

altair, pandas, scipy, nbformat, nbconvert, and playwright were added for the previous notebook version and are no longer used.

Fix ruff config per review feedback

c7351dd

- Change extend-exclude to notebooks/*.py (more precise) - Remove per-file-ignores section for notebooks (redundant when excluded)

tanhaow requested a review from laurejt April 17, 2026 15:40

laurejt approved these changes Apr 17, 2026

View reviewed changes

Drop madlad from sentence corpus per review feedback

5811462

madlad was replaced by TranslateGemma in our working set of models.

tanhaow merged commit 5eeb105 into develop Apr 17, 2026
1 check passed

tanhaow deleted the feature/qualitative-analysis-notebook branch April 17, 2026 16:56

	extend-exclude = [ "notebooks/mt_eval_analysis.py" ]
	extend-exclude = [ "notebooks/*.py" ]

	extend-exclude = [ "notebooks/" ]
	extend-exclude = [ "notebooks/*.py" ]

Conversation

tanhaow commented Apr 6, 2026 • edited by laurejt Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes in this PR

Notes

Reviewer Checklist

Uh oh!

laurejt left a comment

Choose a reason for hiding this comment

Uh oh!

laurejt commented Apr 8, 2026

Uh oh!

tanhaow commented Apr 8, 2026

Uh oh!

laurejt commented Apr 8, 2026

Uh oh!

laurejt commented Apr 13, 2026

Uh oh!

laurejt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tanhaow commented Apr 16, 2026

Uh oh!

laurejt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

laurejt commented Apr 16, 2026

Uh oh!

laurejt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tanhaow commented Apr 6, 2026 •

edited by laurejt

Loading