Skip to content

Feature/qualitative analysis notebook#64

Merged
tanhaow merged 11 commits intodevelopfrom
feature/qualitative-analysis-notebook
Apr 17, 2026
Merged

Feature/qualitative analysis notebook#64
tanhaow merged 11 commits intodevelopfrom
feature/qualitative-analysis-notebook

Conversation

@tanhaow
Copy link
Copy Markdown
Contributor

@tanhaow tanhaow commented Apr 6, 2026

Associated Issue(s): resolves #17

Changes in this PR

A marimo notebook (notebooks/mt_browser.py) for browsing Phase 1 machine translations side-by-side with source text, reference translation, and evaluation scores (chrF, COMET, CometKiwi).

Four sections:

  1. Sentence translations
  2. Sentence backtranslations
  3. Paragraph translations
  4. Paragraph backtranslations

Uses the full sentence corpus (mt-sents-*.jsonl, 854 pairs, 4 models including madlad). Displays dataframes using marimo's native viewer, which supports hovering to view full text.

Notes

Set DATA_DIR at the top of the notebook to the phase-1 data directory (project drive or TigerData mount). All paths are resolved relative to that variable.

Reviewer Checklist

  • Review the notebook

@tanhaow tanhaow marked this pull request as draft April 6, 2026 13:34
@tanhaow tanhaow requested a review from laurejt April 7, 2026 14:04
Copy link
Copy Markdown
Contributor

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to run the notebook because there is something wrong with the eval data. They do not appear to have matching tr_id with the machine translations on the drive (perhaps the eval data was generated on the wrong files?)

Once the data issue is fixed, please update the notebook so at the top a variable to the phase-1 data directory can be provided. Then all subsequent paths can assume the canonical organization (and names) dictated by TigerData and the project drive (which will likely become the staler copy).

Also, to make this notebook runnable: update the pyproject.toml so that it includes the new package dependencies and tell ruff to ignore files in the notebooks directory (we did something similar for remarx)

- Add DATA_DIR variable pointing to Phase 1 data directory
- Update all data paths to canonical Phase 1 structure
- Add scipy and pandas to pyproject.toml dependencies
- Exclude notebook from ruff linting
@laurejt
Copy link
Copy Markdown
Contributor

laurejt commented Apr 8, 2026

I was unable to run the notebook because there is something wrong with the eval data. They do not appear to have matching tr_id with the machine translations on the drive (perhaps the eval data was generated on the wrong files?)

Once the data issue is fixed, please update the notebook so at the top a variable to the phase-1 data directory can be provided. Then all subsequent paths can assume the canonical organization (and names) dictated by TigerData and the project drive (which will likely become the staler copy).

Also, to make this notebook runnable: update the pyproject.toml so that it includes the new package dependencies and tell ruff to ignore files in the notebooks directory (we did something similar for remarx)

The tr_ids do not match for the paragraph data, which suggests that the wrong data was evaluated (i.e., not the translation files available on Google Drive / TigerData).

@tanhaow
Copy link
Copy Markdown
Contributor Author

tanhaow commented Apr 8, 2026

I was unable to run the notebook because there is something wrong with the eval data. They do not appear to have matching tr_id with the machine translations on the drive (perhaps the eval data was generated on the wrong files?)
Once the data issue is fixed, please update the notebook so at the top a variable to the phase-1 data directory can be provided. Then all subsequent paths can assume the canonical organization (and names) dictated by TigerData and the project drive (which will likely become the staler copy).
Also, to make this notebook runnable: update the pyproject.toml so that it includes the new package dependencies and tell ruff to ignore files in the notebooks directory (we did something similar for remarx)

The tr_ids do not match for the paragraph data, which suggests that the wrong data was evaluated (i.e., not the translation files available on Google Drive / TigerData).

@laurejt Thanks for pointing this out. I found the mismatch is caused by that I re-ran the full pipeline locally (generating new translations and evaluations together), but only uploaded the new eval data to Google Drive, not the updated translations. The local copies on my computer match each other. I can push the updated translation files to Google Drive to resolve this, unless there's a reason to keep the current versions there.

@laurejt
Copy link
Copy Markdown
Contributor

laurejt commented Apr 8, 2026

Why did you rerun all of the machine translations for the paragraph data? We should not be replacing what we originally created, especially if there wasn't any issues with them. As a reminder, running Google TLLM costs money, so we shouldn't be rerunning this when it's unnecessary.

@tanhaow tanhaow marked this pull request as ready for review April 13, 2026 18:37
@tanhaow tanhaow requested a review from laurejt April 13, 2026 18:38
@laurejt
Copy link
Copy Markdown
Contributor

laurejt commented Apr 13, 2026

@tanhaow The notebook mentions there being machine translations in different languages than the professional reference translation. This should never be the case. Can you share specific examples where this occurs? This sounds like their might be a bug with the evaluation script or some of the models are failing to translate entirely.

Copy link
Copy Markdown
Contributor

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This notebook does not address #61 which is to qualitatively evaluate the quality of the metrics with respect to our music domain translations. However, this notebook can be revised to meet the initial specifications for #17. This will largely involve trimming down the notebook to section 5, but consider any additional features that would be valuable for completing #61 (the issue has been updated with a more detailed specification).

Before the next review, make sure to update the PR's description to match the shift in the notebook's scope / purpose.

Comment thread pyproject.toml
Comment on lines +26 to 45
"marimo>=0.22.0",
"polars>=0.20.4",
"numpy",
"ipython", # Required by transformers for trainer functionality
"transformers[torch, sentencepiece, tiktoken]",
"google-cloud-translate",
"google-auth",
"orjsonl",
"ftfy",
"evaluate",
"datasets",
"sacrebleu",
"unbabel-comet",
"altair>=5",
"pandas",
"scipy",
"nbformat>=5.10.4",
"nbconvert>=7.17.0",
"playwright>=1.58.0",
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these specific versions all the way to patch levels needed? How are these dependencies being added to the pyproject.toml?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the versioning logic is needed, please document the reasons.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question still needs to be addressed.

Comment thread pyproject.toml Outdated
# Configure src path so ruff import fixes can identify local imports
src = [ "src" ]
# Ignore notebooks — marimo notebooks are not linted
extend-exclude = [ "notebooks/mt_eval_analysis.py" ]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment suggests this is a general rule for all notebooks, so make that the case.

Suggested change
extend-exclude = [ "notebooks/mt_eval_analysis.py" ]
extend-exclude = [ "notebooks/*.py" ]

Comment thread pyproject.toml Outdated
Comment on lines +84 to +87
[tool.ruff.lint.per-file-ignores]
# Notebooks use typographic Unicode in markdown strings intentionally
"notebooks/*.py" = ["RUF001"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that notebooks are ignored by ruff, why is this needed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still stands, if ruff is ignoring these files then there is no need for this rule.

Comment thread notebooks/mt_eval_analysis.py Outdated
import pathlib

import altair as alt
import pandas as pd
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using pandas at all? Altair fully supports polars and does not require pandas.
https://pola.rs/posts/lightweight_plotting/

Comment thread notebooks/mt_eval_analysis.py Outdated
def _(DATA_DIR, pl):
# Load sentence eval CSVs and join with metadata from notion-concept-tasks.jsonl
sents_meta = pl.read_ndjson(
DATA_DIR / "prodigy/notion-concept/notion-concept-tasks.jsonl"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prodigy input is the wrong file to use for exploring our sentence-level machine translations since it is a small subset of the available sentence translations and drops backtranslations entirely.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to be corrected.

Comment thread notebooks/mt_eval_analysis.py Outdated
- Replace quantitative analysis (sections 1-4) with a qualitative
  translation browser, aligning the notebook with issue #17
- Split browser into two sections: Translations and Backtranslations
- Add DATA_DIR variable at top for configurable data path
- Broaden ruff exclude to cover entire notebooks/ directory
@tanhaow tanhaow requested a review from laurejt April 16, 2026 18:27
@tanhaow
Copy link
Copy Markdown
Contributor Author

tanhaow commented Apr 16, 2026

@laurejt Hi Laure! The data browser is ready for your review. I’m using it to analyze the data and draft the Google Doc. I’ll send you the Doc too as soon as it’s ready.

Copy link
Copy Markdown
Contributor

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good, but it will be difficult to use this for analysis until the full texts are displayed. Additionally this notebook does not provide browsing of our full sentence-level machine translation corpus which should cover all of the Notion concepts, not just the subset used in the annotation task, and include the sentence backtranslations.

Requested Changes:

  • Rename the notebook file
  • Remove the descriptions of the metrics at the beginning
  • Use the full sentence translation corpus not the subset used for the annotation task
  • Update the dataframe browsers so that the entire text of a field may be viewed. The marimo's built-in viewer of dataframes supports this via hovering but this is not the case with the way this notebook is displaying.
  • Update pyproject.toml to address my comments

Comment thread notebooks/mt_eval_analysis.py Outdated
def _(DATA_DIR, pl):
# Load sentence eval CSVs and join with metadata from notion-concept-tasks.jsonl
sents_meta = pl.read_ndjson(
DATA_DIR / "prodigy/notion-concept/notion-concept-tasks.jsonl"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to be corrected.

Comment thread pyproject.toml Outdated
Comment on lines +84 to +87
[tool.ruff.lint.per-file-ignores]
# Notebooks use typographic Unicode in markdown strings intentionally
"notebooks/*.py" = ["RUF001"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still stands, if ruff is ignoring these files then there is no need for this rule.

Comment thread pyproject.toml
Comment on lines +26 to 45
"marimo>=0.22.0",
"polars>=0.20.4",
"numpy",
"ipython", # Required by transformers for trainer functionality
"transformers[torch, sentencepiece, tiktoken]",
"google-cloud-translate",
"google-auth",
"orjsonl",
"ftfy",
"evaluate",
"datasets",
"sacrebleu",
"unbabel-comet",
"altair>=5",
"pandas",
"scipy",
"nbformat>=5.10.4",
"nbconvert>=7.17.0",
"playwright>=1.58.0",
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question still needs to be addressed.

Comment thread notebooks/mt_eval_analysis.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename this file to what it represents. Perhaps mt_browser.py?

Comment thread notebooks/mt_eval_analysis.py Outdated
but for non-literal, domain-specific translations it mostly measures surface similarity
to the reference rather than actual quality. The near-zero correlation with CometKiwi
on paragraphs (r ≈ −0.05) is a red flag.
## 2. Backtranslations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good way to display the dataframe since there is no way to read the entire text within a given column.

Comment thread notebooks/mt_eval_analysis.py Outdated
n_sel = mo.ui.slider(5, 50, step=5, value=10, label="Show N")
def _(mo):
mo.md("""
## 1. Translations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good way to display the dataframe since there is no way to read the entire text within a given column.

Comment thread notebooks/mt_eval_analysis.py Outdated
Comment on lines +23 to +25
**Metrics:** chrF measures character n-gram overlap with the reference (unreliable for
cross-script directions such as `en→zh`). COMET and CometKiwi are neural metrics trained
on human judgements; CometKiwi is reference-free and does not penalise paraphrastic output.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are do not sufficiently describe the algorithms. Do not revise, simply delete.

Comment thread pyproject.toml Outdated
# Configure src path so ruff import fixes can identify local imports
src = [ "src" ]
# Ignore notebooks — marimo notebooks are not linted
extend-exclude = [ "notebooks/" ]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extend-exclude = [ "notebooks/" ]
extend-exclude = [ "notebooks/*.py" ]

Comment thread pyproject.toml
@laurejt
Copy link
Copy Markdown
Contributor

laurejt commented Apr 16, 2026

@tanhaow When it comes to fixing the displays issue, just simplify it. Don't use marimo ui components; simply display the dataframes directly. There's also no reason to combine the viewing space for sentences and paragraphs; just display a dataframe for each.

While this approach only shows the dataframes in edit mode, that's fine for our current needs.

tanhaow added 3 commits April 17, 2026 09:40
- Rename to mt_browser.py
- Use full sentence corpus (mt-sents-*.jsonl, 854 rows, 4 models including madlad)
  instead of annotation subset (notion-concept-tasks.jsonl, 300 rows)
- Add sentence backtranslations section (en→* directions)
- Replace mo.ui widgets with native marimo dataframe viewer for full-text hover
- Remove metric descriptions from notebook header
- Split into 4 sections: sentence translations, sentence backtranslations,
  paragraph translations, paragraph backtranslations
altair, pandas, scipy, nbformat, nbconvert, and playwright were added
for the previous notebook version and are no longer used.
- Change extend-exclude to notebooks/*.py (more precise)
- Remove per-file-ignores section for notebooks (redundant when excluded)
@tanhaow tanhaow requested a review from laurejt April 17, 2026 15:40
Copy link
Copy Markdown
Contributor

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good. There's just one minor change: skipping the "madlad" sentence corpus, since that model was dropped (i.e., replaced by TranslateGemma) from our evaluation.

Comment thread notebooks/mt_browser.py Outdated
_sents_meta = pl.concat(
[
pl.read_ndjson(DATA_DIR / f"notion-sents/mt-sents-{m}.jsonl")
for m in ["google_tllm", "hymt", "gemma", "madlad"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop madlad since it was dropped from our working set of models.

madlad was replaced by TranslateGemma in our working set of models.
@tanhaow tanhaow merged commit 5eeb105 into develop Apr 17, 2026
1 check passed
@tanhaow tanhaow deleted the feature/qualitative-analysis-notebook branch April 17, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants