Feature/qualitative analysis notebook#64
Conversation
laurejt
left a comment
There was a problem hiding this comment.
I was unable to run the notebook because there is something wrong with the eval data. They do not appear to have matching tr_id with the machine translations on the drive (perhaps the eval data was generated on the wrong files?)
Once the data issue is fixed, please update the notebook so at the top a variable to the phase-1 data directory can be provided. Then all subsequent paths can assume the canonical organization (and names) dictated by TigerData and the project drive (which will likely become the staler copy).
Also, to make this notebook runnable: update the pyproject.toml so that it includes the new package dependencies and tell ruff to ignore files in the notebooks directory (we did something similar for remarx)
- Add DATA_DIR variable pointing to Phase 1 data directory - Update all data paths to canonical Phase 1 structure - Add scipy and pandas to pyproject.toml dependencies - Exclude notebook from ruff linting
The tr_ids do not match for the paragraph data, which suggests that the wrong data was evaluated (i.e., not the translation files available on Google Drive / TigerData). |
@laurejt Thanks for pointing this out. I found the mismatch is caused by that I re-ran the full pipeline locally (generating new translations and evaluations together), but only uploaded the new eval data to Google Drive, not the updated translations. The local copies on my computer match each other. I can push the updated translation files to Google Drive to resolve this, unless there's a reason to keep the current versions there. |
|
Why did you rerun all of the machine translations for the paragraph data? We should not be replacing what we originally created, especially if there wasn't any issues with them. As a reminder, running Google TLLM costs money, so we shouldn't be rerunning this when it's unnecessary. |
|
@tanhaow The notebook mentions there being machine translations in different languages than the professional reference translation. This should never be the case. Can you share specific examples where this occurs? This sounds like their might be a bug with the evaluation script or some of the models are failing to translate entirely. |
laurejt
left a comment
There was a problem hiding this comment.
This notebook does not address #61 which is to qualitatively evaluate the quality of the metrics with respect to our music domain translations. However, this notebook can be revised to meet the initial specifications for #17. This will largely involve trimming down the notebook to section 5, but consider any additional features that would be valuable for completing #61 (the issue has been updated with a more detailed specification).
Before the next review, make sure to update the PR's description to match the shift in the notebook's scope / purpose.
| "marimo>=0.22.0", | ||
| "polars>=0.20.4", | ||
| "numpy", | ||
| "ipython", # Required by transformers for trainer functionality | ||
| "transformers[torch, sentencepiece, tiktoken]", | ||
| "google-cloud-translate", | ||
| "google-auth", | ||
| "orjsonl", | ||
| "ftfy", | ||
| "evaluate", | ||
| "datasets", | ||
| "sacrebleu", | ||
| "unbabel-comet", | ||
| "altair>=5", | ||
| "pandas", | ||
| "scipy", | ||
| "nbformat>=5.10.4", | ||
| "nbconvert>=7.17.0", | ||
| "playwright>=1.58.0", | ||
| ] |
There was a problem hiding this comment.
Are these specific versions all the way to patch levels needed? How are these dependencies being added to the pyproject.toml?
There was a problem hiding this comment.
If the versioning logic is needed, please document the reasons.
There was a problem hiding this comment.
This question still needs to be addressed.
| # Configure src path so ruff import fixes can identify local imports | ||
| src = [ "src" ] | ||
| # Ignore notebooks — marimo notebooks are not linted | ||
| extend-exclude = [ "notebooks/mt_eval_analysis.py" ] |
There was a problem hiding this comment.
The comment suggests this is a general rule for all notebooks, so make that the case.
| extend-exclude = [ "notebooks/mt_eval_analysis.py" ] | |
| extend-exclude = [ "notebooks/*.py" ] |
| [tool.ruff.lint.per-file-ignores] | ||
| # Notebooks use typographic Unicode in markdown strings intentionally | ||
| "notebooks/*.py" = ["RUF001"] | ||
|
|
There was a problem hiding this comment.
Given that notebooks are ignored by ruff, why is this needed?
There was a problem hiding this comment.
This still stands, if ruff is ignoring these files then there is no need for this rule.
| import pathlib | ||
|
|
||
| import altair as alt | ||
| import pandas as pd |
There was a problem hiding this comment.
Why are you using pandas at all? Altair fully supports polars and does not require pandas.
https://pola.rs/posts/lightweight_plotting/
| def _(DATA_DIR, pl): | ||
| # Load sentence eval CSVs and join with metadata from notion-concept-tasks.jsonl | ||
| sents_meta = pl.read_ndjson( | ||
| DATA_DIR / "prodigy/notion-concept/notion-concept-tasks.jsonl" |
There was a problem hiding this comment.
The prodigy input is the wrong file to use for exploring our sentence-level machine translations since it is a small subset of the available sentence translations and drops backtranslations entirely.
There was a problem hiding this comment.
This still needs to be corrected.
- Replace quantitative analysis (sections 1-4) with a qualitative translation browser, aligning the notebook with issue #17 - Split browser into two sections: Translations and Backtranslations - Add DATA_DIR variable at top for configurable data path - Broaden ruff exclude to cover entire notebooks/ directory
|
@laurejt Hi Laure! The data browser is ready for your review. I’m using it to analyze the data and draft the Google Doc. I’ll send you the Doc too as soon as it’s ready. |
laurejt
left a comment
There was a problem hiding this comment.
This is looking pretty good, but it will be difficult to use this for analysis until the full texts are displayed. Additionally this notebook does not provide browsing of our full sentence-level machine translation corpus which should cover all of the Notion concepts, not just the subset used in the annotation task, and include the sentence backtranslations.
Requested Changes:
- Rename the notebook file
- Remove the descriptions of the metrics at the beginning
- Use the full sentence translation corpus not the subset used for the annotation task
- Update the dataframe browsers so that the entire text of a field may be viewed. The marimo's built-in viewer of dataframes supports this via hovering but this is not the case with the way this notebook is displaying.
- Update
pyproject.tomlto address my comments
| def _(DATA_DIR, pl): | ||
| # Load sentence eval CSVs and join with metadata from notion-concept-tasks.jsonl | ||
| sents_meta = pl.read_ndjson( | ||
| DATA_DIR / "prodigy/notion-concept/notion-concept-tasks.jsonl" |
There was a problem hiding this comment.
This still needs to be corrected.
| [tool.ruff.lint.per-file-ignores] | ||
| # Notebooks use typographic Unicode in markdown strings intentionally | ||
| "notebooks/*.py" = ["RUF001"] | ||
|
|
There was a problem hiding this comment.
This still stands, if ruff is ignoring these files then there is no need for this rule.
| "marimo>=0.22.0", | ||
| "polars>=0.20.4", | ||
| "numpy", | ||
| "ipython", # Required by transformers for trainer functionality | ||
| "transformers[torch, sentencepiece, tiktoken]", | ||
| "google-cloud-translate", | ||
| "google-auth", | ||
| "orjsonl", | ||
| "ftfy", | ||
| "evaluate", | ||
| "datasets", | ||
| "sacrebleu", | ||
| "unbabel-comet", | ||
| "altair>=5", | ||
| "pandas", | ||
| "scipy", | ||
| "nbformat>=5.10.4", | ||
| "nbconvert>=7.17.0", | ||
| "playwright>=1.58.0", | ||
| ] |
There was a problem hiding this comment.
This question still needs to be addressed.
There was a problem hiding this comment.
Please rename this file to what it represents. Perhaps mt_browser.py?
| but for non-literal, domain-specific translations it mostly measures surface similarity | ||
| to the reference rather than actual quality. The near-zero correlation with CometKiwi | ||
| on paragraphs (r ≈ −0.05) is a red flag. | ||
| ## 2. Backtranslations |
There was a problem hiding this comment.
This is not a good way to display the dataframe since there is no way to read the entire text within a given column.
| n_sel = mo.ui.slider(5, 50, step=5, value=10, label="Show N") | ||
| def _(mo): | ||
| mo.md(""" | ||
| ## 1. Translations |
There was a problem hiding this comment.
This is not a good way to display the dataframe since there is no way to read the entire text within a given column.
| **Metrics:** chrF measures character n-gram overlap with the reference (unreliable for | ||
| cross-script directions such as `en→zh`). COMET and CometKiwi are neural metrics trained | ||
| on human judgements; CometKiwi is reference-free and does not penalise paraphrastic output. |
There was a problem hiding this comment.
These are do not sufficiently describe the algorithms. Do not revise, simply delete.
| # Configure src path so ruff import fixes can identify local imports | ||
| src = [ "src" ] | ||
| # Ignore notebooks — marimo notebooks are not linted | ||
| extend-exclude = [ "notebooks/" ] |
There was a problem hiding this comment.
| extend-exclude = [ "notebooks/" ] | |
| extend-exclude = [ "notebooks/*.py" ] |
|
@tanhaow When it comes to fixing the displays issue, just simplify it. Don't use marimo ui components; simply display the dataframes directly. There's also no reason to combine the viewing space for sentences and paragraphs; just display a dataframe for each. While this approach only shows the dataframes in edit mode, that's fine for our current needs. |
- Rename to mt_browser.py - Use full sentence corpus (mt-sents-*.jsonl, 854 rows, 4 models including madlad) instead of annotation subset (notion-concept-tasks.jsonl, 300 rows) - Add sentence backtranslations section (en→* directions) - Replace mo.ui widgets with native marimo dataframe viewer for full-text hover - Remove metric descriptions from notebook header - Split into 4 sections: sentence translations, sentence backtranslations, paragraph translations, paragraph backtranslations
altair, pandas, scipy, nbformat, nbconvert, and playwright were added for the previous notebook version and are no longer used.
- Change extend-exclude to notebooks/*.py (more precise) - Remove per-file-ignores section for notebooks (redundant when excluded)
laurejt
left a comment
There was a problem hiding this comment.
This is looking really good. There's just one minor change: skipping the "madlad" sentence corpus, since that model was dropped (i.e., replaced by TranslateGemma) from our evaluation.
| _sents_meta = pl.concat( | ||
| [ | ||
| pl.read_ndjson(DATA_DIR / f"notion-sents/mt-sents-{m}.jsonl") | ||
| for m in ["google_tllm", "hymt", "gemma", "madlad"] |
There was a problem hiding this comment.
Drop madlad since it was dropped from our working set of models.
madlad was replaced by TranslateGemma in our working set of models.
Associated Issue(s): resolves #17
Changes in this PR
A marimo notebook (
notebooks/mt_browser.py) for browsing Phase 1 machine translations side-by-side with source text, reference translation, and evaluation scores (chrF, COMET, CometKiwi).Four sections:
Uses the full sentence corpus (
mt-sents-*.jsonl, 854 pairs, 4 models including madlad). Displays dataframes using marimo's native viewer, which supports hovering to view full text.Notes
Set
DATA_DIRat the top of the notebook to the phase-1 data directory (project drive or TigerData mount). All paths are resolved relative to that variable.Reviewer Checklist