🔗 Live App: sentiment-analysis-flipkart-reviews.streamlit.app | 📦 Base Project: Sentiment-Analysis-Flipkart-Product-Reviews
- Project Overview
- Relationship to Base Project
- Repository Structure
- MLflow Architecture & Tracking Setup
- Training Script 1 — Multi-Model Experiment Tracking
- Training Script 2 — Hyperparameter Tuning with Run Comparison
- What Gets Logged — Parameters, Metrics, Artifacts & Tags
- MLflow Model Registry & Version Management
- MLflow UI — Experiments & Visualizations
- Experiment Results & Run Comparison
- Streamlit Web Application
- Configuration Reference
- Local Setup & Execution
- Future Scope
This project extends an existing end-to-end Sentiment Analysis system by integrating MLflow 3.9.0 for full ML lifecycle management. The core objective is to demonstrate production-grade experiment tracking, reproducibility, model versioning, and governance — applied to the real-world problem of classifying Flipkart product reviews as Positive or Negative.
Rather than simply training models, this project treats every training run as a tracked, reproducible, auditable experiment, where all decisions — hyperparameter choices, feature engineering strategies, evaluation outcomes — are recorded and queryable through a local MLflow tracking server.
What this project specifically demonstrates:
- Setting up a named MLflow experiment and launching multiple tracked runs within it
- Logging parameters, metrics, models, and file artifacts per run using MLflow Tracking APIs
- Assigning descriptive custom run names to make the MLflow UI immediately readable
- Registering all model candidates under a single centralized MLflow Model Registry entry
- Applying structured tags to individual model versions for governance and lifecycle management
- Comparing run metrics and hyperparameters visually through MLflow's built-in UI charts
- Serving predictions through a Streamlit app backed by the best registered model
MLflow version used: 3.9.0 (confirmed from MLflow UI, running at http://127.0.0.1:5000)
This repository is a direct extension of the Sentiment Analysis of Real-time Flipkart Product Reviews project. The entire src/ package — config.py, data_loader.py, preprocessing.py, feature_engineering.py, train.py, evaluate.py, model_registry.py, and inference.py — is carried over unchanged.
The key additions in this repository are:
| New File | Purpose |
|---|---|
run_training_mlflow.py |
Tracks both Logistic Regression and LinearSVC as separate MLflow runs under a shared experiment |
train_with_mlflow.py |
Performs hyperparameter tuning over Logistic Regression configurations, with each configuration logged as a distinct MLflow run |
mlflow.db |
SQLite file used as the MLflow backend store for persisting experiment and run metadata locally |
mlflow (in requirements.txt) |
The only new dependency added over the base project |
run_training.py is retained as the non-MLflow baseline training script, providing a clean reference point for comparing instrumented vs. uninstrumented training flows.
MLflow_Flipkart_Sentiment_Project/
│
├── data/
│ └── data.csv # 8,518 Flipkart product reviews
│
├── artifacts/ # Joblib-serialized artifacts for Streamlit inference
│ ├── sentiment_model.pkl # Best model from baseline training (LinearSVC)
│ ├── vectorizer.pkl # Fitted TF-IDF vectorizer
│ └── model_metadata.json # Model name and F1 score metadata
│
├── src/ # Core ML pipeline package (shared with base project)
│ ├── config.py # Centralized path and hyperparameter constants
│ ├── data_loader.py # CSV ingestion with null filtering
│ ├── preprocessing.py # Text cleaning and lemmatization
│ ├── feature_engineering.py # TF-IDF vectorization (fit/transform)
│ ├── train.py # Model definitions — LogisticRegression & LinearSVC
│ ├── evaluate.py # F1-Score evaluation loop
│ ├── model_registry.py # Joblib artifact serialization and metadata export
│ └── inference.py # Prediction pipeline for Streamlit app
│
├── app.py # Streamlit web application
├── run_training.py # Baseline training pipeline (no MLflow)
├── run_training_mlflow.py # MLflow-tracked multi-model experiment
├── train_with_mlflow.py # MLflow-tracked hyperparameter tuning runs
├── mlflow.db # SQLite backend store for MLflow tracking server
└── requirements.txt # Python dependency manifest
MLflow is run as a local tracking server backed by a SQLite database. All experiment metadata — run IDs, parameters, metrics, tags, and artifact paths — is persisted in mlflow.db, a file committed to the repository root.
mlflow ui --backend-store-uri sqlite:///mlflow.dbThe UI is accessible at http://127.0.0.1:5000. No remote server, cloud storage, or authentication is configured — this is a fully self-contained local setup ideal for reproducible demonstration and development-time experimentation.
Both MLflow training scripts share a single experiment namespace:
mlflow.set_experiment("Flipkart_Sentiment_Analysis")MLflow creates this experiment on first call if it does not already exist, assigning it a unique experiment ID. All runs from both scripts are grouped under this experiment, enabling cross-run comparison within the same UI view without any additional configuration.
As confirmed in the MLflow Experiments page, two experiments exist in the tracking server:
| Experiment Name | Created | Purpose |
|---|---|---|
Flipkart_Sentiment_Analysis |
02/04/2026, 11:49 AM | All tracked ML runs for this project |
Default |
02/04/2026, 11:27 AM | MLflow system default (auto-created on first use) |
Each call to mlflow.start_run(run_name=...) opens a new run context manager. All log_param(), log_metric(), log_artifact(), log_model(), and set_tag() calls within the with block are scoped to that specific run. The run is automatically closed with status FINISHED when the block exits cleanly, or FAILED if an exception propagates out.
mlflow.start_run()
├── mlflow.log_param() → Stored in run metadata (backend store)
├── mlflow.log_metric() → Stored as a time-series metric in backend store
├── mlflow.log_artifact() → File copied to artifact store (default: ./mlruns/)
├── mlflow.sklearn.log_model() → Model serialized + registered in Model Registry
└── mlflow.set_tag() → Key-value annotation on the run
File: run_training_mlflow.py
This script runs the complete baseline ML pipeline and wraps each model's logging in its own MLflow run. Its purpose is to compare Logistic Regression and LinearSVC side-by-side as distinct tracked experiments within the same experiment group.
mlflow.set_experiment("Flipkart_Sentiment_Analysis")
df = load_data()
df["cleaned"] = df[TEXT_COLUMN].apply(clean_text)
df["sentiment"] = df[RATING_COLUMN].apply(lambda x: 1 if x >= 4 else 0)
X_train, X_test, y_train, y_test = train_test_split(
df["cleaned"], df["sentiment"],
test_size=TEST_SIZE, random_state=RANDOM_STATE
)
X_train_vec, X_test_vec, vectorizer = tfidf_features(X_train, X_test)
models = train_models(X_train_vec, y_train)
scores = evaluate_models(models, X_test_vec, y_test)
for model_name, model in models.items():
with mlflow.start_run(run_name=f"{model_name}_TFIDF"):
mlflow.log_param("vectorizer", "TF-IDF")
mlflow.log_param("model_name", model_name)
mlflow.log_param("test_size", TEST_SIZE)
mlflow.log_metric("f1_score", scores[model_name])
mlflow.sklearn.log_model(
sk_model=model,
artifact_path="model",
registered_model_name="FlipkartSentimentModel"
)
joblib.dump(vectorizer, "temp/vectorizer.pkl")
mlflow.log_artifact("temp/vectorizer.pkl")
mlflow.set_tag("project", "Flipkart Sentiment Analysis")
mlflow.set_tag("dataset", "Flipkart Reviews")
mlflow.set_tag("model_type", model_name)
print(f"Logged {model_name} with F1-score: {scores[model_name]:.4f}")Training outside the run loop: Data loading, preprocessing, splitting, vectorization, and model training all happen before any mlflow.start_run() call. This ensures both models are trained on identical data splits with an identical TF-IDF vocabulary, making their F1 scores a strictly controlled comparison. The MLflow run contexts are used purely for logging, not for controlling execution.
Custom run names (run_name=f"{model_name}_TFIDF"): Produces human-readable identifiers (LogisticRegression_TFIDF, LinearSVM_TFIDF) in the MLflow Experiments UI, eliminating the need to decode auto-generated run UUIDs when browsing or comparing runs.
Vectorizer logged as a file artifact: The fitted TfidfVectorizer is serialized to temp/vectorizer.pkl via joblib and logged with mlflow.log_artifact(). This ensures the exact vocabulary mapping used in each run is permanently stored in MLflow's artifact store — critical for reproducibility, since loading a model without its corresponding vectorizer would produce incorrect sparse vector representations at inference time.
Shared registered model name: Both models are registered to registered_model_name="FlipkartSentimentModel". Each log_model() call auto-increments the version under this single registry entry, enabling side-by-side version comparison within the Model Registry without creating separate registry entries per model type.
File: train_with_mlflow.py
This script focuses exclusively on Logistic Regression and demonstrates how MLflow is used to systematically track the effect of varying hyperparameters across multiple independent runs. Each invocation of train_with_mlflow() is a fully self-contained, isolated, and logged experiment run.
EXPERIMENT_NAME = "Flipkart_Sentiment_Analysis"
mlflow.set_experiment(EXPERIMENT_NAME)
def train_with_mlflow(max_iter, C):
with mlflow.start_run(run_name=f"LogReg_iter={max_iter}_C={C}"):
df = load_data()
df["clean_text"] = df["Review text"].apply(clean_text)
X_train_text, X_test_text, y_train, y_test = train_test_split(
df["clean_text"],
df["Ratings"].apply(lambda x: 1 if x >= 4 else 0),
test_size=0.2, random_state=42
)
X_train, X_test, vectorizer = tfidf_features(X_train_text, X_test_text)
mlflow.log_param("max_iter", max_iter)
mlflow.log_param("C", C)
model = LogisticRegression(max_iter=max_iter, C=C)
model.fit(X_train, y_train)
preds = model.predict(X_test)
f1 = f1_score(y_test, preds)
mlflow.log_metric("f1_score", f1)
mlflow.sklearn.log_model(
model,
artifact_path="model",
registered_model_name="FlipkartSentimentModel"
)
if __name__ == "__main__":
train_with_mlflow(max_iter=200, C=1.0)
train_with_mlflow(max_iter=300, C=0.5)| Run Name | max_iter |
C (Inverse Regularization Strength) |
F1-Score |
|---|---|---|---|
LogReg_iter=200_C=1.0 |
200 | 1.0 — weaker regularization, larger decision boundary tolerance | 0.92 |
LogReg_iter=300_C=0.5 |
300 | 0.5 — stronger regularization, penalizes large coefficients more | 0.92 |
Unlike run_training_mlflow.py where data loading and training happen once outside the run loop, here each train_with_mlflow() call reloads, preprocesses, splits, vectorizes, and trains entirely within its own run context. This makes every run independently reproducible — any single call can be re-executed in isolation with only its logged max_iter and C values, without relying on shared external state.
Run naming convention (f"LogReg_iter={max_iter}_C={C}"): Encodes hyperparameter values directly into the run name, making the MLflow UI self-documenting. No lookup into parameter tables is needed to understand what each run tested.
C parameter — what it controls: In scikit-learn's LogisticRegression, C is the inverse of regularization strength. A smaller C applies stronger L2 regularization, shrinking coefficient magnitudes and reducing overfitting risk. A larger C relaxes regularization, giving the model more freedom to fit training data. Both values tested here (1.0 and 0.5) produced identical F1 scores, suggesting the model is not sensitive to regularization strength in this range on this dataset.
Parameters are scalar values that describe the configuration of a run. They are stored immutably once logged and are displayed in the MLflow UI's parameter comparison columns.
| Parameter Key | Logged In | Value(s) |
|---|---|---|
vectorizer |
run_training_mlflow.py |
"TF-IDF" |
model_name |
run_training_mlflow.py |
"LogisticRegression", "LinearSVM" |
test_size |
run_training_mlflow.py |
0.2 |
max_iter |
train_with_mlflow.py |
200, 300 |
C |
train_with_mlflow.py |
1.0, 0.5 |
Metrics are numeric values used to evaluate model performance. MLflow stores metrics as a time series (step-indexed), supporting metric history plots.
| Metric Key | Logged In | Values Recorded |
|---|---|---|
f1_score |
Both scripts | 0.9218 (LinearSVC), 0.92 (LogReg both configs) |
Every run logs its trained scikit-learn model using MLflow's sklearn flavor:
mlflow.sklearn.log_model(
sk_model=model,
artifact_path="model",
registered_model_name="FlipkartSentimentModel"
)This serializes the model in MLflow's standard format (wrapping joblib internally), stores it under the run's artifact directory at path model/, and simultaneously registers a new version in the Model Registry under the name FlipkartSentimentModel.
In run_training_mlflow.py, the fitted TF-IDF vectorizer is additionally saved as a raw file artifact:
joblib.dump(vectorizer, "temp/vectorizer.pkl")
mlflow.log_artifact("temp/vectorizer.pkl")This makes the vectorizer retrievable from any run's artifact store, ensuring the complete inference pipeline (model + vectorizer) can be reconstructed from MLflow alone.
Tags in run_training_mlflow.py annotate each run with project-level metadata:
| Tag Key | Value |
|---|---|
project |
"Flipkart Sentiment Analysis" |
dataset |
"Flipkart Reviews" |
model_type |
"LogisticRegression" or "LinearSVM" |
These are filterable in the MLflow UI and useful for querying runs programmatically via the MLflow client API.
Every mlflow.sklearn.log_model(..., registered_model_name="FlipkartSentimentModel") call creates a new version in the centralized model registry. Across both scripts and all runs, three versions were registered:
| Version | Source Run | Algorithm | Status |
|---|---|---|---|
| Version 1 | LogisticRegression_TFIDF |
Logistic Regression | — |
| Version 2 | LinearSVM_TFIDF |
Linear SVC | — |
| Version 3 | LogReg_iter=300_C=0.5 |
Logistic Regression | Production ✅ |
As confirmed in the Registered Models page, FlipkartSentimentModel is at Version 3 (latest), last modified on 02/04/2026 at 11:57 AM.
Version 3, sourced from the LogReg_iter=300_C=0.5 run, carries the following model-version-level tags applied via the MLflow UI:
| Tag Name | Tag Value | Purpose |
|---|---|---|
algorithm |
logistic_regression |
Documents the model class used in this version |
features |
tfidf |
Records the text representation technique |
metric |
f1_score |
Identifies the evaluation criterion used for selection |
stage |
production |
Marks this version as the approved production model |
use_case |
sentiment_analysis |
Scopes the model to its intended application domain |
These tags serve as model governance metadata — in a team setting, they communicate to other practitioners which model is live, what it uses, and how it was evaluated, without requiring access to training code or experiment logs.
Both Logistic Regression runs (iter=200,C=1.0 and iter=300,C=0.5) achieved identical F1 scores of 0.92. Version 3 (iter=300, C=0.5) was selected and tagged as production because stronger regularization (C=0.5) reduces the risk of overfitting on the training vocabulary, generally yielding better generalization on unseen review text — a pragmatic tiebreaker when metrics are equal.
The MLflow Experiments page (accessible at http://127.0.0.1:5000/#/experiments) lists all registered experiments with their creation timestamps, last-modified times, and optional tags.
Both Flipkart_Sentiment_Analysis and the system Default experiment are visible. Clicking into Flipkart_Sentiment_Analysis reveals all individual runs, their parameters, metrics, and status.
MLflow's built-in chart view renders a horizontal bar chart comparing f1_score across all runs selected in the UI.
The chart shows both Logistic Regression hyperparameter runs side by side:
LogReg_iter=300_C=0.5→ 0.92LogReg_iter=200_C=1.0→ 0.92
The visual confirms that neither hyperparameter configuration yields a measurable advantage on this dataset, validating that the production tag on Version 3 was applied on principled grounds rather than raw metric superiority.
The Models section of the MLflow UI (http://127.0.0.1:5000/#/models) provides a centralized registry view.
FlipkartSentimentModel is listed at Version 3, with tags (stage: production, algorithm: logistic_regression, features: tfidf) visible directly in the registry overview — demonstrating that model governance metadata is surfaced without having to open individual versions.
The version detail page for FlipkartSentimentModel v3 (http://127.0.0.1:5000/#/models/FlipkartSentimentModel/versions/3) shows:
- Registered At: 02/04/2026, 11:57:46 AM
- Source Run:
LogReg_iter=300_C=0.5(hyperlink back to the originating experiment run) - Tags:
algorithm: logistic_regression,features: tfidf,metric: f1_score,stage: production,use_case: sentiment_analysis - Stage (deprecated field): None — the newer MLflow registry UI uses tags for lifecycle management instead of the legacy
Staging/Production/Archivedstage enum
A consolidated summary of all runs logged under the Flipkart_Sentiment_Analysis experiment:
| Run Name | Script | Model | vectorizer |
max_iter |
C |
F1-Score | Registry Version |
|---|---|---|---|---|---|---|---|
LogisticRegression_TFIDF |
run_training_mlflow.py |
Logistic Regression | TF-IDF | 1000 | — | ~0.92 | Version 1 |
LinearSVM_TFIDF |
run_training_mlflow.py |
LinearSVC | TF-IDF | — | — | 0.9218 | Version 2 |
LogReg_iter=200_C=1.0 |
train_with_mlflow.py |
Logistic Regression | TF-IDF | 200 | 1.0 | 0.92 | — |
LogReg_iter=300_C=0.5 |
train_with_mlflow.py |
Logistic Regression | TF-IDF | 300 | 0.5 | 0.92 | Version 3 (Production) |
Key observations:
- LinearSVC achieves the highest absolute F1 score (0.9218) but was not selected for the production tag in the registry — the registry's production version is a Logistic Regression model, reflecting a deliberate choice to prioritize the model explored through the hyperparameter tuning workflow.
- Both Logistic Regression hyperparameter configurations converge to the same F1, indicating the model has plateaued on this dataset with TF-IDF features and that further gains would likely require richer representations (e.g., BERT embeddings).
File: app.py
The Streamlit app is identical to the base project and remains independent of the MLflow tracking infrastructure. It loads pre-serialized artifacts from the artifacts/ directory using joblib, not from the MLflow artifact store, keeping the deployment runtime free of any MLflow dependency.
st.set_page_config(page_title="Flipkart Sentiment Analyzer", layout="centered")
review = st.text_area("Enter your review:")
if st.button("Analyze Sentiment"):
if review.strip():
result = predict_sentiment(review) # From src/inference.py
st.success(f"Sentiment: {result}")
else:
st.warning("Please enter a review.")
with open("artifacts/model_metadata.json") as f:
metadata = json.load(f)
st.sidebar.write(f"Model: {metadata['model_name']}") # LinearSVM
st.sidebar.write(f"F1 Score: {metadata['f1_score']:.4f}") # 0.9218The sidebar dynamically reads from model_metadata.json rather than hardcoding values, so any future retraining that updates the artifact will automatically be reflected in the UI. The inference chain (clean_text → vectorizer.transform → model.predict) is defined in src/inference.py and is structurally identical across both the base and this repository.
Live App: https://sentiment-analysis-flipkart-reviews.streamlit.app
File: src/config.py
DATA_PATH = "data/data.csv" # Input dataset path
MODEL_DIR = "artifacts" # Output directory for joblib artifacts
LOG_DIR = "logs" # Reserved for future logging integration
TEXT_COLUMN = "Review text" # Column used as model input feature
RATING_COLUMN = "Ratings" # Column used to derive sentiment labels
RANDOM_STATE = 42 # Global seed for reproducibility
TEST_SIZE = 0.2 # 80/20 train-test split ratioAll MLflow training scripts import from config.py for TEST_SIZE and RANDOM_STATE, ensuring that the split used in MLflow runs is identical to the baseline — making cross-script metric comparisons valid.
- Python 3.8 or higher
- Git
git clone https://github.com/Avik-Das-567/Using-MLflow-for-Experiment-Tracking-and-Model-Management.git
cd Using-MLflow-for-Experiment-Tracking-and-Model-Managementpython -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activatepip install -r requirements.txtrequirements.txt:
mlflow
pandas
numpy
scikit-learn
nltk
streamlit
joblib
matplotlib
seaborn
tqdm
mlflow ui --backend-store-uri sqlite:///mlflow.dbOpen http://127.0.0.1:5000 in your browser. Keep this terminal running in the background while executing training scripts so all runs are captured in real time.
python run_training_mlflow.pyThis logs two runs (LogisticRegression_TFIDF and LinearSVM_TFIDF) to the Flipkart_Sentiment_Analysis experiment and registers both models as Version 1 and Version 2 under FlipkartSentimentModel.
Expected output:
Logged LogisticRegression with F1-score: 0.9200
Logged LinearSVM with F1-score: 0.9218
python train_with_mlflow.pyThis sequentially executes two Logistic Regression runs with different hyperparameters and logs each to the same experiment.
Expected output:
Run completed | F1-score: 0.9200
Run completed | F1-score: 0.9200
Navigate to http://127.0.0.1:5000 and:
- Open the
Flipkart_Sentiment_Analysisexperiment to view all runs - Select multiple runs and click Compare to generate metric and parameter comparison charts
- Navigate to Models → FlipkartSentimentModel to inspect registered versions and tags
streamlit run app.pyThe app loads artifacts from artifacts/ and serves predictions at http://localhost:8501.
| Enhancement | Description |
|---|---|
| Remote MLflow Tracking Server | Replace the local SQLite backend with a remote server (e.g., hosted on AWS EC2 with S3 artifact store) to support team-wide experiment sharing |
| Prefect Workflow Orchestration | Wrap the training pipeline in Prefect flows and tasks to enable scheduled auto-retraining, dependency management, and a visual Prefect Dashboard |
| Hyperparameter Search with Optuna | Replace manual grid search with Optuna's Bayesian optimization, logging each trial as an MLflow run for full search history tracking |
| MLflow Model Serving | Deploy the production-tagged model version directly via mlflow models serve as a REST endpoint, replacing the current joblib-based inference module |
| Automated Model Promotion | Write scripts using the MLflow client API (MlflowClient.set_registered_model_alias) to automatically promote the best-performing run to production based on metric thresholds |
| BERT/Transformer Embeddings | Integrate transformer-based feature extraction as an alternative to TF-IDF, tracked as a separate experiment branch within the same Flipkart_Sentiment_Analysis namespace |
| Data Versioning with DVC | Combine MLflow experiment tracking with DVC data versioning so each MLflow run is linked to a specific, reproducible snapshot of data.csv |
This project is licensed under the MIT License — you are free to use, modify, and distribute this project with attribution.