In a global marketplace like Amazon, the integrity of a "Verified Purchase" is the cornerstone of consumer trust. Fraudulent or unverified reviews can distort recommender systems and mislead millions of shoppers. This project tells the story of building a Scalable Big Data Pipeline—moving beyond local processing to the cloud—to automate the detection of review authenticity using distributed Machine Learning.
Processing ~8GB of raw JSONL data is impossible on a standard machine. I architected a distributed pipeline to handle the volume.
- Lazy Evaluation: Leveraged Spark's execution engine to optimize memory usage during high-volume ingestion.
- Schema Evolution: Handled complex, nested JSON structures by flattening metadata for relational compatibility.
- Sampling for Speed: During the development phase, I implemented a 10% sampling strategy (
fraction=0.1) to iterate rapidly on the model without incurring unnecessary cloud costs.
To solve the computational challenge, I took the project to the Google Cloud Platform (GCP).
- Dataproc Orchestration: I provisioned a multi-node cluster specifically for compute-intensive Spark jobs.
- The Data Lake: Optimized I/O performance by integrating Google Cloud Storage (GCS) as the primary storage layer, ensuring seamless data flow between the lake and the Spark driver.
With the infrastructure set, I deployed PySpark MLlib to build a classifier that doesn't just run—it scales.
- Distributed Random Forest: Built a classification model to predict "Verified" status based on feature tensors like helpfulness ratios and rating distributions.
- Feature Tensors: Used
VectorAssemblerandTokenizerto transform unstructured review text and metadata into high-dimensional tensors ready for distributed training. - Relational Insights: Integrated Apache Hive to perform SQL-style analytics, identifying top-reviewed products and verifying purchase percentages across the dataset.
This project is architected to run on a multi-node distributed cluster. Follow these steps to provision the production environment:
- Create Bucket: Provision a Google Cloud Storage bucket (e.g.,
gs://amazon-review-analytics/). - Data Ingestion: Upload the
Toys_and_Games.jsonlfile into a directory named/raw_data/. - Script Upload: Move your
main.pyandmllib_model.pyinto a/scripts/folder.
Submit this command via the GCP Cloud Shell to create a managed Spark/Hadoop environment:
gcloud dataproc clusters create toy-analytics-cluster \
--region=europe-west3 \
--master-machine-type=n1-standard-2 \
--worker-machine-type=n1-standard-2 \
--num-workers=2 \
--image-version=2.0-debian10 \
--enable-component-gateway \
--optional-components=HIVE_WEBHCAT
Once the cluster status shows "Running," submit the PySpark job:
gcloud dataproc jobs submit pyspark gs://[YOUR_BUCKET]/scripts/main.py \
--cluster=toy-analytics-cluster \
--region=europe-west3
- Source: Amazon Product Review Dataset (2018) - Toys & Games
- Volume: ~8GB of raw review metadata (JSONL format).
- KPIs Tracked: Average ratings, review helpfulness density, and stopword-optimized word frequency distributions.
- Scope: This project specifically processes the "Toys and Games" category to evaluate trust signals at scale.
| Tier | Tool / Platform | Purpose |
|---|---|---|
| Compute | Apache Spark (PySpark) | Distributed data processing and ML. |
| Cloud | GCP Dataproc | Managed Spark/Hadoop cluster infrastructure. |
| Storage | GCS / Hive | Scalable object storage and analytical warehousing. |
| Modeling | PySpark MLlib | Production-scale classification algorithms. |
- Trust Calibration: Automated the detection of unverified reviews to improve consumer confidence.
- Scalable ROI: Proven ability to handle enterprise-level datasets (GB to TB scale) with near-linear performance gains.
- End-to-End Ownership: Managed the complete data lifecycle—from cloud ingestion to predictive deployment.
- Lead Developer: Shreya Malogi (Founder @ Codemacrocosm)
- Status: Production-ready architectural proof-of-concept. Optimized for horizontal scalability within enterprise ecosystems.