Data-Engineering-track-with-Zachary

📚 Learning Goals & Assignment Summary

This repository contains assignments from the Data Engineering Track, each designed to build specific skills and competencies in data engineering. Below is a comprehensive summary of learning goals achieved through each assignment.

🎯 Assignment 1: Dimensional Data Modeling

Grade: A | Status: Complete

Learning Objectives:

Advanced SQL Techniques: Master CTEs (Common Table Expressions) for complex queries
Data Structure Design: Use structs and arrays to compact and organize data efficiently
Custom Data Types: Create and implement enumerated types for business logic
SCD Implementation: Design and implement Slowly Changing Dimensions (Type 2)
Historical Data Management: Track changes over time with proper temporal modeling
Window Functions: Apply LAG, LEAD, and other window functions for analytics
Incremental Processing: Implement both full and incremental data loading strategies

Key Skills Developed:

Complex SQL query design and optimization
Dimensional modeling principles
Data warehouse design patterns
Historical data tracking and management

🎯 Assignment 2: Fact Data Modeling

Grade: B | Status: Complete

Learning Objectives:

Fact vs Dimension Tables: Understand when to use each table type
Data Deduplication: Implement effective duplicate removal strategies
Binary Data Transformation: Convert date arrays to binary representations for analytics
Incremental Loading: Design efficient incremental data processing pipelines
Array Operations: Master PostgreSQL array functions and operations
Performance Optimization: Implement efficient data loading and querying strategies

Key Skills Developed:

Fact table design and implementation
Data quality and deduplication techniques
Binary data representation for analytics
Incremental data processing design

🎯 Assignment 3: Spark Fundamentals

Grade: A | Status: Complete

Learning Objectives:

PySpark Operations: Master DataFrame operations and transformations
Performance Optimization: Implement broadcast joins and bucketing strategies
Data Partitioning: Design efficient partitioning and sorting strategies
Join Optimization: Understand and implement different join strategies
Aggregation Techniques: Master complex aggregations and window functions
Performance Monitoring: Analyze and optimize query performance

Key Skills Developed:

Apache Spark/PySpark programming
Big data processing and optimization
Performance tuning and monitoring
Distributed computing concepts

🎯 Assignment 4: PySpark Testing

Grade: A | Status: Complete

Learning Objectives:

PostgreSQL to SparkSQL Conversion: Convert complex SQL queries to Spark
Testing Frameworks: Implement comprehensive testing for data pipelines
Data Validation: Create robust data quality checks and validations
Test Data Generation: Design realistic test datasets for validation
Pipeline Testing: Implement end-to-end testing for data workflows
Error Handling: Design graceful error handling and recovery mechanisms

Key Skills Developed:

SparkSQL query optimization
Data pipeline testing methodologies
Test-driven development for data engineering
Quality assurance and validation techniques

🎯 Assignment 5: Data Pipeline Maintenance

Grade: Pending | Status: In Progress

Learning Objectives:

Pipeline Ownership: Design ownership and responsibility matrices
On-Call Scheduling: Create fair and sustainable on-call schedules
Runbook Creation: Develop comprehensive operational documentation
Risk Assessment: Identify and document pipeline risks and mitigation strategies
SLA Definition: Design Service Level Agreements and monitoring strategies
Incident Response: Plan and document incident response procedures

Key Skills Developed:

Data pipeline operations management
Team coordination and scheduling
Technical documentation and runbooks
Risk management and incident response

🎯 Assignment 6: Data Pipeline Maintenance (Continued)

Grade: Pending | Status: In Progress

Learning Objectives:

Monitoring & Alerting: Design comprehensive monitoring strategies
Performance Metrics: Define and track key performance indicators
Capacity Planning: Plan for growth and scalability
Disaster Recovery: Design backup and recovery procedures
Compliance & Security: Implement data governance and security measures

Key Skills Developed:

Production data pipeline management
Monitoring and observability
Capacity planning and scaling
Data governance and security

🎯 Assignment 7: Product Experimentation & User Journey Analysis

Grade: Pending | Status: In Progress

Learning Objectives:

User Journey Mapping: Analyze and document user experiences
A/B Testing Design: Design comprehensive experimentation frameworks
Statistical Analysis: Apply statistical methods for experiment evaluation
Metrics Definition: Define leading and lagging indicators
Hypothesis Formation: Create testable hypotheses for product improvement
Data-Driven Decision Making: Use analytics to inform product decisions

Key Skills Developed:

Product analytics and user research
Experimental design and A/B testing
Statistical analysis for business decisions
Product strategy and optimization

🎯 Assignment 8: Tableau Dashboard Creation

Grade: Pending | Status: Complete

Learning Objectives:

Data Visualization: Create professional dashboards for stakeholders
Executive Dashboards: Design high-level KPI dashboards for business users
Exploratory Dashboards: Build interactive analysis tools with filters
Data Storytelling: Communicate insights through effective visualizations
Dashboard Design: Apply UX/UI principles to dashboard creation
Data Publishing: Publish and share dashboards on Tableau Public

Key Skills Developed:

Tableau Public dashboard creation
Data visualization best practices
Business intelligence and reporting
Interactive dashboard design

🏆 Overall Learning Journey Summary

Technical Skills Mastered:

SQL & Databases: Advanced SQL, dimensional modeling, SCD patterns
Big Data Processing: Apache Spark, PySpark, performance optimization
Data Quality: Testing, validation, deduplication, error handling
Data Visualization: Tableau, dashboard design, data storytelling
Pipeline Management: Operations, monitoring, risk assessment

Professional Skills Developed:

Problem Solving: Complex data engineering challenges
Documentation: Technical writing, runbooks, specifications
Testing: Test-driven development, quality assurance
Communication: Dashboard design, data storytelling
Project Management: Assignment planning and execution

Industry-Ready Competencies:

Data Engineering: End-to-end data pipeline development
Data Warehousing: Design and implementation of data warehouses
Big Data Processing: Scalable data processing with Spark
Business Intelligence: Dashboard creation and data visualization
Operations: Production pipeline management and maintenance

This learning journey demonstrates comprehensive data engineering skills development, from foundational SQL to advanced big data processing and visualization. Each assignment builds upon previous knowledge, creating a strong foundation for a data engineering career. 🚀📊

Dimensional data modeling

1- Use of CTEs (Common Table Expressions) for temporaray table quering

2- use of structs to compact data and create customized types

3- use of arrays to compact data

4- unnest of data for slice and dice operations

5- track changements and modeling for SCD (Slowly changing Dimensions)

6- incremental and one full methods for back filling tables

7- window functions for advanced data processing

8- creating graphs and advanced analytics (DDLs for vertices, edges and their use for analytics): check the memo

Review of the assignment:

Feedback on Dimensional Data Modeling Submission:

Data Definition Language (DDL):

The use of custom types like films and quality_class is a good approach to encapsulate information and manage enumerations, which makes your schema more expressive. However, it is worth noting that custom types might add complexity when altering schema data types or migrating databases in some systems. The primary key definition for the actors table is well-thought-out. Combining actorid and year ensures unique entries each year per actor, which is crucial for accumulating historical data. Cumulative Table Generation:

The use of common table expressions (CTEs) — years, actor_startings, actors_seasons, and movies_aggregated — is efficient for breaking down the query logic into manageable steps. This makes your SQL readable and modular, aiding in troubleshooting. The logic used to categorize quality_class is straightforward and follows a simplistic tier system based on avg_rating. This is good, assuming ratings fall within expected ranges. Actors History SCD Table:

Your DDL for the actors_history_scd table appropriately sets up the structure for type 2 slowly changing dimensions (SCD). Tracking is_active and quality_class aligns with good SCD principles. Throughout the backfill query, the use of window functions (LAG) to identify changes in quality or activity status is apt and demonstrates an understanding of SQL analytics. Backfill Query:

The query captures historical data changes well with streak identifiers and transition tracking, covering both the data integrity and iterative growth of states correctly. The decision to cap the current_season at 2020 in the outcome is a good placeholder for further incremental steps. Incremental Query for SCD:

The query effectively combines historical records with new data by using structures like unchanged_records, changed_records, and new_records, which is essential for maintaining accurate SCD. The use of set operations (UNION ALL) ensures completeness, gathering from all necessary data streams into one output, indicating apt use of SQL in handling multi-source data aggregation. General Constructive Feedback:

Consider including comments more thoroughly explaining each SQL section or step for enhanced clarity, especially beneficial for future maintainers or collaborators. While the SQL logic is sound, incorporating indexed strategies could be considered in your design, depending on scale, to boost performance, especially for tables that grow significantly over time. Final Evaluation:

Your submission demonstrates a comprehensive understanding of dimensional data modeling, especially in managing historical data changes with SCD. The provided queries are functional, with logical structuring and clear attempts at applying advanced SQL features (like window functions and CTEs) effectively.

FINAL GRADE:

{ "letter_grade": "A", "passes": true }

Fact Data Modeling

1- difference between Fact and dimensions tables

2- when a changing value is considered to belong to dimension or to fact table

3- transforming logging activities to fact table

4- compacting logging history into binary rows for later easy advanced data analytics

5- incremental data loading for logging activity

De-duplication Query:

You attempted to remove duplicates from game_details using SQL's ROW_NUMBER(). However, both game_id and team_id are considered, but player_id has been omitted, which was part of the requirement. The comment suggests comparing data size before and after deduplication, which is helpful for verification. User Devices Activity Datelist DDL:

The table schema for user_devices_cumulated is close to what's needed, but there is a slight misalignment with the requirement. The device_activity_datelist should ideally look like a MAP<STRING, ARRAY[DATE]>, which can be better represented with browser types as keys and date arrays as values. Your current schema seems to focus on having DATE[]. User Devices Activity Datelist Implementation:

Your incremental query approach for populating user_devices_cumulated is organized and uses CTEs effectively. The logic of coalescing between yesterday and today to append to device_activity_datelist is a valid approach for aggregation, though the date handling in coalesce might need careful adjustment for proper incrementation as + Interval '1 day' assumes continued daily processing. The use of array_append maintains consistency of aggregated dates. User Devices Activity Int Datelist:

The transformation of device_activity_datelist into datelist_int is correctly approached with cross joining and summing the results of powers of two, which respects the essence of base-2 transformation. It is well-structured, though the final outcome might be overly specific in using bit(32), which potentially limits longer date aggregation if the intention is to cover a more extended timeline. Host Activity Datelist DDL:

The schema outline is straightforward and aligns with requirements. Defining a primary key on host and date ensures unique record criteria. General Feedback:

Consider refining the deduplication approach to encompass all required identifiers. Ensure your schema captures the requirements of a MAP data structure for devices. Readability is aided by consistent comments; keep this up. Check alignment with data types that cater to broader use cases if required. Your submission covered the five necessary queries effectively. While there are some areas for precision and improvement, your logical structuring and understanding of SQL are evident.

FINAL GRADE:

{ "letter_grade": "B", "passes": true } Good work, and with slight adjustments, you'll be even stronger in these exercises. Let me know if there's anything unclear or if you need further feedback on this assignment!

Spark fundementals and testing

fundamentals

Disabling Default Broadcast Joins

You have correctly disabled the automatic broadcast join setting with spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1"). Well done. Explicitly Broadcast Join

You correctly broadcast the medals and maps tables using F.broadcast, which properly aligns with the task requirements. Bucket Joins and Reading Bucketed Tables

It seems like your match_details, matches, and medals_matches_players tables are read as bucketed. However, it's not entirely clear how these tables were bucketed, as the creation of the bucket is assumed ("Read bucketed tables (assumes they were written with bucketBy match_id)"). It's essential to ensure these tables are bucketed by match_id with 16 buckets during their creation or writing as this affects join performance. Aggregations (Queries 4a, 4b, 4c, 4d)

Query 4a: The calculation for "Which player has the highest average kills per game?" is correctly implemented using groupBy and agg operations to get the average kills.

Query 4b: The most played playlist is correctly obtained with a grouped and ordered DataFrame.

Query 4c: You performed the aggregation for the most played map correctly.

Query 4d: The query for the map with the most "Killing Spree" medals is correctly filtered and aggregated.

Optimization (Partitioning and Sorting)

You executed the partitioning strategies correctly, including repurposing with low and high cardinality fields and adding comments to clarify your intentions.

Applying sortWithinPartitions to ensure data is optimized for query performance was correctly implemented.

Improvements/Suggestions:

Provide more details on the preparation of your bucketed tables (e.g., code or steps) to ensure correctness in setting up the bucket-based join strategy.

You could also expand on how performance could be monitored to verify the impact of each partitioning and sorting strategy.

Overall, your submission covers all tasks appropriately with a good understanding of Spark operations, joins, aggregations, partitioning, and sorting strategies.

Here is your final grade based on the rubric:

{ "letter_grade": "A", "passes": true }

Testing

Here's the feedback for your submission:

Conversion of PostgreSQL Queries to SparkSQL

Device Activity Analysis:
- You successfully converted a PostgreSQL query into a SparkSQL query for analyzing device activity.
- The logic for handling user device events, daily activity aggregation, and the final activity summary appears correct.
- The usage of CTEs (WITH clauses) enhances readability.
Actor Quality Analysis:
- The PostgreSQL-to-SparkSQL conversion for actor quality analysis is well implemented.
- You've categorized actors into quality classes based on ratings and checked for active status correctly.

PySpark Job Implementation

Device Activity Analysis Job:
- The analyze_device_activity function is well-structured with appropriate use of DataFrame registration for SparkSQL execution.
- The generate_datelist_int function correctly integrates binary computation for date list conversion.
Actor Quality Analysis Job:
- The analyze_actor_quality function is effectively structured to perform aggregations and classifications based on ratings.
Overall Code Quality:
- Best practices for Spark session management have been mostly followed, with create_spark_session encapsulating session creation logic.
- The main functions are appropriately configured to allow testing and result showcasing.

Tests

Test Structure & Data Generation:
- You have used pytest fixtures to create a Spark session and generate sample data, which allows for clean and modular test design.
- Test cases cover multiple aspects of the logic such as data structure validation, filtering, aggregation, classification, and ordering.
Coverage & Assertions:
- Device activity tests verify the structure, filtering, aggregations, and ordered output appropriately.
- Actor quality tests confirm the integrity of classifications and the correctness of computed averages and flags.
Test Naming & Clarity:
- Test methods are well-named, indicative of the functionality being tested, leading to clear intentions.

Overall, the submission reflects a strong understanding of the task requirements and demonstrates adept application of PySpark for large scale data processing tasks.

Areas for Improvement

Edge Cases: Consider adding more edge case handling, such as scenarios where expected averages might be on boundary values or when activity lists are empty.
Code Documentation: Additional inline comments can improve understanding of complex logic, especially around binary conversion logic in the date list generation.

FINAL GRADE

{
  "letter_grade": "A",
  "passes": true
}

Your work demonstrates a high level of proficiency and adherence to the assignment requirements. Keep up the good work! If you have any questions or need further clarification, feel free to reach out.

4-apache-flink-training

** This feedback is auto-generated from an LLM **

1. Flink Job Review:

Correctness:

Sessionization Gap: You've correctly implemented a 5-minute session gap based on IP and host, which aligns with the homework's requirements.
Data Persistence: The job writes output to specified PostgreSQL tables effectively.

Code Quality:

Structure and Readability: The script is well-structured, and your use of in-line comments adds clarity. However, consider using more descriptive variable names to enhance readability, especially in sessionize_events.
Flink Best Practices: The use of Flink's Table API to handle session windows is executed correctly, making the code efficient and logical.

2. SQL Script Review:

Correctness:

Calculations: Your SQL queries for calculating the average number of events per session and comparing them across different hosts are correct.
Use of CTEs: You’ve efficiently utilized Common Table Expressions (CTEs) to separate the aggregation logic.

Code Quality:

Readability and Performance: The queries are clean and show a good understanding of SQL best practices.

3. Documented Answers to Questions:

Average Events per Session: The detailed breakdown and comparison between hosts are well-documented, providing clear insights.
Final Insights: Your conclusions and business insights into session behavior are well-analyzed and presented succinctly.

4. Testing Instructions and Execution:

Completeness:

Execution: You provided clear execution instructions for running the Flink job and querying PostgreSQL, which would allow verification of the job's success.

5. Documentation:

Explanation: Your submission includes comprehensive technical implementation details and business insights that are well-documented in HOMEWORK_ANSWERS.md.
Suggestions for Improvement:
- While the technical documentation is strong, consider including a brief section on potential challenges or limitations encountered in the project, which can help in understanding the solution's depth.

Final Feedback:

This submission demonstrates a strong grasp of both the technical implementation of sessionization using Flink and SQL-based analysis of web sessions with PostgreSQL. The code is well-organized, efficient, and aligns with best practices. Your documentation effectively communicates your approach and findings.

Great work on this assignment!

FINAL GRADE:

{
  "letter_grade": "A",
  "passes": true
}

5- Analytical Patterns

Query Review

Query 1: State Change Tracking

Accuracy: Your state change tracking query correctly uses window functions like LAG, LEAD, FIRST_VALUE, and LAST_VALUE to determine player state changes. The logic in your CASE statement accurately categorizes players into the required states.
Efficiency: Your query is thoughtfully structured and should perform well on properly indexed tables.

Query 2: GROUPING SETS Aggregations

Accuracy: The use of GROUPING SETS is correct and effectively aggregates data across multiple dimensions.
Clarity: The use of COALESCE and GROUPING functions is well-executed, making identifying aggregation levels clear.

Query 3 and 4: Player Scoring

Accuracy: Your query to identify the player who scored the most points for a single team correctly implements aggregation logic.
Adjustment: It seems like queries for task 3 and task 4 were merged. Ensure each task is separated, i.e., one to find the player for most points for a single team and another for most points in a single season.

Query 5: Team with Most Wins

The team performance ranking query effectively identifies teams by their winning percentage, which should help identify the top-performing team by total wins.

Query 6: 90-Game Win Streak

Accuracy: Your window function use here is excellent, employing a rolling window with ROWS BETWEEN 89 PRECEDING AND CURRENT ROW.
Performance Consideration: Ensure performance by validating data is indexed properly on game_date.

Query 7: Longest Scoring Streak

Accuracy: The query correctly identifies streaks by using cumulative sums to detect changes in scoring patterns. It addresses the task requirement of identifying LeBron James' scoring streaks over 10 points.

Documentation & Execution Guides

Your README and EXECUTION_GUIDE documentation is thorough, providing clear instructions and rationales for your approach.
You demonstrated a strong connection to Week 1's learnings in your ASSIGNMENT_SUMMARY. This continuity helps bridge your prior knowledge with advanced concepts.

General Observations

Correctness: Your queries generally meet the requirements and produce the expected results. The detailed documentation shows a good understanding of SQL best practices and thoughtful design.
Efficiency: You've considered performance in window function use and grouping sets, though ensure all appropriate indexes are in place, as you noted.
Clarity and Structure: The queries are well-structured and readable, which makes understanding and modifying them easier.
Edge Cases: While your queries handle standard scenarios well, ensure any assumptions, like player names, are examined against actual dataset entries to confirm robustness in all possible cases.

Final Grade

Based on your comprehensive submission:

{
  "letter_grade": "A",
  "passes": true
}

6- Data Pipeline Maintenance

Strengths:

Comprehensive Ownership Assignment:
- You've clearly outlined a primary and secondary ownership structure for each pipeline. This ensures accountability and coverage during any periods of absence. It's good to see assignments based on the engineer's experience and specialization, maximizing the team's skill set effectively.
Fair On-call Schedule:
- The on-call schedule is logically structured and considers individual rotation cycles while also accounting for holiday periods and reduced on-call loads during major holidays. Your attention to distributing weekend and evening shifts evenly helps maintain fairness across the team.
Detailed Runbooks:
- Each pipeline's runbook is detailed, covering crucial aspects like purpose, technical details, monitoring strategies, and troubleshooting steps. The inclusion of business impact and SLA considerations adds valuable context for operational decision-making.
Risk Assessment:
- You've identified a range of potential risks, both technical and non-technical. Additionally, you've proposed relevant risk mitigation strategies and categorized them systematically, which will help in managing risks proactively.
Markdown Formatting:
- The use of Markdown is well-done; your document is well-structured, easy to read, and logically organized with headings and bullet points for clarity.

Areas for Improvement:

Enhanced Escalation Matrix:
- While the escalation matrix is clearly outlined, it could be beneficial to include an example of a scenario for each stage of escalation, demonstrating how to proceed through the steps practically.
More Detailed Testing Practices:
- Consider expanding on your testing practices for both business logic and data quality. Adding more specific examples of automated tests or tools used could deepen the understanding of your approach to ensuring pipeline reliability.
Monitoring Details:
- Although the runbooks mention success metrics and alerts, providing more explicit examples of the types of alerts and monitoring dashboards could offer a richer depiction of the monitoring setup's effectiveness.
Future Improvement Roadmap:
- You suggested some future improvements like automation and advanced monitoring, which is excellent. Providing a more detailed action plan or timelines for these advancements could illustrate a stronger vision for long-term pipeline improvement.

Additional Notes:

Your focus on team dynamics and workload balance is commendable, especially in a complex operational setting. These considerations are vital for maintaining team morale and high productivity.
The inclusion of business logic errors in your risks and the thought into regulatory compliance highlights a mature understanding of the broader consequences of data engineering work.

Final Grade:

Based on your submission, you've fulfilled the assignment requirements commendably and displayed an excellent grasp of managing data pipelines. Your work shows attention to detail and strategic thought in pipeline operations, ownership, scheduling, and risk management.

FINAL GRADE:
{
  "letter_grade": "A",
  "passes": true
}

KPIs and Experimentation

User Journey

Clarity and Detail:
Your description of Spotify's user journey from discovery to daily habit formation is engaging and detailed. You effectively highlight how specific features and interactions have enhanced your experience, such as personalized playlists, social sharing, and cross-device integration. You also accurately noted pain points like content overload and interface complexity as Spotify evolved.
Relevance:
The transitions between different phases of your journey (discovery, adoption, deep integration, and current usage) provide clear insights into Spotify's growing role in your daily life. These phases help contextualize your proposed experiments effectively.

Experiment Design

Experiment 1: Enhanced Social Discovery & Collaborative Listening:
- Clarity and Structure: The experiment design is well-structured with clear test cell allocations. You propose meaningful improvements by leveraging social engagement features.
- Metrics: The leading and lagging metrics align well with the hypothesis, aiming to enhance user engagement and session duration.
Experiment 2: AI-Powered Content Curation & Mood-Based Discovery:
- Test Design: The test groups and feature combinations are logical, exploring the impact of AI and mood-based features on user satisfaction.
- Innovation: The proposed use of biometry and mood detection is innovative and forward-thinking.
Experiment 3: Gamification & Achievement System:
- Hypothesis and Metrics: The hypothesis clearly addresses user engagement through gamification. Metrics such as achievement unlock rates and daily engagement aptly measure the impact of gamification elements.

Technical and Statistical Analysis

Technical Implementation:
Your technical documentation reflects a strong understanding of data architectures and real-time processing needs. The use of modern tools and frameworks (e.g., Apache Kafka, TensorFlow) is well-justified for implementing the proposed features.
Statistical Rigor:
You have demonstrated proficiency in statistical analysis by conducting power analysis and explaining sample size calculations. This shows a solid foundation in experimental design principles.

Overall Comments

Your submission excels in highlighting key areas for improvement within Spotify, while maintaining a clear focus on user experience and business impact. Each experiment is thoughtfully designed with actionable hypotheses and suitable metrics, positioned to drive meaningful insights and improvements.

In summary, your work showcases a comprehensive understanding of product experimentation, user journey analysis, and technical implementation. To make your submission even stronger, you might consider implementing user feedback loops within your analyses to iterate on feature designs in real-time.

FINAL GRADE:

{
  "letter_grade": "A",
  "passes": true
}

Data Visualization

Data and Files: You've provided insightful datasets—match_details.csv, medals_matches_players.csv, assignment_8_tableau_guide.md, export_gaming_data.py, matches.csv, assignment_8_submission.md, and medals.csv—that are well-organized and descriptively named. You've completed a thorough setup for dashboard creation with sample data generation and export scripts, which shows good preparation in handling data necessary for Tableau visualizations.
Dashboards: The assignment entailed creating two dashboards:
- Executive Dashboard - High-level metrics for stakeholders
- Exploratory Dashboard - Interactive analysis with filters
Tableau Public URLs: The submission provides URLs for both dashboards:
- Executive Dashboard: Executive Dashboard - Gaming Analytics
- Exploratory Dashboard: Exploratory Dashboard - Gaming Analytics
Upon checking, the URLs appear to be valid and are pointing to Tableau Public. They follow the format https://public.tableau.com/views/....
Documentation and Guides: Your submission includes detailed documentation and guidelines on implementing the dashboards. The assignment_8_tableau_guide.md contains clear instructions on setting up various visual elements and charts, which is great for ensuring that the dashboards are comprehensive and meet the visual analytical requirements.
Submission Notes and Completeness:
- The submission document, assignment_8_submission.md, outlines the steps completed and the successes in creating and publishing the dashboards.
- It includes a checklist indicating all tasks have been executed.
- The descriptions provided for both dashboards are clear and let viewers know what insights to expect from each dashboard.

Feedback: Your assignment is thorough and well-executed. The tables are prepared with the necessary data to meet the outlined requirements, and the dashboards seem to effectively answer the needs of stakeholders seeking both high-level metrics and more detailed, interactive insights. Your comprehensive approach in preparing, documenting, and submitting this assignment indicates a strong understanding of data visualization principles and the use of Tableau as a tool.

Areas for Improvement:

For future projects, consider adding examples or insights from the dashboards to demonstrate value-add through the visualizations created. Screenshots or walk-throughs can enhance understanding and showcase the practicality of the dashboards.
While the guide provides implementation details, additional contextual insights about the data (e.g., meanings of certain metrics in a competitive gaming setup) could contextualize the dashboard views further.

Overall, your submission meets the expectations as per the assignment guidelines. Here is your final grade in JSON format:

{
  "letter_grade": "A",
  "passes": true
}

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
assignment 3 testing		assignment 3 testing
assignment 4		assignment 4
assignment 6		assignment 6
assignment 7		assignment 7
assignment 8		assignment 8
.env.txt		.env.txt
.gitignore		.gitignore
Book1.twb		Book1.twb
Data-Engineering-track-with-Zachary.Rproj		Data-Engineering-track-with-Zachary.Rproj
Dimensional Data Modeling.sql		Dimensional Data Modeling.sql
README.md		README.md
assignment Analytical patterns.sql		assignment Analytical patterns.sql
assignment3.py		assignment3.py
docker-compose.txt		docker-compose.txt
fact data modeling.sql		fact data modeling.sql
graph design memo.sql		graph design memo.sql
memo fact modeling.sql		memo fact modeling.sql
memo for 2nd assigment.sql		memo for 2nd assigment.sql

Folders and files

Latest commit

History

Repository files navigation

Data-Engineering-track-with-Zachary

📚 Learning Goals & Assignment Summary

🎯 Assignment 1: Dimensional Data Modeling

Learning Objectives:

Key Skills Developed:

🎯 Assignment 2: Fact Data Modeling

Learning Objectives:

Key Skills Developed:

🎯 Assignment 3: Spark Fundamentals

Learning Objectives:

Key Skills Developed:

🎯 Assignment 4: PySpark Testing

Learning Objectives:

Key Skills Developed:

🎯 Assignment 5: Data Pipeline Maintenance

Learning Objectives:

Key Skills Developed:

🎯 Assignment 6: Data Pipeline Maintenance (Continued)

Learning Objectives:

Key Skills Developed:

🎯 Assignment 7: Product Experimentation & User Journey Analysis

Learning Objectives:

Key Skills Developed:

🎯 Assignment 8: Tableau Dashboard Creation

Learning Objectives:

Key Skills Developed:

🏆 Overall Learning Journey Summary

Technical Skills Mastered:

Professional Skills Developed:

Industry-Ready Competencies:

Dimensional data modeling

Fact Data Modeling

Spark fundementals and testing

fundamentals

Testing

Conversion of PostgreSQL Queries to SparkSQL

PySpark Job Implementation

Tests

Areas for Improvement

FINAL GRADE

4-apache-flink-training

1. Flink Job Review:

Correctness:

Code Quality:

2. SQL Script Review:

Correctness:

Code Quality:

3. Documented Answers to Questions:

4. Testing Instructions and Execution:

Completeness:

5. Documentation:

Final Feedback:

FINAL GRADE:

5- Analytical Patterns

Query Review

Query 1: State Change Tracking

Query 2: GROUPING SETS Aggregations

Query 3 and 4: Player Scoring

Query 5: Team with Most Wins

Query 6: 90-Game Win Streak

Query 7: Longest Scoring Streak

Documentation & Execution Guides

General Observations

Final Grade

6- Data Pipeline Maintenance

Strengths:

Areas for Improvement:

Additional Notes:

Final Grade:

KPIs and Experimentation

User Journey

Experiment Design

Technical and Statistical Analysis

Overall Comments

FINAL GRADE:

Data Visualization

Packages