Skip to content

alifradi/Data-Engineering-track-with-Zachary

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data-Engineering-track-with-Zachary

📚 Learning Goals & Assignment Summary

This repository contains assignments from the Data Engineering Track, each designed to build specific skills and competencies in data engineering. Below is a comprehensive summary of learning goals achieved through each assignment.


🎯 Assignment 1: Dimensional Data Modeling

Grade: A | Status: Complete

Learning Objectives:

  • Advanced SQL Techniques: Master CTEs (Common Table Expressions) for complex queries
  • Data Structure Design: Use structs and arrays to compact and organize data efficiently
  • Custom Data Types: Create and implement enumerated types for business logic
  • SCD Implementation: Design and implement Slowly Changing Dimensions (Type 2)
  • Historical Data Management: Track changes over time with proper temporal modeling
  • Window Functions: Apply LAG, LEAD, and other window functions for analytics
  • Incremental Processing: Implement both full and incremental data loading strategies

Key Skills Developed:

  • Complex SQL query design and optimization
  • Dimensional modeling principles
  • Data warehouse design patterns
  • Historical data tracking and management

🎯 Assignment 2: Fact Data Modeling

Grade: B | Status: Complete

Learning Objectives:

  • Fact vs Dimension Tables: Understand when to use each table type
  • Data Deduplication: Implement effective duplicate removal strategies
  • Binary Data Transformation: Convert date arrays to binary representations for analytics
  • Incremental Loading: Design efficient incremental data processing pipelines
  • Array Operations: Master PostgreSQL array functions and operations
  • Performance Optimization: Implement efficient data loading and querying strategies

Key Skills Developed:

  • Fact table design and implementation
  • Data quality and deduplication techniques
  • Binary data representation for analytics
  • Incremental data processing design

🎯 Assignment 3: Spark Fundamentals

Grade: A | Status: Complete

Learning Objectives:

  • PySpark Operations: Master DataFrame operations and transformations
  • Performance Optimization: Implement broadcast joins and bucketing strategies
  • Data Partitioning: Design efficient partitioning and sorting strategies
  • Join Optimization: Understand and implement different join strategies
  • Aggregation Techniques: Master complex aggregations and window functions
  • Performance Monitoring: Analyze and optimize query performance

Key Skills Developed:

  • Apache Spark/PySpark programming
  • Big data processing and optimization
  • Performance tuning and monitoring
  • Distributed computing concepts

🎯 Assignment 4: PySpark Testing

Grade: A | Status: Complete

Learning Objectives:

  • PostgreSQL to SparkSQL Conversion: Convert complex SQL queries to Spark
  • Testing Frameworks: Implement comprehensive testing for data pipelines
  • Data Validation: Create robust data quality checks and validations
  • Test Data Generation: Design realistic test datasets for validation
  • Pipeline Testing: Implement end-to-end testing for data workflows
  • Error Handling: Design graceful error handling and recovery mechanisms

Key Skills Developed:

  • SparkSQL query optimization
  • Data pipeline testing methodologies
  • Test-driven development for data engineering
  • Quality assurance and validation techniques

🎯 Assignment 5: Data Pipeline Maintenance

Grade: Pending | Status: In Progress

Learning Objectives:

  • Pipeline Ownership: Design ownership and responsibility matrices
  • On-Call Scheduling: Create fair and sustainable on-call schedules
  • Runbook Creation: Develop comprehensive operational documentation
  • Risk Assessment: Identify and document pipeline risks and mitigation strategies
  • SLA Definition: Design Service Level Agreements and monitoring strategies
  • Incident Response: Plan and document incident response procedures

Key Skills Developed:

  • Data pipeline operations management
  • Team coordination and scheduling
  • Technical documentation and runbooks
  • Risk management and incident response

🎯 Assignment 6: Data Pipeline Maintenance (Continued)

Grade: Pending | Status: In Progress

Learning Objectives:

  • Monitoring & Alerting: Design comprehensive monitoring strategies
  • Performance Metrics: Define and track key performance indicators
  • Capacity Planning: Plan for growth and scalability
  • Disaster Recovery: Design backup and recovery procedures
  • Compliance & Security: Implement data governance and security measures

Key Skills Developed:

  • Production data pipeline management
  • Monitoring and observability
  • Capacity planning and scaling
  • Data governance and security

🎯 Assignment 7: Product Experimentation & User Journey Analysis

Grade: Pending | Status: In Progress

Learning Objectives:

  • User Journey Mapping: Analyze and document user experiences
  • A/B Testing Design: Design comprehensive experimentation frameworks
  • Statistical Analysis: Apply statistical methods for experiment evaluation
  • Metrics Definition: Define leading and lagging indicators
  • Hypothesis Formation: Create testable hypotheses for product improvement
  • Data-Driven Decision Making: Use analytics to inform product decisions

Key Skills Developed:

  • Product analytics and user research
  • Experimental design and A/B testing
  • Statistical analysis for business decisions
  • Product strategy and optimization

🎯 Assignment 8: Tableau Dashboard Creation

Grade: Pending | Status: Complete

Learning Objectives:

  • Data Visualization: Create professional dashboards for stakeholders
  • Executive Dashboards: Design high-level KPI dashboards for business users
  • Exploratory Dashboards: Build interactive analysis tools with filters
  • Data Storytelling: Communicate insights through effective visualizations
  • Dashboard Design: Apply UX/UI principles to dashboard creation
  • Data Publishing: Publish and share dashboards on Tableau Public

Key Skills Developed:

  • Tableau Public dashboard creation
  • Data visualization best practices
  • Business intelligence and reporting
  • Interactive dashboard design

🏆 Overall Learning Journey Summary

Technical Skills Mastered:

  • SQL & Databases: Advanced SQL, dimensional modeling, SCD patterns
  • Big Data Processing: Apache Spark, PySpark, performance optimization
  • Data Quality: Testing, validation, deduplication, error handling
  • Data Visualization: Tableau, dashboard design, data storytelling
  • Pipeline Management: Operations, monitoring, risk assessment

Professional Skills Developed:

  • Problem Solving: Complex data engineering challenges
  • Documentation: Technical writing, runbooks, specifications
  • Testing: Test-driven development, quality assurance
  • Communication: Dashboard design, data storytelling
  • Project Management: Assignment planning and execution

Industry-Ready Competencies:

  • Data Engineering: End-to-end data pipeline development
  • Data Warehousing: Design and implementation of data warehouses
  • Big Data Processing: Scalable data processing with Spark
  • Business Intelligence: Dashboard creation and data visualization
  • Operations: Production pipeline management and maintenance

This learning journey demonstrates comprehensive data engineering skills development, from foundational SQL to advanced big data processing and visualization. Each assignment builds upon previous knowledge, creating a strong foundation for a data engineering career. 🚀📊


Dimensional data modeling

1- Use of CTEs (Common Table Expressions) for temporaray table quering

2- use of structs to compact data and create customized types

3- use of arrays to compact data

4- unnest of data for slice and dice operations

5- track changements and modeling for SCD (Slowly changing Dimensions)

6- incremental and one full methods for back filling tables

7- window functions for advanced data processing

8- creating graphs and advanced analytics (DDLs for vertices, edges and their use for analytics): check the memo

Review of the assignment:

Feedback on Dimensional Data Modeling Submission:

Data Definition Language (DDL):

The use of custom types like films and quality_class is a good approach to encapsulate information and manage enumerations, which makes your schema more expressive. However, it is worth noting that custom types might add complexity when altering schema data types or migrating databases in some systems. The primary key definition for the actors table is well-thought-out. Combining actorid and year ensures unique entries each year per actor, which is crucial for accumulating historical data. Cumulative Table Generation:

The use of common table expressions (CTEs) — years, actor_startings, actors_seasons, and movies_aggregated — is efficient for breaking down the query logic into manageable steps. This makes your SQL readable and modular, aiding in troubleshooting. The logic used to categorize quality_class is straightforward and follows a simplistic tier system based on avg_rating. This is good, assuming ratings fall within expected ranges. Actors History SCD Table:

Your DDL for the actors_history_scd table appropriately sets up the structure for type 2 slowly changing dimensions (SCD). Tracking is_active and quality_class aligns with good SCD principles. Throughout the backfill query, the use of window functions (LAG) to identify changes in quality or activity status is apt and demonstrates an understanding of SQL analytics. Backfill Query:

The query captures historical data changes well with streak identifiers and transition tracking, covering both the data integrity and iterative growth of states correctly. The decision to cap the current_season at 2020 in the outcome is a good placeholder for further incremental steps. Incremental Query for SCD:

The query effectively combines historical records with new data by using structures like unchanged_records, changed_records, and new_records, which is essential for maintaining accurate SCD. The use of set operations (UNION ALL) ensures completeness, gathering from all necessary data streams into one output, indicating apt use of SQL in handling multi-source data aggregation. General Constructive Feedback:

Consider including comments more thoroughly explaining each SQL section or step for enhanced clarity, especially beneficial for future maintainers or collaborators. While the SQL logic is sound, incorporating indexed strategies could be considered in your design, depending on scale, to boost performance, especially for tables that grow significantly over time. Final Evaluation:

Your submission demonstrates a comprehensive understanding of dimensional data modeling, especially in managing historical data changes with SCD. The provided queries are functional, with logical structuring and clear attempts at applying advanced SQL features (like window functions and CTEs) effectively.

FINAL GRADE:

{ "letter_grade": "A", "passes": true }

Fact Data Modeling

1- difference between Fact and dimensions tables

2- when a changing value is considered to belong to dimension or to fact table

3- transforming logging activities to fact table

4- compacting logging history into binary rows for later easy advanced data analytics

5- incremental data loading for logging activity

De-duplication Query:

You attempted to remove duplicates from game_details using SQL's ROW_NUMBER(). However, both game_id and team_id are considered, but player_id has been omitted, which was part of the requirement. The comment suggests comparing data size before and after deduplication, which is helpful for verification. User Devices Activity Datelist DDL:

The table schema for user_devices_cumulated is close to what's needed, but there is a slight misalignment with the requirement. The device_activity_datelist should ideally look like a MAP<STRING, ARRAY[DATE]>, which can be better represented with browser types as keys and date arrays as values. Your current schema seems to focus on having DATE[]. User Devices Activity Datelist Implementation:

Your incremental query approach for populating user_devices_cumulated is organized and uses CTEs effectively. The logic of coalescing between yesterday and today to append to device_activity_datelist is a valid approach for aggregation, though the date handling in coalesce might need careful adjustment for proper incrementation as + Interval '1 day' assumes continued daily processing. The use of array_append maintains consistency of aggregated dates. User Devices Activity Int Datelist:

The transformation of device_activity_datelist into datelist_int is correctly approached with cross joining and summing the results of powers of two, which respects the essence of base-2 transformation. It is well-structured, though the final outcome might be overly specific in using bit(32), which potentially limits longer date aggregation if the intention is to cover a more extended timeline. Host Activity Datelist DDL:

The schema outline is straightforward and aligns with requirements. Defining a primary key on host and date ensures unique record criteria. General Feedback:

Consider refining the deduplication approach to encompass all required identifiers. Ensure your schema captures the requirements of a MAP data structure for devices. Readability is aided by consistent comments; keep this up. Check alignment with data types that cater to broader use cases if required. Your submission covered the five necessary queries effectively. While there are some areas for precision and improvement, your logical structuring and understanding of SQL are evident.

FINAL GRADE:

{ "letter_grade": "B", "passes": true } Good work, and with slight adjustments, you'll be even stronger in these exercises. Let me know if there's anything unclear or if you need further feedback on this assignment!

Spark fundementals and testing

fundamentals

Disabling Default Broadcast Joins

You have correctly disabled the automatic broadcast join setting with spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1"). Well done. Explicitly Broadcast Join

You correctly broadcast the medals and maps tables using F.broadcast, which properly aligns with the task requirements. Bucket Joins and Reading Bucketed Tables

It seems like your match_details, matches, and medals_matches_players tables are read as bucketed. However, it's not entirely clear how these tables were bucketed, as the creation of the bucket is assumed ("Read bucketed tables (assumes they were written with bucketBy match_id)"). It's essential to ensure these tables are bucketed by match_id with 16 buckets during their creation or writing as this affects join performance. Aggregations (Queries 4a, 4b, 4c, 4d)

Query 4a: The calculation for "Which player has the highest average kills per game?" is correctly implemented using groupBy and agg operations to get the average kills.

Query 4b: The most played playlist is correctly obtained with a grouped and ordered DataFrame.

Query 4c: You performed the aggregation for the most played map correctly.

Query 4d: The query for the map with the most "Killing Spree" medals is correctly filtered and aggregated.

Optimization (Partitioning and Sorting)

You executed the partitioning strategies correctly, including repurposing with low and high cardinality fields and adding comments to clarify your intentions.

Applying sortWithinPartitions to ensure data is optimized for query performance was correctly implemented.

Improvements/Suggestions:

Provide more details on the preparation of your bucketed tables (e.g., code or steps) to ensure correctness in setting up the bucket-based join strategy.

You could also expand on how performance could be monitored to verify the impact of each partitioning and sorting strategy.

Overall, your submission covers all tasks appropriately with a good understanding of Spark operations, joins, aggregations, partitioning, and sorting strategies.

Here is your final grade based on the rubric:

{ "letter_grade": "A", "passes": true }

Testing

Here's the feedback for your submission:

Conversion of PostgreSQL Queries to SparkSQL

  1. Device Activity Analysis:

    • You successfully converted a PostgreSQL query into a SparkSQL query for analyzing device activity.
    • The logic for handling user device events, daily activity aggregation, and the final activity summary appears correct.
    • The usage of CTEs (WITH clauses) enhances readability.
  2. Actor Quality Analysis:

    • The PostgreSQL-to-SparkSQL conversion for actor quality analysis is well implemented.
    • You've categorized actors into quality classes based on ratings and checked for active status correctly.

PySpark Job Implementation

  1. Device Activity Analysis Job:

    • The analyze_device_activity function is well-structured with appropriate use of DataFrame registration for SparkSQL execution.
    • The generate_datelist_int function correctly integrates binary computation for date list conversion.
  2. Actor Quality Analysis Job:

    • The analyze_actor_quality function is effectively structured to perform aggregations and classifications based on ratings.
  3. Overall Code Quality:

    • Best practices for Spark session management have been mostly followed, with create_spark_session encapsulating session creation logic.
    • The main functions are appropriately configured to allow testing and result showcasing.

Tests

  1. Test Structure & Data Generation:

    • You have used pytest fixtures to create a Spark session and generate sample data, which allows for clean and modular test design.
    • Test cases cover multiple aspects of the logic such as data structure validation, filtering, aggregation, classification, and ordering.
  2. Coverage & Assertions:

    • Device activity tests verify the structure, filtering, aggregations, and ordered output appropriately.
    • Actor quality tests confirm the integrity of classifications and the correctness of computed averages and flags.
  3. Test Naming & Clarity:

    • Test methods are well-named, indicative of the functionality being tested, leading to clear intentions.

Overall, the submission reflects a strong understanding of the task requirements and demonstrates adept application of PySpark for large scale data processing tasks.

Areas for Improvement

  • Edge Cases: Consider adding more edge case handling, such as scenarios where expected averages might be on boundary values or when activity lists are empty.
  • Code Documentation: Additional inline comments can improve understanding of complex logic, especially around binary conversion logic in the date list generation.

FINAL GRADE

{
  "letter_grade": "A",
  "passes": true
}

Your work demonstrates a high level of proficiency and adherence to the assignment requirements. Keep up the good work! If you have any questions or need further clarification, feel free to reach out.

4-apache-flink-training

** This feedback is auto-generated from an LLM **

1. Flink Job Review:

Correctness:

  • Sessionization Gap: You've correctly implemented a 5-minute session gap based on IP and host, which aligns with the homework's requirements.
  • Data Persistence: The job writes output to specified PostgreSQL tables effectively.

Code Quality:

  • Structure and Readability: The script is well-structured, and your use of in-line comments adds clarity. However, consider using more descriptive variable names to enhance readability, especially in sessionize_events.
  • Flink Best Practices: The use of Flink's Table API to handle session windows is executed correctly, making the code efficient and logical.

2. SQL Script Review:

Correctness:

  • Calculations: Your SQL queries for calculating the average number of events per session and comparing them across different hosts are correct.
  • Use of CTEs: You’ve efficiently utilized Common Table Expressions (CTEs) to separate the aggregation logic.

Code Quality:

  • Readability and Performance: The queries are clean and show a good understanding of SQL best practices.

3. Documented Answers to Questions:

  • Average Events per Session: The detailed breakdown and comparison between hosts are well-documented, providing clear insights.
  • Final Insights: Your conclusions and business insights into session behavior are well-analyzed and presented succinctly.

4. Testing Instructions and Execution:

Completeness:

  • Execution: You provided clear execution instructions for running the Flink job and querying PostgreSQL, which would allow verification of the job's success.

5. Documentation:

  • Explanation: Your submission includes comprehensive technical implementation details and business insights that are well-documented in HOMEWORK_ANSWERS.md.
  • Suggestions for Improvement:
    • While the technical documentation is strong, consider including a brief section on potential challenges or limitations encountered in the project, which can help in understanding the solution's depth.

Final Feedback:

This submission demonstrates a strong grasp of both the technical implementation of sessionization using Flink and SQL-based analysis of web sessions with PostgreSQL. The code is well-organized, efficient, and aligns with best practices. Your documentation effectively communicates your approach and findings.

Great work on this assignment!


FINAL GRADE:

{
  "letter_grade": "A",
  "passes": true
}

5- Analytical Patterns

Query Review

Query 1: State Change Tracking

  • Accuracy: Your state change tracking query correctly uses window functions like LAG, LEAD, FIRST_VALUE, and LAST_VALUE to determine player state changes. The logic in your CASE statement accurately categorizes players into the required states.
  • Efficiency: Your query is thoughtfully structured and should perform well on properly indexed tables.

Query 2: GROUPING SETS Aggregations

  • Accuracy: The use of GROUPING SETS is correct and effectively aggregates data across multiple dimensions.
  • Clarity: The use of COALESCE and GROUPING functions is well-executed, making identifying aggregation levels clear.

Query 3 and 4: Player Scoring

  • Accuracy: Your query to identify the player who scored the most points for a single team correctly implements aggregation logic.
  • Adjustment: It seems like queries for task 3 and task 4 were merged. Ensure each task is separated, i.e., one to find the player for most points for a single team and another for most points in a single season.

Query 5: Team with Most Wins

  • The team performance ranking query effectively identifies teams by their winning percentage, which should help identify the top-performing team by total wins.

Query 6: 90-Game Win Streak

  • Accuracy: Your window function use here is excellent, employing a rolling window with ROWS BETWEEN 89 PRECEDING AND CURRENT ROW.
  • Performance Consideration: Ensure performance by validating data is indexed properly on game_date.

Query 7: Longest Scoring Streak

  • Accuracy: The query correctly identifies streaks by using cumulative sums to detect changes in scoring patterns. It addresses the task requirement of identifying LeBron James' scoring streaks over 10 points.

Documentation & Execution Guides

  • Your README and EXECUTION_GUIDE documentation is thorough, providing clear instructions and rationales for your approach.
  • You demonstrated a strong connection to Week 1's learnings in your ASSIGNMENT_SUMMARY. This continuity helps bridge your prior knowledge with advanced concepts.

General Observations

  • Correctness: Your queries generally meet the requirements and produce the expected results. The detailed documentation shows a good understanding of SQL best practices and thoughtful design.

  • Efficiency: You've considered performance in window function use and grouping sets, though ensure all appropriate indexes are in place, as you noted.

  • Clarity and Structure: The queries are well-structured and readable, which makes understanding and modifying them easier.

  • Edge Cases: While your queries handle standard scenarios well, ensure any assumptions, like player names, are examined against actual dataset entries to confirm robustness in all possible cases.

Final Grade

Based on your comprehensive submission:

{
  "letter_grade": "A",
  "passes": true
}

6- Data Pipeline Maintenance

Strengths:

  1. Comprehensive Ownership Assignment:

    • You've clearly outlined a primary and secondary ownership structure for each pipeline. This ensures accountability and coverage during any periods of absence. It's good to see assignments based on the engineer's experience and specialization, maximizing the team's skill set effectively.
  2. Fair On-call Schedule:

    • The on-call schedule is logically structured and considers individual rotation cycles while also accounting for holiday periods and reduced on-call loads during major holidays. Your attention to distributing weekend and evening shifts evenly helps maintain fairness across the team.
  3. Detailed Runbooks:

    • Each pipeline's runbook is detailed, covering crucial aspects like purpose, technical details, monitoring strategies, and troubleshooting steps. The inclusion of business impact and SLA considerations adds valuable context for operational decision-making.
  4. Risk Assessment:

    • You've identified a range of potential risks, both technical and non-technical. Additionally, you've proposed relevant risk mitigation strategies and categorized them systematically, which will help in managing risks proactively.
  5. Markdown Formatting:

    • The use of Markdown is well-done; your document is well-structured, easy to read, and logically organized with headings and bullet points for clarity.

Areas for Improvement:

  1. Enhanced Escalation Matrix:

    • While the escalation matrix is clearly outlined, it could be beneficial to include an example of a scenario for each stage of escalation, demonstrating how to proceed through the steps practically.
  2. More Detailed Testing Practices:

    • Consider expanding on your testing practices for both business logic and data quality. Adding more specific examples of automated tests or tools used could deepen the understanding of your approach to ensuring pipeline reliability.
  3. Monitoring Details:

    • Although the runbooks mention success metrics and alerts, providing more explicit examples of the types of alerts and monitoring dashboards could offer a richer depiction of the monitoring setup's effectiveness.
  4. Future Improvement Roadmap:

    • You suggested some future improvements like automation and advanced monitoring, which is excellent. Providing a more detailed action plan or timelines for these advancements could illustrate a stronger vision for long-term pipeline improvement.

Additional Notes:

  • Your focus on team dynamics and workload balance is commendable, especially in a complex operational setting. These considerations are vital for maintaining team morale and high productivity.
  • The inclusion of business logic errors in your risks and the thought into regulatory compliance highlights a mature understanding of the broader consequences of data engineering work.

Final Grade:

Based on your submission, you've fulfilled the assignment requirements commendably and displayed an excellent grasp of managing data pipelines. Your work shows attention to detail and strategic thought in pipeline operations, ownership, scheduling, and risk management.

FINAL GRADE:
{
  "letter_grade": "A",
  "passes": true
}

KPIs and Experimentation

User Journey

  1. Clarity and Detail:
    Your description of Spotify's user journey from discovery to daily habit formation is engaging and detailed. You effectively highlight how specific features and interactions have enhanced your experience, such as personalized playlists, social sharing, and cross-device integration. You also accurately noted pain points like content overload and interface complexity as Spotify evolved.

  2. Relevance:
    The transitions between different phases of your journey (discovery, adoption, deep integration, and current usage) provide clear insights into Spotify's growing role in your daily life. These phases help contextualize your proposed experiments effectively.

Experiment Design

  1. Experiment 1: Enhanced Social Discovery & Collaborative Listening:

    • Clarity and Structure: The experiment design is well-structured with clear test cell allocations. You propose meaningful improvements by leveraging social engagement features.
    • Metrics: The leading and lagging metrics align well with the hypothesis, aiming to enhance user engagement and session duration.
  2. Experiment 2: AI-Powered Content Curation & Mood-Based Discovery:

    • Test Design: The test groups and feature combinations are logical, exploring the impact of AI and mood-based features on user satisfaction.
    • Innovation: The proposed use of biometry and mood detection is innovative and forward-thinking.
  3. Experiment 3: Gamification & Achievement System:

    • Hypothesis and Metrics: The hypothesis clearly addresses user engagement through gamification. Metrics such as achievement unlock rates and daily engagement aptly measure the impact of gamification elements.

Technical and Statistical Analysis

  1. Technical Implementation:
    Your technical documentation reflects a strong understanding of data architectures and real-time processing needs. The use of modern tools and frameworks (e.g., Apache Kafka, TensorFlow) is well-justified for implementing the proposed features.

  2. Statistical Rigor:
    You have demonstrated proficiency in statistical analysis by conducting power analysis and explaining sample size calculations. This shows a solid foundation in experimental design principles.

Overall Comments

Your submission excels in highlighting key areas for improvement within Spotify, while maintaining a clear focus on user experience and business impact. Each experiment is thoughtfully designed with actionable hypotheses and suitable metrics, positioned to drive meaningful insights and improvements.

In summary, your work showcases a comprehensive understanding of product experimentation, user journey analysis, and technical implementation. To make your submission even stronger, you might consider implementing user feedback loops within your analyses to iterate on feature designs in real-time.

FINAL GRADE:

{
  "letter_grade": "A",
  "passes": true
}

Data Visualization

  1. Data and Files: You've provided insightful datasets—match_details.csv, medals_matches_players.csv, assignment_8_tableau_guide.md, export_gaming_data.py, matches.csv, assignment_8_submission.md, and medals.csv—that are well-organized and descriptively named. You've completed a thorough setup for dashboard creation with sample data generation and export scripts, which shows good preparation in handling data necessary for Tableau visualizations.

  2. Dashboards: The assignment entailed creating two dashboards:

    • Executive Dashboard - High-level metrics for stakeholders
    • Exploratory Dashboard - Interactive analysis with filters
  3. Tableau Public URLs: The submission provides URLs for both dashboards:

    Upon checking, the URLs appear to be valid and are pointing to Tableau Public. They follow the format https://public.tableau.com/views/....

  4. Documentation and Guides: Your submission includes detailed documentation and guidelines on implementing the dashboards. The assignment_8_tableau_guide.md contains clear instructions on setting up various visual elements and charts, which is great for ensuring that the dashboards are comprehensive and meet the visual analytical requirements.

  5. Submission Notes and Completeness:

    • The submission document, assignment_8_submission.md, outlines the steps completed and the successes in creating and publishing the dashboards.
    • It includes a checklist indicating all tasks have been executed.
    • The descriptions provided for both dashboards are clear and let viewers know what insights to expect from each dashboard.

Feedback: Your assignment is thorough and well-executed. The tables are prepared with the necessary data to meet the outlined requirements, and the dashboards seem to effectively answer the needs of stakeholders seeking both high-level metrics and more detailed, interactive insights. Your comprehensive approach in preparing, documenting, and submitting this assignment indicates a strong understanding of data visualization principles and the use of Tableau as a tool.

Areas for Improvement:

  • For future projects, consider adding examples or insights from the dashboards to demonstrate value-add through the visualizations created. Screenshots or walk-throughs can enhance understanding and showcase the practicality of the dashboards.
  • While the guide provides implementation details, additional contextual insights about the data (e.g., meanings of certain metrics in a competitive gaming setup) could contextualize the dashboard views further.

Overall, your submission meets the expectations as per the assignment guidelines. Here is your final grade in JSON format:

{
  "letter_grade": "A",
  "passes": true
}

About

Assignments for advanced SQL, flink, KAFKA and Spark for data engineers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages