Premier League Database Engineering & Analytics Project

Project Overview

This project focused on designing and building a normalized relational SQL Server database using historical Premier League football data spanning from the 1993–1994 season through 2024–2025.

The project involved importing raw CSV datasets into SQL Server staging tables, designing a normalized schema using dimension and fact tables, migrating data through SQL transformations, validating referential integrity, and performing analytical queries across multiple related tables.

The final solution demonstrates practical data engineering, relational modelling, ETL workflows, and sports analytics using SQL Server.

Project Objectives

Transform denormalized CSV data into a normalized SQL database
Design a relational schema using dimension and fact tables
Build and populate SQL Server tables using ETL-style workflows
Enforce relationships using primary and foreign keys
Validate migrated data and relationships
Perform advanced analytical SQL queries across multiple tables
Generate insights from Premier League historical data

Tools & Technologies

Tool	Purpose
SQL Server Management Studio (SSMS)	Database development and querying
SQL Server	Relational database engine
Python / Jupyter Notebook	CSV profiling and preprocessing
CSV Files	Raw data source
Power BI / Excel	Data visualisation
T-SQL	Database creation, migration, and analysis

Datasets Used

1. Historical Premier League Dataset

Main dataset containing:

Match results
Team statistics
Referee information
Betting odds
Match outcomes
Seasonal data

Dataset Scale

12,160 matches
43 columns
Seasons:
- 1993–1994 → 2024–2025

2. 2024–2025 Fixtures Dataset

Additional fixture dataset containing:

Match schedule
Stadium/location
Teams
Fixture results

Dataset Scale

380 fixtures

Database Design Strategy

The original CSV data was highly denormalized, with repeated:

team names
referee names
season values
match statistics

To improve:

storage efficiency
query performance
relational integrity
analytical flexibility

the database was normalized into dimension and fact tables.

Database Schema

Dimension Tables

`dim_season`

Stores unique Premier League seasons.

Column
season_id (PK)
season_name

`dim_team`

Stores unique football clubs.

Column
team_id (PK)
team_name

`dim_referee`

Stores unique referees.

Column
referee_id (PK)
referee_name

Fact Tables

`fact_match`

Stores match-level metadata.

Column
match_id (PK)
season_id (FK)
match_week
match_date
match_time
home_team_id (FK)
away_team_id (FK)
referee_id (FK)

`fact_match_result`

Stores goals and outcomes.

Column
match_id (FK)
full_time_home_goals
full_time_away_goals
full_time_result
half_time_home_goals
half_time_away_goals
half_time_result
home_team_points
away_team_points

`fact_match_stats`

Stores performance statistics.

Column
match_id (FK)
shots
corners
fouls
yellow_cards
red_cards

`fact_betting_odds`

Stores betting market data.

Column
match_id (FK)
Bet365 odds
Market averages
Over/under odds

`fact_fixture`

Stores 2024–2025 fixture information.

Column
fixture_id (PK)
season_id (FK)
match_number
round_number
fixture_date
location
home_team_id (FK)
away_team_id (FK)
result_text

ETL & Data Migration Workflow

Step 1 — CSV Profiling (Python)

Python/Jupyter was used to:

inspect column structures
identify missing values
validate data types
compare team naming conventions
identify import risks

Example:

import pandas as pd

df = pd.read_csv("PremierLeague.csv")
df.info()
df.isna().sum()

## Step 2 — Staging Tables

Raw CSV files were imported into SQL Server staging tables:

```text
staging_premier_league
staging_fixtures_2024_25

The staging layer preserved raw data prior to normalization.

Step 3 — Data Cleaning

Key cleaning activities:

Fixed team naming inconsistencies:
- Man Utd → Man United
- Spurs → Tottenham
Corrected CSV auto-formatting issues
Prevented Excel from converting football scores into date/time values
Managed nullable historical statistics
Standardised column data types

Step 4 — Dimension Population

Dimension tables were populated using:

INSERT INTO ... SELECT DISTINCT

This eliminated duplicated:

seasons
teams
referees

Step 5 — Fact Table Population

Fact tables were populated using:

joins to dimension tables
foreign key relationships
normalized references

Validation & Integrity Checks

Validation	Result
Duplicate Match IDs	0
Missing Core Match Fields	0
Missing Fixture Fields	0
Broken Foreign Key Relationships	0
Duplicate Fixture Match Numbers	0

Analytical Queries & Insights

Top Scoring Teams

Key Findings

Manchester United recorded the highest historical goal total
Arsenal and Liverpool closely followed
Manchester City demonstrated exceptional scoring efficiency despite fewer historical matches

Insight

Traditional “Big Six” clubs dominate historical offensive output across Premier League history.

Home Advantage Analysis

Outcome	Matches
Home Wins	5,565
Draws	3,093
Away Wins	3,502

Insight

Home teams historically win significantly more matches than away teams, confirming a strong Premier League home advantage.

Highest Scoring Seasons

Key Findings

Recent seasons dominate the highest-scoring rankings:

2023–2024
2024–2025
2022–2023

Insight

Modern Premier League football appears increasingly attack-oriented, with higher total goal output in recent seasons.

Referee Analysis

Key Findings

Most officiated referees:

Mike Dean
Martin Atkinson
Anthony Taylor
Michael Oliver

Data Engineering Insight

Historical referee data revealed inconsistent naming conventions:

M Dean
Mike Dean
M. L. Dean

This highlighted the importance of master-data standardisation in ETL workflows.

Shot Conversion Analysis

Key Findings

Elite clubs maintained strong conversion rates across very high shot volumes.

Insight

Smaller clubs occasionally showed inflated efficiency due to limited sample sizes, while elite clubs demonstrated sustained attacking efficiency over long historical periods.

Visualisations Created

Dashboard Visuals

Top Scoring Teams
Home vs Away Wins
Goals by Season
Shot Conversion Rates
Referee Match Counts

Challenges Faced

1. CSV Import Issues

Excel automatically converted football scores into date/time values during preprocessing.

Solution

Used VS Code and Python to preserve score values as text
Imported staging fields as nvarchar

2. Team Naming Inconsistencies

Fixture datasets used abbreviated team names.

Solution

Standardised naming using SQL update transformations.

3. Historical Data Gaps

Older seasons lacked:

referee data
betting odds
advanced match statistics

Solution

Preserved NULL values rather than imputing artificial data.

Skills Demonstrated

Relational database design
SQL normalization
ETL workflows
Data migration
Foreign key modelling
SQL Server schema development
Advanced SQL querying
Data validation
Sports analytics
Handling messy real-world datasets

Conclusion

This project demonstrates the complete lifecycle of building a normalized relational database from raw sports data using SQL Server. The project combined database engineering, ETL processes, relational modelling, and advanced analytical querying to transform denormalized Premier League datasets into a scalable analytical database structure.

The final solution supports reliable multi-table analysis while showcasing practical data engineering and SQL development skills applicable to real-world analytical systems.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
datasets		datasets
notebook		notebook
schema-erd		schema-erd
screenshots		screenshots
sql-outputs		sql-outputs
sql		sql
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Premier League Database Engineering & Analytics Project

Project Overview

Project Objectives

Tools & Technologies

Datasets Used

1. Historical Premier League Dataset

Dataset Scale

2. 2024–2025 Fixtures Dataset

Dataset Scale

Database Design Strategy

Database Schema

Dimension Tables

dim_season

dim_team

dim_referee

Fact Tables

fact_match

fact_match_result

fact_match_stats

fact_betting_odds

fact_fixture

ETL & Data Migration Workflow

Step 1 — CSV Profiling (Python)

Step 3 — Data Cleaning

Key cleaning activities:

Step 4 — Dimension Population

Step 5 — Fact Table Population

Validation & Integrity Checks

Analytical Queries & Insights

Top Scoring Teams

Key Findings

Insight

Home Advantage Analysis

Insight

Highest Scoring Seasons

Key Findings

Insight

Referee Analysis

Key Findings

Data Engineering Insight

Shot Conversion Analysis

Key Findings

Insight

Visualisations Created

Dashboard Visuals

Challenges Faced

1. CSV Import Issues

Solution

2. Team Naming Inconsistencies

Solution

3. Historical Data Gaps

Solution

Skills Demonstrated

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`dim_season`

`dim_team`

`dim_referee`

`fact_match`

`fact_match_result`

`fact_match_stats`

`fact_betting_odds`

`fact_fixture`

Packages