Skip to content

StevenTapscott/SQL_Premier-League-Database-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Premier League Database Engineering & Analytics Project

Project Overview

This project focused on designing and building a normalized relational SQL Server database using historical Premier League football data spanning from the 1993–1994 season through 2024–2025.

The project involved importing raw CSV datasets into SQL Server staging tables, designing a normalized schema using dimension and fact tables, migrating data through SQL transformations, validating referential integrity, and performing analytical queries across multiple related tables.

The final solution demonstrates practical data engineering, relational modelling, ETL workflows, and sports analytics using SQL Server.


Project Objectives

  • Transform denormalized CSV data into a normalized SQL database
  • Design a relational schema using dimension and fact tables
  • Build and populate SQL Server tables using ETL-style workflows
  • Enforce relationships using primary and foreign keys
  • Validate migrated data and relationships
  • Perform advanced analytical SQL queries across multiple tables
  • Generate insights from Premier League historical data

Tools & Technologies

Tool Purpose
SQL Server Management Studio (SSMS) Database development and querying
SQL Server Relational database engine
Python / Jupyter Notebook CSV profiling and preprocessing
CSV Files Raw data source
Power BI / Excel Data visualisation
T-SQL Database creation, migration, and analysis

Datasets Used

1. Historical Premier League Dataset

Main dataset containing:

  • Match results
  • Team statistics
  • Referee information
  • Betting odds
  • Match outcomes
  • Seasonal data

Dataset Scale

  • 12,160 matches
  • 43 columns
  • Seasons:
    • 1993–1994 → 2024–2025

2. 2024–2025 Fixtures Dataset

Additional fixture dataset containing:

  • Match schedule
  • Stadium/location
  • Teams
  • Fixture results

Dataset Scale

  • 380 fixtures

Database Design Strategy

The original CSV data was highly denormalized, with repeated:

  • team names
  • referee names
  • season values
  • match statistics

To improve:

  • storage efficiency
  • query performance
  • relational integrity
  • analytical flexibility

the database was normalized into dimension and fact tables.


Database Schema

Dimension Tables

dim_season

Stores unique Premier League seasons.

Column
season_id (PK)
season_name

dim_team

Stores unique football clubs.

Column
team_id (PK)
team_name

dim_referee

Stores unique referees.

Column
referee_id (PK)
referee_name

Fact Tables

fact_match

Stores match-level metadata.

Column
match_id (PK)
season_id (FK)
match_week
match_date
match_time
home_team_id (FK)
away_team_id (FK)
referee_id (FK)

fact_match_result

Stores goals and outcomes.

Column
match_id (FK)
full_time_home_goals
full_time_away_goals
full_time_result
half_time_home_goals
half_time_away_goals
half_time_result
home_team_points
away_team_points

fact_match_stats

Stores performance statistics.

Column
match_id (FK)
shots
corners
fouls
yellow_cards
red_cards

fact_betting_odds

Stores betting market data.

Column
match_id (FK)
Bet365 odds
Market averages
Over/under odds

fact_fixture

Stores 2024–2025 fixture information.

Column
fixture_id (PK)
season_id (FK)
match_number
round_number
fixture_date
location
home_team_id (FK)
away_team_id (FK)
result_text

ETL & Data Migration Workflow

Step 1 — CSV Profiling (Python)

Python/Jupyter was used to:

  • inspect column structures
  • identify missing values
  • validate data types
  • compare team naming conventions
  • identify import risks

Example:

import pandas as pd

df = pd.read_csv("PremierLeague.csv")
df.info()
df.isna().sum()

## Step 2 — Staging Tables

Raw CSV files were imported into SQL Server staging tables:

```text
staging_premier_league
staging_fixtures_2024_25

The staging layer preserved raw data prior to normalization.


Step 3 — Data Cleaning

Key cleaning activities:

  • Fixed team naming inconsistencies:
    • Man UtdMan United
    • SpursTottenham
  • Corrected CSV auto-formatting issues
  • Prevented Excel from converting football scores into date/time values
  • Managed nullable historical statistics
  • Standardised column data types

Step 4 — Dimension Population

Dimension tables were populated using:

INSERT INTO ... SELECT DISTINCT

This eliminated duplicated:

  • seasons
  • teams
  • referees

Step 5 — Fact Table Population

Fact tables were populated using:

  • joins to dimension tables
  • foreign key relationships
  • normalized references

Validation & Integrity Checks

Validation Result
Duplicate Match IDs 0
Missing Core Match Fields 0
Missing Fixture Fields 0
Broken Foreign Key Relationships 0
Duplicate Fixture Match Numbers 0

Analytical Queries & Insights

Top Scoring Teams

Key Findings

  • Manchester United recorded the highest historical goal total
  • Arsenal and Liverpool closely followed
  • Manchester City demonstrated exceptional scoring efficiency despite fewer historical matches

Insight

Traditional “Big Six” clubs dominate historical offensive output across Premier League history.


Home Advantage Analysis

Outcome Matches
Home Wins 5,565
Draws 3,093
Away Wins 3,502

Insight

Home teams historically win significantly more matches than away teams, confirming a strong Premier League home advantage.


Highest Scoring Seasons

Key Findings

Recent seasons dominate the highest-scoring rankings:

  • 2023–2024
  • 2024–2025
  • 2022–2023

Insight

Modern Premier League football appears increasingly attack-oriented, with higher total goal output in recent seasons.


Referee Analysis

Key Findings

Most officiated referees:

  • Mike Dean
  • Martin Atkinson
  • Anthony Taylor
  • Michael Oliver

Data Engineering Insight

Historical referee data revealed inconsistent naming conventions:

  • M Dean
  • Mike Dean
  • M. L. Dean

This highlighted the importance of master-data standardisation in ETL workflows.


Shot Conversion Analysis

Key Findings

Elite clubs maintained strong conversion rates across very high shot volumes.

Insight

Smaller clubs occasionally showed inflated efficiency due to limited sample sizes, while elite clubs demonstrated sustained attacking efficiency over long historical periods.


Visualisations Created

Dashboard Visuals

  • Top Scoring Teams
  • Home vs Away Wins
  • Goals by Season
  • Shot Conversion Rates
  • Referee Match Counts

Challenges Faced

1. CSV Import Issues

Excel automatically converted football scores into date/time values during preprocessing.

Solution

  • Used VS Code and Python to preserve score values as text
  • Imported staging fields as nvarchar

2. Team Naming Inconsistencies

Fixture datasets used abbreviated team names.

Solution

Standardised naming using SQL update transformations.


3. Historical Data Gaps

Older seasons lacked:

  • referee data
  • betting odds
  • advanced match statistics

Solution

Preserved NULL values rather than imputing artificial data.


Skills Demonstrated

  • Relational database design
  • SQL normalization
  • ETL workflows
  • Data migration
  • Foreign key modelling
  • SQL Server schema development
  • Advanced SQL querying
  • Data validation
  • Sports analytics
  • Handling messy real-world datasets

Conclusion

This project demonstrates the complete lifecycle of building a normalized relational database from raw sports data using SQL Server. The project combined database engineering, ETL processes, relational modelling, and advanced analytical querying to transform denormalized Premier League datasets into a scalable analytical database structure.

The final solution supports reliable multi-table analysis while showcasing practical data engineering and SQL development skills applicable to real-world analytical systems.

About

Designed and built a normalized Premier League SQL Server database from raw CSV data using ETL workflows, dimension/fact tables, and relational modelling. Performed data cleaning, validation, and advanced analytical SQL queries to uncover insights into scoring trends, home advantage, referee activity, and team performance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors