This project focused on designing and building a normalized relational SQL Server database using historical Premier League football data spanning from the 1993–1994 season through 2024–2025.
The project involved importing raw CSV datasets into SQL Server staging tables, designing a normalized schema using dimension and fact tables, migrating data through SQL transformations, validating referential integrity, and performing analytical queries across multiple related tables.
The final solution demonstrates practical data engineering, relational modelling, ETL workflows, and sports analytics using SQL Server.
- Transform denormalized CSV data into a normalized SQL database
- Design a relational schema using dimension and fact tables
- Build and populate SQL Server tables using ETL-style workflows
- Enforce relationships using primary and foreign keys
- Validate migrated data and relationships
- Perform advanced analytical SQL queries across multiple tables
- Generate insights from Premier League historical data
| Tool | Purpose |
|---|---|
| SQL Server Management Studio (SSMS) | Database development and querying |
| SQL Server | Relational database engine |
| Python / Jupyter Notebook | CSV profiling and preprocessing |
| CSV Files | Raw data source |
| Power BI / Excel | Data visualisation |
| T-SQL | Database creation, migration, and analysis |
Main dataset containing:
- Match results
- Team statistics
- Referee information
- Betting odds
- Match outcomes
- Seasonal data
- 12,160 matches
- 43 columns
- Seasons:
- 1993–1994 → 2024–2025
Additional fixture dataset containing:
- Match schedule
- Stadium/location
- Teams
- Fixture results
- 380 fixtures
The original CSV data was highly denormalized, with repeated:
- team names
- referee names
- season values
- match statistics
To improve:
- storage efficiency
- query performance
- relational integrity
- analytical flexibility
the database was normalized into dimension and fact tables.
Stores unique Premier League seasons.
| Column |
|---|
| season_id (PK) |
| season_name |
Stores unique football clubs.
| Column |
|---|
| team_id (PK) |
| team_name |
Stores unique referees.
| Column |
|---|
| referee_id (PK) |
| referee_name |
Stores match-level metadata.
| Column |
|---|
| match_id (PK) |
| season_id (FK) |
| match_week |
| match_date |
| match_time |
| home_team_id (FK) |
| away_team_id (FK) |
| referee_id (FK) |
Stores goals and outcomes.
| Column |
|---|
| match_id (FK) |
| full_time_home_goals |
| full_time_away_goals |
| full_time_result |
| half_time_home_goals |
| half_time_away_goals |
| half_time_result |
| home_team_points |
| away_team_points |
Stores performance statistics.
| Column |
|---|
| match_id (FK) |
| shots |
| corners |
| fouls |
| yellow_cards |
| red_cards |
Stores betting market data.
| Column |
|---|
| match_id (FK) |
| Bet365 odds |
| Market averages |
| Over/under odds |
Stores 2024–2025 fixture information.
| Column |
|---|
| fixture_id (PK) |
| season_id (FK) |
| match_number |
| round_number |
| fixture_date |
| location |
| home_team_id (FK) |
| away_team_id (FK) |
| result_text |
Python/Jupyter was used to:
- inspect column structures
- identify missing values
- validate data types
- compare team naming conventions
- identify import risks
Example:
import pandas as pd
df = pd.read_csv("PremierLeague.csv")
df.info()
df.isna().sum()
## Step 2 — Staging Tables
Raw CSV files were imported into SQL Server staging tables:
```text
staging_premier_league
staging_fixtures_2024_25The staging layer preserved raw data prior to normalization.
- Fixed team naming inconsistencies:
Man Utd→Man UnitedSpurs→Tottenham
- Corrected CSV auto-formatting issues
- Prevented Excel from converting football scores into date/time values
- Managed nullable historical statistics
- Standardised column data types
Dimension tables were populated using:
INSERT INTO ... SELECT DISTINCTThis eliminated duplicated:
- seasons
- teams
- referees
Fact tables were populated using:
- joins to dimension tables
- foreign key relationships
- normalized references
| Validation | Result |
|---|---|
| Duplicate Match IDs | 0 |
| Missing Core Match Fields | 0 |
| Missing Fixture Fields | 0 |
| Broken Foreign Key Relationships | 0 |
| Duplicate Fixture Match Numbers | 0 |
- Manchester United recorded the highest historical goal total
- Arsenal and Liverpool closely followed
- Manchester City demonstrated exceptional scoring efficiency despite fewer historical matches
Traditional “Big Six” clubs dominate historical offensive output across Premier League history.
| Outcome | Matches |
|---|---|
| Home Wins | 5,565 |
| Draws | 3,093 |
| Away Wins | 3,502 |
Home teams historically win significantly more matches than away teams, confirming a strong Premier League home advantage.
Recent seasons dominate the highest-scoring rankings:
- 2023–2024
- 2024–2025
- 2022–2023
Modern Premier League football appears increasingly attack-oriented, with higher total goal output in recent seasons.
Most officiated referees:
- Mike Dean
- Martin Atkinson
- Anthony Taylor
- Michael Oliver
Historical referee data revealed inconsistent naming conventions:
M DeanMike DeanM. L. Dean
This highlighted the importance of master-data standardisation in ETL workflows.
Elite clubs maintained strong conversion rates across very high shot volumes.
Smaller clubs occasionally showed inflated efficiency due to limited sample sizes, while elite clubs demonstrated sustained attacking efficiency over long historical periods.
- Top Scoring Teams
- Home vs Away Wins
- Goals by Season
- Shot Conversion Rates
- Referee Match Counts
Excel automatically converted football scores into date/time values during preprocessing.
- Used VS Code and Python to preserve score values as text
- Imported staging fields as
nvarchar
Fixture datasets used abbreviated team names.
Standardised naming using SQL update transformations.
Older seasons lacked:
- referee data
- betting odds
- advanced match statistics
Preserved NULL values rather than imputing artificial data.
- Relational database design
- SQL normalization
- ETL workflows
- Data migration
- Foreign key modelling
- SQL Server schema development
- Advanced SQL querying
- Data validation
- Sports analytics
- Handling messy real-world datasets
This project demonstrates the complete lifecycle of building a normalized relational database from raw sports data using SQL Server. The project combined database engineering, ETL processes, relational modelling, and advanced analytical querying to transform denormalized Premier League datasets into a scalable analytical database structure.
The final solution supports reliable multi-table analysis while showcasing practical data engineering and SQL development skills applicable to real-world analytical systems.