This project analyses the NYC Squirrel Census dataset using SQL Server to uncover patterns in squirrel population distribution and behaviour across different parks and environmental conditions.
The analysis focuses on population density, behavioural trends, and environmental influences such as location and vertical positioning.
- Analyse squirrel population distribution across parks and areas
- Identify common behavioural patterns (running, foraging, eating, etc.)
- Examine how environmental factors influence behaviour
- Apply SQL for real-world data cleaning and analysis
- Develop insights from an unconventional and unstructured dataset
- SQL Server Management Studio (SSMS)
- T-SQL
- CSV Data Import (Flat File Wizard)
- Data Cleaning & Transformation
- Aggregation & Analytical Queries
NYC Squirrel Census Dataset (NYC Open Data)
Includes:
- Location data (Park, Area, Coordinates)
- Behavioural indicators (Running, Foraging, Eating, Climbing)
- Environmental attributes (Location, Above Ground height)
- Qualitative observations (notes and stories)
- Imported datasets using SSMS Flat File Import Wizard
- Ensured correct encoding (UTF-8) for text-heavy fields
- Adjusted column data types to prevent truncation (e.g.
nvarchar(MAX)for notes)
- Resolved ParkID inconsistencies
- Float values converted to integer-based
ParkID_Clean
- Float values converted to integer-based
- Handled invalid numeric values
- Cleaned
AboveGround_feetusingTRY_CAST
- Cleaned
- Managed missing data
- Categorised unknown or invalid values appropriately
- Standardised structure
- Clean column names and consistent data types
Logical relationships established using: squirrel_data.ParkID_Clean → park_data.ParkID_Clean
Due to non-unique values in the park dataset, relationships were handled via validated SQL joins rather than enforced constraints.
- Analysed squirrel counts by:
- Park
- Area
➡️ Squirrel populations are concentrated in a small number of parks.
- Analysed:
- Running
- Foraging
- Eating
- Climbing
➡️ Foraging and movement behaviours dominate, reflecting survival-driven activity.
- Behaviour analysed across:
- Different parks
- Location types
- Vertical positioning
➡️ Behaviour varies depending on environmental context.
- Categories:
- Above ground
- Unknown
A large proportion (~78%) of observations were classified as Unknown, limiting accurate vertical behaviour analysis.
- Squirrel populations are highly concentrated in specific parks
- Foraging is the most common behaviour, followed by running and eating
- Behaviour varies across locations, indicating environmental influence
- Vertical behaviour analysis is limited due to missing data
- Despite limitations, behavioural patterns remain consistent
-
Data quality issues
- ParkID imported as float → required cleaning
- Invalid entries in height data
-
Missing data
- High proportion of "Unknown" values
-
Schema limitations
- Non-unique keys prevented PK/FK enforcement
-
Text-heavy fields
- Required careful handling to avoid truncation and encoding issues
Data was exported and visualised using Excel / Power BI:
- Top Parks by Squirrel Count
- Behaviour Distribution
- Behaviour by Park
- Vertical Position Analysis
This project demonstrates how SQL can be used to analyse real-world datasets with significant data quality challenges. The findings highlight the importance of environmental factors in shaping squirrel behaviour, while also emphasising the need for careful data cleaning and validation.
- SQL data import and transformation
- Data cleaning and validation
- Aggregation and analytical querying
- Handling imperfect real-world datasets
- Deriving insights from behavioural and spatial data
- Logical data modelling