Project: 3
Difficulty: Beginner
This project analyses global demographic data from the CIA World Factbook using a combination of Python (pandas), SQLite, and SQL queries within Jupyter Notebook.
The objective was to transform a highly unstructured dataset into a clean, queryable format and extract meaningful insights about:
- Population distribution
- Population density
- Growth trends
- Demographic patterns across countries
The project demonstrates an end-to-end data workflow, from data cleaning and transformation to SQL-based analysis and interpretation.
- Python
- Pandas
- SQLite
- Jupyter Notebook
- SQL
-
Source: CIA World Factbook (CSV format)
-
Initial dataset characteristics:
- ~980 columns (highly unstructured)
- Mixed data types (text, numeric, units, annotations)
- Required significant preprocessing
-
Extracted relevant demographic fields:
- Population
- Area
- Birth rate
- Death rate
- Population growth
The dataset required extensive cleaning due to inconsistent formatting:
- Removed commas and text from numeric fields
- Extracted numeric values using regular expressions
- Converted columns to appropriate numeric types
- Handled units (e.g. “million sq km”)
- Removed missing and invalid values
- Excluded aggregate rows such as “World”
- Created a SQLite database directly from Jupyter
- Loaded cleaned data into a structured table (
facts) - Enabled SQL-based analysis for all subsequent steps
- Population and area distributions are highly skewed
- Mean values are significantly higher than medians due to outliers
- Median provides a more representative measure of a typical country
- Extremely large populations (China, India) dominate global totals
- Very small populations (microstates) create lower-end extremes
- These are valid real-world outliers, not data errors
-
Highest densities observed in:
- Microstates (e.g. Monaco)
- Highly urbanised countries (e.g. Singapore)
-
Demonstrates how:
- Small land area drives extreme density
- Large populations increase spatial pressure
-
Countries with both:
- High density
- High population growth
Indicate increasing pressure on:
- Infrastructure
- Housing
- Public services
-
Calculated using: Birth Rate − Death Rate
-
High natural increase observed in developing countries
-
Indicates:
- Younger populations
- Rapid demographic expansion
-
Large countries with low or negative growth identified
-
Examples include developed economies with:
- Aging populations
- Lower birth rates
Potential challenges:
- Workforce decline
- Economic sustainability
- Original dataset contained ~980 columns
- Required manual selection and transformation of relevant fields
- Numeric values stored as text with units and annotations
- Required regex extraction and custom cleaning logic
- Extreme values significantly impacted averages
- Required switching from mean to median for accurate interpretation
- Presence of “World” row distorted analysis
- Required explicit exclusion in SQL queries
- Data cleaning and preprocessing (real-world messy data)
- SQL querying within Jupyter Notebook
- SQLite database creation and management
- Use of aggregate functions and subqueries
- Statistical reasoning (mean vs median, skewness)
- Derived metrics (population density, natural increase)
- Multi-variable analysis and interpretation