CIA World Factbook Analysis (SQL + Jupyter Project)

Project: 3
Difficulty: Beginner

Project Overview

This project analyses global demographic data from the CIA World Factbook using a combination of Python (pandas), SQLite, and SQL queries within Jupyter Notebook.

The objective was to transform a highly unstructured dataset into a clean, queryable format and extract meaningful insights about:

Population distribution
Population density
Growth trends
Demographic patterns across countries

The project demonstrates an end-to-end data workflow, from data cleaning and transformation to SQL-based analysis and interpretation.

Tools & Technologies

Python
Pandas
SQLite
Jupyter Notebook
SQL

Dataset

Source: CIA World Factbook (CSV format)
Initial dataset characteristics:
- ~980 columns (highly unstructured)
- Mixed data types (text, numeric, units, annotations)
- Required significant preprocessing

Data Processing Workflow

1. Data Selection

Extracted relevant demographic fields:
- Population
- Area
- Birth rate
- Death rate
- Population growth

2. Data Cleaning & Transformation

The dataset required extensive cleaning due to inconsistent formatting:

Removed commas and text from numeric fields
Extracted numeric values using regular expressions
Converted columns to appropriate numeric types
Handled units (e.g. “million sq km”)
Removed missing and invalid values
Excluded aggregate rows such as “World”

3. Database Creation (SQLite)

Created a SQLite database directly from Jupyter
Loaded cleaned data into a structured table (facts)
Enabled SQL-based analysis for all subsequent steps

Key Analysis & Insights

1. Summary Statistics

Population and area distributions are highly skewed
Mean values are significantly higher than medians due to outliers
Median provides a more representative measure of a typical country

2. Outlier Analysis

Extremely large populations (China, India) dominate global totals
Very small populations (microstates) create lower-end extremes
These are valid real-world outliers, not data errors

3. Population Density

Highest densities observed in:
- Microstates (e.g. Monaco)
- Highly urbanised countries (e.g. Singapore)
Demonstrates how:
- Small land area drives extreme density
- Large populations increase spatial pressure

4. Population Pressure (Density + Growth)

Countries with both:
- High density
- High population growth

Indicate increasing pressure on:

Infrastructure
Housing
Public services

5. Natural Population Growth

Calculated using: Birth Rate − Death Rate
High natural increase observed in developing countries
Indicates:
- Younger populations
- Rapid demographic expansion

6. Aging Economies

Large countries with low or negative growth identified
Examples include developed economies with:
- Aging populations
- Lower birth rates

Potential challenges:

Workforce decline
Economic sustainability

Challenges & Limitations

1. Unstructured Dataset

Original dataset contained ~980 columns
Required manual selection and transformation of relevant fields

2. Data Cleaning Complexity

Numeric values stored as text with units and annotations
Required regex extraction and custom cleaning logic

3. Outliers & Skewness

Extreme values significantly impacted averages
Required switching from mean to median for accurate interpretation

4. Aggregate Data

Presence of “World” row distorted analysis
Required explicit exclusion in SQL queries

Key Skills Demonstrated

Data cleaning and preprocessing (real-world messy data)
SQL querying within Jupyter Notebook
SQLite database creation and management
Use of aggregate functions and subqueries
Statistical reasoning (mean vs median, skewness)
Derived metrics (population density, natural increase)
Multi-variable analysis and interpretation

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
database		database
dataset		dataset
docs		docs
notebook		notebook
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CIA World Factbook Analysis (SQL + Jupyter Project)

Project Overview

Tools & Technologies

Dataset

Data Processing Workflow

1. Data Selection

2. Data Cleaning & Transformation

3. Database Creation (SQLite)

Key Analysis & Insights

1. Summary Statistics

2. Outlier Analysis

3. Population Density

4. Population Pressure (Density + Growth)

5. Natural Population Growth

6. Aging Economies

Challenges & Limitations

1. Unstructured Dataset

2. Data Cleaning Complexity

3. Outliers & Skewness

4. Aggregate Data

Key Skills Demonstrated

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

CIA World Factbook Analysis (SQL + Jupyter Project)

Project Overview

Tools & Technologies

Dataset

Data Processing Workflow

1. Data Selection

2. Data Cleaning & Transformation

3. Database Creation (SQLite)

Key Analysis & Insights

1. Summary Statistics

2. Outlier Analysis

3. Population Density

4. Population Pressure (Density + Growth)

5. Natural Population Growth

6. Aging Economies

Challenges & Limitations

1. Unstructured Dataset

2. Data Cleaning Complexity

3. Outliers & Skewness

4. Aggregate Data

Key Skills Demonstrated

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages