Skip to content

StevenTapscott/SQL-CIA_Factbook_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CIA World Factbook Analysis (SQL + Jupyter Project)

Project: 3
Difficulty: Beginner

Project Overview

This project analyses global demographic data from the CIA World Factbook using a combination of Python (pandas), SQLite, and SQL queries within Jupyter Notebook.

The objective was to transform a highly unstructured dataset into a clean, queryable format and extract meaningful insights about:

  • Population distribution
  • Population density
  • Growth trends
  • Demographic patterns across countries

The project demonstrates an end-to-end data workflow, from data cleaning and transformation to SQL-based analysis and interpretation.


Tools & Technologies

  • Python
  • Pandas
  • SQLite
  • Jupyter Notebook
  • SQL

Dataset

  • Source: CIA World Factbook (CSV format)

  • Initial dataset characteristics:

    • ~980 columns (highly unstructured)
    • Mixed data types (text, numeric, units, annotations)
    • Required significant preprocessing

Data Processing Workflow

1. Data Selection

  • Extracted relevant demographic fields:

    • Population
    • Area
    • Birth rate
    • Death rate
    • Population growth

2. Data Cleaning & Transformation

The dataset required extensive cleaning due to inconsistent formatting:

  • Removed commas and text from numeric fields
  • Extracted numeric values using regular expressions
  • Converted columns to appropriate numeric types
  • Handled units (e.g. “million sq km”)
  • Removed missing and invalid values
  • Excluded aggregate rows such as “World”

3. Database Creation (SQLite)

  • Created a SQLite database directly from Jupyter
  • Loaded cleaned data into a structured table (facts)
  • Enabled SQL-based analysis for all subsequent steps

Key Analysis & Insights

1. Summary Statistics

  • Population and area distributions are highly skewed
  • Mean values are significantly higher than medians due to outliers
  • Median provides a more representative measure of a typical country

2. Outlier Analysis

  • Extremely large populations (China, India) dominate global totals
  • Very small populations (microstates) create lower-end extremes
  • These are valid real-world outliers, not data errors

3. Population Density

  • Highest densities observed in:

    • Microstates (e.g. Monaco)
    • Highly urbanised countries (e.g. Singapore)
  • Demonstrates how:

    • Small land area drives extreme density
    • Large populations increase spatial pressure

4. Population Pressure (Density + Growth)

  • Countries with both:

    • High density
    • High population growth

Indicate increasing pressure on:

  • Infrastructure
  • Housing
  • Public services

5. Natural Population Growth

  • Calculated using: Birth Rate − Death Rate

  • High natural increase observed in developing countries

  • Indicates:

    • Younger populations
    • Rapid demographic expansion

6. Aging Economies

  • Large countries with low or negative growth identified

  • Examples include developed economies with:

    • Aging populations
    • Lower birth rates

Potential challenges:

  • Workforce decline
  • Economic sustainability

Challenges & Limitations

1. Unstructured Dataset

  • Original dataset contained ~980 columns
  • Required manual selection and transformation of relevant fields

2. Data Cleaning Complexity

  • Numeric values stored as text with units and annotations
  • Required regex extraction and custom cleaning logic

3. Outliers & Skewness

  • Extreme values significantly impacted averages
  • Required switching from mean to median for accurate interpretation

4. Aggregate Data

  • Presence of “World” row distorted analysis
  • Required explicit exclusion in SQL queries

Key Skills Demonstrated

  • Data cleaning and preprocessing (real-world messy data)
  • SQL querying within Jupyter Notebook
  • SQLite database creation and management
  • Use of aggregate functions and subqueries
  • Statistical reasoning (mean vs median, skewness)
  • Derived metrics (population density, natural increase)
  • Multi-variable analysis and interpretation

About

SQL demographic analysis project exploring CIA World Factbook data using SQL queries, aggregations, subqueries, and statistical analysis to examine population trends, population density, growth rates, and global country-level demographic patterns.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages