Startup Fundraising Analysis (Crunchbase Dataset)
Project: 2
Difficulty: Beginner
Tools: SQLite, Juypter Notebook, Python,
Focus: Data Cleaning, Analysis, Using SQLite through a Python import, using pandas import.
Project Overview
This project explores startup investment data from the Crunchbase dataset using a combined pandas + SQLite workflow. The objective was to transform raw, unstructured data into a clean, queryable format and derive insights into fundraising trends, sector performance, and funding patterns.
The project follows a full data pipeline:
- Data ingestion from CSV
- Data cleaning and transformation
- Storage in a relational database (SQLite)
- Analytical querying using SQL
- Interpretation of business insights
Tools & Technologies
- Python
- Pandas
- SQLite
- Jupyter Notebook
Dataset
-
Source: Crunchbase Startup Investments dataset
-
Records:
- Raw dataset: ~54,294 rows
- Cleaned dataset: 40,907 rows
Data Processing Workflow
-
Inspected dataset structure, data types, and missing values
-
Identified key issues:
- Funding amounts stored as text
- Date fields in string format
- Inconsistent column naming
- Missing and placeholder values
- Data Loading (SQLite Integration)
-
Loaded CSV into SQLite using chunk-based processing
-
Prevented duplication by using:
replacefor the first chunkappendfor subsequent chunks
-
Created a structured relational table for querying
- Data Cleaning & Transformation
Key transformations included:
- Renaming inconsistent columns (e.g.
_funding_total_usd_→funding_total_usd) - Converting funding values to numeric format
- Handling invalid placeholders (e.g. '-', empty values)
- Converting date columns to datetime
- Standardising
founded_yearto integer - Removing records with missing or invalid funding values
Result:
- ~25% of records removed due to poor data quality
Key Analysis & Insights
- Funding Trends Over Time
- Startup activity increased significantly from the early 2000s
- Peak funding observed around 2010–2012
- Average funding per company decreased over time, indicating broader distribution of capital
- Most Successful Startup Verticals
- Biotechnology received the highest total funding
- Mobile had the highest average funding per company
- Software had many companies but lower average funding, indicating a competitive market
- Funding Round Distribution
- Majority of startups received only 1–2 funding rounds
- Sharp drop-off in later rounds
- Indicates high attrition beyond early-stage funding
- Funding Type Analysis
- Venture capital dominated total funding
- Private equity and debt financing contributed significantly at later stages
- Early-stage funding (seed, angel) occurred frequently but with lower capital amounts
- Investment Concentration
- Funding is highly concentrated among a small number of companies
- Top firms received disproportionately large amounts of capital
- Indicates unequal distribution of investment across the startup ecosystem
Challenges & Limitations
- Data Quality Issues
- Missing and inconsistent values required significant cleaning
- Funding column stored as text with symbols and placeholders
- Category fields were not fully normalised
- Database Duplication Issue
-
Initial data loading caused duplicate records due to repeated
appendoperations -
Resolved by:
- Rebuilding the database from scratch
- Implementing controlled chunk loading (
replace+appendlogic)
- Lack of Investor-Level Data
-
Dataset does not include investor names
-
Prevented direct analysis of investor activity
-
Adapted approach by analysing:
- Funding types
- Company-level investment patterns
- Outliers in Funding Data
- Extremely large funding values skewed averages
- Required careful interpretation of results
Key Skills Demonstrated
- Data cleaning and preprocessing
- SQL querying and aggregation
- Database management (SQLite)
- Handling large datasets with chunking
- Data validation and debugging
- Translating data into business insights
Conclusion
This project demonstrates an end-to-end data analysis workflow, from raw data ingestion to actionable insights. It highlights real-world challenges such as data quality issues, duplication errors, and dataset limitations, while showcasing the ability to adapt analysis techniques accordingly.
Future Improvements
- Normalise category fields for more accurate sector analysis
- Incorporate additional datasets with investor-level detail
- Apply visualisation tools (e.g. Power BI, Tableau) for dashboard reporting