This repository contains sample Comma Separated Value (CSV) files. CSV is a generic flat file format used to store structured data. Datasets are split in 3 categories: Customers, Users and Organizations. For each, sample CSV files range from 100 to 2 millions records. Those CSV files can be used for testing purpose. They can be open by any application compatible with CSV files or with a CSV editor.
The datasets are generated using random values. Mosly using Python Faker package.
- customers-100.csv - Zip version - Customers CSV with 100 records
- customers-1000.csv - Zip version - Customers CSV with 1000 records
- customers-10000.csv - Zip version - Customers CSV with 10000 records
- customers-100000.csv - Zip version - Customers CSV with 100000 records
- customers-500000.csv - Customers CSV with 500000 records
- customers-1000000.csv - Customers CSV with 1000000 records
- customers-2000000.csv - Customers CSV with 2000000 records
- Index
- Customer Id
- First Name
- Last Name
- Company
- City
- Country
- Phone 1
- Phone 2
- Subscription Date
- Website
- people-100.csv - Zip version - People CSV with 100 records
- people-1000.csv - Zip version - People CSV with 1000 records
- people-10000.csv - Zip version - People CSV with 10000 records
- people-100000.csv - Zip version - People CSV with 100000 records
- people-500000.csv - People CSV with 500000 records
- people-1000000.csv - People CSV with 1000000 records
- people-2000000.csv - People CSV with 2000000 records
- Index
- User Id
- First Name
- Last Name
- Sex
- Phone
- Date of birth
- Job Title
- organizations-100.csv - Zip version - Organizations CSV with 100 records
- organizations-1000.csv - Zip version - Organizations CSV with 1000 records
- organizations-10000.csv - Zip version - Organizations CSV with 10000 records
- organizations-100000.csv - Zip version - Organizations CSV with 100000 records
- organizations-500000.csv - Organizations CSV with 500000 records
- organizations-1000000.csv - Organizations CSV with 1000000 records
- organizations-2000000.csv - Organizations CSV with 2000000 records
- Index
- Organization Id
- Name
- Website
- Country
- Description
- Founded
- Industry
- Number of employees
- leads-100.csv - Zip version - Leads CSV with 100 records
- leads-1000.csv - Zip version - Leads CSV with 1000 records
- leads-10000.csv - Zip version - Leads CSV with 10000 records
- leads-100000.csv - Zip version - Leads CSV with 100000 records
- Index
- Account Id
- Lead Owner
- First Name
- Last Name
- Company
- Phone 1
- Phone 2
- Email 1
- Email 2
- Website
- Source
- Deal Stage
- Notes
- products-100.csv - Zip version - Products CSV with 100 records
- products-1000.csv - Zip version - Products CSV with 1000 records
- products-10000.csv - Zip version - Products CSV with 10000 records
- products-100000.csv - Zip version - Products CSV with 100000 records
- products-1000000.csv - Zip version - Products CSV with 1000000 records
- products-2000000.csv - Zip version - Products CSV with 2000000 records
- Index
- Name
- Description
- Brand
- Category
- Price
- Currency
- Stock
- EAN
- Color
- Size
- Availability
- Internal ID
Create a Python virtual env:
python3 -m venv venv/sample-csv
Activate it
source venv/sample-csv/bin/activate
So you can install dependencies:
pip install -r requirements.txt
python src/main.py
The generator also creates datasets for testing AI workflows on spreadsheet rows:
support-tickets: ticket classification, sentiment, and priority routing.customer-reviews: sentiment analysis and topic classification.messy-company-data: company name cleanup and industry classification.product-catalog-ai: ecommerce classification, translation, and attribute extraction.product-translation-ai: AI translation testing with realistic product names, product descriptions, feature bullets, glossary terms, and target languages.lead-scoring-ai: ICP fit and lead scoring prompts.web-page-extraction-ai: structured extraction from page text.research-questions-ai: AI agent and web research prompt testing.
AI datasets include synthetic Expected ... columns. Use them to compare prompt output with reference labels during testing.
The broken_csv generator creates intentionally malformed CSV files for parser and repair-tool testing:
- broken encodings (
Windows-1252,ISO-8859-1) - mixed or wrong delimiters
- unescaped and missing quotes
- unquoted multiline fields
- ragged rows
- duplicate headers
- mixed line endings
When the repair is deterministic, a clean version is generated in files/broken_csv/expected-fixed.
Create and activate a local virtual environment:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtRun the test suite:
pytestGenerate all configured CSV files and upload manifests:
python src/main.py
python src/broken_csv.py