Skip to content

datablist/sample-csv-files

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sample CSV files

This repository contains sample Comma Separated Value (CSV) files. CSV is a generic flat file format used to store structured data. Datasets are split in 3 categories: Customers, Users and Organizations. For each, sample CSV files range from 100 to 2 millions records. Those CSV files can be used for testing purpose. They can be open by any application compatible with CSV files or with a CSV editor.

The datasets are generated using random values. Mosly using Python Faker package.

Customers CSV Sample

Customer Schema

  • Index
  • Customer Id
  • First Name
  • Last Name
  • Company
  • City
  • Country
  • Phone 1
  • Phone 2
  • Email
  • Subscription Date
  • Website

People CSV Samples

People Schema

  • Index
  • User Id
  • First Name
  • Last Name
  • Sex
  • Email
  • Phone
  • Date of birth
  • Job Title

Organizations CSV Samples

Organization Schema

  • Index
  • Organization Id
  • Name
  • Website
  • Country
  • Description
  • Founded
  • Industry
  • Number of employees

Leads CSV Samples

Lead Schema

  • Index
  • Account Id
  • Lead Owner
  • First Name
  • Last Name
  • Company
  • Phone 1
  • Phone 2
  • Email 1
  • Email 2
  • Website
  • Source
  • Deal Stage
  • Notes

Products CSV Samples

Products Schema

  • Index
  • Name
  • Description
  • Brand
  • Category
  • Price
  • Currency
  • Stock
  • EAN
  • Color
  • Size
  • Availability
  • Internal ID

Local Set up to generate files

Python Environments

Create a Python virtual env:

python3 -m venv venv/sample-csv

Activate it

source venv/sample-csv/bin/activate

So you can install dependencies:

pip install -r requirements.txt

Run script

python src/main.py

AI Processing CSV Samples

The generator also creates datasets for testing AI workflows on spreadsheet rows:

  • support-tickets: ticket classification, sentiment, and priority routing.
  • customer-reviews: sentiment analysis and topic classification.
  • messy-company-data: company name cleanup and industry classification.
  • product-catalog-ai: ecommerce classification, translation, and attribute extraction.
  • product-translation-ai: AI translation testing with realistic product names, product descriptions, feature bullets, glossary terms, and target languages.
  • lead-scoring-ai: ICP fit and lead scoring prompts.
  • web-page-extraction-ai: structured extraction from page text.
  • research-questions-ai: AI agent and web research prompt testing.

AI datasets include synthetic Expected ... columns. Use them to compare prompt output with reference labels during testing.

Broken CSV Fixtures

The broken_csv generator creates intentionally malformed CSV files for parser and repair-tool testing:

  • broken encodings (Windows-1252, ISO-8859-1)
  • mixed or wrong delimiters
  • unescaped and missing quotes
  • unquoted multiline fields
  • ragged rows
  • duplicate headers
  • mixed line endings

When the repair is deterministic, a clean version is generated in files/broken_csv/expected-fixed.

Tests

Create and activate a local virtual environment:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run the test suite:

pytest

Generate all configured CSV files and upload manifests:

python src/main.py
python src/broken_csv.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages