This repository hosts the public website for AtlasNLP, a country-aware atlas of dataset representation in NLP.
Website: https://anonymous.4open.science/w/AtlasNLP-6D06/
AtlasNLP maps NLP datasets by the countries and populations they represent, the locations where datasets are produced, and the NLP tasks they cover. The project is designed to make geographic gaps in NLP dataset representation more visible and to support more transparent, country-aware dataset documentation and evaluation.
NLP datasets are often organized by language, task, or benchmark, but this does not always reveal which countries or populations are represented. AtlasNLP addresses this gap by organizing datasets around country-level metadata.
The resource includes:
- AtlasNLP-Core: a large-scale ACL-derived collection of over 18,000 NLP datasets constructed through automated extraction and validation.
- AtlasNLP-Gold: a human-curated reference set used for validation and expanded coverage of underrepresented regions.
- Country-aware metadata: content countries, producer countries, task categories, languages, modality, synthetic status, and related dataset properties.
- Interactive visualizations: maps, country-task coverage, language concentration, producer-content relationships, and dataset summary statistics.