ETL German Charging Infrastructure

This project contains an ETL pipeline that downloads, validates, and transforms the German EV charging station registry (Bundesnetzagentur Ladesäulenregister) into a clean and analyzable dataset.

link here: https://www.bundesnetzagentur.de/DE/Fachthemen/ElektrizitaetundGas/E-Mobilitaet/start.html

Task

The ETL pipeline must complete the following parts:

The Bundesnetzagentur registry has to be fetched automatically
The dataset has to be reduced into the most meaningful columns (both Nennleistung Ladeeinrichtung [kW] and Kreis/kreisfreie Stadt are mandatory) and cleaned into a table that is easy to query and aggregate.
The resulting dataset have to be written to CSV.

Architecture

The raw registry contains approximately 95,000 rows and 47 columns, depending on extract date. The pipeline keeps a fixed output schema; for each logical field the source column and transformations are:

station id: from Ladeeinrichtungs-ID, coerced to a positive integer & duplicate ids are detected and the run fails with an error if any remain
operator: from Betreiber, stripped and lowercased. Replaces empty strings with unknown
status: from Status, ensures the only allowed statuses from config
device type: from Art der Ladeeinrichtung, maps the allowed values from config into more optimal strings
charging point count: from Anzahl Ladepunkte, validates that integers are greater than zero and within the predefined range
rated power in kW: from Nennleistung Ladeeinrichtung [kW] comma decimals are parsed and turned into floats, then the range is checked
district or independent city: from Kreis/kreisfreie Stadt, stripped into type and geographical unity
payment method: from Bezahlsysteme, contains a semicolon-separated list of tokens that is split, trimmed, and expanded into one 0/1 indicator (OHE) column

For column-level analysis that took us here, check: eda/reports/EDA_REPORT.md.

The configuration imported by the pipeline is managed by settings.py and modified by humans in etl_config.yaml. Then the pipeline runs these steps sequentially:

Extract: retrieves and paginates the ArcGIS FeatureServer with requests. If API-only off, an existing local CSV path is used
Parse: each row is parsed with a Pydantic model for types, ranges, and quirks
Validate: validated the config-driven business rules
Quality gate: the configured threshold (by default 0.8) checks the validation scores and allows the pipeline to continue
Transform and load: transforms the dataset by mapping device types, expanding payment tokens into indicator columns and then writes the CSV.

Setup

Python 3.10 or newer. From the repo root:

pip install -e ".[dev]"
python -m src.pipeline
pytest

For the EDA notebook: pip install -e ".[eda]".

Engineering and CI/CD

GitHub Actions runs on every push and pull request to main: dependency install from pyproject, flake8 (fatal errors plus complexity and line-length stats), black in check mode at 79 characters, mypy on the package, and pytest with coverage over src. That keeps formatting, typing, and behaviour aligned before merge.

The project is packaged as an installable module via pyproject.toml, uses a shared logger with rotation to disk and console, and keeps tool configuration in pyproject so local runs and CI use the same rules. Tests and lint targets live beside the code under tests/ and src/.

What I learnt

Working with pydantic opens a new world of configurations via models and field validators but each new power comes with a responsibility. The order or execution and how that gives clearer per-record errors thas to be managed properly.
Data quality and safeguards: a solid pipeline needs visible outcomes and guardrails—reports that surface what happened, a quality-score threshold that aborts before load when too much of the feed is bad, quarantine files for rejected rows instead of silent drops, and hard failures on rules such as duplicate ids so broken identity never ships downstream.
Modularization, testing, and CI/CD: the etl uses small transform steps that are easy to reason about, log and test. Settings are loaded once into a typed object so misconfiguration fails at startup. There is also inting, formatting, typing, and coverage in GitHub Actions
Working with a national registry meant dealing with real-world CSV and API quirks like pagination, field naming in German, and semicolon-separated multivalued fields that do not map cleanly to a flat table without an explicit encoding step

What I'd Do With More Time

Repplace pydantic row-by-row parsing (which is not optimal for large datasets) with vectorized validation with batch constructio
Containerization
Implementation of retry logic
Incremental loads to avoid recreating the full dataset from scracth
Scheduling and orchestration (Airflow for example)
Monitoring and alerting, for example a dashboard on Grafana with defined alerts

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
eda		eda
imgs		imgs
sql		sql
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
logger.py		logger.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL German Charging Infrastructure

Task

Architecture

Setup

Engineering and CI/CD

What I learnt

What I'd Do With More Time

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ETL German Charging Infrastructure

Task

Architecture

Setup

Engineering and CI/CD

What I learnt

What I'd Do With More Time

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages