Skip to content

amaldu/ETL-German-Charging-Infrastructure

Repository files navigation

ETL German Charging Infrastructure

This project contains an ETL pipeline that downloads, validates, and transforms the German EV charging station registry (Bundesnetzagentur Ladesäulenregister) into a clean and analyzable dataset.

link here: https://www.bundesnetzagentur.de/DE/Fachthemen/ElektrizitaetundGas/E-Mobilitaet/start.html

Task

The ETL pipeline must complete the following parts:

  1. The Bundesnetzagentur registry has to be fetched automatically
  2. The dataset has to be reduced into the most meaningful columns (both Nennleistung Ladeeinrichtung [kW] and Kreis/kreisfreie Stadt are mandatory) and cleaned into a table that is easy to query and aggregate.
  3. The resulting dataset have to be written to CSV.

Architecture

The raw registry contains approximately 95,000 rows and 47 columns, depending on extract date. The pipeline keeps a fixed output schema; for each logical field the source column and transformations are:

  1. station id: from Ladeeinrichtungs-ID, coerced to a positive integer & duplicate ids are detected and the run fails with an error if any remain
  2. operator: from Betreiber, stripped and lowercased. Replaces empty strings with unknown
  3. status: from Status, ensures the only allowed statuses from config
  4. device type: from Art der Ladeeinrichtung, maps the allowed values from config into more optimal strings
  5. charging point count: from Anzahl Ladepunkte, validates that integers are greater than zero and within the predefined range
  6. rated power in kW: from Nennleistung Ladeeinrichtung [kW] comma decimals are parsed and turned into floats, then the range is checked
  7. district or independent city: from Kreis/kreisfreie Stadt, stripped into type and geographical unity
  8. payment method: from Bezahlsysteme, contains a semicolon-separated list of tokens that is split, trimmed, and expanded into one 0/1 indicator (OHE) column

For column-level analysis that took us here, check: eda/reports/EDA_REPORT.md.

Column reference

The configuration imported by the pipeline is managed by settings.py and modified by humans in etl_config.yaml. Then the pipeline runs these steps sequentially:

  1. Extract: retrieves and paginates the ArcGIS FeatureServer with requests. If API-only off, an existing local CSV path is used
  2. Parse: each row is parsed with a Pydantic model for types, ranges, and quirks
  3. Validate: validated the config-driven business rules
  4. Quality gate: the configured threshold (by default 0.8) checks the validation scores and allows the pipeline to continue
  5. Transform and load: transforms the dataset by mapping device types, expanding payment tokens into indicator columns and then writes the CSV.

Setup

Python 3.10 or newer. From the repo root:

pip install -e ".[dev]"
python -m src.pipeline
pytest

For the EDA notebook: pip install -e ".[eda]".

Engineering and CI/CD

GitHub Actions runs on every push and pull request to main: dependency install from pyproject, flake8 (fatal errors plus complexity and line-length stats), black in check mode at 79 characters, mypy on the package, and pytest with coverage over src. That keeps formatting, typing, and behaviour aligned before merge.

The project is packaged as an installable module via pyproject.toml, uses a shared logger with rotation to disk and console, and keeps tool configuration in pyproject so local runs and CI use the same rules. Tests and lint targets live beside the code under tests/ and src/.

What I learnt

  1. Working with pydantic opens a new world of configurations via models and field validators but each new power comes with a responsibility. The order or execution and how that gives clearer per-record errors thas to be managed properly.

  2. Data quality and safeguards: a solid pipeline needs visible outcomes and guardrails—reports that surface what happened, a quality-score threshold that aborts before load when too much of the feed is bad, quarantine files for rejected rows instead of silent drops, and hard failures on rules such as duplicate ids so broken identity never ships downstream.

  3. Modularization, testing, and CI/CD: the etl uses small transform steps that are easy to reason about, log and test. Settings are loaded once into a typed object so misconfiguration fails at startup. There is also inting, formatting, typing, and coverage in GitHub Actions

  4. Working with a national registry meant dealing with real-world CSV and API quirks like pagination, field naming in German, and semicolon-separated multivalued fields that do not map cleanly to a flat table without an explicit encoding step

What I'd Do With More Time

  • Repplace pydantic row-by-row parsing (which is not optimal for large datasets) with vectorized validation with batch constructio
  • Containerization
  • Implementation of retry logic
  • Incremental loads to avoid recreating the full dataset from scracth
  • Scheduling and orchestration (Airflow for example)
  • Monitoring and alerting, for example a dashboard on Grafana with defined alerts

About

ETL pipeline that downloads, validates, and transforms the Bundesnetzagentur Ladesäulenregister (German EV charging station registry) into a clean and analyzable dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors