Conversation
|
Some testing:
Each is about 1.5GB zipped and over 38 miilion entries in shapes.txt and 50 million of entries in stop_times.txt Which indicates that there was 50+ million addition and about as many deletion. Examining shapes.txt, we can see that they use some uuid for the shape_id, e.g.: That seems to be completely regenerated for each version. That means that trying to do a diff in that kind of dataset is not possible.
|
|
Tested with STM: mdb-2126-202511111837.zip vs mdb-2126-202511130041.zip. Problems:
|
|
Modified code to ignore extra zeros in coordinates. |
|
@jcpitre Were the big feeds DELFI? I'm curious which datasets they were. IDFM told us their data is 120MB zipped. |
mdb-2014 is the UK aggregate feed. Size is about 1.5GB zipped. Here are the datasets I used: |
… value comparison, and refactor diff dispatch
| """ | ||
| a = a.strip(), | ||
| b = b.strip() | ||
| if a == b: |
There was a problem hiding this comment.
We should do no case-sensitive comparison here
| if a == b: | |
| if a.lower() == b.lower(): |
Closes #2
This pull request introduces the initial release of the GTFS Diff Engine, a memory-efficient Python library and CLI for comparing two GTFS feeds and producing a structured diff conforming to the GTFS Diff v2 schema. The changes include a robust implementation of the core diff logic, a clear public API, a command-line interface, detailed documentation, and supporting scripts for end-to-end usage.
The most important changes are:
Core Functionality and API:
engine.py, exposing a singlediff_feeds()function that returns a typed Pydantic model representing the diff result. [1] [2]gtfs_definitions.py, with a helper for primary key lookup.Command-Line Interface and Tooling:
gtfs-diff) incli.py, supporting options for output file, row change cap, pretty-printing, and feed download timestamps.compare_feeds.shto automate downloading two GTFS feeds by URL and running the diff tool, with argument parsing and error handling.Documentation and Examples:
README.mdwith a comprehensive overview, installation instructions, usage examples, API reference, supported files table, output schema example, and implementation notes on memory efficiency.docs/architecture.mddetailing design goals, module structure, streaming diff algorithm, edge case handling, and future improvements.Packaging and Project Setup:
pyproject.tomlfor installation, development, and test dependencies, and sets up the CLI entry point.__init__.pyand__main__.py. [1] [2]These changes collectively deliver a ready-to-use, well-documented GTFS diff engine suitable for both programmatic and CLI-based workflows.