Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/longwise-local-geography.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
- Add an OA-first long-format local geography weights artifact (`local_geography_weights.csv.gz`) so UK constituency and local-authority consumers can migrate away from dense area-by-household weight matrices.
4 changes: 4 additions & 0 deletions docs/oa_calibration_pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,15 +132,19 @@ Generate per-area H5 files from sparse L0-calibrated weights.

**Deliverables:**
- `policyengine_uk_data/calibration/publish_local_h5s.py` — extracts per-area H5 subsets from the sparse weight vector; each H5 contains only active households (non-zero weight) with their calibrated weights, plus the linked person and benunit rows
- `policyengine_uk_data/calibration/long_geography.py` — exports matrix-free local geography weights as an OA-first long table, with constituency and LA rows derived from assigned OA geography
- `datasets/create_datasets.py` — publish step wired in after calibration, before downrating
- `tests/test_publish_local_h5s.py` — 13 tests covering area-household mapping, H5 structure, pruned-household exclusion, weight correctness, person/benunit FK integrity, full publish cycle, summary statistics, and validation

**Key design:**
- `_get_area_household_indices()`: maps each area code to its household row indices via OA geography columns from clone-and-assign
- `write_long_geography_weights()`: writes `storage/local_geography_weights.csv.gz`, a long sidecar with `area_type`, `area_code`, household identifiers, source-year/source-household provenance, and weights; the production build writes assigned-geography rows, while explicit 2D H5 conversion is available only for small compatibility checks because expanding dense area-by-household matrices is too large for routine builds
- `geography_support_report()`: summarizes low-support areas using unique source households and effective sample size, so clone count and future pooled-FRS builds can be evaluated without mistaking cloned rows for independent evidence
- `publish_area_h5()`: writes a single H5 per area — filters to active (non-zero weight) households, extracts linked persons and benunits via FK joins, stores as HDF5 groups with metadata attributes
- `publish_local_h5s()`: orchestrates the full publish cycle — loads L0 weight vector, iterates over all areas, writes H5 files to `storage/local_h5s/{area_type}/`, produces `_summary.csv` with per-area statistics
- `validate_local_h5s()`: post-publish validation checking file existence, HDF5 structure, and cross-area household ID uniqueness
- Supports both constituency (650) and LA (360) area types
- Zero-weight households (L0-pruned) are excluded from area H5 files — only active records are published
- The legacy `parliamentary_constituency_weights.h5` and `local_authority_weights.h5` artifacts are still produced during migration; new consumers should prefer the OA-first `local_geography_weights.csv.gz` sidecar.

**US reference:** PR #465 (modal)
6 changes: 3 additions & 3 deletions policyengine_uk_data/calibration/clone_and_assign.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@

from policyengine_uk_data.calibration.oa_assignment import (
assign_random_geography,
GeographyAssignment,
)

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -98,8 +97,6 @@ def clone_and_assign(
benunit = dataset.benunit

n_households = len(hh)
n_persons = len(person)
n_benunits = len(benunit)

logger.info(
"Cloning %d households x %d = %d total records",
Expand Down Expand Up @@ -192,6 +189,9 @@ def clone_and_assign(

# Clone household table
hh_clone = hh.copy()
hh_clone["source_household_id"] = hh_id_col
if "source_year" not in hh_clone.columns:
hh_clone["source_year"] = dataset.time_period
hh_clone["household_id"] = new_hh_ids
hh_clone["household_weight"] = hh["household_weight"].values / n_clones

Expand Down
Loading