Skip to content

Commit a30f4f9

Browse files
committed
add woodruff source, results, conclusion
1 parent 3f260a2 commit a30f4f9

11 files changed

Lines changed: 51 additions & 60 deletions

File tree

paper/bibliography/references.bib

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -174,3 +174,13 @@ @article{pytorch2019
174174
volume = {32},
175175
year = {2019}
176176
}
177+
178+
@techreport{woodruff2023survey,
179+
title = {Surveying the (loss) landscape: using machine learning to improve household survey accuracy},
180+
author = {Woodruff, Nikhil},
181+
institution = {University of Durham},
182+
year = {2023},
183+
month = {April},
184+
note = {Demonstrates superiority of machine learning approaches over traditional methods for survey enhancement through comprehensive benchmarking},
185+
url = {https://github.com/policyengine/survey-enhance/blob/main/docs/paper/project_paper.pdf}
186+
}

paper/figures/data_flow.png

205 KB
Loading

paper/figures/ecps_vs_cps_puf.png

64.3 KB
Loading

paper/main.pdf

209 KB
Binary file not shown.

paper/sections/background.tex

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,4 +137,8 @@ \subsection{Key Methodological Challenges}
137137
\item \textbf{Uncertainty Quantification}: Most models provide point estimates without formal measures of uncertainty from parameter estimates, data quality, or specification choices.
138138
\end{enumerate}
139139

140-
Our methodology, detailed in Section~\ref{sec:methodology}, introduces novel approaches to these challenges while building on existing techniques that have proven successful. We particularly focus on quantifying and communicating uncertainty throughout the modeling process.
140+
Our methodology, detailed in Section~\ref{sec:methodology}, introduces novel approaches to these challenges while building on existing techniques that have proven successful. We particularly focus on quantifying and communicating uncertainty throughout the modeling process.
141+
142+
\subsubsection{Empirical Evaluation of Enhancement Methods}
143+
144+
Recent work has systematically compared different approaches to survey enhancement. \citet{woodruff2023survey} evaluated traditional techniques like percentile matching against machine learning methods including gradient descent reweighting and synthetic data generation. Their results showed ML-based approaches substantially outperforming conventional methods, with combined synthetic data and reweighting reducing error by 88\% compared to baseline surveys. Importantly, their cross-validation analysis demonstrated these improvements generalized to out-of-sample targets, suggesting the methods avoid overfitting to specific statistical measures. This empirical evidence informs our methodological choices, particularly around combining multiple enhancement techniques.

paper/sections/conclusion.tex

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,9 @@
11
\section{Conclusion}
2+
3+
This paper presents a novel approach to constructing enhanced microdata for tax-benefit microsimulation by combining survey and administrative data sources. Our methodology leverages machine learning techniques – specifically quantile regression forests and gradient descent optimization – to preserve the strengths of each source while mitigating their weaknesses. The resulting dataset outperforms both the Current Population Survey and IRS Public Use File across a majority of validation targets, with particularly strong improvements in areas crucial for policy analysis such as income distributions and program participation rates.
4+
5+
The enhanced dataset addresses a key challenge in tax-benefit microsimulation: the need for both detailed demographic information and accurate tax/income data. By maintaining the CPS's rich household structure while incorporating the PUF's tax precision, our approach enables more reliable analysis of policies that depend on both demographic characteristics and economic circumstances. The systematic validation against hundreds of administrative targets provides confidence in the dataset's reliability while helping users understand its limitations.
6+
7+
Our open-source implementation and automatically updated validation metrics establish a new standard for transparency in microsimulation data enhancement. This enables other researchers to build upon our work, adapt the methodology to other jurisdictions, or extend it to incorporate additional data sources. Future work could expand the approach to finer geographic levels, integrate data from additional surveys, or apply similar techniques to other domains requiring the combination of survey and administrative data.
8+
9+
The enhanced CPS represents a significant advance in the quality of openly available microdata for tax-benefit analysis. By reducing error rates across a broad range of metrics while preserving essential relationships in the data, it provides a more reliable foundation for understanding the impacts of complex policy reforms on American households.

paper/sections/methodology.tex

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
\section{Methodology}\label{sec:methodology}
22

3+
\begin{figure}[h]
4+
\centering
5+
\includegraphics[width=\textwidth]{figures/data_flow.png}
6+
\caption{Data flow diagram for integrating CPS and PUF microdata. The process ages both datasets to a common year, integrates demographic and income information through quantile regression forests, and optimizes household weights using gradient descent.}
7+
\label{fig:data_flow}
8+
\end{figure}
9+
310
\input{sections/methodology/overview}
411
\input{sections/methodology/demographic_variables}
512
\input{sections/methodology/puf_preprocessing}

paper/sections/methodology/overview.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
\subsection{Overview}
22

3-
Our approach enhances the Current Population Survey (CPS) with information from the IRS Public Use File (PUF) through a multi-stage process:
3+
Our approach enhances the Current Population Survey (CPS) with information from the IRS Public Use File (PUF) through a multi-stage process. This design is motivated by empirical evidence from \citet{woodruff2023survey} showing that combining synthetic data generation with weight optimization achieves substantially better results than either technique alone or traditional enhancement methods. Their comprehensive benchmarking demonstrated an 88\% reduction in survey error through this combined approach, with improvements that generalized across multiple validation metrics.
44

55
\begin{enumerate}
66
\item Train quantile regression forests on PUF tax records to learn distributions of tax-related variables

paper/sections/methodology/quantile_forests.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
\subsection{Quantile Regression Forests}
22

3-
Our implementation uses quantile regression forests (QRF) \citep{meinshausen2006quantile}, which extend random forests to estimate conditional quantiles. We use the quantile-forest package \citep{zillow2024quantile}, a scikit-learn compatible implementation that provides efficient, Cython-optimized estimation of arbitrary quantiles at prediction time without retraining.
3+
Our implementation uses quantile regression forests (QRF) \citep{meinshausen2006quantile}, which extend random forests to estimate conditional quantiles. Building on \citet{woodruff2023survey}, we use the quantile-forest package \citep{zillow2024quantile}, a scikit-learn compatible implementation that provides efficient, Cython-optimized estimation of arbitrary quantiles at prediction time without retraining.
44

55
QRF works by generating an ensemble of regression trees, where each tree recursively partitions the feature space. Unlike standard random forests that only store mean values in leaf nodes, QRF maintains the full empirical distribution of training observations in each leaf. To estimate conditional quantiles, the model identifies relevant leaf nodes for new observations, aggregates the weighted empirical distributions across all trees, and computes the desired quantiles from the combined distribution.
66

paper/sections/methodology/reweighting.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ \subsubsection{Optimization Implementation}
2929

3030
\subsubsection{Dropout Application}
3131

32-
The dropout process:
32+
We apply dropout regularization during optimization to prevent overfitting:
3333
\begin{itemize}
3434
\item Randomly masks p\% of weights each iteration (p = 5)
3535
\item Replaces masked weights with mean of unmasked weights

0 commit comments

Comments
 (0)