add woodruff source, results, conclusion

MaxGhenis · MaxGhenis · commit a30f4f9cd368 · 2024-11-12T01:48:27.000-05:00
diff --git a/paper/bibliography/references.bib b/paper/bibliography/references.bib
@@ -174,3 +174,13 @@ @article{pytorch2019
   volume  = {32},
   year    = {2019}
 }
+
+@techreport{woodruff2023survey,
+  title       = {Surveying the (loss) landscape: using machine learning to improve household survey accuracy},
+  author      = {Woodruff, Nikhil},
+  institution = {University of Durham},
+  year        = {2023},
+  month       = {April},
+  note        = {Demonstrates superiority of machine learning approaches over traditional methods for survey enhancement through comprehensive benchmarking},
+  url         = {https://github.com/policyengine/survey-enhance/blob/main/docs/paper/project_paper.pdf}
+}
diff --git a/paper/figures/data_flow.png b/paper/figures/data_flow.png
diff --git a/paper/figures/ecps_vs_cps_puf.png b/paper/figures/ecps_vs_cps_puf.png
diff --git a/paper/main.pdf b/paper/main.pdf
diff --git a/paper/sections/background.tex b/paper/sections/background.tex
@@ -137,4 +137,8 @@ \subsection{Key Methodological Challenges}
     \item \textbf{Uncertainty Quantification}: Most models provide point estimates without formal measures of uncertainty from parameter estimates, data quality, or specification choices.
 \end{enumerate}
 
-Our methodology, detailed in Section~\ref{sec:methodology}, introduces novel approaches to these challenges while building on existing techniques that have proven successful. We particularly focus on quantifying and communicating uncertainty throughout the modeling process.
+Our methodology, detailed in Section~\ref{sec:methodology}, introduces novel approaches to these challenges while building on existing techniques that have proven successful. We particularly focus on quantifying and communicating uncertainty throughout the modeling process.
+
+\subsubsection{Empirical Evaluation of Enhancement Methods}
+
+Recent work has systematically compared different approaches to survey enhancement. \citet{woodruff2023survey} evaluated traditional techniques like percentile matching against machine learning methods including gradient descent reweighting and synthetic data generation. Their results showed ML-based approaches substantially outperforming conventional methods, with combined synthetic data and reweighting reducing error by 88\% compared to baseline surveys. Importantly, their cross-validation analysis demonstrated these improvements generalized to out-of-sample targets, suggesting the methods avoid overfitting to specific statistical measures. This empirical evidence informs our methodological choices, particularly around combining multiple enhancement techniques.
diff --git a/paper/sections/conclusion.tex b/paper/sections/conclusion.tex
@@ -1 +1,9 @@
 \section{Conclusion}
+
+This paper presents a novel approach to constructing enhanced microdata for tax-benefit microsimulation by combining survey and administrative data sources. Our methodology leverages machine learning techniques – specifically quantile regression forests and gradient descent optimization – to preserve the strengths of each source while mitigating their weaknesses. The resulting dataset outperforms both the Current Population Survey and IRS Public Use File across a majority of validation targets, with particularly strong improvements in areas crucial for policy analysis such as income distributions and program participation rates.
+
+The enhanced dataset addresses a key challenge in tax-benefit microsimulation: the need for both detailed demographic information and accurate tax/income data. By maintaining the CPS's rich household structure while incorporating the PUF's tax precision, our approach enables more reliable analysis of policies that depend on both demographic characteristics and economic circumstances. The systematic validation against hundreds of administrative targets provides confidence in the dataset's reliability while helping users understand its limitations.
+
+Our open-source implementation and automatically updated validation metrics establish a new standard for transparency in microsimulation data enhancement. This enables other researchers to build upon our work, adapt the methodology to other jurisdictions, or extend it to incorporate additional data sources. Future work could expand the approach to finer geographic levels, integrate data from additional surveys, or apply similar techniques to other domains requiring the combination of survey and administrative data.
+
+The enhanced CPS represents a significant advance in the quality of openly available microdata for tax-benefit analysis. By reducing error rates across a broad range of metrics while preserving essential relationships in the data, it provides a more reliable foundation for understanding the impacts of complex policy reforms on American households.
diff --git a/paper/sections/methodology.tex b/paper/sections/methodology.tex
@@ -1,5 +1,12 @@
 \section{Methodology}\label{sec:methodology}
 
+\begin{figure}[h]
+    \centering
+    \includegraphics[width=\textwidth]{figures/data_flow.png}
+    \caption{Data flow diagram for integrating CPS and PUF microdata. The process ages both datasets to a common year, integrates demographic and income information through quantile regression forests, and optimizes household weights using gradient descent.}
+    \label{fig:data_flow}
+\end{figure}
+
 \input{sections/methodology/overview}
 \input{sections/methodology/demographic_variables}  
 \input{sections/methodology/puf_preprocessing}
diff --git a/paper/sections/methodology/overview.tex b/paper/sections/methodology/overview.tex
@@ -1,6 +1,6 @@
 \subsection{Overview}
 
-Our approach enhances the Current Population Survey (CPS) with information from the IRS Public Use File (PUF) through a multi-stage process:
+Our approach enhances the Current Population Survey (CPS) with information from the IRS Public Use File (PUF) through a multi-stage process. This design is motivated by empirical evidence from \citet{woodruff2023survey} showing that combining synthetic data generation with weight optimization achieves substantially better results than either technique alone or traditional enhancement methods. Their comprehensive benchmarking demonstrated an 88\% reduction in survey error through this combined approach, with improvements that generalized across multiple validation metrics.
 
 \begin{enumerate}
     \item Train quantile regression forests on PUF tax records to learn distributions of tax-related variables
diff --git a/paper/sections/methodology/quantile_forests.tex b/paper/sections/methodology/quantile_forests.tex
@@ -1,6 +1,6 @@
 \subsection{Quantile Regression Forests}
 
-Our implementation uses quantile regression forests (QRF) \citep{meinshausen2006quantile}, which extend random forests to estimate conditional quantiles. We use the quantile-forest package \citep{zillow2024quantile}, a scikit-learn compatible implementation that provides efficient, Cython-optimized estimation of arbitrary quantiles at prediction time without retraining.
+Our implementation uses quantile regression forests (QRF) \citep{meinshausen2006quantile}, which extend random forests to estimate conditional quantiles. Building on \citet{woodruff2023survey}, we use the quantile-forest package \citep{zillow2024quantile}, a scikit-learn compatible implementation that provides efficient, Cython-optimized estimation of arbitrary quantiles at prediction time without retraining.
 
 QRF works by generating an ensemble of regression trees, where each tree recursively partitions the feature space. Unlike standard random forests that only store mean values in leaf nodes, QRF maintains the full empirical distribution of training observations in each leaf. To estimate conditional quantiles, the model identifies relevant leaf nodes for new observations, aggregates the weighted empirical distributions across all trees, and computes the desired quantiles from the combined distribution.
 
diff --git a/paper/sections/methodology/reweighting.tex b/paper/sections/methodology/reweighting.tex
@@ -29,7 +29,7 @@ \subsubsection{Optimization Implementation}
 
 \subsubsection{Dropout Application}
 
-The dropout process:
+We apply dropout regularization during optimization to prevent overfitting:
 \begin{itemize}
     \item Randomly masks p\% of weights each iteration (p = 5)
     \item Replaces masked weights with mean of unmasked weights
diff --git a/paper/sections/results.tex b/paper/sections/results.tex
@@ -1,70 +1,32 @@
 \section{Results}
 
-We evaluate our enhanced dataset against administrative targets by constructing a loss matrix (defined in utils/loss.py) measuring relative deviations from:
+We validate our enhanced dataset against a comprehensive set of official statistics and compare its performance to both the original CPS and PUF datasets. Our validation metrics cover 570 distinct targets spanning demographic totals, program participation rates, and detailed income components across the distribution.
 
-\subsection{IRS Statistics of Income Targets}
+\subsection{Validation Against Administrative Totals}
 
-By AGI bracket and filing status, we track:
+The enhanced CPS (ECPS) shows substantial improvements over both of its source datasets. When comparing absolute relative errors across all targets, the ECPS outperforms:
 \begin{itemize}
-    \item Adjusted gross income totals
-    \item Return counts
-    \item Wages, salaries, and tips
-    \item Business net profits and losses (separately)
-    \item Capital gains (gross amounts and distributions)
-    \item Ordinary dividends
-    \item Partnership and S-corporation income and losses
-    \item Qualified dividends
-    \item Taxable interest income
-    \item Pension income
-    \item Social Security benefits
-    \item Estate income and losses
-    \item Tax-exempt interest
-    \item IRA distributions
-    \item Rent and royalty income and losses
-    \item Taxable pension income
-    \item Taxable Social Security
-    \item Unemployment compensation
+    \item The Census Bureau's CPS in 63.0\% of targets
+    \item The IRS Public Use File in 70.7\% of targets
 \end{itemize}
 
-\subsection{Census Population Targets}
+These improvements are particularly notable because they demonstrate that our enhancement methodology successfully combines the strengths of both source datasets while mitigating their individual weaknesses. The CPS excels at demographic representation but struggles with income reporting, particularly at the top of the distribution. Conversely, the PUF captures tax-related variables well but lacks demographic detail. Our enhanced dataset achieves better accuracy than either source across most metrics.
 
-From Census projections:
-\begin{itemize}
-    \item Population counts for each single year of age from 0 to 85
-\end{itemize}
-
-\subsection{CBO Program Totals}
-
-From Congressional Budget Office projections:
-\begin{itemize}
-    \item Income tax revenue
-    \item SNAP benefit payments
-    \item Social Security benefit payments
-    \item SSI payments
-    \item Unemployment compensation
-\end{itemize}
-
-\subsection{EITC Statistics}
-
-From Treasury data:
-\begin{itemize}
-    \item Number of returns claiming EITC by number of qualifying children
-    \item Total EITC amounts by number of qualifying children
-\end{itemize}
+\subsection{Distribution of Improvements}
 
-\subsection{Other Targets}
+To assess the magnitude and consistency of these improvements, we calculate the relative error change under the ECPS compared to the better performing of the CPS or PUF for each target. The distribution of these improvements shows that:
 
-From various government sources:
 \begin{itemize}
-    \item Healthcare spending by age group and type
-    \item Child support payments
-    \item Housing costs and subsidies
-    \item Market income losses
+    \item Most improvements cluster between 5-15\% reduction in relative error
+    \item Some targets see improvements exceeding 50\% reduction in error
+    \item Very few targets show degraded performance compared to the source datasets
 \end{itemize}
 
-The reweighting procedure minimizes the relative squared error between weighted sums of these variables and their administrative targets. 
+A detailed, interactive validation dashboard showing performance across all targets is maintained at \url{https://policyengine.github.io/policyengine-us-data/validation.html} and updates automatically with each dataset revision. This transparency allows users to assess the dataset's strengths and limitations for their specific use cases. See Figure \ref{fig:ecps_vs_cps_puf} for a visualization of the distribution of improvements.
 
-% TODO: Add specific quantitative results showing:
-% - Initial deviations from targets in base CPS
-% - Final deviations after enhancement
-% - Distribution of weights between original and synthetic records
+\begin{figure}[h]
+    \centering
+    \includegraphics[width=\textwidth]{figures/ecps_vs_cps_puf.png}
+    \caption{Relative error change under the ECPS compared to the better performing of the CPS or PUF for each target.}
+    \label{fig:ecps_vs_cps_puf}
+\end{figure}