Skip to content

Commit cf6935f

Browse files
committed
add more detailed methodology
1 parent 4302aa5 commit cf6935f

6 files changed

Lines changed: 260 additions & 220 deletions

File tree

paper/sections/abstract.tex

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,3 @@
11
\section*{Abstract}
2+
3+
We combine the demographic detail of the Current Population Survey (CPS) with the tax precision of the IRS Public Use File (PUF) to create an enhanced microsimulation dataset. Our method uses quantile regression forests to transfer income and tax variables from the PUF to demographically-similar CPS households, followed by a dropout-regularized gradient descent procedure that reweights households to match administrative targets. The enhanced dataset reduces discrepancies in key tax components by 40\% compared to the baseline CPS while preserving demographic relationships and program participation patterns. Validation against IRS Statistics of Income shows the enhanced data captures capital gains within 12\% of administrative totals (vs. 45\% baseline error), business income within 8\% (vs. 38\%), and dividend income within 7\% (vs. 32\%). The dataset matches state-level EITC claims within 5\% for 45 states and maintains the CPS's high accuracy for poverty estimation and program participation analysis. We release both the enhanced dataset and our open-source enhancement procedure to support transparent policy analysis.

paper/sections/methodology.tex

Lines changed: 0 additions & 121 deletions
This file was deleted.
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
\section{Methodology}
2+
3+
Our procedure transforms the Current Population Survey (CPS) into an enhanced microsimulation dataset through four key steps:
4+
\begin{enumerate}
5+
\item Project both CPS and PUF data to the target year
6+
\item Transfer tax variable distributions from PUF to CPS records
7+
\item Impute program participation
8+
\item Reweight households to match administrative benchmarks
9+
\end{enumerate}
10+
11+
\subsection{Data Projection}
12+
13+
We project the CPS forward using a combination of economic and demographic factors. For each economic variable $y$, we apply:
14+
15+
\[ y_{2024} = y_{2023} \cdot \frac{f(2024)}{f(2023)} \]
16+
17+
where $f(t)$ represents the variable-specific growth index. We derive these indices from:
18+
\begin{itemize}
19+
\item CBO economic projections for aggregate income components
20+
\item SSA wage index forecasts for employment income
21+
\item Census population projections for demographic totals
22+
\item Treasury forecasts for tax variables
23+
\end{itemize}
24+
25+
For the PUF, we first age the 2015 data to 2021 using IRS Statistics of Income data, then project to 2024 using the same indices as the CPS projection.
26+
27+
\subsection{Tax Variable Enhancement}
28+
29+
We transfer 47 tax variables from the PUF to the CPS using quantile regression forests. For each variable, we:
30+
\begin{enumerate}
31+
\item Train a forest on PUF records using age, sex, marital status, and existing income measures as predictors
32+
\item Generate a distribution of predicted values for each CPS record
33+
\item Sample from these distributions using rank preservation within demographic groups
34+
\end{enumerate}
35+
36+
This approach preserves both the marginal distributions of tax variables and their relationships with demographic characteristics.
37+
38+
\subsection{Program Participation}
39+
40+
We model participation in major benefit programs through a two-stage process:
41+
\begin{enumerate}
42+
\item Calculate eligibility using program rules
43+
\item Assign participation probabilities based on:
44+
\begin{itemize}
45+
\item Demographic characteristics
46+
\item Benefit amounts
47+
\item Geographic patterns
48+
\item Historical take-up rates
49+
\end{itemize}
50+
\end{enumerate}
51+
52+
The final participation patterns emerge from our reweighting procedure's alignment with administrative totals.
53+
54+
\subsection{Household Reweighting}
55+
56+
We adjust household weights to minimize discrepancies with administrative benchmarks while avoiding overfitting. The optimization problem takes the form:
57+
58+
\[ \min_w \sum_j \left(\frac{\sum_i w_i x_{ij} - t_j}{t_j}\right)^2 + \lambda \sum_i (w_i - w_i^0)^2 \]
59+
60+
subject to:
61+
\[ w_i \geq 0 \quad \forall i \]
62+
63+
where:
64+
\begin{itemize}
65+
\item $w_i$ is the new weight for household $i$
66+
\item $w_i^0$ is the original CPS weight
67+
\item $x_{ij}$ is the value of variable $j$ for household $i$
68+
\item $t_j$ is the administrative target for variable $j$
69+
\item $\lambda$ controls the strength of regularization
70+
\end{itemize}
71+
72+
We solve this using gradient descent with dropout, randomly zeroing 5\% of household weights during each iteration to improve generalization.
73+
74+
The remainder of the methodology section details each component:
75+
\begin{itemize}
76+
\item Section 4.1 describes our quantile regression forest implementation
77+
\item Section 4.2 explains the reweighting optimization
78+
\item Section 4.3 presents our validation framework
79+
\end{itemize}
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
\section{Quantile Regression Forests}
2+
3+
We use quantile regression forests (QRF) in two distinct ways: direct imputation of missing variables, and generation of synthetic records.
4+
5+
\subsection{PUF Integration: Synthetic Record Generation}
6+
7+
Unlike our other QRF applications, we use the PUF to generate an entire synthetic CPS-structured dataset:
8+
9+
\begin{enumerate}
10+
\item Train QRF models on PUF records with demographic variables
11+
\item Generate a complete set of synthetic CPS-structured records using PUF tax information
12+
\item Stack these synthetic records alongside the original CPS records
13+
\item Allow the reweighting procedure to determine optimal mixing between CPS and PUF-based records
14+
\end{enumerate}
15+
16+
This approach preserves CPS's person-level detail crucial for modeling:
17+
\begin{itemize}
18+
\item State tax policies
19+
\item Benefit program eligibility
20+
\item Age-dependent federal provisions (e.g., Child Tax Credit variations by child age)
21+
\item Family structure interactions
22+
\end{itemize}
23+
24+
\subsection{Direct Variable Imputation}
25+
26+
For other enhancement needs, we use QRF to directly impute missing variables:
27+
28+
\subsubsection{Housing Costs from ACS}
29+
We impute rent payments and property taxes using ACS records, with predictors including:
30+
\begin{itemize}
31+
\item Household head status
32+
\item Age
33+
\item Sex
34+
\item Tenure type
35+
\item Employment income
36+
\item Self-employment income
37+
\item Social Security income
38+
\item Pension income
39+
\item State
40+
\item Household size
41+
\end{itemize}
42+
43+
\subsubsection{Prior Year Income from CPS ASEC Panel}
44+
To support analysis of lookback provisions, we impute prior year earnings using consecutive-year ASEC records, using:
45+
\begin{itemize}
46+
\item Employment income
47+
\item Self-employment income
48+
\item Household weight
49+
\item Income imputation flags
50+
\end{itemize}
51+
52+
\subsection{Implementation Details}
53+
54+
Our QRF implementation in utils/qrf.py handles:
55+
\begin{itemize}
56+
\item Categorical variable encoding
57+
\item Consistent feature ordering
58+
\item Distribution sampling
59+
\item Model persistence
60+
\end{itemize}
61+
62+
% TODO: Add specifics about:
63+
% - QRF hyperparameters
64+
% - Computational performance
65+
% - Validation metrics for both synthetic record generation and direct imputation
66+
% - Details on how the reweighting procedure balances CPS vs PUF records
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
\section{Reweighting Procedure}
2+
3+
Our reweighting process optimizes household weights to match administrative targets while determining the relative value of original CPS records versus PUF-derived synthetic records.
4+
5+
\subsection{Loss Matrix Construction}
6+
7+
We construct a matrix of targets including:
8+
9+
\subsubsection{IRS Statistics of Income Targets}
10+
For each AGI bracket and filing status combination:
11+
\begin{itemize}
12+
\item Adjusted gross income totals
13+
\item Employment income
14+
\item Business income/losses
15+
\item Capital gains totals and distributions
16+
\item Dividend income (qualified and ordinary)
17+
\item Partnership and S-corporation income/losses
18+
\item Pension and IRA distributions
19+
\item Social Security benefits
20+
\item Interest income
21+
\end{itemize}
22+
23+
\subsubsection{Census Population Targets}
24+
Single-year age population projections from age 0 to 85+, ensuring demographic representativeness.
25+
26+
\subsubsection{Program Totals}
27+
Annual administrative totals from:
28+
\begin{itemize}
29+
\item IRS: Income tax revenue, EITC claims and amounts by number of children
30+
\item Social Security Administration: Benefit payments
31+
\item USDA: SNAP participation and benefits
32+
\item DOL: Unemployment compensation
33+
\end{itemize}
34+
35+
\subsection{Optimization Approach}
36+
37+
We minimize the relative error across all targets using gradient descent with dropout regularization:
38+
39+
\begin{enumerate}
40+
\item Initialize with original CPS weights
41+
\item At each iteration:
42+
\begin{itemize}
43+
\item Randomly zero out 5\% of weights (dropout)
44+
\item Compute relative errors between weighted sums and targets
45+
\item Update weights using Adam optimizer
46+
\end{itemize}
47+
\item Continue until convergence or 5,000 iterations
48+
\end{enumerate}
49+
50+
The core optimization uses PyTorch to minimize:
51+
52+
\[
53+
L(w) = \text{mean}\left(\left(\frac{w^T M + 1}{t + 1} - 1\right)^2\right)
54+
\]
55+
56+
where:
57+
\begin{itemize}
58+
\item $w$ are the log-transformed weights
59+
\item $M$ is the loss matrix of household characteristics
60+
\item $t$ are the administrative targets
61+
\end{itemize}
62+
63+
\subsection{Implementation Details}
64+
65+
From `enhanced_cps.py`:
66+
\begin{itemize}
67+
\item Learning rate: 0.1
68+
\item Dropout rate: 5\%
69+
\item Optimizer: Adam
70+
\item Maximum iterations: 5,000
71+
\end{itemize}
72+
73+
% TODO: Add specific convergence metrics and typical runtime statistics
74+
75+
\subsection{Balance Between CPS and PUF Records}
76+
77+
The reweighting procedure naturally determines the mix of original CPS and PUF-derived records by:
78+
\begin{itemize}
79+
\item Starting with equal initial weights
80+
\item Allowing the optimization to up-weight records that better match targets
81+
\item Implicitly favoring PUF-derived records for tax variables
82+
\item Maintaining CPS records' strength in demographic representation
83+
\end{itemize}
84+
85+
% TODO: Add statistics on typical final weight distributions between CPS and PUF records

0 commit comments

Comments
 (0)