You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper/sections/data.tex
+17-2Lines changed: 17 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,5 @@
1
1
\section{Data}\label{sec:data}
2
2
3
-
4
3
\subsection{Current Population Survey}
5
4
6
5
The Current Population Survey Annual Social and Economic Supplement (CPS ASEC) provides comprehensive demographic and economic information for a nationally representative sample of U.S. households. For tax year 2024, our base dataset contains approximately 150,000 households representing the U.S. civilian non-institutional population.
@@ -24,7 +23,23 @@ \subsection{Current Population Survey}
24
23
25
24
\subsection{IRS Public Use File}
26
25
27
-
The Internal Revenue Service Public Use File (PUF) is a national sample of individual income tax returns, containing approximately 200,000 records. The data are extensively transformed to protect taxpayer privacy while preserving statistical properties. Our analysis uses the 2015 PUF, the most recent available, aged to 2024.
26
+
The Internal Revenue Service Public Use File (PUF) is a national sample of individual income tax returns, representing the 151.2 million Form 1040, Form 1040A, and Form 1040EZ Federal Individual Income Tax Returns filed for Tax Year 2015. The file contains 119,675 records sampled at varying rates across strata, with 0.07 percent sampling for strata 7 through 13 \cite{bryant2022}. The data are extensively transformed to protect taxpayer privacy while preserving statistical properties.
27
+
28
+
The Public Use Tax Demographic File supplements the PUF with:
29
+
\begin{itemize}
30
+
\item Age ranges for primary taxpayers (different ranges for dependent vs non-dependent filers)
31
+
\item Dependent age information in six categories (under 5, 5-13, 13-17, 17-19, 19-24, 24+)
32
+
\item Gender of primary taxpayer
33
+
\item Earnings splits for joint filers (categorizing primary earner share)
34
+
\end{itemize}
35
+
36
+
Key disclosure protections include:
37
+
\begin{itemize}
38
+
\item Demographic information limited to returns in strata 7-13
39
+
\item Suppression of dependent ages for returns with farm income or homebuyer credits
40
+
\item Minimum population thresholds for dependent age reporting
41
+
\item Sequential limits on dependent counts by filing status
Our procedure transforms the Current Population Survey (CPS) into an enhanced microsimulation dataset through four key steps:
3
+
Following \cite{bryant2022}, our procedure enhances the Current Population Survey (CPS) with tax information from the Public Use File (PUF) through four key steps:
4
4
\begin{enumerate}
5
5
\item Project both CPS and PUF data to the target year
6
6
\item Transfer tax variable distributions from PUF to CPS records
7
-
\itemImpute program participation
7
+
\itemGenerate dependent age, primary age, and earnings split variables
8
8
\item Reweight households to match administrative benchmarks
9
9
\end{enumerate}
10
10
11
11
\subsection{Data Projection}
12
12
13
-
We project the CPS forward using a combination of economic and demographic factors. For each economic variable $y$, we apply:
13
+
We project both datasets forward using variable-specific growth indices $f(t)$:
where $f(t)$ represents the variable-specific growth index. We derive these indices from:
17
+
The indices come from:
18
18
\begin{itemize}
19
-
\item CBO economic projections for aggregate income components
19
+
\item CBO economic projections for income components
20
20
\item SSA wage index forecasts for employment income
21
-
\item Census population projections for demographic totals
21
+
\item Census population projections for demographics
22
22
\item Treasury forecasts for tax variables
23
23
\end{itemize}
24
24
25
-
For the PUF, we first age the 2015 data to 2021 using IRS Statistics of Income data, then project to 2024 using the same indices as the CPS projection.
25
+
For the PUF, we first age the 2015 data to 2021 using IRS Statistics of Income data, then project to 2024 using the same indices.
26
26
27
-
\subsection{Tax Variable Enhancement}
27
+
\subsection{Demographic Variable Construction}
28
+
29
+
Following \cite{bryant2022}, we construct several key demographic variables:
30
+
31
+
\subsubsection{Dependent Ages}
32
+
We create three dependent age variables (AGEDP1/2/3) capturing:
33
+
\begin{itemize}
34
+
\item Up to 3 dependents for joint/HOH returns
35
+
\item Up to 2 dependents for single returns
36
+
\item Up to 1 dependent for MFS returns
37
+
\end{itemize}
38
+
39
+
Ages are categorized as:
40
+
\begin{itemize}
41
+
\item Under 5
42
+
\item 5 under 13
43
+
\item 13 under 17
44
+
\item 17 under 19
45
+
\item 19 under 24
46
+
\item 24 or older
47
+
\end{itemize}
28
48
29
-
We transfer 47 tax variables from the PUF to the CPS using quantile regression forests. For each variable, we:
49
+
Dependents are ordered sequentially by type:
30
50
\begin{enumerate}
31
-
\item Train a forest on PUF records using age, sex, marital status, and existing income measures as predictors
32
-
\item Generate a distribution of predicted values for each CPS record
33
-
\item Sample from these distributions using rank preservation within demographic groups
51
+
\item Children living at home
52
+
\item Children living away from home
53
+
\item Other dependents
54
+
\item Parents
34
55
\end{enumerate}
35
56
36
-
This approach preserves both the marginal distributions of tax variables and their relationships with demographic characteristics.
57
+
\subsubsection{Primary Taxpayer Age}
58
+
We construct age ranges differently for:
59
+
60
+
Non-dependent returns:
61
+
\begin{itemize}
62
+
\item Under 26
63
+
\item 26 under 35
64
+
\item 35 under 45
65
+
\item 45 under 55
66
+
\item 55 under 65
67
+
\item 65 or older
68
+
\end{itemize}
69
+
70
+
Dependent returns:
71
+
\begin{itemize}
72
+
\item Under 18
73
+
\item 18 under 26
74
+
\item 26 or older
75
+
\end{itemize}
76
+
77
+
\subsubsection{Earnings Splits}
78
+
For joint returns, we calculate earnings splits using:
79
+
\[\text{Primary Share} = \frac{\text{Primary Wages} + \text{Primary SE Income}}{\text{Total Wages} + \text{Total SE Income}} \]
80
+
81
+
Where:
82
+
\begin{itemize}
83
+
\item Primary wages and SE income = E30400 - E30500
84
+
\item Secondary wages and SE income = E30500
85
+
\end{itemize}
86
+
87
+
We categorize the splits as:
88
+
\begin{itemize}
89
+
\item 75 percent or more earned by primary
90
+
\item Less than 75 percent but more than 25 percent earned by primary
91
+
\item Less than 25 percent earned by primary
92
+
\end{itemize}
93
+
94
+
\subsection{Tax Variable Enhancement}
37
95
38
-
\subsection{Program Participation}
96
+
We transfer tax variables from PUF to CPS using quantile regression forests trained on:
97
+
\begin{itemize}
98
+
\item Constructed demographic variables described above
99
+
\item Filing status
100
+
\item Existing income measures
101
+
\end{itemize}
39
102
40
-
We model participation in major benefit programs through a two-stage process:
103
+
For each variable, we:
41
104
\begin{enumerate}
42
-
\item Calculate eligibility using program rules
43
-
\item Assign participation probabilities based on:
44
-
\begin{itemize}
45
-
\item Demographic characteristics
46
-
\item Benefit amounts
47
-
\item Geographic patterns
48
-
\item Historical take-up rates
49
-
\end{itemize}
105
+
\item Train a forest on PUF records
106
+
\item Generate predicted distributions for CPS records
107
+
\item Sample preserving rank within demographic groups
50
108
\end{enumerate}
51
109
52
-
The final participation patterns emerge from our reweighting procedure's alignment with administrative totals.
53
-
54
110
\subsection{Household Reweighting}
55
111
56
-
We adjust household weights to minimize discrepancies with administrative benchmarks while avoiding overfitting. The optimization problem takes the form:
112
+
We adjust household weights to minimize discrepancies with administrative targets while avoiding overfitting:
0 commit comments