Skip to content

Commit 09e5e4c

Browse files
committed
add to puf sourcing
1 parent e1fc3ba commit 09e5e4c

5 files changed

Lines changed: 112 additions & 40 deletions

File tree

paper/bibliography/references.bib

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,3 +122,13 @@ @article{auerbach2018
122122
pages = {541--576},
123123
year = {2018}
124124
}
125+
126+
@techreport{bryant2022,
127+
title = {General Description Booklet for the 2015 Public Use Tax File Demographic File},
128+
author = {Bryant, Victoria},
129+
institution = {Statistics of Income Division, Internal Revenue Service},
130+
year = {2022},
131+
month = {September},
132+
type = {Technical Documentation},
133+
url = {https://drive.google.com/file/d/1WoTU70GEjYMO0KHsHvTTH0NwCc-kN5cE/view}
134+
}

paper/main.pdf

3.5 KB
Binary file not shown.

paper/sections/data.tex

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
\section{Data}\label{sec:data}
22

3-
43
\subsection{Current Population Survey}
54

65
The Current Population Survey Annual Social and Economic Supplement (CPS ASEC) provides comprehensive demographic and economic information for a nationally representative sample of U.S. households. For tax year 2024, our base dataset contains approximately 150,000 households representing the U.S. civilian non-institutional population.
@@ -24,7 +23,23 @@ \subsection{Current Population Survey}
2423

2524
\subsection{IRS Public Use File}
2625

27-
The Internal Revenue Service Public Use File (PUF) is a national sample of individual income tax returns, containing approximately 200,000 records. The data are extensively transformed to protect taxpayer privacy while preserving statistical properties. Our analysis uses the 2015 PUF, the most recent available, aged to 2024.
26+
The Internal Revenue Service Public Use File (PUF) is a national sample of individual income tax returns, representing the 151.2 million Form 1040, Form 1040A, and Form 1040EZ Federal Individual Income Tax Returns filed for Tax Year 2015. The file contains 119,675 records sampled at varying rates across strata, with 0.07 percent sampling for strata 7 through 13 \cite{bryant2022}. The data are extensively transformed to protect taxpayer privacy while preserving statistical properties.
27+
28+
The Public Use Tax Demographic File supplements the PUF with:
29+
\begin{itemize}
30+
\item Age ranges for primary taxpayers (different ranges for dependent vs non-dependent filers)
31+
\item Dependent age information in six categories (under 5, 5-13, 13-17, 17-19, 19-24, 24+)
32+
\item Gender of primary taxpayer
33+
\item Earnings splits for joint filers (categorizing primary earner share)
34+
\end{itemize}
35+
36+
Key disclosure protections include:
37+
\begin{itemize}
38+
\item Demographic information limited to returns in strata 7-13
39+
\item Suppression of dependent ages for returns with farm income or homebuyer credits
40+
\item Minimum population thresholds for dependent age reporting
41+
\item Sequential limits on dependent counts by filing status
42+
\end{itemize}
2843

2944
The PUF's key strengths include:
3045
\begin{itemize}

paper/sections/methodology.tex

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
\section{Methodology}\label{sec:methodology}
2-
31
% Include methodology subsections
42
\input{sections/methodology/overview}
53
\input{sections/methodology/quantile_forests}
Lines changed: 85 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,59 +1,115 @@
1-
\section{Methodology}
1+
\section{Methodology}\label{sec:methodology}
22

3-
Our procedure transforms the Current Population Survey (CPS) into an enhanced microsimulation dataset through four key steps:
3+
Following \cite{bryant2022}, our procedure enhances the Current Population Survey (CPS) with tax information from the Public Use File (PUF) through four key steps:
44
\begin{enumerate}
55
\item Project both CPS and PUF data to the target year
66
\item Transfer tax variable distributions from PUF to CPS records
7-
\item Impute program participation
7+
\item Generate dependent age, primary age, and earnings split variables
88
\item Reweight households to match administrative benchmarks
99
\end{enumerate}
1010

1111
\subsection{Data Projection}
1212

13-
We project the CPS forward using a combination of economic and demographic factors. For each economic variable $y$, we apply:
13+
We project both datasets forward using variable-specific growth indices $f(t)$:
1414

1515
\[ y_{2024} = y_{2023} \cdot \frac{f(2024)}{f(2023)} \]
1616

17-
where $f(t)$ represents the variable-specific growth index. We derive these indices from:
17+
The indices come from:
1818
\begin{itemize}
19-
\item CBO economic projections for aggregate income components
19+
\item CBO economic projections for income components
2020
\item SSA wage index forecasts for employment income
21-
\item Census population projections for demographic totals
21+
\item Census population projections for demographics
2222
\item Treasury forecasts for tax variables
2323
\end{itemize}
2424

25-
For the PUF, we first age the 2015 data to 2021 using IRS Statistics of Income data, then project to 2024 using the same indices as the CPS projection.
25+
For the PUF, we first age the 2015 data to 2021 using IRS Statistics of Income data, then project to 2024 using the same indices.
2626

27-
\subsection{Tax Variable Enhancement}
27+
\subsection{Demographic Variable Construction}
28+
29+
Following \cite{bryant2022}, we construct several key demographic variables:
30+
31+
\subsubsection{Dependent Ages}
32+
We create three dependent age variables (AGEDP1/2/3) capturing:
33+
\begin{itemize}
34+
\item Up to 3 dependents for joint/HOH returns
35+
\item Up to 2 dependents for single returns
36+
\item Up to 1 dependent for MFS returns
37+
\end{itemize}
38+
39+
Ages are categorized as:
40+
\begin{itemize}
41+
\item Under 5
42+
\item 5 under 13
43+
\item 13 under 17
44+
\item 17 under 19
45+
\item 19 under 24
46+
\item 24 or older
47+
\end{itemize}
2848

29-
We transfer 47 tax variables from the PUF to the CPS using quantile regression forests. For each variable, we:
49+
Dependents are ordered sequentially by type:
3050
\begin{enumerate}
31-
\item Train a forest on PUF records using age, sex, marital status, and existing income measures as predictors
32-
\item Generate a distribution of predicted values for each CPS record
33-
\item Sample from these distributions using rank preservation within demographic groups
51+
\item Children living at home
52+
\item Children living away from home
53+
\item Other dependents
54+
\item Parents
3455
\end{enumerate}
3556

36-
This approach preserves both the marginal distributions of tax variables and their relationships with demographic characteristics.
57+
\subsubsection{Primary Taxpayer Age}
58+
We construct age ranges differently for:
59+
60+
Non-dependent returns:
61+
\begin{itemize}
62+
\item Under 26
63+
\item 26 under 35
64+
\item 35 under 45
65+
\item 45 under 55
66+
\item 55 under 65
67+
\item 65 or older
68+
\end{itemize}
69+
70+
Dependent returns:
71+
\begin{itemize}
72+
\item Under 18
73+
\item 18 under 26
74+
\item 26 or older
75+
\end{itemize}
76+
77+
\subsubsection{Earnings Splits}
78+
For joint returns, we calculate earnings splits using:
79+
\[ \text{Primary Share} = \frac{\text{Primary Wages} + \text{Primary SE Income}}{\text{Total Wages} + \text{Total SE Income}} \]
80+
81+
Where:
82+
\begin{itemize}
83+
\item Primary wages and SE income = E30400 - E30500
84+
\item Secondary wages and SE income = E30500
85+
\end{itemize}
86+
87+
We categorize the splits as:
88+
\begin{itemize}
89+
\item 75 percent or more earned by primary
90+
\item Less than 75 percent but more than 25 percent earned by primary
91+
\item Less than 25 percent earned by primary
92+
\end{itemize}
93+
94+
\subsection{Tax Variable Enhancement}
3795

38-
\subsection{Program Participation}
96+
We transfer tax variables from PUF to CPS using quantile regression forests trained on:
97+
\begin{itemize}
98+
\item Constructed demographic variables described above
99+
\item Filing status
100+
\item Existing income measures
101+
\end{itemize}
39102

40-
We model participation in major benefit programs through a two-stage process:
103+
For each variable, we:
41104
\begin{enumerate}
42-
\item Calculate eligibility using program rules
43-
\item Assign participation probabilities based on:
44-
\begin{itemize}
45-
\item Demographic characteristics
46-
\item Benefit amounts
47-
\item Geographic patterns
48-
\item Historical take-up rates
49-
\end{itemize}
105+
\item Train a forest on PUF records
106+
\item Generate predicted distributions for CPS records
107+
\item Sample preserving rank within demographic groups
50108
\end{enumerate}
51109

52-
The final participation patterns emerge from our reweighting procedure's alignment with administrative totals.
53-
54110
\subsection{Household Reweighting}
55111

56-
We adjust household weights to minimize discrepancies with administrative benchmarks while avoiding overfitting. The optimization problem takes the form:
112+
We adjust household weights to minimize discrepancies with administrative targets while avoiding overfitting:
57113

58114
\[ \min_w \sum_j \left(\frac{\sum_i w_i x_{ij} - t_j}{t_j}\right)^2 + \lambda \sum_i (w_i - w_i^0)^2 \]
59115

@@ -66,14 +122,7 @@ \subsection{Household Reweighting}
66122
\item $w_i^0$ is the original CPS weight
67123
\item $x_{ij}$ is the value of variable $j$ for household $i$
68124
\item $t_j$ is the administrative target for variable $j$
69-
\item $\lambda$ controls the strength of regularization
125+
\item $\lambda$ controls regularization strength
70126
\end{itemize}
71127

72-
We solve this using gradient descent with dropout, randomly zeroing 5\% of household weights during each iteration to improve generalization.
73-
74-
The remainder of the methodology section details each component:
75-
\begin{itemize}
76-
\item Section 4.1 describes our quantile regression forest implementation
77-
\item Section 4.2 explains the reweighting optimization
78-
\item Section 4.3 presents our validation framework
79-
\end{itemize}
128+
We solve using gradient descent with 5\% dropout for regularization.

0 commit comments

Comments
 (0)