add to puf sourcing

MaxGhenis · MaxGhenis · commit 09e5e4cea992 · 2024-11-10T22:59:20.000-05:00
diff --git a/paper/bibliography/references.bib b/paper/bibliography/references.bib
@@ -122,3 +122,13 @@ @article{auerbach2018
   pages   = {541--576},
   year    = {2018}
 }
+
+@techreport{bryant2022,
+  title       = {General Description Booklet for the 2015 Public Use Tax File Demographic File},
+  author      = {Bryant, Victoria},
+  institution = {Statistics of Income Division, Internal Revenue Service},
+  year        = {2022},
+  month       = {September},
+  type        = {Technical Documentation},
+  url         = {https://drive.google.com/file/d/1WoTU70GEjYMO0KHsHvTTH0NwCc-kN5cE/view}
+}
diff --git a/paper/main.pdf b/paper/main.pdf
diff --git a/paper/sections/data.tex b/paper/sections/data.tex
@@ -1,6 +1,5 @@
 \section{Data}\label{sec:data}
 
-
 \subsection{Current Population Survey}
 
 The Current Population Survey Annual Social and Economic Supplement (CPS ASEC) provides comprehensive demographic and economic information for a nationally representative sample of U.S. households. For tax year 2024, our base dataset contains approximately 150,000 households representing the U.S. civilian non-institutional population.
@@ -24,7 +23,23 @@ \subsection{Current Population Survey}
 
 \subsection{IRS Public Use File}
 
-The Internal Revenue Service Public Use File (PUF) is a national sample of individual income tax returns, containing approximately 200,000 records. The data are extensively transformed to protect taxpayer privacy while preserving statistical properties. Our analysis uses the 2015 PUF, the most recent available, aged to 2024.
+The Internal Revenue Service Public Use File (PUF) is a national sample of individual income tax returns, representing the 151.2 million Form 1040, Form 1040A, and Form 1040EZ Federal Individual Income Tax Returns filed for Tax Year 2015. The file contains 119,675 records sampled at varying rates across strata, with 0.07 percent sampling for strata 7 through 13 \cite{bryant2022}. The data are extensively transformed to protect taxpayer privacy while preserving statistical properties.
+
+The Public Use Tax Demographic File supplements the PUF with:
+\begin{itemize}
+    \item Age ranges for primary taxpayers (different ranges for dependent vs non-dependent filers)
+    \item Dependent age information in six categories (under 5, 5-13, 13-17, 17-19, 19-24, 24+)
+    \item Gender of primary taxpayer
+    \item Earnings splits for joint filers (categorizing primary earner share)
+\end{itemize}
+
+Key disclosure protections include:
+\begin{itemize}
+    \item Demographic information limited to returns in strata 7-13
+    \item Suppression of dependent ages for returns with farm income or homebuyer credits
+    \item Minimum population thresholds for dependent age reporting
+    \item Sequential limits on dependent counts by filing status
+\end{itemize}
 
 The PUF's key strengths include:
 \begin{itemize}
diff --git a/paper/sections/methodology.tex b/paper/sections/methodology.tex
@@ -1,5 +1,3 @@
-\section{Methodology}\label{sec:methodology}
-
 % Include methodology subsections
 \input{sections/methodology/overview}
 \input{sections/methodology/quantile_forests}
diff --git a/paper/sections/methodology/overview.tex b/paper/sections/methodology/overview.tex
@@ -1,59 +1,115 @@
-\section{Methodology}
+\section{Methodology}\label{sec:methodology}
 
-Our procedure transforms the Current Population Survey (CPS) into an enhanced microsimulation dataset through four key steps:
+Following \cite{bryant2022}, our procedure enhances the Current Population Survey (CPS) with tax information from the Public Use File (PUF) through four key steps:
 \begin{enumerate}
     \item Project both CPS and PUF data to the target year
     \item Transfer tax variable distributions from PUF to CPS records
-    \item Impute program participation
+    \item Generate dependent age, primary age, and earnings split variables
     \item Reweight households to match administrative benchmarks
 \end{enumerate}
 
 \subsection{Data Projection}
 
-We project the CPS forward using a combination of economic and demographic factors. For each economic variable $y$, we apply:
+We project both datasets forward using variable-specific growth indices $f(t)$:
 
 \[ y_{2024} = y_{2023} \cdot \frac{f(2024)}{f(2023)} \]
 
-where $f(t)$ represents the variable-specific growth index. We derive these indices from:
+The indices come from:
 \begin{itemize}
-    \item CBO economic projections for aggregate income components
+    \item CBO economic projections for income components
     \item SSA wage index forecasts for employment income
-    \item Census population projections for demographic totals
+    \item Census population projections for demographics
     \item Treasury forecasts for tax variables
 \end{itemize}
 
-For the PUF, we first age the 2015 data to 2021 using IRS Statistics of Income data, then project to 2024 using the same indices as the CPS projection.
+For the PUF, we first age the 2015 data to 2021 using IRS Statistics of Income data, then project to 2024 using the same indices.
 
-\subsection{Tax Variable Enhancement}
+\subsection{Demographic Variable Construction}
+
+Following \cite{bryant2022}, we construct several key demographic variables:
+
+\subsubsection{Dependent Ages}
+We create three dependent age variables (AGEDP1/2/3) capturing:
+\begin{itemize}
+    \item Up to 3 dependents for joint/HOH returns
+    \item Up to 2 dependents for single returns 
+    \item Up to 1 dependent for MFS returns
+\end{itemize}
+
+Ages are categorized as:
+\begin{itemize}
+    \item Under 5
+    \item 5 under 13
+    \item 13 under 17
+    \item 17 under 19
+    \item 19 under 24
+    \item 24 or older
+\end{itemize}
 
-We transfer 47 tax variables from the PUF to the CPS using quantile regression forests. For each variable, we:
+Dependents are ordered sequentially by type:
 \begin{enumerate}
-    \item Train a forest on PUF records using age, sex, marital status, and existing income measures as predictors
-    \item Generate a distribution of predicted values for each CPS record
-    \item Sample from these distributions using rank preservation within demographic groups
+    \item Children living at home
+    \item Children living away from home
+    \item Other dependents
+    \item Parents
 \end{enumerate}
 
-This approach preserves both the marginal distributions of tax variables and their relationships with demographic characteristics. 
+\subsubsection{Primary Taxpayer Age}
+We construct age ranges differently for:
+
+Non-dependent returns:
+\begin{itemize}
+    \item Under 26
+    \item 26 under 35
+    \item 35 under 45
+    \item 45 under 55
+    \item 55 under 65
+    \item 65 or older
+\end{itemize}
+
+Dependent returns:
+\begin{itemize}
+    \item Under 18
+    \item 18 under 26
+    \item 26 or older
+\end{itemize}
+
+\subsubsection{Earnings Splits}
+For joint returns, we calculate earnings splits using:
+\[ \text{Primary Share} = \frac{\text{Primary Wages} + \text{Primary SE Income}}{\text{Total Wages} + \text{Total SE Income}} \]
+
+Where:
+\begin{itemize}
+    \item Primary wages and SE income = E30400 - E30500
+    \item Secondary wages and SE income = E30500
+\end{itemize}
+
+We categorize the splits as:
+\begin{itemize}
+    \item 75 percent or more earned by primary
+    \item Less than 75 percent but more than 25 percent earned by primary
+    \item Less than 25 percent earned by primary
+\end{itemize}
+
+\subsection{Tax Variable Enhancement}
 
-\subsection{Program Participation}
+We transfer tax variables from PUF to CPS using quantile regression forests trained on:
+\begin{itemize}
+    \item Constructed demographic variables described above
+    \item Filing status
+    \item Existing income measures
+\end{itemize}
 
-We model participation in major benefit programs through a two-stage process:
+For each variable, we:
 \begin{enumerate}
-    \item Calculate eligibility using program rules
-    \item Assign participation probabilities based on:
-        \begin{itemize}
-            \item Demographic characteristics
-            \item Benefit amounts
-            \item Geographic patterns
-            \item Historical take-up rates
-        \end{itemize}
+    \item Train a forest on PUF records
+    \item Generate predicted distributions for CPS records
+    \item Sample preserving rank within demographic groups
 \end{enumerate}
 
-The final participation patterns emerge from our reweighting procedure's alignment with administrative totals.
-
 \subsection{Household Reweighting}
 
-We adjust household weights to minimize discrepancies with administrative benchmarks while avoiding overfitting. The optimization problem takes the form:
+We adjust household weights to minimize discrepancies with administrative targets while avoiding overfitting:
 
 \[ \min_w \sum_j \left(\frac{\sum_i w_i x_{ij} - t_j}{t_j}\right)^2 + \lambda \sum_i (w_i - w_i^0)^2 \]
 
@@ -66,14 +122,7 @@ \subsection{Household Reweighting}
     \item $w_i^0$ is the original CPS weight
     \item $x_{ij}$ is the value of variable $j$ for household $i$
     \item $t_j$ is the administrative target for variable $j$
-    \item $\lambda$ controls the strength of regularization
+    \item $\lambda$ controls regularization strength
 \end{itemize}
 
-We solve this using gradient descent with dropout, randomly zeroing 5\% of household weights during each iteration to improve generalization.
-
-The remainder of the methodology section details each component:
-\begin{itemize}
-    \item Section 4.1 describes our quantile regression forest implementation
-    \item Section 4.2 explains the reweighting optimization
-    \item Section 4.3 presents our validation framework
-\end{itemize}
+We solve using gradient descent with 5\% dropout for regularization.

Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,3 @@`
`1`		`-\section{Methodology}\label{sec:methodology}`
`2`		`-`
`3`	`1`	`% Include methodology subsections`
`4`	`2`	`\input{sections/methodology/overview}`
`5`	`3`	`\input{sections/methodology/quantile_forests}`