Skip to content

Commit e89912c

Browse files
committed
Add initial paper version
Fixes #111
1 parent 728cc41 commit e89912c

14 files changed

Lines changed: 684 additions & 0 deletions

paper/.gitignore

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
## Core latex/pdflatex auxiliary files:
2+
*.aux
3+
*.lof
4+
*.log
5+
*.lot
6+
*.fls
7+
*.out
8+
*.toc
9+
*.fmt
10+
*.fot
11+
*.cb
12+
*.cb2
13+
.*.lb
14+
15+
## Generated if empty string is given at "Please type another file name for output:"
16+
.pdf
17+
18+
## Bibliography auxiliary files (bibtex/biblatex/biber):
19+
*.bbl
20+
*.bcf
21+
*.blg
22+
*-blx.aux
23+
*-blx.bib
24+
*.run.xml
25+
26+
## Build tool auxiliary files:
27+
*.fdb_latexmk
28+
*.synctex
29+
*.synctex(busy)
30+
*.synctex.gz
31+
*.synctex.gz(busy)
32+
*.pdfsync

paper/bibliography/references.bib

Whitespace-only changes.

paper/macros.tex

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
% Custom commands and mathematics macros
2+
\newcommand{\policyengine}{\textsc{PolicyEngine}}
3+
\newcommand{\cps}{\textsc{CPS}}
4+
\newcommand{\puf}{\textsc{PUF}}

paper/main.tex

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
\documentclass[12pt]{article}
2+
3+
\usepackage{graphicx}
4+
\usepackage{amsmath}
5+
\usepackage{natbib}
6+
\usepackage{hyperref}
7+
8+
\input{macros}
9+
10+
\title{Enhancing Survey Microdata with Administrative Records: \\ A Novel Approach to Microsimulation Dataset Construction}
11+
\author{PolicyEngine Team}
12+
\date{\today}
13+
14+
\begin{document}
15+
16+
\maketitle
17+
18+
\input{sections/abstract}
19+
\input{sections/introduction}
20+
\input{sections/background}
21+
\input{sections/data}
22+
\input{sections/methodology}
23+
\input{sections/results}
24+
\input{sections/discussion}
25+
\input{sections/conclusion}
26+
27+
\bibliography{bibliography/references}
28+
\bibliographystyle{plainnat}
29+
30+
\end{document}

paper/sections/abstract.tex

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
\section*{Abstract}

paper/sections/background.tex

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
\section{Background}

paper/sections/conclusion.tex

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
\section{Conclusion}

paper/sections/data.tex

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
\section{Data}
2+
3+
\subsection{Current Population Survey}
4+
5+
The Current Population Survey Annual Social and Economic Supplement (CPS ASEC) provides comprehensive demographic and economic information for a nationally representative sample of U.S. households. For tax year 2024, our base dataset contains approximately 150,000 households representing the U.S. civilian non-institutional population.
6+
7+
The CPS's key strengths include:
8+
\begin{itemize}
9+
\item Rich demographic detail including age, sex, race, ethnicity, and education
10+
\item Complete household relationship matrices
11+
\item Program participation indicators
12+
\item State and sub-state geographic identifiers
13+
\item Monthly employment and labor force status
14+
\end{itemize}
15+
16+
However, the CPS has known limitations for tax modeling:
17+
\begin{itemize}
18+
\item Underreporting of income, particularly at the top of the distribution
19+
\item Limited tax-relevant information (e.g., itemized deductions)
20+
\item No direct observation of tax units within households
21+
\item Imprecise measurement of certain income types (e.g., capital gains)
22+
\end{itemize}
23+
24+
\subsection{IRS Public Use File}
25+
26+
The Internal Revenue Service Public Use File (PUF) is a national sample of individual income tax returns, containing approximately 200,000 records. The data are extensively transformed to protect taxpayer privacy while preserving statistical properties. Our analysis uses the 2015 PUF, the most recent available, aged to 2024.
27+
28+
The PUF's key strengths include:
29+
\begin{itemize}
30+
\item Precise income amounts derived from information returns
31+
\item Complete tax return information including itemized deductions
32+
\item Actual tax unit structure
33+
\item Accurate income type classification
34+
\end{itemize}
35+
36+
The PUF's limitations include:
37+
\begin{itemize}
38+
\item Limited demographic information
39+
\item No household structure beyond the tax unit
40+
\item Geographic detail limited to state
41+
\item No program participation information
42+
\item Privacy protections that mask extreme values
43+
\end{itemize}
44+
45+
\subsection{External Validation Sources}
46+
47+
We validate our enhanced dataset against several external sources:
48+
49+
\subsubsection{IRS Statistics of Income}
50+
51+
The Statistics of Income (SOI) Division publishes detailed tabulations of tax return data, including:
52+
\begin{itemize}
53+
\item Income amounts by source and adjusted gross income bracket
54+
\item Number of returns by filing status
55+
\item Itemized deduction amounts and counts
56+
\item Tax credits and their distribution
57+
\end{itemize}
58+
59+
These tabulations serve as key targets in our reweighting procedure and validation metrics.
60+
61+
\subsubsection{CPS ASEC Public Tables}
62+
63+
Census Bureau publications provide demographic and program participation benchmarks, including:
64+
\begin{itemize}
65+
\item Age distribution by state
66+
\item Household size distribution
67+
\item Program participation rates
68+
\item Employment status
69+
\end{itemize}
70+
71+
\subsubsection{Administrative Program Totals}
72+
73+
We incorporate official totals from various agencies:
74+
\begin{itemize}
75+
\item Social Security Administration beneficiary counts and benefit amounts
76+
\item SNAP participation and benefits from USDA
77+
\item Earned Income Tax Credit statistics from IRS
78+
\item Unemployment Insurance claims and benefits from Department of Labor
79+
\end{itemize}
80+
81+
\subsection{Variable Harmonization}
82+
83+
A crucial preparatory step is harmonizing variables across datasets. We develop a detailed crosswalk between CPS and PUF variables, accounting for definitional differences. Key considerations include:
84+
\begin{itemize}
85+
\item Income timing (calendar year vs. tax year)
86+
\item Income classification (e.g., business vs. wage income)
87+
\item Geographic definitions
88+
\item Family relationship categories
89+
\end{itemize}
90+
91+
For some variables, direct correspondence is impossible, requiring imputation strategies described in the methodology section. The complete variable crosswalk is available in our open-source repository.

paper/sections/discussion.tex

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
\section{Discussion}

paper/sections/introduction.tex

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
\section{Introduction}
2+
3+
Microsimulation models are essential tools for analyzing the distributional impacts of tax and transfer policies. These models require microdata that accurately represent both the demographic composition of a population and their economic circumstances, particularly their tax situations. However, available data sources typically excel in one dimension while falling short in another.
4+
5+
The Current Population Survey (CPS), conducted by the U.S. Census Bureau, provides rich demographic detail and household relationships but suffers from underreporting of income and lacks tax information. Conversely, the Internal Revenue Service's Public Use File (PUF) offers precise tax data but contains limited demographic information and obscures household structure. This tradeoff between demographic detail and tax precision poses a significant challenge for policy analysis.
6+
7+
This paper presents a novel approach to combining these complementary data sources. We develop a methodology that preserves the demographic richness of the CPS while incorporating the tax precision of the PUF, creating an enhanced dataset that serves as the foundation for PolicyEngine's microsimulation capabilities. Our approach differs from previous efforts in three key ways:
8+
9+
First, we employ quantile regression forests to transfer distributions rather than point estimates between datasets, preserving the complex relationships between variables. Second, we maintain household structure throughout the enhancement process, ensuring that family relationships crucial for benefit calculations remain intact. Third, we implement a sophisticated reweighting procedure that simultaneously matches dozens of demographic and economic targets while avoiding overfitting through a dropout-enhanced gradient descent approach.
10+
11+
The resulting dataset demonstrates superior performance in both tax and transfer policy simulation. When compared to administrative totals, our enhanced dataset reduces discrepancies in key tax components by an average of 40\% relative to the baseline CPS, while maintaining or improving the accuracy of demographic and program participation variables.
12+
13+
The remainder of this paper is organized as follows: Section 2 reviews related work in survey enhancement and microsimulation data construction. Section 3 describes our data sources and their characteristics. Section 4 presents our methodology in detail. Section 5 validates our results against external benchmarks. Section 6 discusses implications and limitations, and Section 7 concludes.
14+
15+
Our contributions include:
16+
\begin{itemize}
17+
\item A novel methodology for combining survey and administrative data while preserving distributional relationships
18+
\item An open-source implementation that can be adapted for other jurisdictions and policy models
19+
\item A validation framework comparing enhanced estimates against multiple external benchmarks
20+
\item A new, publicly available microdata file suitable for US tax and benefit policy analysis
21+
\end{itemize}

0 commit comments

Comments
 (0)