Skip to content

Commit 3f260a2

Browse files
committed
improve methodology section
1 parent 5c1e33d commit 3f260a2

14 files changed

Lines changed: 507 additions & 254 deletions

paper/bibliography/references.bib

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,3 +150,27 @@ @techreport{census2024
150150
year = {2024},
151151
url = {https://www2.census.gov/programs-surveys/cps/datasets/2024/march/asec2024_ddl_pub_full.pdf}
152152
}
153+
154+
@article{meinshausen2006quantile,
155+
title = {Quantile regression forests},
156+
author = {Meinshausen, Nicolai and Ridgeway, Greg},
157+
journal = {Journal of machine learning research},
158+
volume = {7},
159+
number = {6},
160+
year = {2006}
161+
}
162+
163+
@misc{zillow2024quantile,
164+
title = {quantile-forest: Scikit-learn compatible quantile regression forests},
165+
author = {{Zillow Group}},
166+
year = {2024},
167+
howpublished = {\url{https://zillow.github.io/quantile-forest/}}
168+
}
169+
170+
@article{pytorch2019,
171+
title = {PyTorch: An Imperative Style, High-Performance Deep Learning Library},
172+
author = {Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and others},
173+
journal = {Advances in Neural Information Processing Systems},
174+
volume = {32},
175+
year = {2019}
176+
}

paper/main.pdf

14.7 KB
Binary file not shown.

paper/main.tex

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,17 @@
22

33
\usepackage{graphicx}
44
\usepackage{amsmath}
5-
\usepackage{natbib}
5+
\usepackage[round]{natbib} % Keep round option
66
\usepackage{hyperref}
77
\usepackage{booktabs}
88
\usepackage{geometry}
99
\usepackage{microtype}
1010
\usepackage{xcolor}
1111

12+
% Set citation style in preamble
13+
\bibpunct{(}{)}{;}{a}{,}{,} % Move here
14+
\setcitestyle{authoryear,round} % Move here
15+
1216
\input{macros}
1317

1418
\geometry{margin=1in}
@@ -20,6 +24,7 @@
2024
citecolor=blue,
2125
}
2226

27+
2328
\title{Enhancing Survey Microdata with Administrative Records: \\ A Novel Approach to Microsimulation Dataset Construction}
2429
% Define the \samethanks command
2530
\newcommand*\samethanks[1][\value{footnote}]{\footnotemark[#1]}
@@ -47,4 +52,4 @@
4752
\bibliographystyle{plainnat}
4853
\bibliography{./bibliography/references}
4954

50-
\end{document}
55+
\end{document}

paper/sections/background.tex

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,16 @@ \subsection{Government Agency Models}
1616

1717
The U.S. federal government maintains several microsimulation capabilities through its policy analysis agencies, which form the foundation for official policy analysis and revenue estimation.
1818

19-
The Congressional Budget Office's model emphasizes behavioral responses and their macroeconomic effects \cite{cbo2018}. Their approach uses a two-stage estimation process:
19+
The Congressional Budget Office's model emphasizes behavioral responses and their macroeconomic effects \citep{cbo2018}. Their approach uses a two-stage estimation process:
2020

2121
\begin{enumerate}
2222
\item Static scoring: calculating mechanical revenue effects assuming no behavioral change
2323
\item Dynamic scoring: incorporating behavioral responses calibrated to empirical literature
2424
\end{enumerate}
2525

26-
CBO's elasticity assumptions have evolved over time in response to new research, particularly regarding the elasticity of taxable income (ETI). Their current approach varies ETI by income level and type of tax change, broadly consistent with the academic consensus surveyed in \cite{saez2012}. The model also incorporates detailed projections of demographic change and economic growth from CBO's other forecasting models.
26+
CBO's elasticity assumptions have evolved over time in response to new research, particularly regarding the elasticity of taxable income (ETI). Their current approach varies ETI by income level and type of tax change, broadly consistent with the academic consensus surveyed in \citep{saez2012}. The model also incorporates detailed projections of demographic change and economic growth from CBO's other forecasting models.
2727

28-
The Joint Committee on Taxation employs a similar approach but with particular focus on conventional revenue estimates \cite{jct2023}. Their model maintains detailed imputations for:
28+
The Joint Committee on Taxation employs a similar approach but with particular focus on conventional revenue estimates \citep{jct2023}. Their model maintains detailed imputations for:
2929

3030
\begin{itemize}
3131
\item Business income allocation between tax forms
@@ -36,7 +36,7 @@ \subsection{Government Agency Models}
3636

3737
A distinguishing feature is their treatment of tax expenditure interactions - addressing both mechanical overlap (e.g., between itemized deductions) and behavioral responses (e.g., between savings incentives).
3838

39-
The Treasury's Office of Tax Analysis model features additional detail on corporate tax incidence and international provisions \cite{ota2012}. Their approach emphasizes the relationship between different types of tax instruments through a series of linked models:
39+
The Treasury's Office of Tax Analysis model features additional detail on corporate tax incidence and international provisions \citep{ota2012}. Their approach emphasizes the relationship between different types of tax instruments through a series of linked models:
4040

4141
\begin{itemize}
4242
\item Individual income tax model using matched administrative data
@@ -53,7 +53,7 @@ \subsubsection{Urban Institute Family of Models}
5353

5454
The Urban Institute maintains several complementary microsimulation models, each emphasizing different aspects of tax and transfer policy analysis.
5555

56-
The Urban-Brookings Tax Policy Center model \cite{tpc2022} combines the IRS Public Use File with Current Population Survey data through predictive mean matching, an approach similar to what we employ in Section~\ref{sec:methodology}. Their imputation strategy aims to preserve joint distributions across variables using regression-based techniques for:
56+
The Urban-Brookings Tax Policy Center model \citep{tpc2022} combines the IRS Public Use File with Current Population Survey data through predictive mean matching, an approach similar to what we employ in Section~\ref{sec:methodology}. Their imputation strategy aims to preserve joint distributions across variables using regression-based techniques for:
5757

5858
\begin{itemize}
5959
\item Wealth holdings (18 asset and debt categories)
@@ -63,7 +63,7 @@ \subsubsection{Urban Institute Family of Models}
6363
\item Retirement accounts (DB/DC split and contribution levels)
6464
\end{itemize}
6565

66-
TRIM3 emphasizes the time dimension of policy analysis, with sophisticated procedures for converting annual survey data into monthly variables \cite{trim2024}. Key innovations include:
66+
TRIM3 emphasizes the time dimension of policy analysis, with sophisticated procedures for converting annual survey data into monthly variables \citep{trim2024}. Key innovations include:
6767

6868
\begin{itemize}
6969
\item Allocation of employment spells to specific weeks using BLS benchmarks
@@ -74,11 +74,11 @@ \subsubsection{Urban Institute Family of Models}
7474

7575
This monthly allocation approach informs our treatment of time variation in Section~\ref{sec:data}.
7676

77-
The newer ATTIS model \cite{attis2024} focuses on interactions between tax and transfer programs. Building on the American Community Survey rather than the CPS provides better geographic detail at the cost of requiring additional tax variable imputations. Their approach to correcting for benefit underreporting in survey data parallels our methods in Section~\ref{sec:methodology}.
77+
The newer ATTIS model \citep{attis2024} focuses on interactions between tax and transfer programs. Building on the American Community Survey rather than the CPS provides better geographic detail at the cost of requiring additional tax variable imputations. Their approach to correcting for benefit underreporting in survey data parallels our methods in Section~\ref{sec:methodology}.
7878

7979
\subsubsection{Other Research Institution Models}
8080

81-
The Institute on Taxation and Economic Policy model \cite{itep2024} is unique in its comprehensive treatment of federal, state and local taxes. Key features include:
81+
The Institute on Taxation and Economic Policy model \citep{itep2024} is unique in its comprehensive treatment of federal, state and local taxes. Key features include:
8282

8383
\begin{itemize}
8484
\item Integration of income, sales, and property tax microsimulation
@@ -87,7 +87,7 @@ \subsubsection{Other Research Institution Models}
8787
\item Race/ethnicity analysis through statistical matching
8888
\end{itemize}
8989

90-
The Tax Foundation's Taxes and Growth model \cite{tf2024} emphasizes macroeconomic feedback effects through a neoclassical growth framework. Their approach includes:
90+
The Tax Foundation's Taxes and Growth model \citep{tf2024} emphasizes macroeconomic feedback effects through a neoclassical growth framework. Their approach includes:
9191

9292
\begin{itemize}
9393
\item Production function based on CES technology
@@ -100,7 +100,7 @@ \subsection{Open Source Initiatives}
100100

101101
Recent years have seen growing interest in open source approaches that promote transparency and reproducibility in tax policy modeling.
102102

103-
The Budget Lab at Yale \cite{budgetlab2024} maintains a fully open source federal tax model distinguished by:
103+
The Budget Lab at Yale \citep{budgetlab2024} maintains a fully open source federal tax model distinguished by:
104104

105105
\begin{itemize}
106106
\item Modular codebase with clear separation of concerns
@@ -111,7 +111,7 @@ \subsection{Open Source Initiatives}
111111

112112
Their approach to code organization and testing informs our own development practices.
113113

114-
The Policy Simulation Library's Tax-Data project \cite{psl2024} provides building blocks for tax microsimulation including:
114+
The Policy Simulation Library's Tax-Data project \citep{psl2024} provides building blocks for tax microsimulation including:
115115

116116
\begin{itemize}
117117
\item Data processing and cleaning routines

paper/sections/data.tex

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ \subsection{Current Population Survey}
2222

2323
\subsection{IRS Public Use File}
2424

25-
The Internal Revenue Service Public Use File (PUF) is a national sample of individual income tax returns, representing the 151.2 million Form 1040, Form 1040A, and Form 1040EZ Federal Individual Income Tax Returns filed for Tax Year 2015. The file contains 119,675 records sampled at varying rates across strata, with 0.07 percent sampling for strata 7 through 13 \cite{bryant2023b}. The data are extensively transformed to protect taxpayer privacy while preserving statistical properties.
25+
The Internal Revenue Service Public Use File (PUF) is a national sample of individual income tax returns, representing the 151.2 million Form 1040, Form 1040A, and Form 1040EZ Federal Individual Income Tax Returns filed for Tax Year 2015. The file contains 119,675 records sampled at varying rates across strata, with 0.07 percent sampling for strata 7 through 13 \citep{bryant2023b}. The data are extensively transformed to protect taxpayer privacy while preserving statistical properties.
2626

2727
The Public Use Tax Demographic File supplements the PUF with:
2828
\begin{itemize}

paper/sections/methodology.tex

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,10 @@
1-
% Include methodology subsections
1+
\section{Methodology}\label{sec:methodology}
2+
23
\input{sections/methodology/overview}
4+
\input{sections/methodology/demographic_variables}
5+
\input{sections/methodology/puf_preprocessing}
6+
\input{sections/methodology/aging}
37
\input{sections/methodology/quantile_forests}
4-
\input{sections/methodology/reweighting}
8+
\input{sections/methodology/loss_matrix}
9+
\input{sections/methodology/reweighting}
10+
\input{sections/methodology/pipeline}
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
\subsection{Data Aging and Indexing}
2+
3+
The process of projecting historical microdata involves both demographic aging and economic indexing based on US government forecasts. Our aging process occurs in two stages: first to reach our baseline year (2024), and then to project the calibrated dataset forward.
4+
5+
\subsubsection{Growth Factor Construction}
6+
7+
For each variable in the tax-benefit system with a specified growth parameter, we compute change factors from the base year through 2034:
8+
9+
\[ \text{Index Factor}_{t} = \frac{\text{Index}_{t}}{\text{Index}_{\text{base}}} \]
10+
11+
\subsubsection{Population Adjustment}
12+
13+
Most economic variables are adjusted for changes in total population:
14+
15+
\[ \text{Per Capita Factor}_{t} = \frac{\text{Index Factor}_{t}}{\text{Population Growth}_{t}} \]
16+
17+
Exceptions include:
18+
\begin{itemize}
19+
\item Weight variables maintain raw growth
20+
\item Population itself uses Census projections directly
21+
\end{itemize}
22+
23+
\subsubsection{Data Sources}
24+
25+
Projection factors come from:
26+
\begin{itemize}
27+
\item Congressional Budget Office economic projections
28+
\item Census Bureau population estimates
29+
\item Social Security Administration wage index forecasts
30+
\item Treasury tax parameter indexing
31+
\end{itemize}
32+
33+
\subsubsection{Initial Aging Implementation}
34+
35+
For any variable y, the projected value to reach our baseline year is computed as:
36+
37+
\[ y_{2024} = y_{2023} \cdot \frac{f(2024)}{f(2023)} \]
38+
39+
where f(t) represents the index factor for time t.
40+
41+
\subsubsection{Forward Projection}
42+
43+
After constructing and calibrating the enhanced 2024 dataset, we project it to future years using the same indexing framework. This maintains the dataset's enhanced distributional properties while reflecting:
44+
45+
\begin{itemize}
46+
\item Economic growth forecasts for monetary variables
47+
\item Statutory adjustments to program parameters
48+
\item Population projections applied to household weights
49+
\end{itemize}
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
\subsection{Demographic Variable Construction}
2+
3+
Following the IRS specifications for the Public Use File, we construct three key demographic variables: dependent ages, primary taxpayer age ranges, and earnings splits between spouses.
4+
5+
\subsection{Dependent Ages}
6+
7+
For each dependent, we construct age categories following IRS constraints:
8+
\begin{itemize}
9+
\item Under 5
10+
\item 5 under 13
11+
\item 13 under 17
12+
\item 17 under 19
13+
\item 19 under 24
14+
\item 24 or older
15+
\end{itemize}
16+
17+
The number of dependents is limited by filing status:
18+
\begin{itemize}
19+
\item Up to 3 dependents for joint returns and head of household returns
20+
\item Up to 2 dependents for single returns
21+
\item Up to 1 dependent for married filing separately returns
22+
\end{itemize}
23+
24+
Dependents are ordered sequentially by type:
25+
\begin{enumerate}
26+
\item Children living at home
27+
\item Children living away from home
28+
\item Other dependents
29+
\item Parents
30+
\end{enumerate}
31+
32+
\subsubsection{Primary Taxpayer Age}
33+
34+
Age ranges are constructed differently for dependent and non-dependent returns:
35+
36+
For non-dependent returns:
37+
\begin{itemize}
38+
\item Under 26
39+
\item 26 under 35
40+
\item 35 under 45
41+
\item 45 under 55
42+
\item 55 under 65
43+
\item 65 or older
44+
\end{itemize}
45+
46+
For dependent returns:
47+
\begin{itemize}
48+
\item Under 18
49+
\item 18 under 26
50+
\item 26 or older
51+
\end{itemize}
52+
53+
\subsubsection{Earnings Splits}
54+
55+
For joint returns, we calculate the primary earner's share of total earnings:
56+
57+
\[ \text{Primary Share} = \frac{\text{Primary Wages} + \text{Primary SE Income}}{\text{Total Wages} + \text{Total SE Income}} \]
58+
59+
Where:
60+
\begin{itemize}
61+
\item Primary wages and SE income = E30400 - E30500
62+
\item Secondary wages and SE income = E30500
63+
\end{itemize}
64+
65+
This share is categorized into:
66+
\begin{itemize}
67+
\item 75 percent or more earned by primary
68+
\item Less than 75 percent but more than 25 percent earned by primary
69+
\item Less than 25 percent earned by primary
70+
\end{itemize}
71+
72+
\subsubsection{Implementation Details}
73+
74+
When decoding age ranges into specific ages, we use random assignment within the range to avoid unrealistic bunching. For example, when the PUF indicates age 80, we randomly assign an age between 80 and 84.
75+
76+
The ordering of dependents is preserved when constructing synthetic tax units to maintain consistency with the original data structure.

0 commit comments

Comments
 (0)