PolicyEngine
diff --git a/‎paper/bibliography/references.bib‎
Lines changed: 24 additions & 0 deletions b/‎paper/bibliography/references.bib‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎paper/main.pdf‎
14.7 KB b/‎paper/main.pdf‎
14.7 KB
diff --git a/‎paper/main.tex‎
Lines changed: 7 additions & 2 deletions b/‎paper/main.tex‎
Lines changed: 7 additions & 2 deletions
diff --git a/‎paper/sections/background.tex‎
Lines changed: 11 additions & 11 deletions b/‎paper/sections/background.tex‎
Lines changed: 11 additions & 11 deletions
diff --git a/‎paper/sections/data.tex‎
Lines changed: 1 addition & 1 deletion b/‎paper/sections/data.tex‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎paper/sections/methodology.tex‎
Lines changed: 8 additions & 2 deletions b/‎paper/sections/methodology.tex‎
Lines changed: 8 additions & 2 deletions
diff --git a/‎paper/sections/methodology/aging.tex‎
Lines changed: 49 additions & 0 deletions b/‎paper/sections/methodology/aging.tex‎
Lines changed: 49 additions & 0 deletions
diff --git a/‎paper/sections/methodology/demographic_variables.tex‎
Lines changed: 76 additions & 0 deletions b/‎paper/sections/methodology/demographic_variables.tex‎
Lines changed: 76 additions & 0 deletions
@@ -150,3 +150,27 @@ @techreport{census2024
   year        = {2024},
   url         = {https://www2.census.gov/programs-surveys/cps/datasets/2024/march/asec2024_ddl_pub_full.pdf}
 }
+
+@article{meinshausen2006quantile,
+  title   = {Quantile regression forests},
+  author  = {Meinshausen, Nicolai and Ridgeway, Greg},
+  journal = {Journal of machine learning research},
+  volume  = {7},
+  number  = {6},
+  year    = {2006}
+}
+
+@misc{zillow2024quantile,
+  title        = {quantile-forest: Scikit-learn compatible quantile regression forests},
+  author       = {{Zillow Group}},
+  year         = {2024},
+  howpublished = {\url{https://zillow.github.io/quantile-forest/}}
+}
+
+@article{pytorch2019,
+  title   = {PyTorch: An Imperative Style, High-Performance Deep Learning Library},
+  author  = {Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and others},
+  journal = {Advances in Neural Information Processing Systems},
+  volume  = {32},
+  year    = {2019}
+}
@@ -2,13 +2,17 @@
 
 \usepackage{graphicx}
 \usepackage{amsmath}
-\usepackage{natbib}
+\usepackage[round]{natbib}  % Keep round option
 \usepackage{hyperref}
 \usepackage{booktabs}
 \usepackage{geometry}
 \usepackage{microtype}
 \usepackage{xcolor}
 
+% Set citation style in preamble
+\bibpunct{(}{)}{;}{a}{,}{,}  % Move here
+\setcitestyle{authoryear,round}  % Move here
+
 \input{macros}
 
 \geometry{margin=1in}
@@ -20,6 +24,7 @@
     citecolor=blue,
 }
 
+
 \title{Enhancing Survey Microdata with Administrative Records: \\ A Novel Approach to Microsimulation Dataset Construction}
 % Define the \samethanks command
 \newcommand*\samethanks[1][\value{footnote}]{\footnotemark[#1]}
@@ -47,4 +52,4 @@
 \bibliographystyle{plainnat}
 \bibliography{./bibliography/references}
 
-\end{document}
+\end{document}
@@ -16,16 +16,16 @@ \subsection{Government Agency Models}
 
 The U.S. federal government maintains several microsimulation capabilities through its policy analysis agencies, which form the foundation for official policy analysis and revenue estimation.
 
-The Congressional Budget Office's model emphasizes behavioral responses and their macroeconomic effects \cite{cbo2018}. Their approach uses a two-stage estimation process:
+The Congressional Budget Office's model emphasizes behavioral responses and their macroeconomic effects \citep{cbo2018}. Their approach uses a two-stage estimation process:
 
 \begin{enumerate}
     \item Static scoring: calculating mechanical revenue effects assuming no behavioral change
     \item Dynamic scoring: incorporating behavioral responses calibrated to empirical literature
 \end{enumerate}
 
-CBO's elasticity assumptions have evolved over time in response to new research, particularly regarding the elasticity of taxable income (ETI). Their current approach varies ETI by income level and type of tax change, broadly consistent with the academic consensus surveyed in \cite{saez2012}. The model also incorporates detailed projections of demographic change and economic growth from CBO's other forecasting models.
+CBO's elasticity assumptions have evolved over time in response to new research, particularly regarding the elasticity of taxable income (ETI). Their current approach varies ETI by income level and type of tax change, broadly consistent with the academic consensus surveyed in \citep{saez2012}. The model also incorporates detailed projections of demographic change and economic growth from CBO's other forecasting models.
 
-The Joint Committee on Taxation employs a similar approach but with particular focus on conventional revenue estimates \cite{jct2023}. Their model maintains detailed imputations for:
+The Joint Committee on Taxation employs a similar approach but with particular focus on conventional revenue estimates \citep{jct2023}. Their model maintains detailed imputations for:
 
 \begin{itemize}
     \item Business income allocation between tax forms
@@ -36,7 +36,7 @@ \subsection{Government Agency Models}
 
 A distinguishing feature is their treatment of tax expenditure interactions - addressing both mechanical overlap (e.g., between itemized deductions) and behavioral responses (e.g., between savings incentives).
 
-The Treasury's Office of Tax Analysis model features additional detail on corporate tax incidence and international provisions \cite{ota2012}. Their approach emphasizes the relationship between different types of tax instruments through a series of linked models:
+The Treasury's Office of Tax Analysis model features additional detail on corporate tax incidence and international provisions \citep{ota2012}. Their approach emphasizes the relationship between different types of tax instruments through a series of linked models:
 
 \begin{itemize}
     \item Individual income tax model using matched administrative data
@@ -53,7 +53,7 @@ \subsubsection{Urban Institute Family of Models}
 
 The Urban Institute maintains several complementary microsimulation models, each emphasizing different aspects of tax and transfer policy analysis.
 
-The Urban-Brookings Tax Policy Center model \cite{tpc2022} combines the IRS Public Use File with Current Population Survey data through predictive mean matching, an approach similar to what we employ in Section~\ref{sec:methodology}. Their imputation strategy aims to preserve joint distributions across variables using regression-based techniques for:
+The Urban-Brookings Tax Policy Center model \citep{tpc2022} combines the IRS Public Use File with Current Population Survey data through predictive mean matching, an approach similar to what we employ in Section~\ref{sec:methodology}. Their imputation strategy aims to preserve joint distributions across variables using regression-based techniques for:
 
 \begin{itemize}
     \item Wealth holdings (18 asset and debt categories)
@@ -63,7 +63,7 @@ \subsubsection{Urban Institute Family of Models}
     \item Retirement accounts (DB/DC split and contribution levels)
 \end{itemize}
 
-TRIM3 emphasizes the time dimension of policy analysis, with sophisticated procedures for converting annual survey data into monthly variables \cite{trim2024}. Key innovations include:
+TRIM3 emphasizes the time dimension of policy analysis, with sophisticated procedures for converting annual survey data into monthly variables \citep{trim2024}. Key innovations include:
 
 \begin{itemize}
     \item Allocation of employment spells to specific weeks using BLS benchmarks
@@ -74,11 +74,11 @@ \subsubsection{Urban Institute Family of Models}
 
 This monthly allocation approach informs our treatment of time variation in Section~\ref{sec:data}.
 
-The newer ATTIS model \cite{attis2024} focuses on interactions between tax and transfer programs. Building on the American Community Survey rather than the CPS provides better geographic detail at the cost of requiring additional tax variable imputations. Their approach to correcting for benefit underreporting in survey data parallels our methods in Section~\ref{sec:methodology}.
+The newer ATTIS model \citep{attis2024} focuses on interactions between tax and transfer programs. Building on the American Community Survey rather than the CPS provides better geographic detail at the cost of requiring additional tax variable imputations. Their approach to correcting for benefit underreporting in survey data parallels our methods in Section~\ref{sec:methodology}.
 
 \subsubsection{Other Research Institution Models}
 
-The Institute on Taxation and Economic Policy model \cite{itep2024} is unique in its comprehensive treatment of federal, state and local taxes. Key features include:
+The Institute on Taxation and Economic Policy model \citep{itep2024} is unique in its comprehensive treatment of federal, state and local taxes. Key features include:
 
 \begin{itemize}
     \item Integration of income, sales, and property tax microsimulation
@@ -87,7 +87,7 @@ \subsubsection{Other Research Institution Models}
     \item Race/ethnicity analysis through statistical matching
 \end{itemize}
 
-The Tax Foundation's Taxes and Growth model \cite{tf2024} emphasizes macroeconomic feedback effects through a neoclassical growth framework. Their approach includes:
+The Tax Foundation's Taxes and Growth model \citep{tf2024} emphasizes macroeconomic feedback effects through a neoclassical growth framework. Their approach includes:
 
 \begin{itemize}
     \item Production function based on CES technology
@@ -100,7 +100,7 @@ \subsection{Open Source Initiatives}
 
 Recent years have seen growing interest in open source approaches that promote transparency and reproducibility in tax policy modeling.
 
-The Budget Lab at Yale \cite{budgetlab2024} maintains a fully open source federal tax model distinguished by:
+The Budget Lab at Yale \citep{budgetlab2024} maintains a fully open source federal tax model distinguished by:
 
 \begin{itemize}
     \item Modular codebase with clear separation of concerns
@@ -111,7 +111,7 @@ \subsection{Open Source Initiatives}
 
 Their approach to code organization and testing informs our own development practices.
 
-The Policy Simulation Library's Tax-Data project \cite{psl2024} provides building blocks for tax microsimulation including:
+The Policy Simulation Library's Tax-Data project \citep{psl2024} provides building blocks for tax microsimulation including:
 
 \begin{itemize}
     \item Data processing and cleaning routines
 
@@ -22,7 +22,7 @@ \subsection{Current Population Survey}
 
 \subsection{IRS Public Use File}
 
-The Internal Revenue Service Public Use File (PUF) is a national sample of individual income tax returns, representing the 151.2 million Form 1040, Form 1040A, and Form 1040EZ Federal Individual Income Tax Returns filed for Tax Year 2015. The file contains 119,675 records sampled at varying rates across strata, with 0.07 percent sampling for strata 7 through 13 \cite{bryant2023b}. The data are extensively transformed to protect taxpayer privacy while preserving statistical properties.
+The Internal Revenue Service Public Use File (PUF) is a national sample of individual income tax returns, representing the 151.2 million Form 1040, Form 1040A, and Form 1040EZ Federal Individual Income Tax Returns filed for Tax Year 2015. The file contains 119,675 records sampled at varying rates across strata, with 0.07 percent sampling for strata 7 through 13 \citep{bryant2023b}. The data are extensively transformed to protect taxpayer privacy while preserving statistical properties.
 
 The Public Use Tax Demographic File supplements the PUF with:
 \begin{itemize}
 
@@ -1,4 +1,10 @@
-% Include methodology subsections
+\section{Methodology}\label{sec:methodology}
+
 \input{sections/methodology/overview}
+\input{sections/methodology/demographic_variables}  
+\input{sections/methodology/puf_preprocessing}
+\input{sections/methodology/aging}
 \input{sections/methodology/quantile_forests}
-\input{sections/methodology/reweighting}
+\input{sections/methodology/loss_matrix}
+\input{sections/methodology/reweighting}
+\input{sections/methodology/pipeline}
@@ -0,0 +1,49 @@
+\subsection{Data Aging and Indexing}
+
+The process of projecting historical microdata involves both demographic aging and economic indexing based on US government forecasts. Our aging process occurs in two stages: first to reach our baseline year (2024), and then to project the calibrated dataset forward.
+
+\subsubsection{Growth Factor Construction}
+
+For each variable in the tax-benefit system with a specified growth parameter, we compute change factors from the base year through 2034:
+
+\[ \text{Index Factor}_{t} = \frac{\text{Index}_{t}}{\text{Index}_{\text{base}}} \]
+
+\subsubsection{Population Adjustment}
+
+Most economic variables are adjusted for changes in total population:
+
+\[ \text{Per Capita Factor}_{t} = \frac{\text{Index Factor}_{t}}{\text{Population Growth}_{t}} \]
+
+Exceptions include:
+\begin{itemize}
+    \item Weight variables maintain raw growth
+    \item Population itself uses Census projections directly
+\end{itemize}
+
+\subsubsection{Data Sources}
+
+Projection factors come from:
+\begin{itemize}
+    \item Congressional Budget Office economic projections
+    \item Census Bureau population estimates 
+    \item Social Security Administration wage index forecasts
+    \item Treasury tax parameter indexing
+\end{itemize}
+
+\subsubsection{Initial Aging Implementation}
+
+For any variable y, the projected value to reach our baseline year is computed as:
+
+\[ y_{2024} = y_{2023} \cdot \frac{f(2024)}{f(2023)} \]
+
+where f(t) represents the index factor for time t.
+
+\subsubsection{Forward Projection}
+
+After constructing and calibrating the enhanced 2024 dataset, we project it to future years using the same indexing framework. This maintains the dataset's enhanced distributional properties while reflecting:
+
+\begin{itemize}
+    \item Economic growth forecasts for monetary variables
+    \item Statutory adjustments to program parameters
+    \item Population projections applied to household weights
+\end{itemize}
@@ -0,0 +1,76 @@
+\subsection{Demographic Variable Construction}
+
+Following the IRS specifications for the Public Use File, we construct three key demographic variables: dependent ages, primary taxpayer age ranges, and earnings splits between spouses.
+
+\subsection{Dependent Ages}
+
+For each dependent, we construct age categories following IRS constraints:
+\begin{itemize}
+    \item Under 5
+    \item 5 under 13 
+    \item 13 under 17
+    \item 17 under 19
+    \item 19 under 24
+    \item 24 or older
+\end{itemize}
+
+The number of dependents is limited by filing status:
+\begin{itemize}
+    \item Up to 3 dependents for joint returns and head of household returns
+    \item Up to 2 dependents for single returns
+    \item Up to 1 dependent for married filing separately returns
+\end{itemize}
+
+Dependents are ordered sequentially by type:
+\begin{enumerate}
+    \item Children living at home
+    \item Children living away from home
+    \item Other dependents
+    \item Parents
+\end{enumerate}
+
+\subsubsection{Primary Taxpayer Age}
+
+Age ranges are constructed differently for dependent and non-dependent returns:
+
+For non-dependent returns:
+\begin{itemize}
+    \item Under 26
+    \item 26 under 35
+    \item 35 under 45
+    \item 45 under 55
+    \item 55 under 65
+    \item 65 or older
+\end{itemize}
+
+For dependent returns:
+\begin{itemize}
+    \item Under 18
+    \item 18 under 26
+    \item 26 or older
+\end{itemize}
+
+\subsubsection{Earnings Splits}
+
+For joint returns, we calculate the primary earner's share of total earnings:
+
+\[ \text{Primary Share} = \frac{\text{Primary Wages} + \text{Primary SE Income}}{\text{Total Wages} + \text{Total SE Income}} \]
+
+Where:
+\begin{itemize}
+    \item Primary wages and SE income = E30400 - E30500
+    \item Secondary wages and SE income = E30500
+\end{itemize}
+
+This share is categorized into:
+\begin{itemize}
+    \item 75 percent or more earned by primary
+    \item Less than 75 percent but more than 25 percent earned by primary
+    \item Less than 25 percent earned by primary
+\end{itemize}
+
+\subsubsection{Implementation Details}
+
+When decoding age ranges into specific ages, we use random assignment within the range to avoid unrealistic bunching. For example, when the PUF indicates age 80, we randomly assign an age between 80 and 84.
+
+The ordering of dependents is preserved when constructing synthetic tax units to maintain consistency with the original data structure.