|
| 1 | +\section{Data} |
| 2 | + |
| 3 | +\subsection{Current Population Survey} |
| 4 | + |
| 5 | +The Current Population Survey Annual Social and Economic Supplement (CPS ASEC) provides comprehensive demographic and economic information for a nationally representative sample of U.S. households. For tax year 2024, our base dataset contains approximately 150,000 households representing the U.S. civilian non-institutional population. |
| 6 | + |
| 7 | +The CPS's key strengths include: |
| 8 | +\begin{itemize} |
| 9 | + \item Rich demographic detail including age, sex, race, ethnicity, and education |
| 10 | + \item Complete household relationship matrices |
| 11 | + \item Program participation indicators |
| 12 | + \item State and sub-state geographic identifiers |
| 13 | + \item Monthly employment and labor force status |
| 14 | +\end{itemize} |
| 15 | + |
| 16 | +However, the CPS has known limitations for tax modeling: |
| 17 | +\begin{itemize} |
| 18 | + \item Underreporting of income, particularly at the top of the distribution |
| 19 | + \item Limited tax-relevant information (e.g., itemized deductions) |
| 20 | + \item No direct observation of tax units within households |
| 21 | + \item Imprecise measurement of certain income types (e.g., capital gains) |
| 22 | +\end{itemize} |
| 23 | + |
| 24 | +\subsection{IRS Public Use File} |
| 25 | + |
| 26 | +The Internal Revenue Service Public Use File (PUF) is a national sample of individual income tax returns, containing approximately 200,000 records. The data are extensively transformed to protect taxpayer privacy while preserving statistical properties. Our analysis uses the 2015 PUF, the most recent available, aged to 2024. |
| 27 | + |
| 28 | +The PUF's key strengths include: |
| 29 | +\begin{itemize} |
| 30 | + \item Precise income amounts derived from information returns |
| 31 | + \item Complete tax return information including itemized deductions |
| 32 | + \item Actual tax unit structure |
| 33 | + \item Accurate income type classification |
| 34 | +\end{itemize} |
| 35 | + |
| 36 | +The PUF's limitations include: |
| 37 | +\begin{itemize} |
| 38 | + \item Limited demographic information |
| 39 | + \item No household structure beyond the tax unit |
| 40 | + \item Geographic detail limited to state |
| 41 | + \item No program participation information |
| 42 | + \item Privacy protections that mask extreme values |
| 43 | +\end{itemize} |
| 44 | + |
| 45 | +\subsection{External Validation Sources} |
| 46 | + |
| 47 | +We validate our enhanced dataset against several external sources: |
| 48 | + |
| 49 | +\subsubsection{IRS Statistics of Income} |
| 50 | + |
| 51 | +The Statistics of Income (SOI) Division publishes detailed tabulations of tax return data, including: |
| 52 | +\begin{itemize} |
| 53 | + \item Income amounts by source and adjusted gross income bracket |
| 54 | + \item Number of returns by filing status |
| 55 | + \item Itemized deduction amounts and counts |
| 56 | + \item Tax credits and their distribution |
| 57 | +\end{itemize} |
| 58 | + |
| 59 | +These tabulations serve as key targets in our reweighting procedure and validation metrics. |
| 60 | + |
| 61 | +\subsubsection{CPS ASEC Public Tables} |
| 62 | + |
| 63 | +Census Bureau publications provide demographic and program participation benchmarks, including: |
| 64 | +\begin{itemize} |
| 65 | + \item Age distribution by state |
| 66 | + \item Household size distribution |
| 67 | + \item Program participation rates |
| 68 | + \item Employment status |
| 69 | +\end{itemize} |
| 70 | + |
| 71 | +\subsubsection{Administrative Program Totals} |
| 72 | + |
| 73 | +We incorporate official totals from various agencies: |
| 74 | +\begin{itemize} |
| 75 | + \item Social Security Administration beneficiary counts and benefit amounts |
| 76 | + \item SNAP participation and benefits from USDA |
| 77 | + \item Earned Income Tax Credit statistics from IRS |
| 78 | + \item Unemployment Insurance claims and benefits from Department of Labor |
| 79 | +\end{itemize} |
| 80 | + |
| 81 | +\subsection{Variable Harmonization} |
| 82 | + |
| 83 | +A crucial preparatory step is harmonizing variables across datasets. We develop a detailed crosswalk between CPS and PUF variables, accounting for definitional differences. Key considerations include: |
| 84 | +\begin{itemize} |
| 85 | + \item Income timing (calendar year vs. tax year) |
| 86 | + \item Income classification (e.g., business vs. wage income) |
| 87 | + \item Geographic definitions |
| 88 | + \item Family relationship categories |
| 89 | +\end{itemize} |
| 90 | + |
| 91 | +For some variables, direct correspondence is impossible, requiring imputation strategies described in the methodology section. The complete variable crosswalk is available in our open-source repository. |
0 commit comments