Researchers have developed logical, demographic, and statistical strategies for imputing immigrants’ legal status, but these methods have never been empirically assessed. We used Monte Carlo simulations to test whether, and under what conditions, legal status imputation approaches yield unbiased estimates of the association of unauthorized status with health insurance coverage. We tested five methods under a range of missing data scenarios. Logical and demographic imputation methods yielded biased estimates across all missing data scenarios. Statistical imputation approaches yielded unbiased estimates only when unauthorized status was jointly observed with insurance coverage; when this condition was not met, these methods overestimated insurance coverage for unauthorized relative to legal immigrants. We next showed how bias can be reduced by incorporating prior information about unauthorized immigrants. Finally, we demonstrated the utility of the best-performing statistical method for increasing power. We used it to produce state/regional estimates of insurance coverage among unauthorized immigrants in the Current Population Survey, a data source that contains no direct measures of immigrants’ legal status. We conclude that commonly employed legal status imputation approaches are likely to produce biased estimates, but data and statistical methods exist that could substantially reduce these biases.
“Interest in immigrants’ socioeconomic characteristics from scientists, policy-makers, and the public has run ahead of the availability of the data to address these interests. The most serious omission from data sets is information on legal status. . . .”
—Clark and King (2008:295)
The capacity for immigration scholars to produce research results of major social and policy significance remains hampered by a lack of population and survey data allowing the identification of immigrants’ legal status. Large-scale, nationally representative surveys that are most commonly used to study the foreign-born population—such as the American Community Survey (ACS)—distinguish between naturalized citizens and noncitizens, but they do not inquire about the legal status of the latter. Surveys that have included such measures are limited by the fact they are typically relatively small, regionally targeted, and/or focused on a particular subpopulation of immigrants (Bachmeier et al. 2014). As a result, important questions—such as the extent to which unauthorized status threatens the well-being of immigrant families, the role undocumented immigrants play in the labor market, and their economic and fiscal impacts—remain largely unaddressed (Clark and King 2008; Clark et al. 2009; Massey and Bartley 2005).
Faced with these data limitations, researchers have developed logical, demographic, and statistically based strategies for imputing the legal status of immigrants in the aforementioned nationally representative surveys (e.g., Batalova et al. 2014; Heer and Passel 1987; Marcelli 2004; Marcelli and Heer 1997, 1998; Passel and Cohn 2009; State Health Access Data Assistance Center 2013). At present, such methods are the only means through which much-needed avenues of research on legal status can be opened. However, the conditions under which these methods yield unbiased estimates of the characteristics of the unauthorized foreign-born population have never been tested.
We address this question using Monte Carlo simulations based on the Survey of Income and Program Participation (SIPP), an underutilized source of information on the foreign-born population and the only nationally representative survey with questions about immigrants’ legal status. We provide empirical tests of whether, and under what conditions, it is possible to impute legal status on the basis of commonly available socioeconomic and demographic survey items: to spin straw into gold, so to speak. We first review existing legal status imputation methods, including a recently developed method that employs multiple imputation using pooled survey samples. We subsequently present simulation results that compare the various approaches with respect to the degree that they yield unbiased estimates of the association of unauthorized status with insurance coverage, an important predictor of access to health care, and thus a potential source of cumulative disadvantage (e.g., Bustamante et al. 2012; Javier et al. 2010; Ku 2009; Sommers 2013; Stevens et al. 2010). Although insurance coverage serves primarily as an example used to test the various imputation methods, our analyses do provide new estimates of the level and geographic distribution of coverage among unauthorized immigrants.
The results show that it is not possible to spin straw into gold. All the approaches that we tested produced biased estimates. Some methods failed in all circumstances, and others failed only when the “joint observation” condition was not met, meaning that the imputation method was not informed by the association of unauthorized status with the dependent variable. Nevertheless, we also show that these methods could be improved if external (“prior”) information about legal status were available. Additionally, in an example using the Current Population Survey (CPS), we demonstrate the utility of the best-performing method for increasing statistical power when the joint observation condition is met.
Background: Methods for Measuring Immigrants’ Legal Status
Several panels of the SIPP (1996, 2001, 2004, and 2008) asked foreign-born respondents their immigration status when they entered the United States and whether they had since adjusted their status (U.S. Census Bureau 2013). Assessments of the quality of the legal status data from the SIPP further reveal that they are likely to produce an accurate portrayal of the unauthorized population. Despite moderately high levels of missing data, the demographic characteristics of the unauthorized immigrants in the SIPP closely match residual estimates produced by the Department of Homeland Security (DHS) and the Pew Hispanic Center (Pew) (Bachmeier et al. 2014).
However, the SIPP is not appropriate for all research questions involving unauthorized migration. Its sample is too small for some types of analyses (e.g., state-level). Additionally, although the SIPP includes detailed information about income, poverty, and public assistance receipt, it provides much less information about health, education, and fertility, all of which are highly significant topics. Other surveys collect data on legal status, but none are nationally representative. They are limited to a specific metropolitan area (e.g., Los Angeles Family and Neighborhood Study, Los Angeles County Mexican Immigrant Residency Status Survey, Los Angeles County Household Survey), state (e.g., California Health Interview Survey), occupational group (e.g., National Agricultural Workers Survey), immigrants who legalized (e.g., Legalized Population Survey, New Immigrant Survey), or immigrants who returned to Mexico (e.g., Mexican Migration Project).
To compensate for the scarcity of survey data on legal status, several researchers have developed creative ways to impute legal status in surveys lacking such measures. In the most basic terms, imputation methods involve assigning legal status to immigrants in a survey sample lacking measures of legal status on the basis of information provided by outside knowledge about the characteristics of legal immigrants or independent data sources, such as a survey with direct measures of legal status. Most of these methods do not account for uncertainty but rather treat imputed values as if they were true.
The most widely publicized imputation-based results are those produced by Passel and published by the Pew Hispanic Center. Originally developed by Passel and Clark and published by the Urban Institute (Passel and Clark 1998), this method imputes legal status for foreign-born respondents of the CPS in order to produce detailed descriptive profiles of the unauthorized population, such as poverty rates, unemployment rates, educational attainment, and occupational composition (e.g., Passel 2006; Passel and Cohn 2009).
The method on which Pew Hispanic Center estimates are based combines a variety of techniques—logical imputation, statistical imputation, and weighting adjustments—to assign legal status. Because it attempts to match legal status assignments with external information about immigration policy and residual (demographic) estimates of the unauthorized population, we classify it as a “demographic accounting method.” This method first identifies those who are very likely to be legally resident on the basis of indicators of legality, such as U.S. citizens, veterans, and those in occupations that make it nearly impossible for them to be unauthorized.1 The remaining noncitizens are the pool of potentially unauthorized immigrants: a group that contains a mixture of legal noncitizens and unauthorized immigrants. To further distinguish the unauthorized from legal noncitizens, the method assigns legal status to the remaining noncitizens based on an estimated probability of being unauthorized, which is calculated from the occupational distribution by age, sex, and state of residence of unauthorized immigrants in the Legalized Population Survey (LPS). The LPS is a 1989 survey of those who applied for legalization under the main provisions of the 1986 Immigration Reform and Control Act (IRCA). After additional data editing to ensure that the status assignments correspond with U.S. immigration law for families, the sampling weights of respondents are adjusted to match control totals, derived from residual estimates (e.g., Passel 2006) of the number of unauthorized immigrants by state and national origin.
Demographic accounting estimates, particularly those produced by Pew/Passel, are based on the meticulous application of demographic methods and have come to be trusted and widely cited outside of academia. However, the method has never been evaluated. The specific details of the Pew/Passel method are not publicly available, thus making it difficult for other researchers to replicate the method. Beyond the lack of transparency, its reliance on LPS data raises concern largely because the LPS was collected more than two decades ago, and it represents only the unauthorized who applied for legalization under the nonagricultural worker provisions of IRCA. The LPS thus excludes those who qualified for the other major legalization program under IRCA (the Special Agricultural Workers program), and it overrepresents Mexicans and those who arrived in the United States before 1982, and who were then concentrated in the American Southwest to a far greater degree than is true today (Durand et al. 2005).
Others have used survey-based statistical imputation to assign unauthorized status (Caponi and Plesca 2014; Capps et al. 2013; Heer and Passel 1987; Marcelli and Heer 1997, 1998). Statistical imputations use the associations between a set of predictors and unauthorized status from a survey that includes questions about immigrants’ legal status (the donor sample) to assign legal status to foreign-born respondents in surveys lacking such measures (the target sample). Statistical imputations have employed both single and multiple imputation approaches as the basis for prediction. Heer and Passel (1987) were among the first to use a single-imputation approach. In subsequent developments, Marcelli and Heer (1997) estimated a logistic regression model predicting unauthorized status as a function of duration of U.S. residence, educational attainment, age, and sex in the 1994 Los Angeles County Household Survey (the donor sample). They then used this model to estimate the predicted probability of being unauthorized among immigrants in the Los Angeles County 1990 Census (the target sample). Finally, they used the predicted probabilities in subsequent analyses that estimated the relationship between unauthorized status and labor force and welfare outcomes (Marcelli and Heer 1997, 1998).
In contrast to the single-imputation method, researchers have begun to employ cross-survey multiple imputation to impute variables that are completely missing in one data set but observed in another (Rässler 2004; Rendall et al. 2013; Resche-Rigon et al. 2013; Schenker et al. 2010). Multiple-imputation approaches are preferred because they account for the uncertainty in imputed data (Little and Rubin 2002). This approach has recently been used to impute legal status. To produce a profile of health insurance coverage and other social and economic characteristics among the unauthorized, Capps et al. (2013) pooled the SIPP (the donor sample) with the ACS (the target sample), and multiply imputed unauthorized status for all foreign-born observations in the ACS. A Minnesota policy analysis group (State Health Access Data Assistance Center 2013) used a similar approach in examining immigrants’ access to health insurance coverage.
As long as a suitable donor sample is available (such as the SIPP), statistical imputation methods can be readily replicated, unlike the detailed algorithms involved in logical and demographic accounting methods. However, the bias and precision of the resulting estimates remain unclear. Importantly, Rendall and his colleagues (2013) argued that the success of the cross-survey multiple-imputation method depends on two conditions. First, the target and donor samples must be drawn from the same universe. In other words, the population-level associations producing the donor sample should be statistically identical in the target sample. Second, to avoid identification problems, every pair of variables must be jointly observed in one data set or the other to enable the estimation of the covariance for all pairs. Taking an example developed by Rässler (2004), say we have two data sets. Variables X and Z are observed in the first, Z and Y are observed in the second, and X and Y are never jointly observed. If the correlation between Z and X is .9, and that between Z and Y is .8, then the estimated correlation between X and Y is mathematically bounded between 0.4585 and 0.9815—a wide range. Without additional information, no value in this range is a better estimate than another. Additionally, Rodgers (1984) showed that only very high correlations approaching 1.0 will narrow the range considerably.
When applied to legal status imputations, the joint observation requirement effectively limits the analytic variables to those that are jointly observed with legal status. For example, if an analyst were interested in estimating the effects of unauthorized status on insurance coverage, and if legal status were completely missing in the target sample, then insurance coverage must be observed in both the donor and target samples. If insurance coverage were completely missing in the donor data, then legal status and insurance coverage would never be jointly observed. If both the same universe and joint observation conditions must always be met, this would cast doubt on methods that violate them, including most imputation methods employed in past research.
Here, we evaluate the prevailing approaches to imputing legal status. We do not attempt to replicate and evaluate specific imputation methods, such as the precise methodology from which Pew Hispanic Center estimates are derived, largely because such methods change over time as researchers refine their methodologies and data inputs—and, as noted, they can be difficult to replicate. Rather, we evaluate and compare five general approaches (explained in the Imputation Methods section). We tested multiple variations of each of these methods in preliminary analyses, but due to space constraints, we present the results for only the best-performing variants.
We conducted Monte Carlo simulations that evaluate whether, and under what conditions, estimates of the association between imputed unauthorized status and insurance coverage are unbiased. By varying the imputation method, the simulations identify the optimal method. We alter the missing data patterns in the simulation data to assess the performance of the methods when the joint observation condition is not met. We further assessed how much the methods would improve if prior information about immigrants’ legal status were available beyond that already included in most demographic surveys, whether through administrative record linkages, new survey questions, or information from an auxiliary survey.
Throughout, we assessed the robustness of the results across different dependent variables by varying, in simulated data, the magnitude of the association between unauthorized status and health insurance coverage. Imputation methods may perform well when the association between unauthorized status and the dependent variable is consistent with socioeconomic and demographic characteristics (e.g., the unauthorized have lower levels of insurance coverage than legal immigrants, which is consistent with their lower levels of education and income). However, imputation methods may be less able to detect “surprises,” such as when unauthorized immigrants exhibit unique or exceptional outcomes.
We used the SIPP as a basis for generating data and establishing true population values for the simulations. The SIPP is a longitudinal survey of the U.S. noninstitutionalized population conducted by the U.S. Census Bureau (2013). Every few years, the SIPP draws a new panel of households (i.e., 1996, 2001, 2004, and 2008). All individuals in these households are then followed up every four months for three to four years. Panel respondents in each wave are asked a set of core questions primarily about labor force activity, income, and program participation. In addition, respondents are administered wave-specific topical modules. In all panels from 1996–2008, including the 2004 panel on which we rely for our simulations, the second wave of data collection includes a series of questions about migration, which includes questions about country of birth, year of arrival, citizenship, and visa status. Although SIPP is longitudinal, each wave can be weighted with cross-sectional weights to represent the current U.S. population. The final SIPP sample from which our simulated data were generated was restricted to 8,898 foreign-born respondents age 16 and older who were interviewed in Wave 2 of the 2004 panel. Children as well as persons born abroad to U.S. citizen parents were excluded.
The weighted means for all analytic variables from the SIPP are shown in Table 1 for the total foreign-born sample in the SIPP as well as separately by three legal status designations—“probably legal,” “ambiguous,” and “unauthorized”—the definitions for which are provided in the Imputation Methods section. Further description of the two simulated insurance coverage variables—insurance 2 and 3—is also provided later herein.
Unauthorized Legal Status
The key independent variable used in our analyses is a dichotomous indicator of unauthorized legal status (=1 if unauthorized). The SIPP asked questions about immigrants’ legal status at the second interview. Foreign-born respondents were asked whether they were citizens, and all noncitizens were asked about their status upon arrival. Immigrants could select one of six arrival statuses: three types of legal permanent resident (LPR) status; and three non–LPR statuses, including refugee/asylee status, legal nonimmigrants (e.g., student or tourist visas), and “other.” Finally, noncitizen, non-LPR arrivals were asked whether they have adjusted to LPR status since first immigrating. Following others (Greenman and Hall 2013; Hall et al. 2010), we infer that the group of persons arriving with “other” status and who have not adjusted to LPR status overwhelmingly consists of unauthorized immigrants. To address data handling challenges presented both by relatively high rates of missing data on immigration-related items in the SIPP and by the suppression by the Census Bureau of detailed visa status categories in the arrival status item in the public-use data, we have employed similar methods reported in previous research (Bachmeier et al. 2014; Greenman and Hall 2013; Hall et al. 2010).
The dependent variable, “insurance 1,” is a dichotomous indicator of insurance coverage (employer, other private, Medicaid, and other public = 1).
All models of insurance coverage included the following controls: income-to-poverty ratio (logged), educational attainment (years), Mexican place of birth, years of U.S. residence, age, sex, number of functional limitations, and self-rated health (fair/poor vs. better health).
We conducted Monte Carlo simulations to evaluate the bias of the estimate of the association of imputed unauthorized status with the outcome (health insurance coverage) under different scenarios. For the simulation exercises, we assumed that the true association (i.e., the expected population value) of unauthorized status with insurance coverage is the association observed in the SIPP. Bias is the difference between the true and estimated association. The validity of the SIPP-based measure of legal status (or how close the SIPP-based measure comes to reality) is also an important question, but we set it aside here because it is not central to our question concerning bias and because we have already addressed it in another article (Bachmeier et al. 2014).
Each simulation involved three steps. First, we drew 10,000 cases with replacement from a self-weighted version of the 2004 SIPP (i.e., expanded in proportion to sampling weights). Second, we randomly divided the sample in half, with the first half representing the donor sample, and the second half representing the target sample. This randomization ensured that the donor and target samples are drawn from the same universe. We coded unauthorized status to missing in the target sample, and for some simulations, we coded the dependent variable and/or a key independent variable (the income-to-poverty ratio) to missing in the donor sample. Both samples included a common set of control variables. The data structure of the donor and target samples is illustrated in Table 2. Third, we imputed unauthorized status in the target sample using one of five imputation methods, and then estimated a multivariate model predicting insurance coverage as a function of imputed unauthorized status and controls on cases in the target sample.
For each simulation, we repeated all three steps 500 times to estimate the coefficient and standard error for imputed unauthorized status (i.e., the average of the coefficient and standard error across the 500 replications). We estimated bias as the average difference between the estimated coefficient and the expected population value; we estimated relative bias as bias divided by the expected value. The level of acceptable bias varies by application. As a rule of thumb, we favor methods that produce estimates with lower bias and relative bias, and we describe estimates as “unbiased” if they fall within 10 % of the expected value (i.e., relative bias falls in the range of –.10 to .10).
As summarized in Table 3, we varied the imputation methodology to evaluate the comparative performance of a set of approaches under a range of conditions.
The logical-imputation method is similar to the first part of the Pew/Passel method. We tested it separately from the demographic accounting method because it is sometimes used to proxy legal status in policy analyses (e.g., Bohn et al. 2014; Bozick and Miller 2014; Flores 2010; Kaushal 2006; Potochnick 2014). It codes as legal those in the target data who have characteristics that make it very unlikely they are unauthorized (i.e., those who are “probably legal”), and all others as unauthorized. In our simulations, the “probably legal” included U.S. citizens and others with indicators of legality, such as employment by the U.S. government, a history of military service, or receipt of Social Security income; we used similar indicators of legality as the Passel/Pew method (see footnote 1 for a complete list). These indicators are measured in most major demographic and health surveys (e.g., ACS, CPS, NHIS). As already shown in Table 1, roughly one-half (51 %) of all foreign-born in the SIPP can be logically imputed as legal by these criteria. Among the remaining unclassified foreign-born respondents, one-half are actually unauthorized (24 % of all foreign-born), and the rest are neither unauthorized nor have characteristics that signify legality and therefore have “ambiguous” status.
The demographic accounting method has similarities to the full Pew/Passel method in that it combines elements of the logical- and single-imputation methods, and its estimates are forced to match target values of the percentage of unauthorized immigrants. Those classified as “probably legal” in the target sample were coded as legal (51 %). Among the remaining foreign-born, the single-imputation method was employed to assign unauthorized status. To do this, we estimated a logistic regression model predicting unauthorized status as a function of several regressors2 in the donor data source. Importantly, the dependent variable (insurance coverage) was included in the prediction model in simulations in which the dependent variable was jointly observed with unauthorized status. The estimated coefficients were then applied to immigrants in the target data to derive for each person a predicted probability of being unauthorized. Each individual’s predicted probability was compared with a random draw from a uniform distribution: if the predicted probability was greater than the random draw, the individual was assigned unauthorized status. This assignment process continued until a targeted percentage (24 %, or about one-half of the non–probably legal, as observed in the SIPP) was coded as unauthorized.3
The single-imputation method is similar to the approach taken by Heer and Marcelli (Marcelli 2004; Marcelli and Heer 1997, 1998), and more recently by Caponi and Plesca (2014). A logistic regression model predicting unauthorized status was estimated on the donor data, using the same predictors as for the demographic accounting method. In the target data, foreign-born persons were coded as unauthorized if a random draw from a uniform distribution was less then their predicted probability of being unauthorized, and all others were coded as legal. Unlike the demographic accounting method, no attempt was made to first code people as “probably legal,” and the percentage assigned as unauthorized was derived from the percentage unauthorized in the donor data, not from a predetermined target.
The cross-survey multiple-imputation (CSMI) method is similar to the approach taken by Capps et al. (2013). It pools the donor and target samples and treats the absence of an unauthorized status indicator in the latter as a missing data problem to be addressed by multiple-imputation techniques. Specifically, missing values in the target data were imputed using multiple chained equations (StataCorp 2013). Following common practice (Rubin 1987), 10 data sets were created, and results were summarized using the mi routines available in Stata version 12 or higher. We used the same predictors as in the single-imputation method to inform the imputations.
Finally, the logical cross-survey multiple-imputation (logical-CSMI) method combines elements of three methods. We tested this approach because we wondered whether the multiple-imputation method could be improved if it were informed by outside (i.e., logical) information about legal status. As with the CSMI method, the donor and target samples were pooled, and missing data were imputed with multiple chained equation techniques. In addition to the predictors used by the other methods, the imputation model was informed by the respondents’ classification as “probably legal” and predicted probability of being unauthorized (based on a single prediction model). Specifically, the predicted probability was coded to 0 for those who are “probably legal” and was then included as a predictor in the imputation model.4 Like the single-imputation and CSMI methods, the percentage assigned as unauthorized was derived from the donor data, not a predetermined target.
Joint Observation of the Dependent and Independent Variables With Unauthorized Status
We assessed whether bias depends on whether (1) the dependent variable, (2) an important independent variable (income-to-poverty ratio), and (3) both the dependent and independent variables are jointly observed with unauthorized status. In simulations in which these variables are treated as jointly observed, they are observed along with unauthorized status in the donor data set, as shown in Table 2. In simulations in which they are never jointly observed, we recoded them to missing in the donor sample prior to carrying out the imputation.
Strength of Association Between Unauthorized Status and Insurance Coverage
Robust imputation methods should produce unbiased estimates of the association of unauthorized status regardless of how surprising the result is. To assess the robustness of the methods, we created two variants of insurance coverage, one with weak (“insurance 2”) and another with strong (“insurance 3”) associations with unauthorized status; we used these along with the original measure of insurance coverage as dependent variables in our models. We created the variants by recoding insurance coverage among those with ambiguous legal status (i.e., legal noncitizens in the donor sample that do not have characteristics of the “probably legal” population). As observed in the SIPP, insurance coverage is lowest for unauthorized immigrants, higher among those with “ambiguous” status, and highest among those who are “probably legal.” To weaken the association, we randomly reduced coverage among “ambiguous” immigrants, thus reducing the insurance gap between the unauthorized and all legal immigrants (see Table 1). To strengthen the association, we randomly increased coverage among “ambiguous” immigrants, thus increasing the gap.
As noted earlier, the associations observed in the SIPP represent the expected population values. The first row of Table 4 reports ordinary least squares (OLS) coefficients from the model estimated on the SIPP data. Because insurance coverage is treated continuously, the interpretation of coefficients is that of a linear probability model. In the case of “insurance 1” (observed), the coefficient for unauthorized status is –0.152, indicating that insurance coverage is about 15 percentage points lower among the unauthorized than the authorized after accounting for the control variables. As designed, unauthorized status is more weakly associated with “insurance 2” (ß = –0.058) and more strongly associated with “insurance 3” (ß = –0.236).
The simulations presented in Table 4 assess the imputation methods under optimal circumstances: that is, when all variables are jointly observed with unauthorized status. For this scenario, both the logical and demographic accounting methods produce biased estimates for two of the three dependent variables. The logical-imputation method completely fails to pick up variations in the association between unauthorized status and the dependent variables, estimating the strongest association where the expected association is weakest (“insurance 2”; relative bias = 3.259), and the weakest association where the expected association is strongest (“insurance 3”; relative bias = –0.743). The demographic accounting method does not perform much better. For example, the bias in the model predicting “insurance 2” is –0.10 (relative bias = 1.757).
In contrast, the single-imputation, CSMI, and logical-CSMI methods yield virtually unbiased estimates for all three dependent variables, with bias never exceeding ±5.5 % of the expected value. The two methods using multiple imputation (CSMI and logical-CSMI) yield larger standard errors than the single-imputation method because single imputation treats imputed values as true and therefore underestimates the true variance (Little and Rubin 2002).
We next evaluate the effects of violating the joint observation assumption. As shown in Table 5, the logical and demographic accounting methods are less sensitive to the missing data pattern than the other methods because they do not rely heavily (and the logical imputation does not at all rely) on the missing variables to impute legal status. Nevertheless, both methods produce coefficients with large biases across all missing data scenarios for at least two of the three dependent variables.
In contrast, the single-imputation, CSMI, and logical-CSMI methods are sensitive to the missing data pattern. For these methods, the estimates are virtually unbiased when the dependent variable is jointly observed with unauthorized status (either “DV and IV jointly observed” or “DV jointly observed, IV never jointly observed”). Even when a key independent variable (income-to-poverty ratio) is missing from the donor data, bias remains somewhat low and less than ±20 % of the expected value. Supplementary analyses suggest that this holds only when the imputation model is “inclusive” (i.e., containing “everything but the kitchen sink”). When the imputation model is parsimonious and includes only the variables used in the model of insurance coverage, the estimates are more biased (results available upon request), which is consistent with Collins et al. (2001). Finally, when the dependent variable is not jointly observed with unauthorized status in the donor data (either “DV never jointly observed, IV jointly observed” or “DV and IV never jointly observed”), bias is much greater than for the other missing data scenarios—in one case, exceeding 80 % of the expected value. For “insurance 1” and “insurance 3” (for which the expected coefficient is large and negative), the coefficients overestimate health coverage for unauthorized immigrants relative to legal immigrants.
Using Prior Information to Improve Imputations
The results thus far show that when the joint observation requirement is met, the statistical imputation approaches, particularly those employing multiple imputation, yield unbiased estimates, at least in the particular scenarios we tested. However, when the joint observation requirement is not met, none of the methods produce unbiased estimates across all three outcomes, including the logical-CSMI method and all its variants (see footnote 4). This led us to explore whether the incorporation of prior information into the imputation methods would improve the estimates even when the joint observation condition is not met.
We first considered the methods that rely on logical imputation: the logical, demographic accounting, and logical-CSMI methods. Our data permit us to logically impute about one-half (51 %) of the foreign-born as “probably legal,” but we wondered how much bias would be reduced if a higher percentage were logically imputed (i.e., if additional information enabled us to identify more legal immigrants). Logically imputing a higher percentage would be possible because additional indicators of legality from administrative record linkages are available in restricted data sets (e.g., possession of a valid Social Security number), and new survey questions could be added to surveys. To explore this question, we reran the simulations while randomly increasing the percentage of those with ambiguous status that are classified as “probably legal” to 50 %, 75 %, and 90 % (meaning that the non–“probably legal” group was composed of 66 %, 80 %, and 90 % unauthorized, respectively). We confined our tests to the most problematic scenario in which the dependent variable is never jointly observed with unauthorized status. As shown in Table 6, bias decreases substantially across all methods and dependent variables as the percentage logically imputed increases. Nevertheless, even when as many as 90 % of those with ambiguous status are logically imputed as “probably legal,” relative bias remains moderately high for “insurance 2,” reaching 46.2 % of the expected value in the case of the demographic accounting imputation method, although the absolute bias is low (–0.028).
This approach offers a potential solution to analysts who are unable to meet the joint observation requirement. However, most demographic and health data do not typically include enough indicators of legality to classify such a high proportion as “probably legal.” Additionally, federal statistical agencies appear to have become more, not less, restrictive in their willingness to release sensitive data and administrative record linkages, especially concerning immigrants and their statuses. We therefore explored yet another option available to researchers when the joint observation condition is not initially met. Following Rässler (2004), the CSMI method is likely to yield less-biased estimates if an auxiliary data set with jointly observed measures of unauthorized status and the dependent variable were pooled with the donor and target data sets. Typically, an auxiliary data set is one that contains the necessary variables but may not be drawn from the same universe as the target sample. To illustrate, one might pool the SIPP with a national health survey to estimate the association of legal status with chronic health conditions. Because legal status is measured only in the SIPP and chronic health conditions are measured only in the health survey, the two are never jointly observed. However, one could add an auxiliary data set that includes both unauthorized status and chronic health conditions, such as the California Health Interview Survey (2013), to satisfy the joint observation requirement.
To evaluate this approach, we pooled target and donor data (wherein unauthorized status and the dependent variable are never jointly observed) with a third equal-sized data set containing both variables. In practice, it would be difficult to locate an auxiliary data set that is drawn from the same universe as the other two data sets, so we assessed how the results differ when all three data sets are drawn from the same universe (the SIPP) versus when the auxiliary data set is drawn from a different universe (specifically, from Californians in the SIPP). The results, shown in Table 7, indicate that bias is reduced to nearly 0 when all three data sets are drawn from the same universe. Bias is somewhat higher (but still quite low) when the same-universe assumption is violated. It is likely that the level of bias depends on a variety of factors, such as the number of variables common to the three data sets, the relative sample sizes of three data sets, and how different their universes are. Given space constraints, identifying and evaluating the impact of these factors extends beyond the scope of this article.
Using the Cross-Survey Multiple Imputation Method to Increase Statistical Power
In this last section, we demonstrate how the CSMI method could be used with currently available public-use data to increase statistical power when the joint observation requirement is met. We selected CSMI over other methods because it yielded less-biased estimates than the logical and demographic accounting methods and is easier to implement than the equally well-performing logical-CSMI method. Although CSMI generally should not be used to examine outcomes that are unobserved in the donor data (i.e., without prior information), it can be used to increase sample size and power, which is extremely valuable for producing estimates by detailed characteristics, such as state of residence, country of birth, or year of entry cohort.
To demonstrate, we applied the CSMI method to actual data, using the 2004 SIPP data as the donor sample and the 2004 March CPS as the target sample. The CPS is conducted throughout the year by the U.S. Census Bureau on approximately 60,000 civilian U.S. housing units; thus, the sample is very large and has the capacity to produce estimates for many states, something that is not feasible with the smaller SIPP sample. The CPS lacks a measure of legal status, but includes many of the same variables as the SIPP (including all the predictors and the dependent variable in our example), which we coded identically as the SIPP variables. Thus, the joint observation condition can be met. Additionally, the CPS relies on nearly the same sampling frame as the SIPP. As with the analytical SIPP sample, our final CPS sample was restricted to foreign-born adults age 16 and older, excluding those born abroad of U.S.-born parents (N = 21,214). A comparative profile of the SIPP and CPS foreign-born samples on standard socioeconomic and demographic variables provides support for the same universe assumption. On nearly every dimension of comparison, the samples are virtually identically distributed (available upon request).
To implement the multiple-imputation method, we pooled the SIPP and CPS, multiply imputed unauthorized status for cases in the CPS on the basis of a large set of predictors (including insurance coverage), and estimated the percentage of unauthorized with insurance coverage by state/region in the CPS data. To demonstrate the sensitivity of the estimates to the specification of the imputation model, we first excluded the dependent variable—insurance coverage—from the predictors in the imputation model (i.e., treating it as if it were never jointly observed). In a second analysis, we included it. Finally, in the third, we estimated the imputation model separately by state/region (i.e., essentially a fully interactive model), thus allowing for interactions between state/region and all other predictors.
Results are reported in Table 8. In the first set of columns, means and standard errors are reported for states and regions in the SIPP in which there are at least 250 observations. For example, as observed in the SIPP, approximately 42 % of unauthorized immigrants in California had health insurance coverage in 2004. The second column reports the corresponding percentage estimated in the CPS when unauthorized status has been imputed without health insurance coverage being included in the model. Despite greater precision signified by lower standard errors, these estimates are significantly different from the SIPP-based estimates in approximately one-half (8 of 14) of the state/regions and tend to overestimate coverage among unauthorized immigrants. For example, the estimate for California derived from the CPS when insurance was not included in the imputation model is fully 10 percentage points higher than that estimated by the SIPP.
Health insurance coverage is allowed to be jointly observed in the final two columns of Table 8. Estimates in the third set of columns include state/region as a variable in the imputation model, and estimates in the fourth set of columns are derived from a model estimated separately within the 14 states/regions. In these two columns, estimates of insurance coverage among the unauthorized are much more in line with the SIPP-based estimates compared with the scenario in which coverage is never jointly observed with unauthorized status. Just 1 of the 14 states/regions has estimates that are significantly different from the SIPP-based estimates both when state/region is included as a variable and when separate models are estimated within states/regions; and in both instances, the magnitude of the difference is smaller compared with the scenario in which coverage is never jointly observed.
Research on immigration, immigrant incorporation, and immigration policy has been stymied by inadequate data on legal status. To compensate for the lack of individual-level indicators of legal status, researchers have tried to impute legal status in order to examine the socioeconomic and demographic characteristics of unauthorized immigrants. Although imputation-based results appear to be widely accepted (especially in policy settings), the degree to which such methods produce biased estimates have not been empirically tested. In this article, we used Monte Carlo simulations to evaluate a variety of imputation approaches under a range of conditions.
Our simulations revealed significant limitations in all the methods we tested. The logical and demographic accounting approaches produced biased estimates even under the most favorable scenarios regarding missing data. They were especially poor at detecting surprising results, such as unusually low or high insurance coverage among unauthorized relative to legal immigrants. The statistical imputation methods (single-imputation, CSMI, and logical-CSMI) produced unbiased estimates when unauthorized status was jointly observed with the dependent variable. However, when this condition was not met, these methods tended to overestimate coverage among the unauthorized.
These biases arose when the imputation method failed to incorporate information about the association of unauthorized status with the dependent variable. To understand how problematic this is, consider the surprising but well-established finding that Hispanics have lower mortality than non-Hispanic whites despite their lower socioeconomic status (SES) (Markides and Eschbach 2005). If we knew only that low SES is associated with higher mortality and that Hispanics have lower SES, we would erroneously conclude that Hispanics have higher mortality. Direct observation of the association between Hispanic origin and mortality (i.e., joint observation) is necessary to detect the truth. The critical nature of the joint observation requirement means that legal status imputation is not appropriate for analyses of outcomes that are not measured alongside legal status in the donor data set.
However, our simulations also showed that external data about legal status (i.e., prior information) could be used to reduce bias, even when joint observation does not occur. As the percentage of those classified as “probably legal” increased, bias in the estimates produced by the logical-imputation, demographic accounting, and logical-CSMI imputation methods decreased substantially. Similarly, CSMI estimates were improved by incorporating information about the association between unauthorized status and the dependent variable from an auxiliary data set, even when it was drawn from a different universe (California) than the other two data sets.
What do the results imply about prior research? Unfortunately, most legal status imputation methods violate the joint observation assumption and are therefore likely to produce estimates that are biased to an unknown degree. To our knowledge, no existing applications of the logical and single-imputation methods use information about the dependent variable to inform the imputation. Although the demographic accounting approach used by Pew/Passel is harder to evaluate, it also appears to violate the joint observation assumption because it imputes legal status on the basis of only a few predictors in the LPS survey (occupational distribution by age, sex, and state of residence), yet examines imputed legal status across a wide variety of outcomes.
Beyond this, prior approaches have clearly violated the same-universe requirement. For example, Marcelli and Heer (1997, 1998) used a local Los Angeles–based sample to impute unauthorized status for respondents in a sample that is representative of a larger geographic area; and Pew/Passel used an older sample (the LPS) to impute unauthorized status decades later. Similarly, Caponi and Plesca (2014) used data from the New Immigrant Survey (NIS) (Jasso et al. 2000), a nationally representative sample of legal permanent residents (LPRs) admitted to the United States in the early 2000s, to impute the legal status of all immigrants in the ACS. Although our tests involving the usage of an auxiliary data set suggest that the same-universe condition is less critical than the joint observation condition, we caution that more research is necessary to assess the conditions under which the same-universe assumption can be relaxed.
What do the results mean for future research on legal status? All the imputation methods we tested were limited in one way or another, suggesting that rather than continuing to try to spin gold from straw, the research community should increase efforts to improve data on immigrants. The most inexpensive and timely way to accomplish this would be to permit administrative record linkages to be used to logically impute legal status. Such data linkages already exist, but as we note earlier, U.S. federal statistical agencies have been reluctant to permit researchers to use these linkages to proxy legal status. A more expensive, but perhaps more ethically acceptable, route would be to add questions about legal status in surveys. Recent evaluations of the quality of information gathered from survey questions on legal status are promising, and suggest that the addition of such items to questionnaires are unlikely to compromise survey response rates (Bachmeier et al. 2014).
Absent better data, the CSMI method has promise, as long as the joint observation condition is met. As demonstrated in our CPS example, this method has the potential to increase statistical power, thus enabling analyses for detailed subgroups and geographies. Although the joint observation condition appears to constrain the applicability of the CSMI method, the SIPP nevertheless includes a very rich set of outcomes in many topical modules. To the extent that the SIPP can be pooled with larger data sets with common variables, there are numerous opportunities to expand research on the unauthorized population. Of course, the SIPP is not perfect. It appears to undercount unauthorized immigrants, and handling its data on legal status can be challenging. When using the SIPP in research on legal status, we recommend following the methods outlined in Bachmeier et al. (2014). Additionally, when implementing the CSMI method, it is important that donor and target data sample from the same universe, although as noted earlier, our tests suggest that this restriction could be less important than the joint observation requirement. We refer readers to Rendall et al. (2013) for guidance on how to test for violations of this assumption. Finally, it is important that analysts employ best practices for multiple imputation, such as specifying the correct functional form, including all analytically important variables, and appropriately accounting for clustered observations in the imputation model (Allison 2002; Rubin 1987).
Research on immigration to the United States remains plagued by the lack of data available to researchers, precisely at a time when public policy discussions are most in need of input by social scientists. Even more problematic is the fact that the limited existing knowledge that we have about the characteristics of the unauthorized population has been derived from imputation approaches that the simulation exercises reported here have shown to yield biased estimates. Nevertheless, the simulation exercises have also demonstrated that social scientists have at their disposal reliable data and statistical methods for imputing legal status in large-scale surveys lacking such measures, insofar as important conditions are met. Although these conditions might appear to limit the utility of the CSMI method, it can nevertheless be employed to substantially expand the body of literature on the incorporation of the unauthorized population beyond the limited number of studies that currently comprise it.
This research was supported by grants from the National Institutes of Health (RC2 HD064497, P01 HD062498, K01MH087219, and 2R24HD041025). We thank Michelle Frisco, Molly Martin, Nancy Landale, Claire Altman, Susana Sanchez, and the anonymous reviewers for helpful comments. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Indicators of legality include U.S. citizenship, migration from countries and periods that correspond with known patterns of refugee flows, being newly arrived with characteristics that would qualify for certain visa categories, working in occupations or industries that require legal status; receipt of public assistance or social services, and having moved to the United States before 1982 (thus qualifying for IRCA legalization).
Regressors include insurance coverage, all the controls described earlier, and several additional variables: marital status, spouse’s citizenship, occupational status, English proficiency, parental status, household size, homeownership, employment status, occupation, state of residence, and selected squared and interaction terms.
We tested variations where we altered which half of the non–probably legal were coded as unauthorized: those most likely to be unauthorized, the most disadvantaged, and a random half. None performed better than the demographic accounting method described here.
We also tried (1) coding unauthorized status to 0 for those who are “probably legal” before multiply imputing, (2) including “probably legal” and the predicted probability separately in the imputation model, and (3) coding the “probably legal” as legal and those with very high probabilities of being unauthorized (>.8) as unauthorized prior to multiple imputation. None of these variations outperformed the logical-CSMI method.