Abstract

In this study, we provide an assessment of data accuracy from the 2020 Census. We compare block-level population totals from a sample of 173 census blocks in California across three sources: (1) the 2020 Census, which has been infused with error to protect respondent confidentiality; (2) the California Neighborhoods Count, the first independent enumeration survey of census blocks; and (3) projections based on the 2010 Census and subsequent American Community Surveys. We find that, on average, total population counts provided by the U.S. Census Bureau at the block level for the 2020 Census are not biased in any consistent direction. However, subpopulation totals defined by age, race, and ethnicity are highly variable. Additionally, we find that inconsistencies across the three sources are amplified in large blocks defined in terms of land area or by total housing units, blocks in suburban areas, and blocks that lack broadband access.

Introduction

Censuses, which include a complete and total enumeration of a population at a single time point, are indispensable in the field of demography. In the United States, the first census took place in 1790 and was conducted every 10 years thereafter. Though every attempt to enumerate a diverse population across a large land area like the United States involves challenges that are distinct to the historical period in which it is undertaken, the 2020 Census faced a particularly unusual set of circumstances. Most notably, this was the first census to include an option to participate online, it was preceded by a contentious Supreme Court case regarding the suppression of immigrant participation, and it was conducted during a global health pandemic. Additionally, new measures to protect the privacy of individuals have elevated concerns about the accuracy of the data at lower levels of geography, such as census blocks.

In this study, we evaluate the quality of 2020 Census data across 173 census blocks in California. To do so, we compare population totals from the 2020 Census on these 173 blocks with estimates from two independent sources: (1) a survey administered to all households on those blocks and (2) population projections. The former is the first independently conducted survey designed to replicate census data collection at scale. The latter uses an array of datasets collected across the intercensal period to estimate the population in 2020. With these data, we can assess the comparability of population estimates across our 173 sample blocks. Additionally, we attempt to identify barriers to enumeration that may have contributed to discrepancies across the different sources. Our results provide insight into how well the 2020 Census enumerated the population of California, the viability of using block-level data artificially infused with error, and, consequently, considerations for researchers when analyzing 2020 Census data.

Challenges Facing the 2020 Census

The 2020 Census faced challenges that were both anticipated and unanticipated. Regarding anticipated challenges, the U.S. Census Bureau (herein referred to as the Bureau) changed data collection procedures between 2010 and 2020. Specifically, the 2020 Census was the first in which individuals could respond online. Households were invited to participate online as the first option, with traditional paper surveys delivered by mail, interview attempts by phone, and in-person visits should households refuse to respond online. During these in-person visits, data collectors used smartphones to conduct the interview and broadband to transmit the data. Although online surveys and digital technology have been used for many years by survey researchers, this was the first time they were simultaneously implemented in a census. Funding constraints forced the Bureau to drastically scale back its field testing, which raised the risk of functionality problems, connectivity failures, and cybersecurity threats (Lapowsky 2019).

In addition to anticipated technological challenges, there were unanticipated ones as well, namely, attempts to suppress the participation of immigrants. In the United States, the census counts all individuals living in the country regardless of citizenship status. The Trump administration acted to include a question on the 2020 Census that would ask respondents to report their citizenship status, which research showed would deter the participation of immigrants and contribute to an undercount (Brown et al. 2018). The Supreme Court ultimately blocked the government from including the citizenship question (Howe 2019). Still, many critics expressed concern that the Bureau's credibility among immigrants may have already been damaged (Marimow et al. 2019).

The most serious operational challenge to the 2020 Census was undoubtedly the COVID-19 pandemic, which emerged as a national emergency just as the Bureau began data collection in March 2020. The pandemic delayed fieldwork by approximately three months amid a flurry of short-term moves and adjustments to living arrangements for health and safety reasons (Supan 2021). Such moves could have substantially affected block-level population totals, particularly for blocks that attracted temporary migrants or blocks that experienced considerable outmigration to mitigate exposure.

All these challenges are expected to affect data quality in some way. However, the accuracy of the data is further compromised by new procedures enacted to protect the privacy of survey respondents. A new initiative implemented in 2020, the Bureau's Disclosure Avoidance System (DAS), will now artificially inject noise into its public-use data. Although top-line population counts at the state level will be unaffected, lower levels of geography will contain errors by design and, in some cases, produce erroneous estimates (e.g., individuals living in census blocks with no housing units). This practice has called into question the data quality at lower levels of geography, such as counties and census blocks, as well as for key subpopulations of interest (see, e.g., Hauer and Santos-Lozada 2021; Mueller and Santos-Lozada 2022; Winkler et al. 2021). Research has found that the quality of the data post-DAS processing as it pertains to accurately producing distributions of racial-ethnic groups at lower levels of geography is less than satisfactory (Asquith et al. 2022; Kenny et al. 2021; Mueller and Santos-Lozada 2022; Santos-Lozada al. 2020). As a result, demographers have expressed reservations about the ability of the 2020 Census to accurately characterize the population at the block level.

As the largest state in terms of population size and the third largest state in terms of land area, California has historically posed challenges for accurate enumeration during the decennial census. In the three previous censuses, the net undercount in California was estimated at 1.8% in 1990 (McGhee et al. 2018), 1.5% in 2000 (Ericksen 2001), and 0.26% in 2010 (McGhee et al. 2018). The overall undercount in California has thus improved in recent censuses. Nevertheless, scholars expressed concern that an undercount in 2020 might be more similar to that of the 1990 Census, given that more than 70% of California's population now belong to sociodemographic subgroups, such as renters, children, young men, and racial and ethnic minorities, that have traditionally been undercounted in the state (McGhee et al. 2018).

Initial tallies of the 2020 Census show that the total population of California was 39,538,223, reflecting a 6.1% growth rate across the decade (America Counts 2021). This rate was lower than that for the country as a whole, which grew by 7.4% across the decade (America Counts 2021). When considering year-to-year intercensal trends, however, the state logged a net loss of 182,083 people between 2019 and 2020—its first-ever reduction (Christopher 2021). It is unclear whether this loss reflects a real shift in the state's population, an undercount in the 2020 Census, or a combination of the two.

Standard Approaches for Assessing the Accuracy of the Census

Two sources of data are typically used to assess the accuracy of the decennial census: estimates produced by Demographic Analysis (DA) and the Bureau's Post-Enumeration Survey (PES). DA uses administrative records on births and deaths (to ascertain natural increases) and estimates of migration from survey data (to ascertain net migration) to project the expected size of the population.1 This projected value is then compared with the enumerated population total from the census. If the projected value from DA exceeds the enumerated population total from the census, the census is considered to have an undercount. Unlike DA, which relies on existing administrative data, the PES requires additional data collection. The PES is essentially a form of test–retest reliability in which a sample of households who completed their census forms are invited to complete an identical form a few months after the census is taken. Responses from the household's actual census form are then compared with their responses to the PES.

Initial findings from the Bureau's DA and PES suggest that despite all the challenges to undertaking an accurate enumeration of the population in 2020, data from the decennial count appear to be satisfactory. At the national level, both DA and the PES indicate that there was no significant undercount or overcount of the total U.S. population in 2020 (Hill et al. 2022; Khubba et al. 2022). The PES identified a slight overcount in California (0.47%), but this overcount was not significantly different from zero (Hill et al. 2022).

Though informative, both DA and the PES have limitations. DA uses data from birth and death certificates to measure natural increases and survey estimates on net migration to determine the population size at the national level. In general, data from birth and death certificates in the United States are relatively complete for population-level analysis. However, because migration totals are based on survey data, they are subject to measurement error. Although the DA is useful in determining the extent of an undercount at the national level, the Bureau does not produce DA at the state level. Moreover, DA is nearly impossible for small-area estimates, such as census blocks, because migration data are limited at more fine-grained levels of geography.

In contrast to DA, the PES is based on data collected from individuals who were willing to participate in the decennial census or a follow-up survey. These respondents tend to be a select group of households who are not representative of the full population. The PES is usually undertaken in the months immediately following the census. As a result of the pandemic, however, most fieldwork for the 2020 PES took place between November 2021 and March 2022. For many individuals, this survey was conducted nearly two years after they completed their original census form. The considerable amount of time that elapsed between the actual census and the PES increased the chances of recall error or the experience of a demographic event. Finally, the PES is based on a sample and cannot be used to evaluate the accuracy of estimates for small areas, such as census blocks.

Research Questions

In this study, we improve upon the PES and DA and contribute to the growing evidence base on the quality of the 2020 Census by addressing the following research questions:

  1. To what extent do population totals from the 2020 Census diverge from totals produced by an independent enumeration and from totals produced by demographic projections?

  2. What characteristics of census blocks are associated with divergent population totals when comparing the 2020 Census with an independent enumeration and with demographic projections?

To answer these questions, we compare official 2020 Census population totals from a sample of census blocks in California with population totals from an independent enumeration we conducted and with population totals based on advanced demographic projection methods. We conducted our independent enumeration at nearly the same time as the 2020 Census; thus, unlike the PES, it is not subject to recall error. Moreover, our independent enumeration invested heavily in on-the-ground presence in our sample blocks and attempted to enumerate every household, making block-level comparisons feasible. Our demographic projections improve upon the standard DA technique by more explicitly accounting for population trends across the intercensal period measured across an array of data sources, including changes to the number of housing units located on each census block. With our complete, independent enumeration at the block level and advanced projection methods, we are better positioned to assess the quality of the 2020 Census at the block level than either DA or the PES.

In answering the first question, we assess the accuracy of estimates from the 2020 Census at the block level and for key demographic groups across our sample blocks. In answering the second question, we identify features of census blocks that may have contributed to discrepancies in population totals. At its core, the census is an extensive survey operation that requires a direct physical accounting of all housing units. Blocks vary considerably in their size, safety, ease of entry and navigation, capacity for new construction, and broadband access. These differences create various challenges for field staff tasked with identifying the presence of housing units requiring enumeration and performing in-person follow-up visits to nonresponders. Similar challenges exist for postal carriers who deliver and return census forms. When census data are analyzed at the national or state level, these challenges tend to “average out in the wash.” Still, they can introduce substantial error in estimations of population totals at lower levels of geography. In identifying which features of census blocks are most strongly correlated with discrepant population totals, our analysis provides context for researchers and policymakers evaluating small-area estimates from the 2020 Census and informs efforts to enumerate hard-to-count neighborhoods in future censuses.

Methods and Materials

Sample

To answer our research questions, we analyzed data collected from a sample of 173 census blocks in California. To ensure geographic diversity, we sought a sample that would capture a high level of environmental variation in terms of land cover and climate. To ensure demographic diversity, we sought a sample that would facilitate the construction of representative populations based on age, race, and ethnicity. To meet these dual goals, we designed our sample with two complementary subsamples: a geographic and a demographic subsample. The common starting point for both subsamples is a division of the state into seven regions: the North Coast, the Northern Interior, the Eastern Sierra, the Central Coast, the San Joaquin Valley, Southern California, and the Inland Empire/South Desert.

To draw the geographic subsample, we created strata by interacting the seven regions with 15 land cover classifications defined by the European Space Agency (e.g., grassland, shrubland, urban area). This process resulted in 57 unique strata. To draw the demographic subsample, we first identified five major cities alongside the original seven regions to ensure that we would sample blocks from the state's major population centers. These cities included Los Angeles, Sacramento, San Diego, San Jose, and San Francisco. We used city boundaries to distinguish these cities from the rest of the region. Adding these five cities to the original seven regions resulted in 12 mutually exclusive areas spanning the entirety of the state. We stratified these 12 areas using tertiles derived from the California Hard-to-Count Index, a summary index correlated with the difficulty of enumerating different census tracts on the basis of response rates from previous censuses.2 This process resulted in 36 strata.

The 57 strata from the geographic subsample combined with the 36 strata from the demographic subsample yielded a total of 93 potential strata from which to sample census blocks. To maximize the number of total census blocks in population areas while also including blocks with every possible land cover type, we identified and excluded 22 strata from the geographic subsample with land cover already represented in the demographic subsample, yielding a total of 71 eligible strata. We then randomly selected two census blocks within each stratum. The probability of selection was proportional to the block's share of the state's total housing units in the 2019 public block-level summary counts from the Bureau's Master Address File. Because we were most concerned with evaluating the hardest-to-count blocks, we sampled four census blocks (instead of two) in the demographic subsample strata defined by the tertile with the highest scores on the California Hard-to-Count Index.

Our final analytic sample consists of 173 census blocks: 70 blocks in the geographic subsample and 103 blocks in the demographic subsample. After consideration of previous California population totals, we determined via power analyses that 173 blocks would be sufficient to detect a potential undercount or overcount. For this analysis, we combined the geographic and demographic subsamples and analyzed them together. For each block, we determined population totals through three sources: the 2020 Census, the California Neighborhoods Count independent enumeration, and demographic projections.

Data Sources

2020 Census

The conduct of the 2020 Census is well-documented by the Bureau, with full details of its methodology made publicly available on the Bureau's website. Of particular relevance for our analysis is the Public Law 94-171 Redistricting Data File that the Bureau provides each state's governor to guide the redrawing of districts for the U.S. Congress and state legislatures. This file, delivered to each state approximately one year after the census is taken, shows the population totals by age, race, and ethnicity for all residents on each census block in each state. From this file, we extracted population totals for our 173 sample blocks. These population totals have been subjected to the new DAS procedures and thus contain an undiscernible degree of artificial error.

California Neighborhoods Count Enumeration

The California Neighborhoods Count (CNC) is the first independent enumeration of a population with the purpose of validating official totals from a U.S. decennial census. Sponsored by the California Complete Count Committee–Census 2020 Office and administered by the California Department of Finance, CNC was undertaken by the RAND Corporation and the California Center for Population Research at the University of California–Los Angeles. CNC was designed to emulate the methods used by the Bureau as closely as possible, with two distinct phases of data collection: an address canvass phase and an enumeration phase. During the address canvass phase, undertaken between January and March of 2020, a team of trained interviewers physically went door-to-door around each sample block to verify the street address and the number of separate housing units at each address. This effort was crucial in identifying accessory dwelling units often missed when relying on lists of addresses.

In previous decennial censuses, the Bureau undertook a complete in-person physical address canvass to establish their address frame in advance of the enumeration of the population. However, for cost-saving purposes, the 2020 Census deployed a two-tier approach to their address listing: approximately 35% of addresses were subjected to an in-person physical address canvass, as done in previous decennial censuses, and 65% of addresses were verified “in-office” using geospatial imaging software. A limitation of geospatial imaging software is that it cannot determine multifamily units that share the same address, and it is less capable of identifying accessory dwelling units. Such omissions in the address frame are known contributors to an undercount. CNC, however, included a 100% complete in-person physical address canvassing operation with specific protocols for the field staff to inquire directly about multifamily and accessory dwelling units when interviewing residents.

Across the 173 sample blocks, the address canvass identified 23,929 unique housing units, which is 1,261 more than the Bureau identified in their canvassing efforts (Burgette et al. 2022). During the enumeration phase, all addresses identified in the address canvass phase were sent a form that collected a household roster identical to that used on the official census form. Residents had the option of filling out the paper form or completing the form online. Nonresponders were contacted by telephone and in-person interviewers who visited each block. Whereas the Bureau makes up to six attempts at interviewing households during their nonresponse follow-up phase, CNC made more than six attempts, including up to 11 telephone calls and up to eight in-person visits. Data collection for the enumeration phase yielded a 54.0% response rate of the 23,929 sample addresses. The remaining 46.0% of cases were filled using administrative record allocation and imputation.

We imputed population totals for nonresponding households using three administrative data sources that could be linked directly with housing units: (1) real estate tax determinations made by the state, which include information on the housing unit, such as the number of bedrooms and its square footage; (2) California voter registration data, which include demographic characteristics of household residents registered to vote; and (3) eligibility data for Medi-Cal, which include demographic characteristics of household residents participating in the state's health insurance program. With these data sources, along with the data collected on the demographic composition of participating households on the same blocks as nonresponders, we applied chained equations imputation with random forests. This strategy is based on machine learning algorithms that can accommodate nonlinearities, interactions, and outliers. We took a staged approach to imputation. First, we imputed the total number of residents for each nonresponding household. Second, if the imputed number of residents was greater than zero, we imputed race, ethnicity, and age for the “primary respondent” (i.e., the individual who would have responded to the survey if the household had responded). Finally, if the imputed number of residents was greater than one, we imputed the demographic characteristics of all other individuals, conditional on the race, ethnicity, and age of the primary respondent. To improve the quality of our race and ethnicity imputations, we used Bayesian Improved Surname Geocoding (Elliott et al. 2009). This method formalizes the observation that knowing a person's name along with their neighborhood's racial and ethnic composition provides indirect information about household residents' self-identified race and ethnicity.

We evaluated a wide range of candidate imputation models, including hot deck, k-nearest neighbors, and multiple imputation via chained equations. The conditional models we considered included classification and regression trees and random forests. In combination with the imputation model itself, we considered several design choices that are applicable to many of these methods, including whether imputing missing values in one variable depends on all other observed variables or only a subset; whether all variables are used to impute the total number of residents in the housing unit or only those sufficiently correlated and observed with low rates of missingness; and whether indicators of missingness are included in the imputation model.

These options combined to produce 83 distinct imputation approaches that we evaluated. The various approaches produce substantial differences in the total population estimates. We focused on the configurations that produced population estimates close to the 2010 population totals after applying a statewide inflation factor from 2010 to 2020. Among the top 10 algorithms producing totals that track with the 2010 population totals, we compared the distribution of housing unit counts for observed versus imputed counts. After this process, we selected random forests with a tree depth of one as our preferred estimate. Given the variability in estimates produced by the different imputation approaches, we supplemented our preferred estimate by also showing population tabulations produced using classification and regression trees or “CART” (our lower range estimate) and using random forests (our higher range estimate). More detailed information on these imputation approaches and the data collection procedures used in CNC are available in Burgette et al. (2022).

Demographic Projections

The goal of our demographic projections is to produce the best possible alternative small-area estimate without relying on data collected during 2020. We derived estimates as of April 1, 2020, for California census blocks by race and ethnicity from an array of survey and administrative data sources. The 2010 decennial census serves as a baseline for census block-level vacancy rates and average population per housing unit, as well as an initial race and ethnicity distribution. We tested household size and vacancy rates for statistically significant changes at the block group level from American Community Survey (ACS) data. When significant differences existed at the block group level between the most recent ACS and the last census, we used the most recent data.

In preparation for the 2020 census, the Bureau released Address Count operational data containing housing units per block from the Master Address File. We used these data to estimate total household population per block using average household size and vacancy rates generated in the previous step. We added the latest estimates of the population in group quarters to generate total population estimates per block. We adjusted block populations for consistency with the California Department of Finance's independent population estimates for cities and counties, as well as with the race and ethnicity distributions of the population observed in the recent ACS for tracts and public-use microdata areas (PUMAs).

The 2020 Census apportionment results contain the total California state resident population as enumerated in the 2020 Census: 39,538,223 people. This estimate was used for poststratification weighting to control the California Department of Finance's county-level estimates to sum to the total state population and to control ACS block group, tract, and PUMA totals to match the state enumeration. We controlled the extrapolated block total population counts to adjusted place and county totals, and the population by race or Hispanic/Latino/Spanish origin to the distributions observed in the adjusted ACS.3

Census block geographies change each decade, so it was necessary to translate the 2010 census blocks used in the analysis into 2020 blocks to compare counts. We used the percentage of built-up area in each 2010 block within each intersecting 2020 census block as weights to distribute the 2010 block's population and housing.4 We then calculated marginal totals and rounded numbers to the nearest integers. We performed resampling when necessary to ensure that the totals by 2020 geography were consistent with the totals stored before geographic conversion.

The results are independent of the census field operations but leverage data from other programs of the Bureau, as well as other federal and state government sources. The small-area estimates are subject to errors in the California Department of Finance estimates or the Master Address File, as well as sampling error in the ACS; they are not necessarily more accurate than the 2020 Census but are subject mostly to different sources of error from the decennial census. The estimates may therefore enable us to evaluate the credibility of the 2020 census enumeration on the basis of its consistency with other data sources.

Empirical Strategy

To determine the extent to which population totals from the 2020 Census diverge from totals produced by our independent enumeration and our demographic projections, we tallied and compared population totals for our full sample and for subpopulations defined by age, race, and ethnicity. Additionally, we calculated net coverage ratios as follows:

Census EstimateIndependent Estimate Census Estimate×100.

We calculated and compared two sets of ratios: (1) ratios where CNC provides the independent estimate and (2) ratios where our demographic projections provide the independent estimate.

To determine structural characteristics of census blocks associated with divergent population totals when comparing the 2020 Census with CNC and demographic projections, we estimated an ordinary least-squares (OLS) regression model predicting variation in population totals across the 2020 Census, CNC, and our demographic projections. We measured this variation as the block-level standard deviation of the three estimates of the population total. Lower values of this outcome measure indicate greater consistency across the three sources, while higher values indicate greater discrepancy across the three sources. This variation is estimated as a function of six exogenous block-level factors that may pose challenges to an accurate enumeration: the size of the census block, the number of housing units on the census block, broadband access on the block, the urbanicity of the block, the presence of hard-to-count structures on the block, and the overall difficulty of accessing the block.

Size of the census block is a continuous measure taken from the California Public Utilities Commission and is expressed in square miles. Number of housing units is a continuous variable taken from Bureau's official 2020 block-level totals. Broadband access is a binary variable taken from the California Public Utilities Commission indicating whether the block is wired for broadband. We represent urbanicity by a set of binary variables indicating whether the block is in an urban, suburban, or rural area, as classified by the Bureau. The presence of hard-to-count structures is a binary variable indicating whether the block had any gated communities, group quarters, or high-rise apartment buildings, as observed directly by CNC data collection staff. Lastly, difficulty of accessing the block is a continuous variable taken from direct observations of CNC data collection staff who rated each block on a scale of 1 to 5, with 1 indicating that the block was easy to access and 5 indicating that the block was difficult to access. We report descriptive statistics for these measures in Table 1.

Instead of using traditional t tests to assess the statistical significance of the parameter estimates in this regression model, we applied Monte Carlo permutation tests. Researchers often use these permutation tests for parametric inference from small, nonprobability samples such as ours (Good 2013). Following conventional standards, we based our analyses on 10,000 permutations (Good 2013).

Results

Our first analytical task is to determine the extent to which population totals from the 2020 Census diverge from totals produced by an independent enumeration and from totals produced by demographic projections. To do so, we first show comparisons at the block level. In Figure 1, we plot 2020 Census population totals against our CNC population totals. In Figure 2, we plot 2020 Census population totals against our projected population totals. We fit a regression line to these bivariate distributions and report Pearson correlation coefficients.

Both figures show considerable alignment in the estimates, as evidenced by a nearly 45-degree regression line and strong correlations (r = .95 in both figures). However, there appears to be greater consistency in census blocks that have smaller populations. When the population is less than 500 residents, the plots are more closely clustered near the fitted line. There is greater variability in the estimated population totals when the population is greater than 500 residents. This finding provides suggestive evidence that smaller blocks may produce more accurate totals or be less affected by the application of the DAS procedures than larger blocks—an issue that we further explore in our multivariate analysis.

It is worth pointing out that both figures contain a noticeable outlier in which the 2020 Census total far exceeds the independently estimated total. The largest outlier when comparing the 2020 Census with the CNC in Figure 1 is for a block in Palo Alto, where the 2020 Census counted 1,246 individuals but CNC counted only 263 individuals. On this block, the Bureau address canvass identified 552 housing units, whereas our CNC address canvas identified only 124 housing units. Given this large discrepancy, the CNC field staff recanvassed this block to validate the total. Although we cannot unequivocally ascertain the reason for this discrepancy, we speculate there was a block boundary change that was not accounted for in either the CNC or the 2020 Census address canvass. The largest outlier when comparing the 2020 Census with the projected totals in Figure 2 is for a block in the Bay Area, where the 2020 Census counted 1,314 individuals but we projected only 69 individuals. On this block, the Bureau address canvass identified 569 housing units, whereas we projected only 25 housing units. When we remove these two outliers, the correlation in Figure 1 improves to .97, and the correlation in Figure 2 improves to .98.

Next, in Table 2, we tally and compare population totals for our full sample and for subpopulations defined by age, race, and ethnicity. The 2020 Census counted 53,295 individuals living in our 173 sample blocks, which is 1,483 more than counted by CNC (translating to a 2.8% potential overcount) and 32 fewer than we projected (translating to a 0.1% potential undercount).5 Though we focus on our preferred CNC estimate, it is important to note that our various imputation strategies yield considerable variation. Our low CNC estimate was 6,298 fewer individuals than the 2020 Census, and our high CNC estimate was 5,210 more individuals than the 2020 Census. In sum, the different sources yield nonnegligible variation, but the official 2020 Census enumeration is near the center of these ranges. Taken together, alongside the plots shown in Figures 1 and 2, our analysis suggests that block-level population totals from the 2020 Census are not biased in any notable direction.

The situation changes when moving from the total population to key demographic subpopulations. For ease of interpretation, we shift our focus from raw totals to the net coverage ratios presented in Table 3. Positive ratios indicate the percentage by which each subpopulation is potentially undercounted, and negative ratios indicate the percentage by which each subpopulation is potentially overcounted. Whereas the ratios are close to zero for the total population, the ratios are larger for subpopulations defined by age, race, and ethnicity. In some cases, they are considerably larger.

Regarding age, demographers are often concerned with the undercount of the young. However, our independent sources suggest the opposite for our sample blocks in California. Our preferred CNC estimate shows an 11.4% potential overcount for those younger than 18, while our projected totals show a 27.5% potential undercount. As with our top-line population totals, the CNC low and high estimates evidence a considerable range, with our low estimate indicating an 18.6% potential overcount of the youngest ages and our high estimate indicating a 17.6% potential undercount of the youngest ages. Regarding race and ethnicity, the ratios comparing the 2020 Census with CNC suggest that the 2020 Census Bureau potentially undercounted Native Hawaiian and other Pacific Islanders (−731.4%) and potentially overcounted those reporting two or more races (43.5%) and Asians (43.1%). The ratios comparing the 2020 Census with our projections suggest that the 2020 Census Bureau potentially overcounted those reporting two or more races (58.2%) or some other race (22.2%) while potentially undercounting Whites (−30.5%). Given that these ratios are large and produce inconsistent patterns across the different data sources, we have less faith in the validity of the 2020 Census subpopulation totals by age, race, and ethnicity at the block level.

Our second analytical task is to identify characteristics of census blocks associated with divergent population totals. To do this, we estimate an OLS regression predicting the average deviation in population totals across the 2020 Census, CNC, and our projections as a function of block-level characteristics that may have posed challenges to an accurate enumeration. We present parameter estimates from this model in Table 4.

In the model, two variables yield coefficients that meet conventional levels of statistical significance: the size of the block in square miles (p < .05) and the number of housing units on the block (p < .01). The corresponding coefficients are positive, indicating that larger blocks—in terms of both land area and the number of housing units—show more variation in their population estimates across our different sources than smaller blocks. This result complements the findings in Figures 1 and 2, which show less consistent population totals for larger overall populations. Lastly, because the possibility of making a type II error is elevated with small sample sizes such as ours, we highlight two coefficients that approach conventional levels of statistical significance (p < .10). Blocks without broadband access produce more heterogeneous population totals than blocks with broadband access, and suburban blocks produce more heterogeneous population totals than rural blocks. We discuss the implications of these four significant coefficients in our discussion.

Discussion

In this study, we assess the accuracy of block-level population totals in the 2020 Census using data from CNC, the first independent enumeration survey of census blocks, as well as from demographic projections. We also identify characteristics of census blocks that may lead to inaccurate population totals. By doing so, we can provide some background and guidance to social scientists who use block-level data from the 2020 Census for their analyses.

Our study has three key findings. First, we find that in our sample of 173 census blocks in California, population counts provided by the Bureau at the block level for the 2020 Census are not biased in any consistent direction. Our preferred CNC estimate indicates that the 2020 Census produced a 2.8% potential overcount while our projections indicate that the 2020 Census produced a 0.1% potential undercount. To put these findings in context, recall that the Bureau's PES detected a 0.5% potential overcount in California. Further, the correlation between the 2020 Census and CNC totals was .95, and the correlation between the 2020 Census and our projected totals was also .95. Given these strong correlations and given that the official 2020 Census total falls within the range of our CNC and projected totals, we conclude that block-level population total estimates from the 2020 Census are satisfactory.

Second, we find that block-level totals for subpopulations defined by age, race, and ethnicity are highly variable, with the largest discrepancies observed for Asians, Native Hawaiians, other Pacific Islanders, and those reporting multiple races. One plausible explanation for the discrepancies between the 2020 Census and our projections is that the latter are functionally derived from the 2010 Census and ACS data collected during the 2010 to 2020 intercensal period. Across the decade, the share of babies who are ethnoracially mixed increased (Alba 2020), and alongside the rise of genetic testing services, there is evidence that Americans are increasingly likely to report being of mixed race and ethnicity (Johfre et al. 2021). Relying on data collected earlier in the intercensal period may fail to pick up these changes and allocate mixed-race individuals to single-race categories. Regardless of the reason for these differences, our findings cast doubt on the ability of 2020 Census data to accurately characterize the racial-ethnic composition of census blocks. These findings corroborate other research that has found inaccurate racial-ethnic statistics produced from the 2020 Census at lower levels of geography (Asquith et al. 2022; Kenny et al. 2021; Mueller and Santos-Lozada 2022; Santos-Lozada al. 2020).

Third, we find that three structural features of census blocks are associated with discrepant population totals across our different sources: block size (measured in terms of land area or total housing units), urbanicity, and access to broadband access. Blocks that are large in land area and in housing stock, blocks in suburban areas, and blocks with no broadband access appear to have the least accurate population totals. These findings have implications for researchers using block-level data from the 2020 Census and those at the Bureau planning for future censuses. Researchers using block-level population totals as key variables in their analysis, either as a predictor or as an outcome, may want to consider including controls for block size, urbanicity, and broadband access in their models. Controlling for these factors will not eliminate error caused by faulty enumeration or from the application of the DAS, but it will help provide clearer estimates. In planning for future censuses, the Bureau should prioritize large blocks, blocks in suburban areas, and blocks lacking broadband in their address canvass, outreach activities, and nonresponse follow-up procedures. Additionally, developers of the DAS may want to consider these block-level factors when infusing noise so that data for large blocks, blocks in suburban areas, and blocks lacking broadband are not distorted more than necessary to ensure respondent confidentiality.

There are two notable limitations to our analysis. First, in previous censuses, concerns about data quality centered entirely on the completeness of the data collection operation. However, with the introduction of the DAS, we cannot ascertain whether the discrepancies observed in our study are due to data collection failures or the artificial noise infused into the data. Therefore, our findings can speak only to the general accuracy of block-level totals provided by the Bureau from the 2020 enumeration. Second, CNC was by and large a successful effort by contemporary survey research standards but still lagged the 2020 Census in terms of participation. In the 2020 Census, data were directly collected from 76.1% of households, with administrative record allocation and imputation needed to estimate the population for the remaining 23.9%. In CNC, data were directly collected from 54.0% of households, with administrative record allocation and imputation needed to estimate the population for the remaining 46.0%. However, the findings we report from CNC, with its high administrative record allocation and imputation rate, are instructive for future censuses. This is especially true as initial planning for the 2030 Census indicates a move toward increasing reliance on administrative data instead of direct surveys administered to households (MITRE Corporation 2016).

In closing, if we liken demography to photography, censuses can be considered point-in-time snapshots of the population. If two photographers were to take a picture of the same scene at the exact same time, or if the same photographer shot the same scene twice but minutes apart, the resulting snapshots would be similar but not identical. The situation becomes even further complicated when the scene in question is of a population—a fluid construct that changes size and shape by the second. In the case of the 2020 Census, the DAS can be thought of the same way as post hoc applying a filter to a photograph: it maintains the overall composition of the scenery but artificially obscures and amplifies different details. Our study, with the first-of-its-kind independent enumeration survey, provides a rare opportunity to compare multiple snapshots of the same population with different “cameras” and “filters.” While we do not expect the results to be identical, they provide different angles from which to make comparisons to construct an overall evaluation of the block-level enumeration. In evaluating these various population snapshots, we highlight relative consistency at the population level with substantial inconsistencies at the subpopulation level. Further, we observe systematic patterning in these inconsistencies across different structural dimensions of census blocks. From our assessment of these multiple snapshots, we urge researchers to proceed with extreme caution when using block-level subpopulation totals from the 2020 Census.

Acknowledgments

This study was funded by the California Complete Count Committee–Census 2020 Office and administered by the California Department of Finance. All analyses and interpretations are those of the authors alone and do not express the opinions of the state of California.

Notes

1

Detailed information on the Bureau’s DA methodology is available in Jensen et al. (2020).

2

The California Hard-to-Count Index was developed to identify parts of the state that would require community outreach to encourage participation in the census. More information can be found at https://census.ca.gov/california-htc/.

3

The distribution of Hispanic/Latino/Spanish-origin persons by race was made using proportions from the 2010 decennial census, owing to deficiencies in the ACS. We relied on these proportions because they are the best estimates of ethnicity available at the block level.

4

This geographic allocation dataset, “2020 Census Block Crosswalk Data,” was provided by Amos (2021).

5

We use the expression “potential” undercount/overcount because we cannot definitively benchmark population totals as tallied by the 2020 Census against the true population total. All methods used to determine undercounts and overcounts—including DA and the PES—must use an independent estimate of the true population total. Because this true total is simply an estimate, we cannot state with certainty whether the observed undercount or overcount is real.

References

Alba, R. (
2020
).
The great demographic illusion: Majority, minority, and the expanding American mainstream
.
Princeton, NJ
:
Princeton University Press
.
America Counts
. (
2021
,
August
25
).
California remained most populous state but growth slowed last decade. U.S. Census Bureau
. Retrieved from https://www.census.gov/library/stories/state-by-state/california-population-change-between-census-decade.html
Amos, B. (
2021
).
2020 Census Block Crosswalk Data, V2
[Dataset].
Harvard Dataverse
. https://doi.org/10.7910/DVN/T9VMJO
Asquith, B. J., Hershbein, B. J., Kugler, T. A., Reed, S. M., Ruggles, S., Schroeder, J., . . . Van Riper, D. C. (
2022
).
Assessing the impact of differential privacy on measures of population and racial residential segregation
.
Harvard Data Science Review
,
2022
(Special issue
2
). https://doi.org/10.1162/99608f92.5cd8024e
Brown, J. D., Heggeness, M. L., Dorinski, S. M., Warren, L., & Yi, M. (
2018
).
Understanding the quality of the alternative citizenship data sources of the 2020 census
(Working Paper No. CES-18-38).
Washington, DC
:
U.S. Census Bureau, Center for Economic Studies
. Retrieved from https://www2.census.gov/ces/wp/2018/CES-WP-18-38.pdf
Burgette, L. F., Weidmer, B., Bozick, R., Kofner, A., Tzen, M., Brand, J. E., . . . Shih, R. A. (
2022
).
California neighborhoods count: Validation of the U.S. Census population counts and housing characteristic estimates for California
(Social and Economic Well-being Report, No. RR-A2028-1).
Santa Monica, CA
:
RAND Corporation
. Retrieved from https://www.rand.org/content/dam/rand/pubs/research_reports/RRA2000/RRA2028-1/RAND_RRA2028-1.pdf
Christopher, B. (
2021
,
May
7
).
California's population shrank in 2020, but don't call it an exodus
.
Cal Matters
. Retrieved from https://calmatters.org/politics/2021/05/california-population-shrink-exodus/
Elliott, M. N., Morrison, P. A., Fremont, A., McCaffrey, D. F., Pantoja, P., & Lurie, N. (
2009
).
Using the Census Bureau's surname list to improve estimates of race/ethnicity and associated disparities
.
Health Services and Outcomes Research Methodology
,
9
,
69
83
.
Ericksen, E. (
2001
).
An evaluation of the 2000 census. In U.S. Census Monitoring Board (Ed.), Final report to Congress
(pp.
15
42
).
Suitland, MD
:
U.S. Census Bureau Monitoring Board
. Retrieved from https://govinfo.library.unt.edu/cmb/cmbp/reports/final_report/fin_sec3_evaluation.pdf
Good, P. (
2013
).
Permutation tests: A practical guide to resampling methods for testing hypotheses
(2nd illustrated ed.).
New York, NY
:
Springer Science+Business Media
.
Hauer, M. E., & Santos-Lozada, A. R. (
2021
).
Differential privacy in the 2020 census will distort COVID-19 rates
.
Socius
,
7
. https://doi.org/10.1177/2378023121994014
Hill, C., Heim, K., Hong, J., & Phan, N. (
2022
).
Census coverage estimates for people in the United States by state and census operations
(U.S. Census Bureau 2020 Post-Enumeration Survey Estimation Report, No. PES20-G-02RV,).
Washington, DC
:
U.S. Government Publishing Office
. Retrieved from https://www2.census.gov/programs-surveys/decennial/coverage-measurement/pes/census-coverage-estimates-for-people-in-the-united-states-by-state-and-census-operations.pdf
Howe, A. (
2019
,
July
11
).
Trump administration ends effort to include citizenship question on 2020 census. SCOTUSblog
. Retrieved from https://www.scotusblog.com/2019/07/trump-administration-ends-effort-to-include-citizenship-question-on-2020-census/
Jensen, E. B., Knapp, A., King, H., Armstrong, D., Johnson, S. L., Sink, L., & Miller, E. (
2020
).
Methodology for the 2020 demographic analysis estimates
(Report).
Suitland, MD
:
U.S. Census Bureau, Population Division
. Retrieved from https://www2.census.gov/programs-surveys/popest/technical-documentation/methodology/2020da_methodology.pdf
Johfre, S. S., Saperstein, A., & Hollenbach, J. A. (
2021
).
Measuring race and ancestry in the age of genetic testing
.
Demography
,
58
,
785
810
. https://doi.org/10.1215/00703370-9142013
Kenny, C. T., Kuriwaki, S., McCartan, C., Rosenman, E. T. R., Simko, T., & Imai, K. (
2021
).
The use of differential privacy for census data and its impact on redistricting: The case of the 2020 U.S. Census
.
Science Advances
,
7
,
eabk3283
. https://doi.org/10.1126/sciadv.abk3283
Khubba, S., Heim, K., & Hong, J. (
2022
).
National census coverage estimates for people in the United States by demographic characteristics
(U.S. Census Bureau 2020 Post-Enumeration Survey Estimation Report, No. PES20-G-01).
Washington, DC
:
U.S. Government Publishing Office
. Retrieved from https://www2.census.gov/programs-surveys/decennial/coverage-measurement/pes/national-census-coverage-estimates-by-demographic-characteristics.pdf
Lapowsky, I. (
2019
,
February
6
).
The challenge of America's first online census
.
Wired
. Retrieved from https://www.wired.com/story/us-census-2020-goes-digital/
Marimow, A. E., Zapotosky, M., & Bahrampour, T. (
2019
,
July
2
).
2020 census will not include citizenship question, Justice Department confirms
.
The Washington Post
. Retrieved from https://www.washingtonpost.com/local/social-issues/2020-census-will-not-include-citizenship-question-doj-confirms/2019/07/02/0067be4a-9c44-11e9-9ed4-c9089972ad5a_story.html
McGhee, E., Bohn, S., & Thorman, T. (
2018
).
The 2020 census and political representation in California
(Report).
San Francisco, CA
:
Public Policy Institute of California
. Retrieved from https://www.ppic.org/wp-content/uploads/the-2020-census-and-political-representation-in-california-october-2018.pdf
MITRE Corporation
. (
2016
).
Alternative futures for the conduct of the 2030 census
(Report No. JSR-16-Task-009).
McClean, VA
:
MITRE Corporation
. Retrieved from https://www2.census.gov/programs-surveys/decennial/2020/program-management/final-analysis-reports/alternative-futures-2030-census.pdf
Mueller, J. T., & Santos-Lozada, A. R. (
2022
).
The 2020 U.S. Census differential privacy method introduces disproportionate discrepancies for rural and non-White populations
.
Population Research and Policy Review
,
41
,
1417
1430
.
Santos-Lozada, A. R., Howard, J. T., & Verdery, A. M. (
2020
).
How differential privacy will affect our understanding of health disparities in the United States
.
Proceedings of the National Academy of Sciences
,
117
,
13405
13412
.
Supan, J. (
2021
,
May
27
).
Pandemic moving study: How remote work spurred moves out of big cities
.
Allconnect
. Retrieved from https://www.allconnect.com/blog/covid-moving-trends
Winkler, R. L., Butler, J. L., Curtis, K. J., & Egan-Robertson, D. (
2021
).
Differential privacy and the accuracy of county-level net migration estimates
.
Population Research and Policy Review
,
41
,
417
435
.
This is an open access article distributed under the terms of a Creative Commons license (CC BY-NC-ND 4.0).