Self-reported race is generally considered the basis for racial classification in social surveys, including the U.S. census. Drawing on recent advances in human molecular genetics and social science perspectives of socially constructed race, our study takes into account both genetic bio-ancestry and social context in understanding racial classification. This article accomplishes two objectives. First, our research establishes geographic genetic bio-ancestry as a component of racial classification. Second, it shows how social forces trump biology in racial classification and/or how social context interacts with bio-ancestry in shaping racial classification. The findings were replicated in two racially and ethnically diverse data sets: the College Roommate Study (N = 2,065) and the National Longitudinal Study of Adolescent Health (N = 2,281).
For more than 200 years, the measurement of race has been a major component in the United States (U.S.) decennial censuses (Hirschman et al. 2000). Race and ethnicity are standard items in all contemporary population and social surveys. Since the passage of civil rights laws in the 1960s, this information has been used for monitoring racial and ethnic differences in areas such as equal opportunity, affirmative action, the redistributing provisions of the Voting Rights Act, access to health care, exposure to environmental hazards, and medical prevention and treatment strategies. The information is crucial for enforcing policies developed to reduce and eliminate racial and ethnic differences in these areas.
Contemporary surveys and the U.S. censuses since 1960 ask respondents to self-report their race/ethnic category or categories. The U.S. censuses ask household heads to report on other family members’ racial/ethnic category/categories. Farley (1991) interpreted self-report as ethnicity rather than ancestry. Perez and Hirschman (2009) did not consider the census responses on race and ethnicity as measuring ancestry, either, because these responses measure theoretically distinct identities. The consensus is that these measures are without an objective basis beyond self-report (Hirschman et al. 2000:390; Rosenberg et al. 2003:157). As Perlmann and Waters (2002:11) suggested, “the great irony is that the American government gathers data on people’s race through a more or less slippery and subjective procedure of self-identification and then must use these counts as the basis of legal status in an important domain of law and administrative regulation—namely, civil rights.”
The “scientific” racism of the early twentieth century, which held that races were biologically distinct peoples with differential abilities and behaviors, has long been discredited by the scientific community (Gould 1981). However, a socially influenced definition of race need not preclude any logical basis for race/ethnic classifications. Over the past two decades, advances in molecular genetics have yielded a body of evidence showing genetic clustering across geographically separated human populations (Li et al. 2008; Rosenberg et al. 2002). These developments present a prime opportunity to examine the links between bio-ancestry and survey measures of race/ethnicity and to study how bio-ancestry interacts with social factors to shape how individuals respond to survey questions on race/ethnicity.
Our overarching goal is to seek fresh insights into the understanding of racial classification in the contemporary United States by combining a social science perspective with recent advances in human molecular genetics. We aim to (1) establish geographic bio-ancestry as a component of racial classification, and (2) use bio-ancestry measures to examine whether, how much, and how racial self-classification departs from bio-ancestry because of social-contextual influences.
We demonstrate that bio-ancestry (the geographic origin of an individual based on genetic data) and social context interact to influence the classification of race and ethnicity. In other words, the effect of bio-ancestry depends on social, historical, and cultural context. To our knowledge, no social scientist has considered bio-ancestry when studying racial classification, and geneticists do not investigate social context that influences racial classification above and beyond bio-ancestry.
Our contribution is threefold. First, we replicate the match between genetic bio-ancestry and self-reported race across a number of independent data sources (two U.S. and two worldwide sources). We estimate bio-ancestry using saliva DNA in two racially and ethnically diverse data sets from the United States: the College Roommate Study (ROOM, N = 2,065) and the National Longitudinal Study of Adolescent Health (Add Health, N = 2,281).
A general match between genetic bio-ancestry and race has been shown using worldwide populations (Cavalli-Sforza et al. 1994; Li et al. 2008; Rosenberg et al. 2002) and clinical convenience samples in the United States (Fyr et al. 2007; Parra et al. 1998; Reiner et al. 2005; Tang et al. 2005; Yaeger et al. 2008). Others have concluded that the physical characteristics distinguishing East Asians were an adaptive response to living in the Mammoth Steppe environment in Central Asia (Guthrie 1996). However, a number of important differences exist between our work and previous research. Earlier studies focused mostly on the study of human migration spanning the past 50,000 to 100,000 years and population admixture in medical genetic association studies. Integrating bio-ancestry into a study of race and ethnicity requires data sources representative of U.S. ethnic and racial minorities and a social science perspective.
Tang et al. (2005) is a case in point. This study used a large data set of 3,636 U.S. patients with high blood pressure, and showed a 99.86 % match between cluster-analysis assignment and self-classification into white, African American, East Asian, or Hispanic. The study did not consider a social science perspective and did not use a diverse and representative sample. The study treated Hispanics as a race along with blacks and whites; however, Hispanics are considered an ethnicity in the current U.S. census and social surveys. Hispanics can be black, white, and/or Asian. The study obtained a “perfect” match, most likely because all Hispanics in the study are from Starr County, Texas. The Hispanic population in the United States, though, is much more heterogeneous than Hispanics from a single county in Texas. Tang and colleagues did not examine multiracial individuals. As mentioned earlier, the individuals in their study were assumed white, African American, East Asian, or Hispanic. Comparatively, our findings using U.S.-based, nationally representative, and racially and ethnically diverse population samples suggest that a substantial proportion of individuals in the United States is multiracial and cannot be readily assigned to a single racial category.
Second, we show in a test of the “one-drop rule” (the century-old U.S. social and legal practice of treating individuals with any amount of African ancestry as black) that the influence of bio-ancestry on racial classification depends on how black and white are historically and socially defined. In the absence of bio-ancestry, the “one drop” cannot be measured, and thus the rule cannot be tested directly and generally.
Third, we examine the fluidity of racial classification, providing evidence that social context influences whether individuals “change” their racial classification above and beyond bio-ancestry. A common finding in previous work is that multiracial individuals are more likely to change their reported race than mono-racial individuals across occasions (Hitlin et al. 2006) and under different social circumstances (Harris and Sim 2002). Adding the control of bio-ancestry enables us to conclude that given the same proportion of African or Caucasian ancestry, social contextual factors—such as the racial composition of youths’ friendship networks and neighborhoods—contribute to the fluidity of racial classification. Without taking bio-ancestry into account, these social influences cannot be isolated from the influences of bio-ancestry.
Why does bio-ancestry match self-classification of race? After all, individuals typically do not have access to their genetic information. An argument can be made that bio-ancestry underlies phenotypic features (e.g., skin tone, hair color, hair texture, and facial features) and family ancestral history (e.g., race of parents, grandparents, and great grandparents), and that genetic bio-ancestry can be more of a summary measure of bio-ancestry than a measure of phenotypic features and family history. Family history and phenotypic features are usually not measured or are crudely measured in social science studies. This reasoning explains why inaccessible bio-ancestry can be highly correlated with self-report of race.
Social Construction of Racial Classification
Race is much more than human phenotypic or biological characteristics. The meanings of race are grounded in historical, cultural, social, and legal processes (Bonilla-Silva 2001; Davis 1991; López 1996; Omi and Winant 1994; Williamson 1980). The role of bio-ancestry in racial classification must be understood in this larger sociohistorical context. In contemporary perspective, race is widely accepted as predominantly a social, rather than a biological, concept.
The One-Drop Rule or the Rule of Hypodescent
The one-drop rule, which originated in the American South, denoted that one drop of African blood or any amount of African ancestry would define an individual as black (Berry and Tischler 1978:97–98; Davis 1991:5; Myrdal et al. 1944:1–2; Williamson 1980:1–2). The rule implied that even a small amount of black ancestry contaminates, thus disqualifying an individual from being classified as white. Historically, the one-drop rule lay at the heart of socially constructed race for African Americans and, together with anti-miscegenation laws, was designed to preserve racial hierarchy. If all progeny of a black-white union were considered black, and thus those black-white (mixed) individuals could only ever bear (by definition) black children, a sharp color or racial line could be maintained. The one-drop rule was practiced widely in the decades following the Civil War. The rule was further entrenched in the first half of the twentieth century with legalized racial segregation under the Jim Crow system in the South and de facto racial segregation and discrimination in other parts of the United States.
Only individuals with African ancestry are subject to the one-drop rule (Davis 1991; Rockquemore and Brunsma 2001). In the United States, those with one-fourth or less American Indian, Mexican, Chinese, or Japanese ancestry are considered assimilating Americans. The one-drop rule does not apply as strictly to these individuals, and their nonwhite racial backgrounds become ethnic legacies. The one-drop rule is uniquely American. Other countries usually conceptualize race and ethnicity differently, resulting in different systems that determine race based not only on physical characteristics but also on social status, class, and other social circumstances (Surratt and Inciardi 1998; Telles 2006).
Traditional racial and ethnic boundaries have been blurred by the enormous gains in civil rights since the mid-century, by interracial marriage, immigration, and social mobility, and by the new options of multiracial categories introduced in the 2000 U.S. census (Hirschman et al. 2000; Perez and Hirschman 2009). Despite these developments, it remains an open question whether and to what extent the one-drop rule is still observed.
Without measures of bio-ancestry, previous empirical studies of the one-drop rule used “multirace” to measure “one drop” (Fairlie 2009; Roth 2005). Roth’s study examined the race-labeling patterns of black-white married parents for their children ages 15 and younger using the 5 % Integrated Pubic Use Microdata Series (IPUMS) of the 2000 U.S. census (2005). The study considered only the special case in which the “one drop” is approximately 50 % African ancestry.
In this study, we investigated whether the one-drop rule is still observed by respondents in social surveys in the contemporary United States and the amount of African ancestry “required” for an individual to self-classify or be classified by interviewers as black. We also examined the amount of European ancestry required to self-classify or be classified by interviewers as white. Bio-ancestral measures allow a quantitative empirical test of the one-drop rule. Our analysis examined various proportions of African ancestry, including those with 50 % African ancestry as a special case.
It is important to consider external classification when examining the one-drop rule (Penner and Saperstein 2008). Our analysis included an external interviewer-classification of race/ethnicity. We also examined self-reports because they illuminate the historical consequences of the one-drop rule as both a process of external racial ascription and self-identification. One’s self-report is not independent of social settings. The classic social psychological concept of the “looking-glass self” is often invoked in the discussion of the fluidity of racial identity. Specifically, the concept states that an individual’s self-perception is shaped by others’ perception, and one learns to see oneself as society does (Cooley 1902). Previous work on racial identity has also considered self-reports (Harris and Sim 2002).
The Fluidity of Racial Classification
The fluidity of racial classification refers to the changeability of racial classification across cultures, historical periods, and everyday social contexts. Even the same individual may assume multiple racial classifications under different social circumstances. Racial fluidity is influenced and constrained by historical and contemporary political, legal, and other societal forces that tend to use racial grouping to maintain and perpetuate social stratification (Bonilla-Silva 2001; Gould 1981).
The fluidity and arbitrariness of racial boundaries have been a central theme in the literature on the social construction of race (Brown 1992; Brunsma 2006; Campbell and Troyer 2007; Hahn et al. 1992; Harris and Sim 2002; Herman 2010; Khanna 2004, 2010; Nagel 1994; Penner and Saperstein 2008; Saperstein 2006; Tashiro 2002; Thornton et al. 2000; Waters 1990). A respondent’s self-classification in social surveys may be shaped by the purpose of the survey, the explicit or implicit expectation of the circumstances surrounding the survey, and the characteristics of the interviewer (Harris and Sim 2002; Hill 2002). A number of studies have empirically investigated the fluidity of racial classification in the contemporary United States. For example, Harris and Sim (2002) reported that interview contexts when responding to the race/ethnicity questionnaire were related to whether mixed-race individuals rejected or accepted the one-drop rule. Hitlin et al. (2006) reported that multiracial youths were four times more likely to change their reported race between two interviews about eight years apart.
In this study, we empirically investigated social forces associated with a change in racial classification for youth in the United States between an occasion when they were allowed to mark more than one racial category and an occasion when they were asked to mark only one. The analysis controlled for bio-ancestry.
Race and Genetic Clustering Across Geographically Separated Human Populations
Analyzing data from 17 genetic loci, Lewontin (1972) discovered that 94 % of human genetic variations across individuals occurs within a racial group, while the remaining 6 % occurs among the racial groups of Caucasian, African, Mongoloid, South Asian Aborigines, Amerinds, Oceanians, and Australian Aborigines. He concluded that racial classification was of no genetic or taxonomic significance. Lewontin’s pioneering work on the distribution of genetic variance within a population and between populations was confirmed by work using more recent data and statistical methods (e.g., Rosenberg et al. 2002).
Without contradicting Lewontin’s findings, recent work reported that the main genetic clusters occur among Europeans/West Asians, sub-Saharan Africans, and East Asians/Pacific Islanders/American Indians (Li et al. 2008; Rosenberg et al. 2002). The genetic clustering or the structure of various populations today is largely a result of the history of human migration (Cavalli-Sforza et al. 1994). Starting about 100,000 years ago, humans migrated out of Africa and established themselves in new environments. The migrants possessed only a subset of the alleles of the parent population. The smaller the founder population or migrant group, the larger the genetic disparity from the parent population. Furthermore, the reproductive isolation among populations caused by geographical barriers ensures that any differences arising from genetic drift be maintained. As a result, the genetic differences across geographically separated populations would solidify into structured differences between populations.
Relevant to this body of work is the neutral theory of molecular evolution (Kimura 1968, 1983). The theory states that most mutations at the molecular level are selectively neutral or nearly neutral rather than Darwinian-selective. These selectively neutral mutations do not confer functions that increase or decrease evolutionary fitness. The theory is supported by evidence in molecular genetics, which allows comparative studies of amino acid change rates in evolution across related organisms. Frequently, random genetic mutations did not change the amino acid for which a given codon triplet was coding. The majority of mutant polymorphisms could not be functional polymorphisms; otherwise, the stable change rates in amino acids would be much higher. The recognition of a large number of such neutral polymorphisms led to increased attention to the role of random genetic drift in shaping population structure.
The recent work on human migration and the neutral theory together suggest that a small amount of genetic data, which can be much lower than 6 % of the total genetic differences across individuals, is sufficient to predict the continental origins of a person with reasonable accuracy. These genetic differences, however, are largely due to random drift and unrelated to natural selection.
For the recent work on human migration, skepticism in social science circles exists with regard to the representativeness of the analyzed samples (Duster 2005; Rotimi 2003, 2004) and whether the way ancestral informative markers (AIMs) are selected might have predetermined the results (Duster 2005). Our replication using the same set of AIMs across four independent data sets addresses the sample representativeness and the potential problem of predetermined results.
Europeans, Africans, and East Asians are important categories because they represent a majority of the human population and because they are the root categories of a great number of subpopulations (Li et al. 2008). However, these population categories are neither the only set nor the most important set of genetic classifications. Given a proper set of genetic markers, genetic clustering can be deciphered within Africans and African Americans (Tishkoff et al. 2009), Europeans (Novembre et al. 2008), Pacific Islanders (Friedlaender et al. 2008), and American Indians in both North and South America (Wang et al. 2007). Most importantly, genetically, although every individual is unique, we all belong to the same human species. All individuals are, to various extents, admixed or genetically mixed from previously isolated human populations.
Data, Measures, and Methods
Our project tapped a total of four data sources. The main analysis was performed on two U.S. data sets: ROOM and Add Health. The panel of ancestral informative markers was selected from the HapMap project (2005). The estimated bio-ancestry using the U.S. data was compared with that from the worldwide Human Genome Diversity Project (HGDP).
ROOM, carried out in the spring semester of 2008 at a large public university, was designed to investigate joint peer and genetic effects on health behaviors on a college campus. The study consisted of a survey component and a saliva-based DNA component; 2,664 (79.5 %) students in the targeted sample completed a Web-based survey, and 2,080 (78.7 % of the survey completers) provided a saliva sample.
Add Health is a nationally representative longitudinal study of the health-related behaviors of about 20,000 U.S. adolescents in grades 7–12 in 1994–1995 (Harris et al. 2003). Our Add Health analysis sample consisted of 2,281 individuals with valid genotype data from the Illumina 1,536 array, including a panel of 186 AIMs and valid survey data from Wave I. These 2,281 individuals represent 87 % of 2,612 individuals whose saliva DNA was collected in 2002 at Wave III. We also analyzed self-report of race and ethnicity from Waves II and III. The findings are similar and not presented. Table 1 shows that the DNA sample characteristics are similar to those in the full Add Health sample at Wave I, suggesting that the DNA sample is also representative of the U.S. population.
To cross-check our estimates of bio-ancestry, we reanalyzed the more than 1,000 individuals from 52 worldwide populations in HGDP and compared the estimates of bio-ancestry in HGDP with our estimates from the U.S. data. The HGDP populations are spread over most of the inhabited continents (Cann et al. 2002). The same set of AIMs that was genotyped in HGDP was also genotyped in our U.S. data sets. The HapMap project has yielded genotype data for 90 Caucasian individuals from Utah with ancestry in Northern and Western Europe, 45 Han Chinese from Beijing, 44 Japanese from Tokyo, and 90 Yoruban individuals from Ibadan, Nigeria on >6 million single nucleotide polymorphisms (SNPs) located across the genome.
In ROOM, DNA was extracted according to the manufacturer’s instructions from 2ml of saliva (containing buccal epithelial and white blood cells) collected from participants in an Oragene DNA collection kit (DNA Genotek; Ottawa, Ontario, Canada). DNA was plated for Illumina genotyping at 30 μl at >50 ng/μl. Our median DNA yield was 27.33 μg, with a minimum of 0 μg (six individuals) and a maximum of 71.32 μg.
For ROOM, we designed an Illumina GoldenGate assay for 384 candidate SNPs, including 186 ancestral informative markers. Hardy-Weinberg equilibrium tests were performed on each SNP within each race and ethnicity. Less than 1 % of the SNPs yielded a p value smaller than .001. The genetic analysis was based on the 162 of 186 AIMs that were successfully genotyped.
In Add Health, genomic DNA was isolated from buccal cells at the Institute of Behavior Genetics at the University of Colorado, Boulder. The average yield of DNA was 58 ± 1 μg. We designed and genotyped an Illumina GoldenGate assay for 1,536 candidate SNPs, including the same 186 AIMs genotyped in ROOM. In Add Health, 121 of 186 AIMs were successfully genotyped. The literature (briefly described herein) on AIMs suggests that 121 are still likely sufficient for differentiating the continental groups, given our sample sizes.
Race, Ethnicity, and Other Sample Characteristics
ROOM has two sets of self-reported race and ethnicity: one from the housing application form submitted by students when requesting a dorm room to the university housing department before their freshman year, and the second from an online survey. The university housing form allowed students to self-classify as only one of six racial/ethnic groups: white, black, Hispanic, Asian and Pacific Islander, Native Indian, and Other; comparatively, the online questionnaire allowed respondents to mark one or more races.
At Wave I, Add Health’s main race/ethnicity questions predate the format followed in the 2000 U.S. census, allowing identification of more than one racial group. When a respondent selected more than one race during the home interview, the respondent was asked to indicate a single race category that would best describe him or her. Importantly, interviewers were instructed to record the single-best race of the respondent from their observations—not from what the respondent reported. The categories available for interviewers included only single-race categories of white, black, American Indian or Alaska Native, and Asian or Pacific Islander; Hispanic was not an option for interviewers.
The single-race responses in ROOM were recorded from housing application forms submitted to the university’s housing department before the freshman year. In ROOM, the race questionnaire allowing multirace categories was filled out in the spring of 2008. In Add Health, the single-race responses and the multirace responses were recorded in the same survey almost immediately one after the other.
In Add Health, “Southern States” was coded as 1 for individuals who lived in one of the following states at Wave I: Maryland, Virginia, Delaware, Tennessee, Arkansas, Louisiana, Missouri, North Carolina, South Carolina, Mississippi, Alabama, Georgia, Florida, Texas, Oklahoma, West Virginia, and Kentucky. In ROOM, “Southern States” was coded as 1 for those whose permanent address on the housing application form is one of the aforementioned states. The much higher percentage (89 %) of Southern States in ROOM than in Add Health (36 %) is due to the location of the study university (Table 1).
Our estimation of bio-ancestry relies on a panel of AIMs (rather than one or two distinguishing genetic variants) to estimate bio-ancestry or detect genetic differentiation across human populations. AIMs are sets of genetic polymorphisms whose allele frequencies differ significantly across populations (Frudakis et al. 2003; Parra et al. 1998; Shriver et al. 1997). Our panel of AIMs consists of 186 SNPs and was developed to detect and correct population stratification for genetic association studies (Enoch et al. 2006). The AIMs were selected according to four criteria:
Each AIM differed in allele frequency by a range of 0.7–10 times between at least a pair of continental populations of Europeans, sub-Saharan Africans, and East Asians.
The absolute value of log (RAF1/RAF2) was >1, where RAF1 and RAF2 are the reference allele frequency in continental populations 1 and 2, respectively.1
Each AIM was a genetically independent HapMap SNP with a minimum distance from any other AIM of at least 100 kilo-base pair (kb) to ensure that the AIMs were not in linkage disequilibrium.
The AIMs were evenly distributed throughout the genome for the three continental populations.
The AIM selection was based on the observed reference allele frequencies of the European, African, and Chinese/Japanese populations of the HapMap Project (HapMap data release #16c.1, June 2005). The AIMs were specifically designed for detecting continental populations. As such, these AIMs are much less effective in detecting substructures within a continental population of Europeans, Africans, or East Asians.
Factors such as the minimum number of markers and sample size also affect an AIM panel’s accuracy and informativeness. Bamshad et al. (2004) found that African American populations had roughly 4,700 SNPs that were potentially private to the population (and thus potential AIMs), while Europeans had 580 such SNPs. Rosenberg et al. (2002) found that 100–160 SNPs were sufficient when the sample size was roughly 1,000; other studies have generally used 150–200, with samples of at least 400 (Halder et al. 2008; Smith et al. 2001; Yang et al. 2005).
We used the AIM panel to estimate biogeographical ancestry via three statistical procedures: PLINK-based cluster analysis (Purcell et al. 2007), STRUCTURE-based cluster analysis (Pritchard et al. 2000), and principal components analysis implemented in the software EIGENSTRAT (Price et al. 2006). All three procedures estimated ancestral population membership without using information from self-report of race.
Cluster analysis has been used to infer population structures and to assign individuals to clusters or groups according to the degree of similarity of genetic data between individuals. Individuals within each cluster share more genetic variants than those in different clusters. However, the traditional cluster analysis assumes that each individual comes from only one population. Pritchard et al. (2000) proposed a method that allows each individual’s ancestral composition to represent a mixture of multiple unobserved populations. This method has been implemented in the software package STRUCTURE.
The particular PLINK procedure we used sets a fixed cluster size or the fixed number of ancestral populations. It assigns individuals into one and only one ancestral population, and the individuals assigned to the same ancestral population are relatively homogeneous with respect to AIM frequencies. To estimate the precision of our PLINK estimates, 95 % bootstrapping confidence intervals (Efron and Tbshirani 1993) were calculated.
The STRUCTURE analysis considers each individual’s genome having potentially arisen from an admixture of multiple populations; it also estimates relative contributions to each individual from multiple ancestral populations. The STRUCTURE analysis assumes a K value that represents the hypothesized number of ancestral populations. It then uses the differences in allele frequencies in the AIMs to predict how much each ancestral population contributed to the genetic ancestry of a given individual. The K contributions from K ancestry populations for each individual sum to 1.
Each STRUCTURE run used a burn-in period of 10,000 iterations, followed by 20,000 iterations from which estimates of bio-ancestry were obtained. To take into account precision of estimates, we performed 20 replicate STRUCTURE runs. All pairwise symmetric similar coefficients (SSC) are greater than 0.995. A SSC measures the similarity of two sets of population structure estimates. Our final figures for bio-ancestry were averaged over the results of the 20 sets of estimates. Our approach is similar to that used in studies of genetic structure among American Indians (Wang et al. 2007) and Pacific Islanders (Friedlaender et al. 2008).
Both the PLINK and STRUCTURE procedures assume that the individuals in the analysis have originated from K populations. K is was chosen for each analysis run, but it can be varied across different runs. Because our panel of AIMs was designed to differentiate continental populations of Europeans, Africans, and East Asians, we set K = 3. However, to test the robustness of our results to choice of K, we performed analyses assuming K = 3, 4, 5, 6, and 7.
The third method, implemented in the software EIGENSTRAT (Price et al. 2006), identifies bio-ancestry through principal components (PCs). Principal component analysis is one of the most widely used techniques to reduce the dimensionality while retaining most of the variation in a data set. In other words, the technique summarizes a large number of variables by a small number of new linearly independent variables. Principal component analysis ranks the relative importance of those components in a descending way, so that the first component contains the largest variation of the original variables. A large number of AIMs provide rich and detailed ancestry-related information for each individual. However, such high-dimensional data make it difficult to visualize the patterns of genetic distances between individuals. When we plot the first and second principal components, genetic distances between individuals (thus genetic clusters) are displayed. The first two principal components represent a significant portion of ancestral information contained in the set of AIMs.
Social Construction of Race
To examine the practice of the one-drop rule, we calculated the percentage of the sample with a proportion of African ancestry that reports itself as black and the percentage’s 95 % bootstrapping confidence interval. We expect that the higher the proportion of African ancestry, the more likely it is that individuals will self-classify or be classified by an interviewer as black. However, the important question is, at what proportions of African ancestry do substantial percentages of individuals begin to self-classify or be classified by an interviewer as black? We also calculated percentage of the sample with a proportion of European ancestry that reports itself as white as well as the percentage’s 95 % bootstrapping confidence interval. Comparing black and white calculations would reveal the likely asymmetry between these two groups: that is, does it take a much higher proportion of European ancestry to self-classify or be classified as white than the proportion of African ancestry needed to self-classify or be classified as black?
Our analysis also takes into consideration three factors expected to affect the practice of the one-drop rule: an individual’s ancestral composition, whether a race questionnaire contains a multiracial option, and/or whether an individual self-classifies or is classified by an interviewer. Our main analysis sample on the one-drop rule included only individuals who self-reported as non-Hispanic black, white, or black-white. A separate analysis using only Hispanics was performed so that Hispanics and non-Hispanics were compared.
For ROOM, we calculated two sets of percentages and their confidence intervals: one set using the self-reported race on the college housing application form that did not have multirace categories; and the second set using the online survey responses, which did allow selection of multiple race categories. For Add Health, we analyzed two samples: the first sample included non-Hispanics, and the second included only Hispanics. Using the first sample, we calculated three sets of percentages and their confidence intervals: the first used self-reported single race, the second used single race recorded by interviewers, and the third used the 2000 U.S. census self-reported questionnaire that allowed the selection of multiple races.
To examine the fluidity of racial classification, we restricted our analysis sample to individuals who were classified by our PLINK analysis as blacks and whites; Hispanics were excluded. First, we investigated the extent to which these individuals “switch” to a multirace category when presented with this option; second, we explored which social circumstances might make individuals more likely to switch racial classification than others. In all the analysis, we controlled for bio-ancestry.
Table 2 presents results from PLINK cluster analysis, showing both the percentage and case distribution of self-reported race by PLINK-estimated genetic cluster or bio-ancestry. These PLINK estimates (as well as other estimates based on genetic data) are placed in quotation marks to differentiate them from self-reports. The samples were assumed to have derived from three ancestral populations (K = 3). We repeated the analysis, assuming K = 3, 4, 5, 6, and 7, and using a fuller range of self-reported racial classification groupings. The findings from these additional analyses are substantively identical to those in Table 2 and are also available upon request.
In ROOM, of those who self-reported as white, 99.5 % were assigned into the “white” category by the cluster analysis. Of those who self-reported as black, 99.3 % were classified as “black.” We separated South Asians from non–South Asians; previous work suggests that South Asians share substantial bio-ancestry with Europeans (e.g., Rosenberg et al. 2002). Of those self-classifying as non–South Asians (including Chinese, Japanese, Koreans, Filipinos, and Vietnamese), 97.7 % were assigned as “non–South Asians.” Three of the four self-reported American Indians were classified as “white.” The bootstrapping 95% confidence intervals for the three key groups of whites, blacks, and non–South Asians were [99.0, 99.9], [94.7, 100], and [89.5, 100], respectively, indicating that the correspondence between bio-ancestry and self-reports for the three main racial groups is estimated with precision.
The results from Add Health are comparable. Of individuals who self-classified as white, black, or non–South Asian, 99.4 %, 100.0 % and 93.7 %, respectively, were assigned by cluster analysis into the “white,” “black,” and “non–South Asian” categories. The only two self-reported South Asians in Add Health were excluded from the analysis. All self-reported American Indians were classified as “white.” The three confidence intervals for Add Health were [97.7, 99.9], [96.9, 100], and [88.1, 97.5], respectively.
Assuming three ancestral populations, we performed a STRUCTURE analysis (Pritchard et al. 2000) on data from ROOM and Add Health (Fig. 1). This analysis allows each individual to have memberships in as many as three ancestral populations. The horizontal bar graph shows ancestral proportional composition for each individual. Each individual is represented by a vertical line partitioned into as many as three segments; the length of each segment is the measure of each ancestral contribution to an individual’s genome from three ancestral groups. The three continental ancestries are European (red in Fig. 1), black (blue in the figure), and Asian (yellow in the figure). The labels of self-reported race/ethnicity were used to order the individuals or vertical lines in the graph and were added only after each individual’s ancestry had been estimated. There are two sets of labels for white, black, Hispanic, Asians, and so on, with one set above the graph and the other below. The two sets of labels indicate the self-reported single-race and mixed-race individuals.
The results from the STRUCTURE analysis not only confirm the findings described in Table 2 but also demonstrate a close match between the estimated bio-ancestry and self-reported race of multiracial individuals. For example, the bar graph for ROOM shows that the vertical lines for individuals who self-reported as black-white are mostly composed of blue and red colors; the lines for those who self-reported as East Asian-white are largely composed of yellow and red colors. In Add Health, there are fewer respondents who are black-white; the lines of these individuals are composed of red and blue colors. Panel 3 of Fig. 1 magnifies the section of Hispanics in Panel 2, showing that Cubans in Add Health contain a high percentage of European ancestry, that Puerto Ricans contain a significant portion of African ancestry, and that Chicanos are similar in ancestral composition to Mexicans.
Table 3 gives the distribution of average ancestry for each self-reported race/ethnicity assuming three ancestral populations. The results in Table 3 were averaged over the estimates presented in Fig. 1. The results across ROOM and Add Health are consistent. For example, in the two studies, respectively, the average percentage of Caucasian ancestry among self-reported whites is 98.1 % and 98.3 %; the percentage of African ancestry among self-reported blacks is 89.7 % and 93.2 %; and the percentage of East Asian ancestry among self-reported East Asians is 95.5 % and 92.7 %. The ancestry distribution for subgroups within Hispanics in Add Health is also presented.
Figure 2 displays the genetic distances among the individuals in ROOM (Panels 1a–1d) and Add Health (Panels 2a–2d) in the context of 52 world populations consisting of more than 1,000 individuals from HGDP. We analyzed the U.S. participants and reanalyze the HGDP study participants in order to compare the two sets of results. Each panel plots the two largest principal components obtained from analyzing the same set of AIMs, and the resulting figure reveals patterns of genetic distances among individuals. Panel 1a plots bio-ancestral distances among the HGDP individuals only. Africans and East Asians are the furthest from each other; American Indians and individuals from Oceania are much closer to East Asians than to Europeans and Africans; and Central Asians and Middle Eastern individuals are closer to Europeans than to East Asians.
In Panels 1b–1d, the HGDP map of ancestral locations in Panel 1a is used as a backdrop with the U.S. sample (black symbols) imposed onto the HGDP map. The U.S. sample self-classified as African Americans (Panel 1b), East Asians (Panel 1c), and Europeans (Panel 1d). Self-classified East Asians and Europeans in the U.S. sample overlap almost completely with the HGDP East Asians and Europeans, respectively, while self-classified African Americans are located slightly away from the HGDP Africans and closer to the HGDP North Africans and Europeans, which is consistent with the presence of some European ancestry in African Americans. The Add Health results (2a–2d), based on a smaller set of AIMs (121 vs. 162 for ROOM) are similar to those in ROOM. These findings have thus established an agreement among our bio-ancestral results from the PLINK, STRUCTURE, and EIGENSTRAT analyses. We also demonstrate an agreement among the findings based on the U.S. data (ROOM and Add Health), the HGDP, and HapMap.
The One-Drop Rule
Table 4 shows the percentage of a sample with a proportion of African ancestry that reports itself as black for ROOM and Add Health. The related 95 % bootstrapping confidence intervals are given in parentheses. The point estimates are boldfaced to highlight the general patterns across the proportion of African ancestry. We display the information in deciles, but we collapse several deciles where sample sizes are small.
In ROOM, when only a single race was allowed to be self-reported on the housing application, individuals with 30 % to 40 % or more African ancestry always self-classified as black. After the questionnaire in the online survey allowed multiracial categories, the percentages that self-classified as black lowered considerably in comparison with those in the housing form. The lowering or the weakening of the one-drop rule is particularly conspicuous near the 50 % African ancestry mark. Among those with 40 % to 70 % African ancestry (N = 39), when single race was the only choice, 100 % self-identified as black; when offered multiracial options, 24 of the 39 did not self-classify as black in the online survey (column 3 vs. column 2). The 95 % bootstrapping confidence intervals for the online estimates are almost always below those for the housing form (column 3 vs. column 2). On the other hand, the point estimates and confidence intervals in column 3 show that large proportions of individuals with 40 % to 70 % African ancestry still self-classified as black, indicating a cultural influence of the one-drop rule in spite of multiracial options.
The non-Hispanic data from Add Health Wave I displayed a similar pattern as those from ROOM. The large majority of individuals with >30 % African ancestry self-classified as black. The percentages of individuals who self-classified as black also dropped considerably when multiracial categories were an option (column 7). Interviewer-classification did not differ markedly from self-classification. The Hispanic data from Add Health have a small number of persons with >30 % African ancestry—too few to be informative on the one-drop rule (columns 9–11).
Table 5, a mirror image of Table 4, gives the percentage of a sample with a proportion of European ancestry that reports itself as “white” for both ROOM and Add Health. The contrast between Tables 4 and 5 among non-Hispanic individuals is evident. A much larger proportion of individuals with 30 % to 70 % African ancestry self-classified as black (Table 4: 100 % and 38 % in response to a single-race question and a multirace question for ROOM; 82 % and 42 % for Add Health) than the proportion of individuals with 30 % to 70 % European ancestry self-classified as white (Table 5: 3 % and 0 % in response to a single-race question and a multirace question for ROOM; 27 % and 13 % for Add Health). The asymmetry between Tables 4 and 5 is that it takes a higher proportion of European ancestry to self-classify or be classified by an interviewer as white than the proportion of African ancestry needed to self-classify or be classified as black. When multiracial categories come into play, some individuals with a high proportion of European ancestry (columns 3 and 7) switched classification from white to multiracial. Again, interviewer classification does not differ from self-classification noticeably.
The Hispanics from Add Health in Table 5 show a distinct pattern. Those with 30 % to 60 % European ancestry are more likely than non-Hispanics with African ancestry to self-classify as white (column 9 vs. column 5). For example, about 45 % of Hispanics with 40 % to 50 % European ancestry self-classified as white, compared with about 14 % of non-Hispanics with 40 % to 50 % European ancestry. Hispanics with >60 % European ancestry were less likely to self-classify as white and more likely to self-classify as multiracial (column 9 vs. column 5). For example, only about 50 % of Hispanics with 80 % to 90 % European ancestry self-classified as white, compared with 100 % of non-Hispanics who self-classified as white and who have 80 % to 90 % European ancestry.
Tables 4 and 5 record another asymmetry from both ROOM and Add Health. In the column of the number of individuals by proportion of African ancestry in Table 4, individuals with 10 % to 50 % African ancestry (N = 14 for ROOM and N = 26 for Add Health) are considerably less numerous than individuals with 50 % to 90 % African ancestry (N = 131 for ROOM and N = 94 for Add Health).
The Fluidity of Racial Classification
Table 6 shows the number and percentage of blacks and whites who switched racial classification between the single-race and multirace options. In ROOM, 16.8 % and 2.6 % of blacks and whites, respectively, switched racial classification. The black switchers and nonswitchers scored .76 and .91, respectively, on African ancestry. The white switchers and nonswitchers scored .96 and .98, respectively, on European ancestry. In Add Health, 5.03 % and 2.8 % of blacks and whites, respectively, changed their racial classifications. The changers and nonchangers scored, respectively, .68 and .93 on African ancestry among blacks and .94 and .98 on European ancestry among whites. Among individuals who changed racial classification, more than 70 % of both blacks and whites switched to a multiracial category. Overall, those who changed classification scored higher on bio-ancestry than the nonswitchers within both the African and European samples. The higher probability of classification switching among blacks than whites could be partially attributed to bio-ancestry, suggesting that bio-ancestry needs to be accounted for when examining sociocontextual sources of classification switching.
Logistic regression was used to examine the sociocontextual sources of classification switching (Table 7). The descriptive statistics of the variables used in the regression models are given in Table 8. The outcome variable was coded as 1 for classification-changers and 0 for nonchangers. In ROOM, Model 1 (which is based on the combined sample of blacks and whites) contains a statistical significance test for the exploratory results described in Table 6, indicating that blacks were about seven times as likely to switch racial classification as whites. This finding is highly statistically significant. However, after primary ancestry—that is, an individual’s most prominent ancestry (African, Caucasian, or Asian bio-ancestry)—is controlled, the odds ratio is reduced from 7.33 to 2.94 (Model 2). Primary ancestry has proved important; an increase of 1 % bio-ancestry reduces the likelihood of classification change by (1 – .94) = 6 %. This result applies to those whose primary ancestry is African and those whose primary ancestry is European. Model 3 shows that students from the South are about 42 % as likely or 58 % less likely to change racial classification as the non-Southern students. Age and gender are not related to classification switching. The findings were obtained after African ancestry was controlled.
For self-reported white participants in ROOM (Model 4), an increase of 1 % European ancestry reduced the likelihood of classification switching by 12 %. Model 4 indicates that in addition to bio-ancestry, social environment also influences classification switching among white students. Those whose neighborhoods were mostly white were 77 % less likely to switch racial classification than those whose neighborhoods were completely or mostly nonwhite. The coefficient estimate for those whose neighborhoods were completely white is similar (.25), but the estimate is statistically significant at the .10 level. White participants whose friends were 76 % to 100 % white were 70 % less likely to change racial classification than those whose friends were 0 % to 50 % white. The neighborhood and friend effects were estimated in the same model.
In Add Health, black adolescents were about twice as likely to switch racial classification as white adolescents when bio-ancestry was not controlled. The black and white difference disappeared after bio-ancestry was included in the model (Model 6). Among blacks, “Southern State” as measured in Add Health was not related to classification switching. Among both blacks and whites, living in a census block group in which the mode of racial composition was the same as one’s own race was associated with a 70 % lower likelihood of racial classification switching. We replaced the measure of neighborhood racial composition by a measure of racial heterogeneity in respondent’s friendship networks created from nominated friends in the in-school study at Wave I (Models 9 and 10). The racial heterogeneity ranges from 0 (where all in the networks are of the same race/ethnicity) to .8 (where all five racial/ethnic groups (black, Asian, Hispanic, white, and other) are equally presented). Higher racial heterogeneity is associated with a higher likelihood of classification change for both blacks and whites. The marginal significant result for blacks could be due to the reduction in sample size.
Discussion and Conclusion
Our research demonstrates a close match between estimated bio-ancestry and self-reported race among self-reported blacks, whites, and East Asians in ROOM and Add Health. Our overall analytical strategy for estimating bio-ancestry resembles that used for estimating the links between genetic variations and human traits. That strategy is composed of two essential components. The first is an association between a genetic variant and a human trait, and the second is a replication in one or more independent data sets. This strategy was used in a number of influential publications that identified genetic variants associated with human diseases (e.g., Frayling et al. 2007). In this project, the same panel of AIMs that differentiate European, African, and East Asian populations were first selected in the HapMap data set and then replicated in three independent data sets: the U.S. ROOM, the U.S. Add Health study, and the worldwide HGDP. If either sample representativeness or result predetermination were a serious threat, the replication of these findings across four independent data sources would be unlikely. Our results were also replicated across three different methods (as implemented in PLINK, STRUCTURE, and EIGENSTRAT) that estimate genetic clustering across continental populations.
The extent to which bio-ancestry matches self-classification of race, however, varies across social and cultural contexts. The one-drop rule represents an important case in which social context trumps bio-ancestry. When asked to classify into a single race, most individuals with 30 % to 60 % African ancestry self-report as black; virtually all respondents with >60 % African ancestry self-classify as black. In contrast, a substantially higher proportion of European ancestry is “required” to self-classify or to be classified by an interviewer as white than the proportion of African ancestry necessary to self-classify or be classified as black. However, when given the option of identifying as multiracial, the majority of individuals with 40 % to 60 % African ancestry in both ROOM and Add Health and substantial proportions of individuals with >60 % African ancestry in ROOM stopped self-classifying as only black and primarily chose a multiracial classification.
In summary, although the cultural legacy of the one-drop rule is still evident among the youth in survey responses, the practice has been eroded by recent modifications in survey questions of race and ethnicity. Given the choice of multiracial categories, large proportions of black-white mixed individuals self-classify as multiracial rather than black. This tendency to follow the one-drop rule is observed only among non-Hispanic white, black, and black-white individuals—not among Hispanics. This observation is consistent with the black-nonblack divide discussed recently by Bean et al. (2009) and Lee and Bean (2007). The recent nonwhite racial/ethnic diversity from immigration, the growth of intermarriage, and the rise of multiracial births have not erased the traditional black-white color line. Instead, the United States may simply be redrawing a color line that divides blacks from other racial/ethnic groups.
The fluidity in racial classification represents another major case in which social forces interact with bio-ancestry to shape racial classification. In both ROOM and Add Health, the racial composition of an individual’s social environment is important. In ROOM, white students from a mostly white neighborhood and with mostly white friends are less likely to change racial classification from white to a multiracial category. In Add Health, both black and white students from neighborhoods composed mostly of own-race residents are less likely to change racial classification. Replacing racial composition in neighborhoods with racial composition in one’s friend networks yielded similar results.
After bio-ancestry is adjusted for, blacks are more likely than whites to opt for another racial classification when multiracial categories were an option. This finding was found only in ROOM, not in Add Health. In ROOM, black students from a southern state were less likely than those from other parts of the country to change racial classification. This result may be explained by the observation that the American South is the region where the one-drop rule first originated (Davis 1991) and where racial discrimination and segregation were practiced legally and overtly.
A cautionary note should be made about the comparison between the housing form and the online survey in ROOM, and between ROOM and Add Health. The different responses to the two surveys in ROOM could have resulted from factors other than differences in the questions. Factors such as college education could play a role. Similarly, the differences in the results between ROOM and Add Health could be due to the differences in how responses on racial classification were obtained in the two studies. Students ages 12–18 in Add Health might have treated a race/ethnicity question in a survey less seriously than incoming college freshmen treated a similar question on a housing application form. The information on the housing form would be part of the official university database. Even though the university housing authority did not use race and ethnicity for assigning a dormitory room, students may not have known this. In addition, students may be concerned about whether the expectation created by self-reported race and ethnicity on the housing form would be in agreement with their prospective roommates’ conceptualization of race and ethnicity.
Another case in which self-reports did not match bio-ancestry occurred among those who self-classified as American Indian. Averaging a European ancestry of 67 % and 63 %, respectively, in ROOM and Add Health, and with distal ties to American Indians, these individuals were predominantly of European ancestry. These findings explain the drastic rise in the number of American Indians reported in the U.S. census over the past few decades as a result of ethnic re-identification (Eschbach 1993; Kelly and Ngel 2002; Nagel 1995).
The analysis reveals many fewer individuals with an African ancestry of 10 % to 50 % than individuals with an African ancestry of 50 % to 90 %. This imbalanced distribution is unlikely to result from the fact that there are many more whites than blacks. As long as a mixed union requires a white person and a black person, the marginal distribution in terms of the number of persons (not the proportions) should be balanced. This imbalanced distribution is likely a result of the one-drop rule and/or the minimal miscegenation between African and European Americans since 1865 (Davis 1991: chapters 3–4; Williamson 1980:188). For many decades, mixed-race individuals with one black parent and one white parent were treated as blacks rather than mixed-race individuals. Under such racial exclusion, these mixed-race individuals partnered predominantly with other mixed-race or black individuals rather than whites. These patterns of marriages redistributed the European ancestry in the original mixed-race individuals, “whitening” the general black population and yielding few individuals of more than 50 % European ancestry.
Our findings apply only to the contemporary United States. The dynamics of racial classification in other countries could be quite different. Race is fluid. The racial and ethnic categories as we know them in the contemporary United States are constantly changing. Ongoing immigration, intermarriage, and social mobility are likely to blur contemporary racial and ethnic divisions and boundaries (Perez and Hirschman 2009); therefore, the racial categories we use today may no longer be relevant, or as relevant, in the future.
Our work has a larger theoretical significance on identity studies. Brubaker and Cooper (2000) criticized the overproduction of the word of “identity” in the social analysis of such concepts as race, gender, and sexual orientation in social sciences, cultural studies, ethnic studies, literature, and political philosophy. They argued: “. . . that the prevailing constructivist stance on identity—the attempt to ‘soften’ the term, to acquit it of the charge of ‘essentialism’ by stipulating that identities are constructed, fluid, and multiple—leaves us without a rationale for talking about ‘identities’ at all and ill-equipped to examine the ‘hard’ dynamics and essentialist claims of contemporary identity politics” (p. 1). For example, they asked, “If [identity] is constructed, how can we understand the sometimes coercive force of external identifications?” (p. 1).
Brubaker and Cooper were not opposed to social construction per se. In the particular case of “race” in the United States, for example, they promoted a detailed analysis of how particular forms of social construction of race “emerge, crystallize, and fade away in particular social and political circumstances” (p. 30). They maintained that construction analysis should not be reduced to an oversimplified and flattened identity account.
Our work demonstrates that in the case of race, social construction could be analyzed and examined against a measurable continental and biological ancestry. Race is, indeed, multiple and fluid, but not all identifications of race are equally constructed. Some deviate more and some less from bio-ancestry. Capitalizing on bio-ancestry, social construction analysis can lay bare whether, how much, and under what social circumstances racial identification departs from bio-ancestry.
Two grants to Guang Guo supported the College Roommate Study (the William T. Grant Foundation) and the Illumina 1536 genotyping in Add Health (NSF’s Human and Social Dynamics program BCS-0826913). Data from Add Health were funded by the National Institute of Child Health and Human Development, with cooperative funding from 17 other agencies (www.cpc.unc.edu/addhealth/contract.html) to Kathleen Mullan Harris (P01-HD31921). Special acknowledgment is due Rick Bradley of the Housing Department, Kirk Wilhelmsen of the Genetics Department, Patricia Basta of the Bio-Specimen Process Center, Jason Luo of the Mammalian Genotyping Center, and the Odum Institute at the University of North Carolina, Chapel Hill. We received important assistance in SNP selection and the analysis of HGDP data from David Goldman and his Neurogenetics lab at NIAAA. Many hearty thanks go to Greg Duncan for his important role in the project and his helpful comments on the manuscript. We are grateful to the Carolina Population Center (R24 HD050924) for general support.
See Rosenberg et al. (2003) for a technical justification.