Abstract
Although Wikipedia has a widely studied gender gap, almost no research has attempted to discover if it has a comparable race and ethnicity gap among its editors or its articles. No such comprehensive analysis of Wikipedia's editors exists because legal, cultural, and social structures complicate surveying them about race and ethnicity. Nor is it possible to precisely measure how many of Wikipedia's biographies are about people from indigenous and nondominant ethnic groups, because most articles lack ethnicity information. While it seems that many of these uncategorized biographies are about white people, these biographies are not categorized by ethnicity because policies require reliable sources to do so. These sources do not exist for white people because whiteness is a social construct that has historically been treated as a transparent default. Thus, these biographies cannot be categorized as white because whiteness is unverifiable in Wikipedia's white epistemology. In the absence of a precise analysis of the gaps in its editors or its articles, I present a quantitative and qualitative analysis of these structures that prevent such an analysis. I examine policy discussions about categorization by race and ethnicity, demonstrating persistent anti-Black racism. Turning to Wikidata, I reveal how the ontology of whiteness shifts as it enters the database, functioning differently than existing theories of whiteness account for. While the data does point toward a significant race and ethnicity gap, the data cannot definitively reveal meaning beyond its inability to reveal quantitative meaning. Yet the unverifiability of whiteness is itself an undeniable verification of Wikipedia's whiteness.
Though Wikipedia has a widely studied gender gap, very little research has attempted to discover if it has a comparable race and ethnicity gap in its content and its contributors. Among the thousands of articles about Wikipedia in academic journals and the popular press, only three recently published articles have begun to analyze this race and ethnicity gap.1 While these texts show evidence of a gap in a comparative analysis of a small subset of articles, none was able to make a comprehensive analysis, which would answer two questions: (1) What percentage of Wikipedia's editors are from indigenous and historically nondominant ethnic groups? (2) What percentage of Wikipedia's biographies are about people from indigenous and historically nondominant ethnic groups? I set out to try to answer these two questions, but in the process I discovered three interrelated problems that prevent such a comprehensive analysis: (a) the multitude of different cultural understandings of race, ethnicity, nationality, and caste throughout the world prevents surveying the editors about their race and ethnicity; (b) Wikipedia's category structures, combined with the policies that constrain their use, limit their analytic usefulness; and (c) the unverifiability of whiteness leaves the majority of articles without any race or ethnicity metadata. These constraints prevent determining either the numerator or the denominator necessary to calculate a percentage.
In the absence of a precise, comprehensive analysis of the gaps in content and editors, I present a quantitative and qualitative analysis of these structures that prevent such an analysis. Through close readings of policy debates, I present qualitative evidence of how white English Wikipedia really is. And through an exploration of how theories of whiteness function differently in the database, I reveal how the ontology of whiteness shifts as it enters the database, functioning differently than existing theories of whiteness account for. This shift holds significant meaning for theorists, who need to understand how the ontology of whiteness functions differently in a database. It also impacts data scientists, who need to understand that the data is far less neutral than it is being treated as and that the presumed neutrality of the data parallels the presumed neutrality of whiteness. Data structures the emerging political economy, amplifying the importance of these nuances. While the data does point toward a significant race and ethnicity gap, it cannot definitively reveal meaning beyond its inability to reveal quantitative meaning. Yet the unverifiability of whiteness is itself an undeniable verification of Wikipedia's whiteness.
Wikipedia has become one of the internet's key sources of data, in part because it is a digital commons. Data is so central to our contemporary political economy that some argue that capitalism has been superseded by a new economic system in which control of the means of production has been surpassed by control over information and its means of distribution.2 Such analyses of data focus on the private ownership of user data extracted via ubiquitous surveillance built into such tools as Gmail and Facebook, and the ways in which these data are used to predict and modify our behavior, with terrible consequences for society.3 Yet Wikipedia's users remain unsurveilled. Wikipedia has steadfastly refused to put advertisements on the site, track user information, or even collect nonvoluntary user demographic data from IP addresses. Nor is Wikipedia's data privately owned. Wikipedia is produced under a Creative Commons Attribution-ShareAlike license, allowing anyone to reuse its content as part of the digital commons. Through this license, Wikipedia is reused throughout the internet; for instance, it appears on Google's Knowledge Panel in the sidebar of internet searches, and it helps YouTube and Facebook fact-check disinformation. Wikidata, which is like Wikipedia for structured data, feeds Alexa and Siri their answers. Thus, absences on Wikipedia and Wikidata echo across the internet.4
How Wikipedia is written shapes its content and the value it creates. Wikipedia inherits its enlightenment belief in objectivity from L'Encyclopédie and Brittanica, whose empirical rationalism reinscribed entrenched power relations in a colonialist effort to describe empire.5 And yet its content, processes, and political economy reveal key differences. Wikipedia is a history of the now, dominated by sportspeople and popular media, with more articles about people born in 1947 than any other year.6 Wikipedia's processes are collaborative and radically transparent, showing every edit made to a page over its entire history.7 It aspires to inclusivity, allowing anyone to edit, and though its technical and social processes exclude people, Wikipedia seeks to be, and retains the radical potential to be, something different.8 Lastly, Wikipedia's editors contribute as volunteers, providing free labor, and the content they create is freely available.9 The nonprofit project does, of course, produce value. This value is shared equally through its Creative Commons license but is exploited unevenly. The value Apple extracts from the data by displaying Wikipedia summaries whenever an iOS user chooses to “look up” a term is vastly larger than the value average readers extract when they read an article.
Who edits Wikipedia matters, because it determines which articles are written and from what perspective.10 This impact is twofold: more obviously, if certain groups are underrepresented among the editors, content about those groups will be underrepresented in the articles; more insidiously, those groups will not be present for the policy discussions that govern Wikipedia.11 Famous for its openness, the encyclopedia that anyone can write actually has a lot of rules that govern content and conduct. Most significantly for this analysis, “Wikipedia is written from a neutral point of view”; thus “all articles must strive for verifiable accuracy, citing reliable, authoritative sources, especially when the topic is controversial or is about a living person.”12 The requirement for verifiability, as well as the interpretations of the bounds of neutrality, and what constitutes reliable sources, has been agreed on by consensus through extensive discussions that take place on talk pages—essentially Wikipedia's peer review process, with a dash of social media chatter. Who participates in these discussions matters. As I show through close readings of persistent anti-Black rhetoric in the policy discussions about the guidelines for categorization by race and ethnicity on English Wikipedia, if only white cisgender men are participating in these decisions, they will encode their own implicit biases into the construction and interpretation of these rules that govern how Wikipedia is built.
Wikipedia's gender gap has been studied extensively, but the quantitative methodologies used to analyze gender are unable to answer similar questions about race and ethnicity. The gender gap among editors has been measured by surveys, but no such comprehensive data exist about race and ethnicity. Researchers calculate the gender gap in Wikipedia articles through a combination of Wikipedia and Wikidata.13 Much of the content in Wikidata comes from Wikipedia's categories, which are the metadata that appears at the bottom of articles. Nearly every biography has a gender, but as I show here, the majority lack an ethnicity on Wikipedia or Wikidata. While it seems that many of these uncategorized biographies are about white people, these biographies are not categorized by ethnicity, because Wikipedia policies significantly limit such categorization. The verifiability policies on both sites require all claims of ethnicity be verifiable (e.g., cited) by reliable sources, which do not exist in the case of biographies about white people.
Critical whiteness studies helps explain that authoritative sources that could attest to a person's whiteness often do not exist because whiteness is a “set of cultural practices that are usually unmarked and unnamed.”14 Barbara J. Flagg's concept of the transparency phenomenon is particularly relevant to this analysis, which she defines as “the tendency of whites not to think about whiteness, or about norms, behaviors, experiences, or perspectives that are white-specific.”15 Whiteness is normalized and becomes an invisible default. For Wikipedia, the transparency principle means more than just that whiteness is the default in US society; it means that, because whiteness is the default, it is rendered transparent, which is to say, unrecorded by history and thus unverifiable by reliable sources.
The ontology of whiteness behaves differently in the database than existing theories account for. Building on Flagg, Ian Haney López argues that “whites do not exist as a natural group, but only as a social and legal creation.”16 Analyzing US case law, Haney López shows how “the courts established not so much the parameters of whiteness as the non-whiteness of Chinese, South Asians, and so on.” Because whiteness was defined by courts deciding who was not white, “whites exist as category of people subject to a double negative: they are those who are not non-white.”17 This paradox of the double negative stymies any effort to measure a race and ethnicity gap on Wikipedia. You cannot measure a category of articles defined by not being categorized as nonwhite. This problem becomes even more wicked on Wikidata because you cannot enter a double negative into its database.
Without a source to verify someone's ethnicity, it cannot be included on Wikipedia or Wikidata. Thus, these biographies cannot be categorized as white because whiteness is ultimately unverifiable inside Wikipedia's dominant Western epistemologies. The inability to categorize biographies as white produces a critically incomplete data set. The vast majority of the biographies have no ethnicity data, leaving two false choices for any potential analysis. It is statistically fallacious either to analyze only the biographies that have ethnicity data, or to treat the uncategorized biographies as white. The unverifiability of whiteness prevents any comprehensive analysis of indigenous and historically nondominant ethnic groups, and yet in the process it reveals Wikipedia's whiteness.18
Race and Ethnicity in Its Many Global Contexts
The social and historical implications of race and ethnicity vary from country to country, and cannot be extricated from histories of colonialism.19 As I show here, questions or principles that may make sense in one national context often produce disagreement on Wikipedia and Wikidata because of these varying cultural norms and contexts and prevent surveying Wikipedia's editors about their race and ethnicity. Wikipedia is a global project, with over three hundred active Wikipedia language editions. This analysis focuses on English Wikipedia, but even English Wikipedia is a global project, with contributors across and beyond the Anglosphere.20 Wikidata is an open data repository that is interoperable with all Wikimedia projects, including all the Wikipedia language editions. Policy discussions across these projects usually involve editors from different continents negotiating guidelines from the perspectives of their many nations and cultures. Conflicts between these local cultural contexts and legal systems prevent any comprehensive demographic analysis of its editors.
The terms used to describe race or ethnicity vary significantly from country to country. The census category itself does not have a consistent name, varying among race, ethnic origin, nationality, ancestry, and indigenous, tribal, or aboriginal group.21 The specific terms also vary widely, and one country's terminology may seem nonsensical or outright offensive in another national context. For example, in Brazil, the census options for “race” are branca (white), parda (brown), preta (black), amarela (yellow), or indígena (indigenous); in Bulgaria, the “ethnic group” options are Bulgarian, Turkish, Gypsies, and Other; in Japan, the census form asks for “nationality,” offering two choices, Nihon (Japan) and Gaikoku (foreign country);22 and the United Kingdom offers eighteen choices for “ethnic group.”23
In the midst of writing this article, I experienced this difference firsthand during a call from the Camden London Borough Council, where I was living for the year. The caller asked a few survey questions about my satisfaction with the trash collection (I had no complaints) and some optional demographic questions. They did not ask any follow-up questions when I confirmed that yes, I was disabled; offered no comment or objection when I said my gender was nonbinary; but quickly and politely told me that “Jewish is not an ethnicity, it is a religion” and started reading choices from the list of eighteen recognized ethnic groups.24 Apparently unwilling to check off either “Any other White background” or “Any other ethnic group,” the caller and I agreed to disagree and skip the question. No wonder the bagels in London are terrible.
I am making a global argument that it is difficult to cohesively describe the related phenomena of race, ethnicity, and nationality on a global scale—and yet I still need language make this argument.25 I am conscious that I am making it from a specific positionality, as described above. The terms people of color (POC) and Black, Indigenous, and people of color (BIPOC) are widely used in the United States, and Black, Asian, and minority ethnic (BAME) is used in the United Kingdom, but as global terms they are limited and reinscribe colonial power/knowledge relationships.26 Furthermore, neither term can encompass Basque or Catalan nationalism or describe the caste oppression of Dalits in India, nor can they account for formerly colonized countries where indigenous ethnic groups maintain asymmetric power relationships, all of which were subject to European colonial domination (whose traces may yet remain). In this article I use the term indigenous and historically nondominant ethnic groups.27 This language is not meant to be authoritative; rather, it is a provisional term to allow me to make this argument about the difficulty of describing or analyzing these groupings.
The laws and customs about racial and ethnic data also vary from country to country. In Europe collecting demographic data on race or ethnicity is controversial. When implemented, it primarily tracks country of origin, thus focusing on immigration and citizenship status. This is in keeping with the history of Europe's post-WWI ethnostates and the stateless people who fled them.28 While the United Kingdom, Netherlands, and Finland have comprehensive data collection programs, the majority do not.29 For example, French law explicitly forbids the collection of racial, ethnic, or religious data, in keeping with its republican constitutional law, which embraces equality as a founding principle, as manifest in its policy of laïcité (secularism).30 Though Germany does track some “migrant background” data, its history of Nazi eugenics and the Holocaust, as well as the earlier Herero and Namaqua genocide in Africa, creates a strong cultural prohibition against the use of ethnic data to identify people.31 Language compounds this historical sensitivity, as Rasse, the German word used to describe race, also means animal breed. Because the word refers to race as a biological essence it forecloses the ability to understand race and ethnicity as a social construct.32 For these reasons, it is culturally unacceptable to survey most Europeans about their ethnicity.
These considerations make it very challenging to design a global survey about race and ethnicity that successfully captures specific cultural meaning that allows for analysis at local and global scales. As a result, prior to this year, the Wikimedia Foundation has never included race or ethnicity in any of its community surveys. The community has discussed the difficulty of including such questions on its surveys at least as far back as 2013, though the Wikimedia Foundation has recently proposed taxonomy that uses the category “cultural background” to assess contributors based on the broad spectrum of “political, religious, or ethnic background.”33 Ethnicity was included in the 2020–21 survey, but only for editors in the United States and the United Kingdom. As I was finishing this article, the Wikimedia Foundation had begun analyzing that data and revealed that less than 1 percent of editors in the United States are Black.34 The demographic data about the gender gap in Wikipedia's editors has come from these surveys, but no comprehensive survey data exist about what race and ethnicity gap may or may not exist among the entire English Wikimedia community.
The available data does show Wikipedia's geographical gap, because the surveys ask about geographical location, most content is geolocatable, and the Wikimedia Foundation has limited access to contributors’ IP address locations. From this data, the Wikimedia Foundation knows it has a marked underrepresentation of content about and contributors from the continent of Africa and from the larger global majority.35 Concurrently, African editors and groups have become leaders in the movement, demanding more support from the Wikimedia Foundation.36 While, they have increased the number of grants given to supporting African participation, these grants comprise only a small percentage of the overall community funding.37
Wikipedia's Guidelines for Categorization by Race and Ethnicity
Reading the policy debates over the guidelines for categorization by race and ethnicity reveals evidence of Eurocentric anti-Black bias. Wikipedia categories are metadata that live at the bottom of articles and cover an extraordinarily wide range of topics, from “AAAA-rated tourist attractions” to “ZZ Top members” (fig. 1). In its early years, Wikipedia did not have a consistent practice for categorization by race, ethnicity, or gender.38 Like with all processes on Wikipedia, decisions about the practice were made through open conversations where editors ask questions and make arguments for or against a proposal. Editors are invited, but not required, to state whether they support, oppose, or remain neutral on the proposal. After a period of at least a week—though it can go on for months—an administrator will review the arguments and make a decision.39 This process is not a vote, and if the administrator finds both sides remain opposed and have presented compelling arguments, they can decide there was no consensus, which means no action will be taken, but a similar proposal could be put forward in the future.
In 2005 a group of editors began discussing if and how to create a set of guidelines for categorizing by gender, race, and sexuality.40 While these discussions covered gender, race, sexuality, and religion, my analysis focuses primarily on discussions regarding race and ethnicity, to show the transparent anti-Black hostility toward such categorization by the majority of participants in the discussion, as well as the schism between the North American and European editors.41
In this discussion the majority of editors opposed categorization, with clear breakdowns between European and American editors, as well as between the majority, who are likely cisgender heterosexual white men, and the few editors who self-identify as Black, gay, or women. Thirteen editors participated in the conversation; nine opposed categorizations by race or ethnicity, and four supported it.42 All six European editors opposed such categorization, with one offering very limited support that “inclusion of blacks/white/greens should depend on their noteworthy achievements. Everything else particular about their life or person should be included only, if it had proven impact on these achievements.”43
The arguments against categorization by race and ethnicity mostly fall into two categories: a belief that Wikipedia should be color-blind, and a resistance to universalizing US culture. The arguments that Wikipedia should be color-blind assert that including race or ethnicity information is itself a form of racism; that race is a social construct, and thus not real; or simply that it is not important or does not exist. As a European editor argues, “We should not do categorisation by ‘race.’ At the moment, some people are categorised by ‘race’ (Michael Jackson), while others are not (George W. Bush). It seems only ‘black’ people are categorised by ‘race.’ This is a racist bias. To remove the bias, either no, or all, people should be categorised by ‘race.’ ”44 Another European editor argues, “I can not imagine that anyone would seriously use ‘race’ for categorization of articles on the wikipedia. First of all ‘race’ isn't very clear, then there's all kind of mixtures, maybe not in the USA, but in other parts of the world different ‘races’ do interbreed (and produce fertile offspring), and last but not least, it is not very important to which ‘race’ somebody belongs.” This editor's use of the dehumanizing word interbreed and the casual reference to nineteenth-century eugenic theories of racial difference clearly displays the editor's racist point of view. That same editor later argues for a universal humanism that supersedes all categories in the most preposterous of ways, saying, “Another example is Bill Cosby. I fail to see why he should be categorized as an ‘African-American’ when it's a cultural category and not a race-based category. His TV-shows represent human culture, not African-American.” These Wikipedians’ insistence on removing race or ethnicity as a category enforces color-blindness—which is to say, it erases race. These editors use the framework of neutrality and the requirement for a neutral point of view to enforce a white supremacist point of view that prohibits the possibility of acknowledging race, reifying the neutrality of whiteness.
These same European editors argue against such categorization because they perceive it will universalize American culture. One editor writes that “US-culture is not world-culture. I will resist against pushing the US-view on the world as being the way the wikipedia should be categorized.” Another editor writes, “This issue seems mostly centered on America.” The geographic division is even present in the title of the discussion, which proposed an end to “Categorisation by race” using the British spelling with an s rather than a z. In these editors’ views, race is an American problem, and any attempt to acknowledge it is a form of American cultural imperialism.
The first substantive argument in support of categorization by race or ethnicity comes from one of two editors to take part in the discussion who identify themselves as African American, who writes,
I say categorize by nationality and culture, not race. Michael Jackson is an African-American, so he should be classified as such. Robert Blake is an Italian-American, and should be categorized as such. To the best of my knowledge, George W. Bush has no strong ethnic ties to another culture, so he is properly termed an American. It is not POV to identify an African-American person as an African-American; some articles read rather ridiculously (Rosa Parks, for example) without the mention.
He argues for categorization not by race but by “nationality and culture,” which is to say, by ethnicity.
The four editors strongly in favor of categorization by race or ethnicity are all American. Of these, two identify as African Americans, and one identifies as a gay man.45 This positionality schism among Wikipedia's editors is reflected in the other areas of this discussion thread, where LGBTQ+ and women editors speak for themselves or their communities in opposition to the implied white cisgender straight male majority.
While none of the implied white majority declares their own positionality (e.g., “As a white heterosexual cisgender man, I believe . . . ”), the language and tone speaks clearly, including when an editor mentions “a good friend of mine who is ‘Black’ ”; putting terms like Black in scare quotes; the repeated use of third-person grammar, as in, “If their work is different, we should be able to have an article African-American social science, which we don't and probably couldn't”; frequent digressions into physiognomy; and the casual racism of dismissing the discussion topic as the “inclusion of blacks/white/greens.” The language and tone of these editors are frequently insensitive or offensive in a way that clearly indicates their own biases and unfamiliarity with the issues at play. Thus, the process of forming these guidelines itself articulates some of the contours of the race and ethnicity gap among Wikipedia's editors: the majority in the discussion are white, with unexamined implicit biases, who casually dismiss or shout down the voices of editors of color.
The two editors who identify as Black engage in a telling side-channel conversation on a separate talk page in a section titled “These folks done lost they minds.” One writes, “Until I came here, I never realized just how badly some people want to just whitewash everthing and smother it with a blanket of sameness. Now I know, and it's highly upsetting.” The other editor commiserated: “I hear ya . . . Wikipedia is an ignorant/ arrogant-white-male-dominated microcosm. They know what they think they know, and we, of course, couldn't possibly know jack.”46
Despite the numerical opposition, the arguments for such categorization were more persuasive to the two editors who summarized the discussions in January 2006. These new guidelines affirmed that “general categorization by race or sexuality is permitted” but required that “inclusion should be justifiable by external references.” They also emphasized that categorization should take place only when it is relevant to the subject's notability.47 These two restrictions both enable categorization of articles about people from indigenous and historically nondominant ethnic groups and prevent such categorization of biographies about white people.
Wikipedia's Categories and the Unverifiability of Whiteness
Wikipedia does categorize biographical articles based on the ethnicity of the subject, but this does not aid in determining what percentage of all biographies are about people from indigenous and nondominant ethnic groups, because the guidelines limit when they can be applied, with particular restrictions on categorizing majoritarian groups, and because they are designed for discovery, not sorting. In their current form, the guidelines for categorization by ethnicity, gender, religion, sexuality, or disability allow for such a categorization if it is “relevant to the topic,” a “defining characteristic . . . that reliable sources commonly and consistently define the subject as having”; and of course, such “inclusion must be based on reliable sources.”48 These requirements make it possible to categorize Jackie Robinson as an African American baseball player, because his ethnicity is relevant, defining, and verifiable in reliable sources, and also make it nearly impossible to categorize Branch Rickey by ethnicity, because it is not considered “relevant” given his majoritarian status, and it is never mentioned in reliable sources as defining the subject, despite the fact that Rickey was the team manager who hired Robinson to end baseball's racial segregation. Despite (or rather because of) the patently obvious fact that Rickey would have had to be white to be able to manage a racially segregated team, it is nearly impossible to locate a reliable source to verify the claim that Rickey had an ethnicity. In Wikipedia's white epistemology, whiteness is unverifiable.
Wikipedia's messy categories are unsuitable for sorting data because they were intended to be a finding aid. (Wikidata was created for sorting data, a topic addressed later.) Wikipedia categories were intended to help readers discover articles similar to the one they were reading, by following the category links to see other articles in those categories. For this reason, guidelines prohibit categories that are too big or too small.49 Even if it were verifiable, a category that might contain whiteness is discouraged because it is ineffective as a finding aid, as it would contain over a million articles.
Because categories were designed for discovery, the branches of the category tree further hinder any use for sorting. The pages in the category “People of African descent” and its subcategories include numerous incongruous results, not the least of which are the numerous nonbiographical articles (fig. 2).50 But the biographical articles have other anomalies, such as the inclusion of abolitionist Amos Bronson Alcott because his page is in the category “Underground Railroad people.” This is a result of the branching category structure:
• “People of African descent”
○ “North American people of African descent”
▪ “American people of African descent”
• “African-American people”
○ “African-American refugees”
▪ “Underground Railroad”
• “Underground Railroad people”
Though a data scientist might object to this taxonomy, it follows Wikipedia's intention to use categories as a tool for discovering related content. Furthermore, this category does not differentiate between people of indigenous African descent and descendants of European colonists in Africa. Albert Camus appears in the results because he was born in Algeria to French colonial parents. Charlize Theron also appears in the results, as she is South African from an Afrikaner family. Because the categories blur nationality and ethnicity, the data will not allow the precise measurement of people of indigenous African descent.
Even if it were possible to accurately count the number of articles about people from the African diaspora, it would still not be possible to accurately analyze them, because it is not possible to establish what number to divide these articles by. There are 1.8 million biographical articles on Wikipedia,51 though there are only 550,000 that are categorized by ethnicity or nationality.52 The 1.25 million articles without any categorization defy countability. It is not correct to assume all the uncategorized are white and divide the number of articles about people in the “people of African descent” category by all 1.8 million articles, nor is it correct to divide that number by the 550,000 that are categorized by ethnicity or nationality, which assumes that the remaining uncategorized articles will follow the breakdown of the categorized articles.
While it may be slightly less complicated to determine the numbers with more nationally specific ethnicities, the problem of the uncategorized majority remains. Approaching the question from the national specificity of the United States, it is not correct to divide the 31,834 articles in the category “African-American people” by the 245,289 total articles in the category “American people by ethnic or national origin” or the 984,670 total in the category “American people.”53 Assuming that all articles without an ethnicity are white gives us 3.2 percent of biographies about African American people, a very low percentage.54 Ignoring these uncategorized articles results in an improbable 13 percent.
To negotiate such data, social scientists sometimes call upon markedness, a linguistic theory developed by Roman Jakobson.55 Markedness is the condition of being different from a dominant or general form. For example, actress is marked, whereas actor is unmarked, as it can refer to either form.56 This methodology uses the markedness of one variable as a predictor of the other variable; in this case, the markedness of articles categorized as racialized would be used as a predictor of the unmarked transparent whiteness.57
This approach is less complicated for gender than for race and ethnicity.58 Race and ethnicity produce a more complicated scenario, not because either is any more or less real or constructed but because absences in who is counted and how they are counted have historical and political ramifications, and because of how these data function on Wikipedia at scale.59 One of the studies that have attempted to assess Wikipedia's race and ethnicity gap used markedness to define the white people in the data set. The authors of this study acknowledged that, while this approach “is not without problems of reification and misclassification,” it was justifiable because their data set included only 2,978 items, and they were able to manually code each entry.60 As data scales up beyond human evaluation, this assumption becomes less viable. Furthermore, Wikipedia and Wikidata are always in the process of being built. Over 1 million of the 1.8 million biographies are assessed as a “stub,” the lowest level of development that “provides very little meaningful content.” Another 600,000 are assessed as a “start,” an “article that is developing but still quite incomplete.” Fewer than 140,000 (i.e., less than 8 percent) of all the articles are “mostly complete and without major problems.” Thus, the absence of an ethnic group category or property might simply mean that the article or item has not been developed beyond the most basic stages. Given these considerations, the 1.8 million biographies in English Wikipedia seem like too many to make assumptions about their markedness.
The data that comes from analyzing these categories is doubly meaningless. The categorization structure prevents a precise measurement of the number of articles about people by race or ethnicity, and the unverifiability of whiteness prevents certainty about the total number of articles to compare that number to. Speaking mathematically, it is not possible to confidently measure either the numerator or the denominator to calculate a percentage.
Wikidata's Null Set: Whiteness as an Undefined Variable
The problem of the unverifiability of whiteness present on Wikipedia becomes even more manifest on Wikidata because whiteness functions differently in the database than existing theories of account for. Many analyses of Wikipedia use Wikidata, because its linked open data structure is built for computation, allowing researchers to parse large amounts of information quickly and precisely. Growing rapidly, it currently contains over 90 million items, fifteen times more than English Wikipedia's 6 million articles.61 All Wikipedia articles have Wikidata items, and Wikidata meshes with all language versions of Wikipedia, allowing for cross-language comparisons and analyses of content gaps. As discussed earlier, Wikidata occupies a growing position in the political economy of data; for example, Siri and Alexa find their answers among its structured data. Other important databases use Wikidata as the central reference to sync their data with one another.
Like with Wikipedia, Wikidata items are built over time, gaining information with each edit. Each entry typically starts out with a name, such as “Rose Park” or “Rosa Parks,” and a statement of the type of data it is an instance of, which in this case are “suburb” (in Adelaide, South Australia) and “human,” respectively. Some of the most common statements added to the entries for humans are the properties sex or gender, country of citizenship, date of birth, and occupation. Ethnic group is not one of the most common properties. These properties correspond to Wikipedia's categories, and much of this content has been migrated from these categories. These items are always in the process of being built and thus always inherently incomplete. In fact, many items consist of only the name and type of data, because they are obscure and no one has bothered to develop them. Thus, the absence of an assigned ethnic group statement could be an instance of the transparency phenomenon (only nonwhite people have their race or ethnicity property defined), but it could also be because the entry is simply underdeveloped. Put another way, the transparency function behaves differently in the database because data requires affirmative assignment. If, as Haney Lopez argues, whiteness is defined by a double negative (e.g., “those who are not nonwhite”), a database cannot assign a double negative value.
Each item in the database is referred to by a unique identifier called a QID, which is a number preceded by a Q. For example, the universe is Q1, Earth is Q2, life is Q3, death is Q4, and so on. The items for President Obama (Q76) and President Trump (Q22686) are among the more developed. The item for Obama (fig. 3) has four different values for the ethnic group property, while the item for Trump (fig. 4) has none, despite extensive discussion of his whiteness.62 Furthermore, three of the seven discussion threads on the Obama talk page deal with ethnicity, while Trump's ethnicity is discussed only implicitly, in someone's trollish effort to add “instance of racist” to this item.63 Other items for people who are historically associated with whiteness and yet have no ethnicity include George Washington (Q23), Elizabeth II (Q9682), Cecil John Rhodes (Q19825), Jefferson Davis (Q162269), David Duke (Q163042), and Francis Galton (Q191026), among many more.
Like on Wikipedia, the creation of the ethnic group property was considered controversial, and its use requires a citation. Shortly after its creation, a German user started a discussion titled “Controversial” with the following statement:
Ethnic group is not a well defined property. We should delete it. To the examples: Bloomberg is US-citizen and his religion is Judaism. That's it. His family has roots in russia and poland. What ethnic group is he in? The jewish-russian-poland-living-in-US-group?
Guliani: Why is he an ethnical italian? Because of his name??? I would not say that Guliani has more in common with somebody living in Italy than me. Ethnical group is a minefield.64
Because Wikidata is a multilingual project, serving Wikipedias in more than three hundred languages, its editor base is even more international than that of English Wikipedia, increasing the geographic division in the policy discussion described above. Other European editors echoed this editor's concerns, and the consensus led to the requirement that any use have at least one citation. As a result, only 46,033 (or 2 percent) of the 1.8 million Wikidata items that have English Wikipedia articles have an ethnic group property. Periodically, European editors will “clean up” unsourced ethnic group claims, removing them en masse, hundreds or thousands at a time.65 While the term clean up is used elsewhere on Wikidata, in this context its relationship to the language of genocide makes this removal feel a bit more like an ethnic purge.
Given this context, it is not surprising that the unverifiability of whiteness is even more pronounced on Wikidata. As shown in figure 5, of the 46,033 Wikidata items that have English Wikipedia articles and have an ethnic group property, 17,546 are for African Americans. Very little can be extrapolated from this about the race and ethnicity gap on English Wikipedia.66 It is obviously incorrect to divide the number of items with the “African Americans” ethnic group property by the total number of items with an ethnic group property and claim that 38 percent of all articles on Wikipedia are about African Americans, and even more patently incorrect to divide that number by the 20,110 Americans with an ethnic group property and claim that 87 percent of articles about Americans are about African Americans.67 But it is also incorrect to divide these items by the previously established 984,670 articles about Americans and claim that articles about African Americans comprise 1.8 percent of all articles about Americans, or to divide by all 1.8 million biographies and claim that 0.9 percent of the biographies are about African Americans. The most an analysis can do is claim that at least 1.8 percent of the articles about Americans are about African Americans. Though the analysis of Wikipedia categories showed a larger percentage, either is quite small.
Conclusion: Whiteness in the Database
A comprehensive analysis of English Wikipedia's race and ethnicity gap would answer two questions: (1) What percentage of Wikipedia's editors are from indigenous and nondominant ethnic groups? (2) What percentage of Wikipedia's biographies are about people from indigenous and nondominant ethnic groups? As I have shown, multiple data points suggest the possibility of a significant race and ethnicity gap, yet it is not possible at this time to answer either of these questions because of cultural norms, limitations of the data, and the unverifiability of whiteness.
Cultural norms prevent the Wikimedia Foundation from conducting a comprehensive survey of the community's race or ethnicity, but all signs point to a significant gap among the editors of English Wikipedia. From a qualitative viewpoint, the actions of the editors debating the policy for categorization by race and ethnicity reveal their own white positionality and anti-Black racism. This close reading of the policy discussions does provide a rough sense of just how white Wikipedia is.
Limitations in the race and ethnicity data on Wikipedia and Wikidata prevent a comprehensive count of articles about people from indigenous and nondominant ethnic groups. The data on Wikipedia is not reliable, first, because the category structure is intended as a finding aid, not a sorting mechanism; thus, category trees will often include branches that diverge from the parent category in ways that include erroneous entries. Second, these results group together people of indigenous origins and colonists in the same categories, eliding the legacy of colonialism and its impact on the power relationships that construct race and ethnicity. Nor is the data on Wikidata reliable, because its more international editor base mandates significantly higher expectations for verifiability for ethnic group claims than Wikipedia does, resulting in so few data that no accurate count is possible. Thus, it is not possible to determine the numerator.
Lastly, the unverifiability of whiteness prevents a comprehensive count of articles about people from dominant and nonindigenous ethnic groups. Because whiteness is a transparent unmentioned default, no sources exist to verify categorization by race or ethnicity for most dominant and nonindigenous (e.g., white) subjects. Thus, in Wikipedia's white ontology, these articles cannot be categorized. Because Wikipedia articles and Wikidata items are mostly in an incomplete state of development, it is not possible to make calculations based on the 1.25 million Wikipedia articles or the 1.75 million Wikidata items without any race or ethnicity metadata. Thus, it is not possible to determine the denominator.
Without a numerator or a denominator, the data reveal little knowledge. Any attempt to perform large-scale comprehensive quantitative analysis on the ethnicity data will reproduce this bad math. But it is important to show this bad math to help theorists think about how these concepts/categories, which they have extensively theorized, shift as they are quantified in database structures, and to help data-centric researchers understand that these categories they compute in their research are not as stable or transparent as they may believe.
While it is possible to perform a small-scale comparative analysis between small samples, as recent reports have done, researchers will continue to query Wikidata because it is designed to make these queries so easy, and the data can be updated in real time. While Wikidata queries reveal meaningful data about gender, at present they produce deceptively meaningless data about race and ethnicity.68 But ultimately, none of these calculations are accurate because whiteness functions differently in database logic than in legal or semiotic contexts. The difference is subtle, but important.
The transparency phenomenon that Flagg articulates describes a world where whiteness is the default, is normalized, and goes without mentioning. As Haney López has made clear, the category of whiteness is constructed from a double negative: the process of naming nonwhiteness. Thus, in legal or semiotic contexts, persons are considered white as long as they are not described as nonwhite.
In the database, in contrast, because data requires affirmative assignment—a database cannot assign a value of a double negative—the transparency function behaves differently. The absence of an ethnic group value cannot be transparently read to indicate whiteness, as it could also be understood as absent data. In the database whiteness is an undefined variable, uncountable and unaccountable, which produces an incomplete and incompletable data set that corrupts the ability to calculate its relationship to its Other. The transparency phenomenon becomes a null set when turned into data. And you cannot divide by zero.
Notes
Adams, Brückner, and Naslund, “Who Counts as a Notable Sociologist on Wikipedia?”; Field, Park, and Tsvetkov, “Controlled Analyses of Social Biases”; Bjork-James, “New Maps for an Inclusive Wikipedia.” For comparison, a Google Scholar search for “gender gap” AND “Wikipedia” returned 6,280 results. A search for “ethnicity gap” AND “Wikipedia” returned no results that included both terms, as did a search for “race gap” AND “Wikipedia.”
This is a paraphrase of something Art+Feminism has long argued. Art+Feminism is a global project that builds a community of activists committed to closing information gaps related to gender, feminism, and the arts, beginning with Wikipedia. I am one of the cofounders, and my thinking on these topics has been formed working and talking with my collaborators, including Mohammed Sadat Abdulai, Alice Backer, Amber Berson, Siân Evans, Heather Hart, Richard Knipel, Jacqueline Mabey, McKensie Mack, Melissa Tamani Becerra, Kira Wisniewski, and Nina Yeboah. See https://artandfeminism.org.
Wikidata Query Service, https://w.wiki/4evD.
For example, Wikipedia has a well-organized collection of twenty pages that cover multiple aspects of the gender gap in society, but no organized group of pages exists to cover the race and ethnicity gap in society. Instead, there are three isolated pages: two newer articles on the United States titled “Racial pay gap” and “Racial achievement gap,” and a long-standing article titled “Racial gaps in intelligence,” which redirects to “Race and intelligence.” The contrast in quantity and subject matter is stark. Originating in 2002 as a page paraphrasing Richard J. Herrnstein and Charles Murray's Bell Curve, “Race and intelligence” has over one hundred archives on its Talk Page where experienced editors that range “from far out disruptive racists to merely ignorant, ill-informed ‘expert in everything’ engineer-types, who missed the forest for the trees” push arguments redolent with the stench of scientific racism (Wikipedia, “Wikipedia:Arbitration”). These debates boiled over into an arbitration case that ran over 22,000 words long. It is these policy-level metadiscussions that define the terrain on which the content is created. Wikipedia, “Gender Gap”; Wikipedia, “Racial Gaps in Intelligence”; Wikipedia, “Wikipedia:Arbitration.” To avoid repetition and unnecessary detail, I have used Wikipedia or Wikidata as author, preserved all spellings in context, and provided stable links to specific revision IDs.
Delgado and Stefancic, Critical White Studies; Frankenberg, White Women, Race Matters; Dyer, White.
This article seeks to answer if Wikipedia has a race and ethnicity gap in its content and its contributors. Why Wikipedia might have such a gap is a separate question, the answer to which shares many of the factors surfaced in gender gap analyses but intersects with other forms of oppression. Wikipedia's guidelines on notability and reliable sources constitute a possessive investment in whiteness, in which the value of whiteness as property is reified. These guidelines place higher value on the types of sources that have historically placed higher value on whiteness: other encyclopedias, newspapers of record like the New York Times, and peer-reviewed articles and books that have historically been written by, for, and about white people. For example, at the Schomburg Center's 2015 Black Lives Matter edit-a-thon I helped a new editor, a Black woman, create an article for the Harlem Book Fair. Our research on the fair highlighted how Wikipedia's guidelines reproduce structural racism and sexism. Looking for a page to model the draft on, we turned to the Brooklyn Book Festival. Comparing the reliable sources available for both fairs revealed the structural biases baked into the sources, which in turn are incorporated into Wikipedia by virtue of its investment in these structures. While the Harlem Book Fair is older and of equivalent size to the Brooklyn Book Festival, it has been mentioned in the New York Times only four times, whereas a search for the Brooklyn Book Festival turns up eighty-six results. While the Amsterdam News has sixteen stories that mention the fair, this source carries far less weight than the New York Times in the subjective calculations that determine whether an article will be included. Thankfully for the Harlem Book Fair, this sourcing was enough to establish notability, but for other articles these structural biases lead to exclusion. The end result is a kind of information suburb whose proxy racial covenant requires the approval of white gatekeepers for admittance. See Harris, “Whiteness as Property”; Lipsitz, “Possessive Investment in Whiteness”; Roediger, Wages of Whiteness; and Katznelson, When Affirmative Action Was White.
For example, people of African descent have different contextual relationships to majority or minority status in the United States, Haiti, Ghana, and South Africa and a concomitant historical relationship to power and autonomy.
As a case in point, this is the first reliable source to explicitly state my ethnicity. You can find traces of it in my past artistic work, and maybe in some tweets, but none of that would be considered acceptable verification on Wikipedia or Wikidata.
I am grateful to Tilman Bayer, Marc Miquel-Ribé, Eliza Myrie, and Kira Wisniewski for helping me think through this linguistic challenge.
The adverb historically clarifies that dominant refers to power relations rather than to the total number of people, placing power relations in the historical context of European imperialism (which can include Basque nationalism). This is a version of the language Art+Feminism has arrived at, after thinking through this problem for some time with organizers from many different countries and cultural contexts.
Beyer, “African Americans Are Vastly Underrepresented.” I am grateful to Tilman Beyer for sharing his insights on the Wikipedia datasphere.
Nartley and Ndubane, “Letter to Katherine.” I am grateful to Kelly Foster for helping me understand the Wikimedia Foundation's relationship to the African Wikimedia community.
Wikimedia Foundation, “Grantmaking Report . . . 2019−2020”; Wikimedia Foundation, “Grantmaking Report . . . 2020−2021.”
For a timeline and all quotations from discussions, see Mandiberg, “User:Theredproject.”
An administrator is an experienced editor who has gone through a community peer review that grants access to the software tools to fulfill administrative responsibilities.
Wikipedia, “Wikipedia Talk.” This formal discussion followed from a series of isolated decisions regarding categorization by race, ethnicity, and gender in 2004–5, many of which reached no consensus. In 2004 the community made its first decision about categorizing by race and ethnicity, agreeing to keep the category “Jews.” In June 2005 an editor nominated the entire category “People by race/ethnicity” and all of its subcategories (including “Jews”) for deletion, arguing, “It's racist to make a subdivision on this one.” Though the majority of editors supported deleting the category, in the end it was closed as no consensus; simultaneously, editors discussed deleting the categories “Women scientists,” “Women biologists,” “Women chemists,” “Women mathematicians,” and “Women physicists” on the grounds that these categories “enforce gender inequality” but also reached no consensus. A month later an editor successfully nominated the categories “White people” and “Mulattos” for deletion, asking, “Do we really have to categorize people by race? What purpose does it serve?” Notably, that editor did not include the category “Black people” in the nomination.
In this paper I use the term anti-Blackness to describe racism directed toward Black people, people of African descent, or people with darker skin in all societies, including nonwhite ones, following the lead of Art+Feminism. See Art+Feminism, “Art+Feminism User Group Anti-racism Policy.”
Three other editors made procedural comments but did not state a position and are not included here.
These editors’ user pages indicate they are from the Netherlands, France, Germany, and the United Kingdom.
I do not mention individual editors by username, as this analysis is about a pervasive culture.
Additionally, one American opposed, and one offered limited support. Two editors from unknown locations opposed, with one of these offering very limited support and suggesting such categorization should be used only when someone is a “race activist.”
In the following years, after the repeated deletion of the categories “White people,” “Black people,” and “Multiracial people,” several discussions achieved consensus to shift the categorization from race to ethnicity. Two sticking points were vigorous disagreements about whether to categorize people of more than one race/ethnicity, and the use of nuanced terms for transnational diasporas that differed from nation to nation. These conversations emphasized the importance of upholding the article subject's own self-identification. With established practice for categorizing based on ethnicity, and not race, most recent discussions have addressed continental versus national origins and their relationship to race and ethnicity. These include a discussion of nationality versus ethnicity and whether these are proxy for race, as well as preventing the generalized categorization of someone by continent of origin. Several discussions have further articulated the reason for not categorizing by majority groups, including a third deletion of the category “White people.”
It is possible to remove these from the results, which yields 54,819 biographical articles. Wikipedia, “PetScan, People of African Descent.”
This is likely an undermeasurement, but by how much we can't know.
As another example, in the case of cosmetic and hair care products, products for women are unmarked, while those for men explicitly say “for men.”
When counting gender in a historical context, women are consistently marked; though they are subject to forms of erasure, trans, nonbinary, intersex, agender, and other nonconforming genders are also marked. In the context of an encyclopedia, men are not marked. Additionally, gender can be cross-referenced by evaluating pronouns (in the process, again, erasing trans and nonbinary people). Furthermore, the compulsion to gender people crosses cultures, and even in the cultures that name and recognize genders beyond the binary, these data points do not complicate the unmarkedness of men in this equation.
The Wikipedia Diversity Observatory website is one of the most advanced tools for analyzing Wikipedia and Wikidata. Alongside its comprehensive data about gender and other gaps, it presents a version of this flawed data, showing the number of articles for people with an ethnic group Wikidata property (https://wdo.wmcloud.org/ethnic_groups_gap/). While the Wikipedia Diversity Observatory does not calculate a percentage, the ratios are implied visually by presenting this data in graphic form as a stacked bar chart, which catalyzes our visual cognition to infer a comparison of ratios. Because approximately 16,000 items from ethnic groups with less than 380 items are cut off, the stacked bar chart creates the false appearance that articles about African Americans are the majority of all articles.
It is likely that the disproportionate number of items is a result of the Wikidata Game, which gamified the addition of information to Wikidata. A popular channel of the game was configured to add values for the African American ethnic group property; see Manske, “Wikidata.”
As evidenced by the Wikipedia Diversity Observatory, discussed in n. 66.