Abstract

Despite computer vision's extensive mobilization of cameras, photographers, and viewing subjects, photography's place in machine vision remains undertheorized. This article illuminates an operative theory of photography that exists in a latent form, embedded in the tools, practices, and discourses of machine vision research and enabling the methodological imperatives of dataset production. Focusing on the development of the canonical object recognition dataset ImageNet, the article analyzes how the dataset pipeline translates the radical polysemy of the photographic image into a stable and transparent form of data that can be portrayed as a proxy of human vision. Reflecting on the prominence of the photographic snapshot in machine vision discourse, the article traces the path that made this popular cultural practice amenable to the dataset. Following the evolution from nineteenth-century scientific photography to the acquisition of massive sets of online photos, the article shows how dataset creators inherit and transform a form of “instrumental realism,” a photographic enterprise that aims to establish a generalized look from contingent instances in the pursuit of statistical truth. The article concludes with a reflection on how the latent photographic theory of machine vision we have advanced relates to the large image models built for generative AI today.

In 2015, leading Stanford University computer scientist Fei-Fei Li appeared in a TED talk in which she outlined the challenge faced by those engaged in “teaching computers to understand pictures” (Fei-Fei 2015). Offering a celebratory public account of the experimental development of machine vision, Li's talk centers on her creation of the object recognition dataset ImageNet, which became a pivotal benchmark and breakthrough in the shift toward data-driven machine learning. To emphasize the scale and complexity of this task, Li begins by comparing the visual capacity of machines to that of children, who must learn to recognize common objects in order to navigate their environment. Continuing her analogy, Li conflates the fleshy apparatus of the human eye with the machinery of the camera lens: if we consider the child's eyes as “biological cameras” that capture “one picture about every two hundred milliseconds,” she says, visual experience can be understood as the accumulation of “training examples” in the form of photographs. Because the child is positioned by Li as a naive and innocent subject, the object of their perception can be modeled as self-evident to the senses. In presenting visual data as, by definition, theory-free, and assimilating the camera into the process of human seeing, Li's account translates the photograph into visual information and naturalizes it as a measure of vision.

In the years following the 2009 release of ImageNet, this understanding of photography has largely persisted in machine vision discourse. Photographs are understood to function as self-standing documents free from the contexts from which they emerged, and they are positioned as straightforward visual “samples” of the real world. In the literature on dataset production, “photograph” and “real-world scene” are often used interchangeably.1 Absent is an explicit understanding of photography as a versatile technology put to work in the service of art, science, journalism, and historiography, let alone as a complex sociocultural practice with implications for epistemology, ideology, and ontology. This understanding of the photograph as an uncomplicated window onto the real world is problematic in part because it has tended to coincide with a host of biases that occur when research assumes that snapshots provide a functional proxy for the world “out there.”2 Such “datafication” of photography also obscures the various ways in which computer vision reshapes the visual world through practices of classification and surveillance3 and the critical role of machine vision researchers in shaping contemporary visuality.

In what follows, we argue that the relation between machine vision and photography has become so entwined that researchers need to acknowledge photography as a significant epistemic agent for their field. To put this differently, despite the field's extensive mobilization of cameras, photographers, and viewing subjects, photography's place in machine vision remains undertheorized; or, as we propose, there is an operative theory of photography in machine vision, but in a latent form, embedded in tools, practices, and discourses that ensure the methodological imperatives and routines we describe. By analyzing data practices, we illuminate this latent theory, demonstrating the tendency of machine vision researchers to conflate photography with an objectifiable world, operationalized as the “ground truth” for reality.4

From this basis, we contribute an analysis of the epistemic configuration of machine learning by directing attention to the photographic cultures that researchers encounter and appropriate. While rarely described as such, machine vision has historically relied on an array of photographic practices (e.g., composing, capturing, labeling, and categorizing photographic images) and has engineered complex curatorial pipelines5 that translate the labor of millions of photographers and perceiving subjects into datasets: ImageNet alone includes fourteen million images that were viewed and annotated by twenty-five thousand Amazon “Mechanical Turkers” (Fei-Fei 2010).6 From this perspective, scraping a photo from the web into a dataset, annotating it, and binding it to a taxonomy is a significant operation with far-reaching cultural and epistemic consequences, not a mere practicality.

In focusing on the experimental practices of Li and her colleagues, we observe how the dataset pipeline is designed to translate the radical polysemy of the photographic image into a stable and transparent form of data that can be portrayed as a neutral or “representative” proxy of human vision. As we show, the experimental practices of Li and her colleagues were instrumental in developing modes of human perception amenable to the efficient production of datasets, particularly a model of the perceiving subject who would encounter photography as a “stimulus” for a taxonomy. Dataset curation of this kind functions to stabilize the photographic image by eliding questions about the material conditions and social relations that enable this onto-epistemic flattening. Thus, while the question of biased systems is an urgent social priority—not least when the biases in question create harms for people and the world—we suggest that bias is a symptom of how “photography” operates as an onto-epistemic force of alignment that can be operationalized as a “normal” or self-evident view of the world, requiring the least perceptual effort to decode. Moreover, when these assumptions about “photography” are paired with the sociometrics of the search engine and the scale of the web, they offer a false promise of democratic consensus. In this way, machine vision researchers have relied upon and activated an unacknowledged—a latent—theory of photography.

How the Photograph Came to Stand for Reality: The Snapshot Enters the Dataset

The tendency of machine vision to regard photographic images as ostensible proxies for the natural world has a long history in the sciences. As Lorraine Daston and Peter Galison (1992: 120) have documented, nineteenth-century scientists regarded the camera as a rational and objective means of generating pictures “uncontaminated by interpretation”; the camera's mechanical automatism promised an image that was “miraculously free from the inner temptation to theorize, anthropomorphize, beautify, or otherwise interpret nature.” Crucially, such “mechanical objectivity” was a product not simply of photography's verisimilitude (secured by the imprint of light on a chemically sensitive substrate) but also of the camera's supposed capacity to bypass the subjectivity of the operator (Daston and Galison 2007: 187). Having bound photography to a framework of optics and light, on the one hand, and automation, on the other, science could mobilize visual representation as objective knowledge, relegating the photographer to the role of naive operator, a functionary of the camera.

As photographer and critical theorist Allan Sekula (1984: 56) argues, this scientific appropriation of photography occluded the social processes that undergird the photographic capture of the world. Within this paradigm, “the camera serves to ideologically naturalize the eye of the observer” and is reduced to “an engine of fact” whose task is to produce “a duplicate world of fetishized appearances, independent of human practice.” The same naturalization, we contend, underlies the imagery that Li and her colleagues chose in the design of the Stanford Vision Lab logo (fig. 1). The logo adapts The False Mirror (fig. 2), a 1928 painting by Belgian surrealist René Magritte, an artist known for challenging the transparency of vision. Magritte's work depicts the outline of a gigantic eye against a luminous sky that also occupies the space of the iris, creating an ambiguous perceptual effect. Viewers of The False Mirror may wonder whether their gaze is being met by the eye, or whether they are being surveyed by a translucent signifier. Such nuance is lost in the painting's assimilation into Stanford's logo, which crops and disembodies the eye and replaces the retina with a camera lens, a gesture that seeks to restore the pellucidity of vision.

If the iris can be considered a camera lens, then the world can be grasped as a collection of photographs. One consequence of this conflation of world and photograph is that the sourcing of photographic images became a practical and epistemological problem for engineers. Researchers historically acquired imagery in-house, photographing colleagues or commissioning photo shoots. Another canonical source of imagery was Corel's release of eight hundred photo CDs containing stock photography (fig. 3).7 However, this dependence on professional photography raised concerns because the cultural conventions of such images (smiling faces, centered subjects, careful lighting) were perceived as sources of “capture bias” that would limit the dataset's representation of statistically diverse objects and scenes, affecting the model's accuracy. Dataset creators thus required a form of authentic “real-world photography” that originated from outside the lab—a problem that the internet appeared to solve. With the advent of popular platforms for online image sharing, computer scientists turned to the “World ‘Wild’ Web” as a “freely available” source of unconstrained imagery that promised “a dense sampling of the visual” (Torralba, Fergus, and Freeman 2008).8 Key datasets such as ImageNet, Pascal VOC (2012), COCO (2014), and YFCC100M (2016) collected a significant portion, if not the entirety, of their contents from the photo-sharing site Flickr, making amateur photography a defining trait of machine vision's photographic culture.9

To understand this privileging of the amateur photograph, one must recognize that machine vision dataset discourse—in contrast, say, to media studies—does not regard the “real” as a contested and political condition of photographic ontology. Rather, in such discourse realism operates on an axis of authenticity wherein bias—a key term to which we will return—is a product of human cognition, intervention, and control. The photographic genre that was malleable to this ideal was the “snapshot.” Emerging in the nineteenth century, the “snapshooter” was a particular kind of photographer, a naive camera operator and enthusiast who used simplified roll-film cameras to document family life and leisure time. These amateur photographers were largely invented by camera manufacturers such as Kodak, who encouraged a “snapshot view” of the world, promoting Instamatic cameras that promised to collapse the distinction between eye, camera, and world (fig. 4).

The snapshot's excessive ordinariness and claim to authenticity offered a store of “natural” imagery of obvious utility to machine vision. As Fei-Fei et al. (2007: 2) argued, amateur photography represents the world as it is “commonly seen” by “most people.” If, as Daston and Galison (2007: 187) suggest, the ideal for mechanical objectivity is “non-intervention,” it is hardly surprising that Flickr snapshots gradually became the favored genre of uploaded photographic production for use in datasets. Understood as the product of a spontaneous and distracted engagement with the world, the snapshot extends the automatism of the camera to its operator, who, as an amateur, lacks the expertise to intervene in “real-world” image capture. Reinforcing the notion of a “common” view, images found on the internet could be imagined as authentic, random, and neutral.10

In keeping with this framing of the snapshot as neutral data, unmediated by culture or human intervention, dataset creators rarely acknowledge the contribution of photographers or their communities, and the scientific literature largely ignores them. Even as photo-sharing communities have become the main providers of “ground truth” for machine vision (as well as a source of data for large image models such as DALL·E 2), photographers are largely unaware of their contribution to the field.11 If the photographer's task is to produce objective “data” for a machine vision pipeline—without any complicating social interaction or history—then the question of creative authorship can be sidelined entirely. A telling exception to this tendency is Andy E. Nystrom, a Canadian photographer who has shared a staggering 6.4 million images on Flickr since 2008 (fig. 5). Thomee et al. (2016), whose YFCC100M dataset scraped 200,000 of Nystrom's images, cite him in their paper, reproducing his snapshot of downtown Toronto as a canonical example of a “real-world scene.” Setting his camera to shoot continuously as he performs his daily life, Nystrom rejects the cultural and technical conventions of photography, reproducing a feed of blurry, banal, and habitual images—precisely the kind of “real-world photography” (2016: 5) that machine vision researchers want to operationalize.

Another claim for the supposed neutrality of snapshots rests on its ubiquitous circulation on social media platforms. Although Li and her colleagues recognized that “everyone has a bias when taking a picture,” they “believe[d] that the large number of images from different unknown sources would help average out these biases” (Fei-Fei et al. 2007: 5). Flickr's endless stream of birthday parties, cats, holidays, and sunsets could thus be upheld as a kind of social consensus: through the repetition of commonplaces and stereotypes shared on the internet, snapshot photography proffered an agreed-upon representation of the world.12 To harness this standing reserve, Li's team deployed scrapers and search engines to automate the process of locating millions of photos.13 The use of search engines provided yet another claim for neutrality: algorithmic ranking would help to ensure that the most popular and socially relevant images would make their way into the pipeline, without the conscious intervention of the researcher.14

Machine Vision's Instrumental Realism

ImageNet's determination to “average out” the contingencies of photography, a complex and problematic form of representation, is revealed in the depiction of marine life across three of Li's synsets.15Figure 6 portrays a hammerhead shark as a shadowy figure glimpsed through the lens of a tourist's underwater camera; a dead trout in the hands of a fisherman; and a lobster cooked, garnished, and destined to become someone's dinner. The zoological category of marine creatures thus encompasses diverse ways of seeing and an array of photographic genres: underwater photography, the amateur trophy snapshot, and commercial food photography. While the hammerhead shark appears to be the object of scientific observation or touristic pleasure, the lifeless trout, given the social media culture of Flickr, is probably a trophy pic. Meanwhile, the plated lobster could be intended to draw business to somebody's restaurant. ImageNet, as a photographic proxy for the visual world, indirectly incorporates cultural and semiotic practices that include differing conventions, genres, social situations, and relations of power.

These images were not handpicked by dataset curators taking direct responsibility for the depiction of marine life; rather, a curatorial pipeline was engineered to secure a statistically representative distribution. The ImageNet team did not intend to convey any normative investment in dead trout or swimming sharks. But as these images of fish were dominant online, the pipeline reproduced that dominance in the training data. In this move from representation (the photographic mediation of marine life) to representativeness (a sample of marine life photography), ImageNet's curators performed a sleight of hand enabling them to sidestep the cultural context of each individual image in order to privilege the problem of statistical distribution.16 This approach implicates ImageNet in the “instrumental realism” that Sekula (1981: 16) has defined as a photographic operation that aims to establish a generalized look from contingent instances in the pursuit of “abstract, statistical truth.” Furthermore, it cements the perception that bias can be “cured” through the inclusion of more diverse and better images, rather than being immanent to the ontological condition of photography.

And yet Li and her team also transformed instrumental realism. Instrumental realism, Sekula (1986: 16–17) writes, relies crucially on the ostensibly objective character of the photograph—secured by its indexical relation to the material world, and reinforced by bureaucratic structures of science and government. As we have seen, Li augmented photography's alleged mechanical objectivity by conceiving the snapshooter as a passive reproducer of agreed-upon representations of the world and regarding popular photo-sharing sites as further evidence of consensus. This somewhat murky positivism would require the team to reconceptualize the agency not just of those involved in the production of photographs, but also of those who are required to interpret them.

To instrumentalize a semiotically unstable sociotechnical process, the ImageNet team devised an industrial-scale classificatory process to pair each image with metadata (e.g., tags, categories, descriptions, bounding boxes). Li's team engaged legions of on-demand workers to “clean” and categorize each image at high speed. That enterprise in turn required Li to operationalize a perceiving subject who could accurately classify snapshots at a machine-like pace. According to her “back of the envelope” calculation (forty thousand categories, multiplied by ten thousand images per category, multiplied by the number of annotators necessary to confirm a given result), it would take six hundred million seconds of human labor. This roughly equates to nineteen years in the life of a person who “doesn't eat or sleep,” based on a worker who spends five hundred milliseconds in front of each displayed image. Because students could not be relied upon to undertake this onerous labor, Li and her colleagues engaged on-demand workers using Amazon's Mechanical Turk (Fei-Fei 2011).17

Through a set of preparatory experiments, Li and her colleagues tested a regime for this categorization at a glance.18 As described in a 2007 paper, Li and Pietro Perona engaged students from the California Institute of Technology to describe snapshots presented on screen for durations ranging from twenty-seven to five hundred milliseconds.19 The aim of this research was to pinpoint an ideal perceptual threshold for classifying images through words with the least possible cognitive effort. These experiments imbued the snapshot with a new dimension—a stimulus that provokes an automatic response.20 Notably, the psychologist James J. Gibson (1979: xiii), an early critic of this model of instantaneous perception, dubbed this perceptual model “snapshot vision.” Hence, like the “snapshot” itself, which reduces photography to an automated output, Gibson's “snapshot vision” extends this instrumentalizing outlook to the on-demand annotator.

The Lingering Problem of Bias

So far we have argued that machine vision's instrumental realism is rooted in a long history of scientific positivism that has been onto-epistemologically amplified and remediated by the vast affordances of the internet, photo-sharing platforms, search engines, on-demand labor, and the rise of powerful techniques for the computational modeling of data. The need to harness photography to machine vision's project of assembling an objectifiable world depends on the presumption that these very affordances enable the creation of bias-free curatorial pipelines and databases. It is perhaps ironic, then, that two years after the release of ImageNet, Antonio Torralba and Alexei A. Efros (2011: 1523) demonstrated in a much-cited article that machine vision's long-standing quest for “a better, more authentic representation of the visual world” remained subject to troubling biases of a technical kind. That is, despite the field's desire for a statistically robust ground truth, datasets including ImageNet could not “generalize” to a different context without a significant loss in accuracy.21

Torralba and Efros's article is unusual in directly acknowledging the epistemological issues raised when photographic datasets attempt to stand in for a complex and multidimensional reality. In their brief history of image processing research, Torralba and Efros offer a heroic narrative of scientific struggle in which the release of each new dataset by the machine vision community is “a reaction against the biases and inadequacies of the previous datasets in explaining the visual world” (1523).22 Confronting the same paradox of representation that Jose Luis Borges memorably depicted in a famous short story,23 Torralba and Efros admit that datasets cannot escape producing partial views of the world; even measuring “a dataset's bias,” they write, “would require comparing it to the real visual world, which would have to be in the form of a dataset, which could also be biased” (1524). Nonetheless, even as Torralba and Efros recognize that a “a bias-free representation of the visual world” is “not a viable option” (1524), they retain the positivistic view of the individual photograph as a “sample” of reality. As such, their article authorized the field of machine vision research to continue sidestepping the diverse sociotechnical dynamics and power relations that constitute photography.24 Moreover, when Torralba and Efros (1528) suggested that the best approach to datasets might involve “start[ing] with a large collection of unannotated images,” they adumbrated the field's increasing turn to scale. The idea that increasing access to troves of data scraped from the internet might eventually represent something approaching a totality underwrote a discourse of “big data” that was already dominant in 2011 and would soon spur a corresponding investment in ever-larger models.25

The pursuit of “a more authentic representation” was not the expression of a rallying call to confront the politics of photographic mediation and its ideological claims on reality but rather of the desire for better sampling methods and a better distribution of images—one reason why calls for more data (including Torralba and Efros's) are almost always calls for more diverse data. But with the turn to deep learning that became the dominant paradigm after 2012, concerns for the dataset's statistical representativeness would become conflated with the belief that scale alone could deliver representativeness. That is, with the turn to scale, engineers would trade a fetishization of size for what, in Li's paradigm, remains a pronounced concern with statistical distributions, behind which is a vestigial assumption that photographic signs import their “real world” referents into machine vision.

Machine Vision's Latent Theory of Photography

In this essay we have documented how a theory of photography immanent to the machine vision pipeline emerges from a variety of ontological conflations between “real world” and photograph, human eye and camera, snapshot and visual stimulus, statistical distribution and neutrality, and (eventually) scale and representativeness of data. Although never fully spelled out or specified, this theory is embedded in the logistics of machine vision research, beginning with the curatorial pipeline and informing the rhetoric of papers, TED talks, and corporate branding. The same theory organizes and disavows the appropriation of photographic and experimental cultures for the production datasets. The pipeline modeled by ImageNet recurs in many dataset projects, carrying with it a specific photographic configuration that conditions the level at which the world is described, how regularities are extracted, and how a supposed consensus is reached.

Standing back from Li's account of the data pipeline, we find the complex sociotechnical practice of photography and its unstable semiology stabilized to conform to the demands of statistical positivism. As photography stands in for the capture of “real-world” samples and for human vision itself, the dynamics between camera, operator, subject, viewer, and world are collapsed—with no need to acknowledge the labor, authorship, subjectivity, and power relations that underpin them. In the curatorial pipeline of machine vision, dataset annotators only respond to a stimulus (rather than a cultural object), and the stimulus is treated as a real-world scene. There is no cultural complexity granted to the act of interpretation, which is simply bypassed.

For machine vision research, photography is never just a practical tool; it is also seen as a remedy, a conceptual device that suspends the contradiction between indexicality and social consensus in order to grant social consensus the attributes of indexicality. To acknowledge this latent theory, the first step is to open it for discussion. Computer vision is, among other things, a fundamentally cultural practice, although it is rarely described as such. As machine vision technologies appropriate cultural work to reap private profit, the producers of these systems need more than ever to account for their actions. The curatorial function at the core of these technologies has become an urgent topic of research and public discussion.

Coda: After ImageNet?

Our discussion of machine vision's curatorial pipeline has been grounded in the paradigmatic example of ImageNet, a dataset that became inseparable from the machine learning technique that would come to be called “deep learning” and would help to create confidence in the promise of “AI.” More than ten years after the introduction of this groundbreaking paradigm, is our analysis relevant to the current state of the art? In particular, what are the continuities and breaks between the latent photographic theory of machine vision that we have advanced and the large image models (LIMs) built for “generative AI” today? Is there a difference in degree or in kind?

At this early stage in the development of a proprietary technology, our answers to these questions must be provisional. But it is worth setting forth some salient comparisons. Consider that ImageNet was created to advance academic research on the long-standing task of machine vision. By contrast, LIMs such as Stable Diffusion and DALL·E 2 emphasize the algorithmic generation of visual content as part of the broader development of “generative AI,” or commercial systems that claim to model and “disrupt” human creativity in visual as well as language domains. In the years since the ImageNet benchmark and the advent of LIMs, dataset curation has begun to operate in a very different political economy, dominated by the largest tech companies or by start-ups funded by venture capital. The scale of computing necessary for building large models is increasingly beyond the resources of academic researchers—and of many small companies or NGOs.

Despite our limited access to peer into the production of proprietary systems, it is clear that the new wave of LIMs diverge from the ImageNet paradigm in scale, motivation, and provenance. While ImageNet contained fourteen million images in 2015 (Russakovsky et al. 2015), the current crop of LIMs are trained on billions of images. Li's intuition that machine vision's history would be anchored to the history of its datasets implied that the crucial questions to be answered would not lie in algorithms per se but in enlarging the scale of access to the visual world. Nonetheless, Li's curatorial pipeline, premised on the perceptual labor of on-demand annotators, stands in stark contrast to massive datasets obtained by scraping images with no documented human effort to taxonomize.

As we have seen, ImageNet's curatorial pipeline distributed curation across various social media platforms, search engines, and annotators who “cleaned” the data by aligning each image to a preexisting taxonomy. By contrast, CLIP, the OpenAI text-to-image model on which DALL·E 2 is based, does not train on images labeled retrospectively by annotators. CLIP's solution to the problem of generalization involved a turn to scale: using captioned images scraped from the internet, CLIP was trained on four hundred million text-to-image pairs.26 The result of this “unsupervised” mode of learning is a much-celebrated ability to generalize; that is, CLIP needs no additional training in order to recognize a class of image that was not in the training data. CLIP-based image models are thus understood to be less mechanically dependent on their training datasets.

While this vaunted capacity to generalize may appear to have mitigated the technical bias that troubled Torralba and Efros, it has not precluded the problem of ingrained stereotypes gleaned from the internet (Birhane, Prabhu, and Kahembwe 2021). A growing body of literature documents the enduring presence of racist and sexist stereotypes in LIMs.27 If, as may be the case, LIMs based on CLIP are as socioculturally biased as models trained on ImageNet, then their capacity to generalize results with less dependence on training sets may be overstated. Despite an inflated rhetoric of unstoppable technological progress, there are signs that engineers realize they have overestimated CLIP-based models' ability to overcome bias by scaling up their training data. As Andy Baio suggests, the improvements to DALL·E 2 targeting algorithmic bias focused on manipulating the contents of users' submitted prompts rather than the model's behavior. Reflecting on this move from debiasing models to so-called prompt engineering, Fabian Offert and Thao Phan (2022: 2) conclude that “these models are figured as beyond improvement, or at least, less easy to improve upon than users themselves.”

On these grounds one might speculate that despite the divergence from the paradigm that Li introduced with ImageNet, LIMs extend and perhaps even radicalize the onto-epistemological flattening at work in machine vision's latent theory. One striking continuity concerns the epistemic role assigned to scale. As the sizes of models and datasets increase exponentially, computer vision researchers continue to invest their energy in turning quantity of training data into quality of output, with the expectation of better performance. However, quantitative increase never means simply more of the same. When visual datasets jump in scale from millions to billions of entries, it precipitates a tendency we already analyzed: the dataset's claims to objectivity depend less on the relations between subject (photographer) and object (world), and more on the internet's claims to universalism, underwritten by statistical mediation.

In this way, the scraped datasets for LIMs have radicalized the reliance on online images while shifting the empirical encounter secured by the camera's indexical capture to the somewhat murkier positivism of downloaded images at scale. This shift away from indexicality may partly explain the eerie relation to photorealism of many images generated by DALL·E and DALL·E 2. As scale turns the internet into a posited locus of so-called ground truth, the supposed verisimilitude of images produced by LIMs appears irremediably divorced from any perceptual and taxonomic control exerted by glancing annotators. And yet, the celebrated ability of LIMs to generate synthetic photorealistic imagery is congruent with a tendency in machine vision that Sarah Kember (2012: 337) once described as being “far more invested in photographic codes, conventions and rituals than in photography as an enduring medium.” While LIMs cannibalize and disrupt the received cultural form of the photograph, they paradoxically serve to maintain its powerful hold on visual production (Dewdney 2021).28

The fact that ImageNet's curation relied on heavily automated processes such as web scrapers likewise helps us to see how LIMs radicalize the latent theory of photography explored in this article. In their recent investigation of a subset of the LAION dataset that was used to train Google's Imagen and Stable Diffusion, Andy Baio and Simon Willison established that every image includes not only a URL and a description scraped from a webpage but also what its creators name “predicted attributes” (Baio 2022). These often consist of an aesthetic score or the likelihood of a watermark being imprinted on the image.29 The presence of these two predicted attributes is important in suggesting that Stable Diffusion doesn't generate visuals purely by drawing from a relation between textual descriptions and images (as in CLIP's captioned pairings). It also includes in the mix an index of ownership and provenance (the watermark) and a model of perception (aesthetic score).

The production of these metrics is also revealing. The aesthetic score is determined by a model trained on a supervised dataset, the Simulacra Aesthetic Captions (Pressman, Crowson, and Simulacra Captions Contributors 2022), the design of which involved human annotators who rated the images on a grade of 1 to 10. Even if Stable Diffusion did not use human annotation for the labeling of its very large dataset, it used a model that was itself reliant on human annotation. The predictive algorithm that “automatically” classifies the “scraped” data still relies on—or indeed, is formed from—a dataset curated by human annotators, just as the LIMs that supposedly depart from or supersede earlier machine vision paradigms incorporate them into their curatorial pipeline.

Of course, this is a provisional account of the breaks and continuities with ImageNet, which is severely limited by the growing concentration and privatization of machine vision's means of production. And therein lies perhaps the most important difference, one that should be scrutinized closely as the world grapples with “generative AI.” If we are to understand machine learning's relationship to hegemonic systems of power and discrimination, we need to acknowledge that the biggest obstacle that precludes easy answers is not only an epistemic one. Since the arrival of ImageNet, the project of machine vision as a whole has evolved into a resource-intensive and increasingly proprietary enterprise. There is a struggle awaiting scholars and activists: to resist the corporate capture of the production processes that machine vision research helped bring into being.

Notes

1.

See, for example Fei-Fei et al. (2007). Similarly, researchers in cognitive science who maintain close relations with machine vision scientists “will often use the phrase ‘real world’ to imply that some feature of the image matches some aspect of reality . . . even though the stimuli are not real objects” (Snow and Culham 2021: 506).

2.

See, for example, Buolamwini and Gebru (2018); Crawford and Paglen (2019); Denton et al. (2021); Birhane and Prabhu (2021); Harvey and LaPlace (2021).

3.

On datafication as a process, see the introduction to this issue.

4.

Ground truth is a term used in machine learning to describe factual data originating in the real world, signifying the ideal expected result of the model, and used to assess the accuracy of predictions. Jaton (2021) offers a study of the sociotechnical practices that enable the constitution of ground truth in image processing. For a discussion of the term's genealogy and origins in geographic information systems, see Gil-Fournier and Parikka (2021).

5.

Computer science uses the term pipeline to signify a technical infrastructure that allows engineers to manage and automate the various steps in assembling, processing, cleaning, and modeling data. The notion of a curatorial data pipeline includes many variants, from a relatively indiscriminate process of “scraping” data from the web to elaborate linkages between platforms, as we will show. We advance a parallel argument for the cultural significance of engineers as photography curators in Malevé, Sluis, and Tedone (2023).

6.

For a critical account of how tech companies employ crowd workers through sites such as Amazon's Mechanical Turk to satisfy their needs for on-demand labor, see Irani (2015).

7.

On the evolution of photographic sources for facial recognition datasets, see Raji and Fried (2021). Müller, Marchand-Maillet, and Pun (2002) discuss the earlier prevalence of Corel imagery in image processing.

8.

As Harvey and LaPlace (2022) note, such sources of data “in the wild” are neither “natural” nor “wild” and do not exist outside systemic social inequalities. Representing datasets as if the data that constitutes them exists outside of systemic bias in this way “simplifies complexities in the real world where nothing is free of bias.”

9.

The mobilization of snapshots as “real-world” photography in the YFCC100M dataset is developed further in Sluis (2022).

10.

Photography scholars have long argued that such claims for the universalism of the snapshot are ideological. As Schroeder (2013) argues, today the snapshot continues to be exploited by artists and advertisers as a photographic style designed to appear unstylized and authentic.

11.

Researchers Harvey and LaPlace developed a search engine (available at https://exposing.ai) to allow Flickr users to discover if their snapshots have been used in dozens of the most influential public face and biometric image datasets. Similarly, “Have I Been Trained?” is a search tool developed by art collective Spawning that enables creators to query whether their work has been incorporated into the LAION-5B dataset used to train large image models including Stable Diffusion and Google's Imagen.

12.

Through a somewhat different rationale that emphasizes the “human in the loop,” Gurari et al. (2018) argue that “images curated from the web intrinsically pass a human quality assessment of ‘worthy to upload to the internet.’”

13.

This is a typical trait of large-scale datasets, the curation of which involves many automated processes. Both ImageNet and Common Crawl (which underlies many large models) rely on crawling and scraping mechanisms through which images are extracted from web pages. Crawlers and scrapers determine what counts as an image or description, the importance given to structural elements in a webpage, the periodicity of scraping, and the significance granted to the number of links to a page to make it a candidate for inclusion in a dataset. In this respect curation is a distributed process encompassing human and nonhuman agents.

14.

This is a point later echoed by the creators of the Yahoo Flickr Creative Commons 100M dataset (Thomee et al. 2016: 1), who suggest that scale can transcend “what is captured in any individual snapshot and provide information on trends, evidence of phenomena or events, social context, and societal dynamics.”

15.

ImageNet categories are called synsets in reference to the WordNet taxonomy from which they are borrowed. Categories' names are expressed as set of synonyms, or synsets, not by a unique name.

16.

For a comprehensive discussion of representativeness as it is encountered in statistics, politics, and machine learning, see Chasalow and Levy (2021).

17.

As Li explains in a Google TechTalk: “Let's just say a person just looks at two images per second and doesn't eat and sleep and so on. . . . It will take nineteen human years. So I ask my graduate student, ‘do you want to do this?’ [laughter] He said, ‘no. I need to graduate’” (Fei-Fei 2011).

18.

For an extended analysis of the social ontology of the machine vision lab and this experimental practice, see Malevé (2019, 2022).

19.

In a later experiment described in Krishna et al. (2016), Li and her colleagues studied how workers subjected to rapid serial visual presentation techniques could augment their productivity (but produced more errors).

20.

Gibson (1979: xiii) coined the expression “snapshot vision” to critique the assumption that vision is best studied when “the eye is held still, as a camera has to be, so that a picture is formed that can be transmitted to the brain.” In Li's experiment, students sit before a screen where the stimulus/photograph will appear for a predetermined time. These viewing conditions, which Gibson (1979: xiv) calls “aperture vision,” assume “that each fixation of the eye is analogous to an exposure of the film in a camera, so that what the brain gets is something like a sequence of snapshots.” In the experiment, the rationale for using a snapshot as a proxy for the real becomes extended to the annotator, a viewer who performs snapshot vision.

21.

Torralba and Efros's concern with “technical” bias relates to the problem of statistical overfitting (a situation wherein models that perform well on their training data cannot “generalize” that performance to a fresh dataset). Bias in this technical sense does not refer to training data that embeds racist or sexist discrimination.

22.

In this teleology Toralba and Efros position the famous “Lena” image, widely used in image processing to the present day, as “a reaction against all the carefully controlled lab stock images” and, tellingly, as the first “real” image to be used as a dataset (1523). The 1972 Playboy centerfold of Lena—posed, polished, and constructed for the male gaze—is understood by Torralba and Efros, somewhat paradoxically, as a wild and unimpeded fragment of the visual world, “found” in the computer lab.

23.

Jorge Luis Borges's “Del rigor en la ciencia” (1946) tells the tale of an empire that, in the effort to be exact, produced a map the same size as its territory. See Borges (1999: 325).

24.

Torralba withdrew his Tiny Images dataset from circulation in 2020 in response to media coverage of an anonymous preprint of Birhane and Prabhu (2021) that criticized its racist and offensive content and its dependence (like that of ImageNet) on the stagnant vocabulary of the Wordnet taxonomy.

25.

As the introduction to this issue also discusses, the rise of data-driven deep learning amplified the belief in scale and spurred a technological and commercial revolution. As Meta's chief AI scientist and deep learning pioneer Yann LeCun explained to Forbes (2018): “Data is important for making a business out of machine learning. . . . You need data to train your system, and the more data you have, the more accurate your system will be. So, from a technology goal and business point of view, having more data is better.”

26.

In a similar curatorial vein, LAION-5B, the dataset behind Stable Diffusion, which originates from Common Crawl, was obtained by scraping very large arrays of sources and using alt tags as authoritative descriptions of images.

27.

Hundt et al. (2022) showed that robots trained on CLIP mimic the “malignant stereotypes” of the web, associating “criminal” with random images of dark-skinned males and “housekeeper” with brown and Black females. Bianchi et al. (2022) have documented that recent LIMs cannot render an image of a disabled woman leading a meeting or an African man owning an impressive house, even when the “house” in question is inputted as a “fancy” house or a “mansion.”

28.

Engineers would do well to remember when looking at the outputs of LIMs that what they perceive is neither a photograph nor an illustration but a raster graphic designed to remediate the look of these analogue cultural forms. As a measure of this confusion, Yongshu Wu and David Fleet's 2021 blog post celebrating the capacity of Google's IMAGEN and PARTI to generate “high-quality, photorealistic images” is accompanied by images of cartoon robots and sci-fi illustrations of futuristic cities, rendered as though they were illustrated by hand.

29.

Digital watermarking refers to a set of techniques to verify the integrity of a file's content. The watermark attribute in this case refers to the presence of the logo or the name of an image's author overlaid on the image.

Works Cited

Baio, Andy.
2022
. “
Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion's Image Generator
.”
Waxy
,
August
30
. https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/.
Bianchi, Federico, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Zou, and Aylin Caliskan.
2023
. “
Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale
.” In
FAccT ’23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency
,
1493
504
. https://doi.org/10.1145/3593013.3594095.
Birhane, Abeba, and Vinay Uday Prabhu.
2021
. “
Large Image Datasets: A Pyrrhic Win for Computer Vision?
” In
2021 IEEE Winter Conference on Applications of Computer Vision
,
1536
46
. https://doi.org/10.1109/WACV48630.2021.00158.
Birhane, Abeba, Vinay Uday Prabhu, and Emmanuel Kahembwe.
2021
. “
Multimodal Datasets: Misogyny, Pornography, and Malignant Stereotypes
.” arXiv,
October
5
. http://arxiv.org/abs/2110.01963.
Borges, Jorge Luis.
1999
.
Collected Fictions
. Translated by Andrew Hurley.
London
:
Penguin Books
.
Buolamwini, Joy, and Timnit Gebru.
2018
. “
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification
.” In
Proceedings of Machine Learning Research
, vol.
81
, Conference on Fairness, Accountability and Transparency, 23–24 February 2018, New York, NY, USA, edited by Sorelle A. Friedler and Christo Wilson,
77
91
.
New York
:
PMLR
.
Chasalow, Kyla, and Karen Levy.
2021
. “
Representativeness in Statistics, Politics, and Machine Learning
.” arXiv,
January
11
. http://arxiv.org/abs/2101.03827.
Crawford, Kate, and Trevor Paglen.
2019
. “
Excavating AI: The Politics of Images in Machine Learning Training Sets
.” https://www.excavating.ai.
Daston, Lorraine, and Peter Galison.
1992
. “
The Image of Objectivity
.”
Representations
40
(
October
):
81
128
. https://doi.org/10.2307/2928741.
Daston, Lorraine, and Peter Galison.
2007
.
Objectivity
.
New York
:
Zone Books
.
Denton, Emily, Alex Hanna, Razvan Amironesei, Andrew Smart, and Hilary Nicole.
2021
. “
On the Genealogy of Machine Learning Datasets: A Critical History of ImageNet
.”
Big Data and Society
8
, no.
2
. https://doi.org/10.1177/20539517211035955.
Dewdney, Andrew.
2021
.
Forget Photography
.
London
:
Goldsmiths
.
Fei-Fei, Li.
2010
. “
Crowdsourcing, Benchmarking, and Other Cool Things
.” https://www.image-net.org/static_files/papers/ImageNet_2010.pdf.
Fei Fei, Li.
2011
. “
Large-Scale Image Classification: ImageNet and ObjectBank
.” YouTube,
May
5
. https://www.youtube.com/watch?v=qdDHp29QVdw.
Fei-Fei, Li.
2015
. “
How We Teach Computers to Understand Pictures
.” YouTube,
March
24
. https://www.youtube.com/watch?v=40riCqvRoMs.
Fei-Fei, Li, Asha Iyer, Christof Koch, and Pietro Perona.
2007
. “
What Do We Perceive in a Glance of a Real-World Scene?
Journal of Vision
7
, no.
1
. https://doi.org/10.1167/7.1.10.
Gibson, James J.
1979
.
The Ecological Approach to Visual Perception
.
Boston
:
Houghton-Mifflin
.
Gil-Fournier, Abelardo, and Jussi Parikka.
2021
. “
Ground Truth to Fake Geographies: Machine Vision and Learning in Visual Practices
.”
AI and Society
36
, no.
4
:
1253
62
. https://doi.org/10.1007/s00146-020-01062-3.
Gurari, Danna, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham.
2018
. “
VizWiz Grand Challenge: Answering Visual Questions from Blind People
.” In
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
,
3608
17
.
Salt Lake City
:
IEEE
. https://doi.org/10.1109/CVPR.2018.00380.
Harvey, Adam, and Jules LaPlace.
2021
. “
Exposing.Ai
.” https://exposing.ai.
Harvey, Adam, and Jules LaPlace.
2022
. “
Researchers Gone Wild: Origins and Endpoints of Image Training Datasets Created ‘in the Wild.’
” In
Practicing Sovereignty: Digital Involvement in Times of Crises
, edited by Bianca Herlo, Daniel Irrgang, Gesche Joost, and Andreas Unteidig,
289
309
.
Bielefeld
:
Transcript
. https://doi.org/10.14361/9783839457603.
Hundt, Andrew, William Agnew, Vicky Zeng, Severin Kacianka, and Matthew Gombolay. “
Robots Enact Malignant Stereotypes
.” In
FAccT ’22: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency
,
743
56
. https://doi.org/10.1145/3531146.3533138.
Irani, Lilly.
2015
. “
The Cultural Work of Microwork
.”
New Media and Society
17
, no.
5
:
720
39
. https://dx.doi.org/10.1177/1461444813511926.
Jaton, Florian.
2017
. “
We Get the Algorithms of Our Ground Truths: Designing Referential Databases in Digital Image Processing
.”
Social Studies of Science
47
, no.
6
:
811
40
. https://doi.org/10.1177/0306312717730428.
Kember, Sarah.
2012. “Ubiquitous Photography
.”
Philosophy of Photography
3
, no.
2
:
331
48
. https://doi.org/10.1386/pop.3.2.331_1.
Krishna, Ranjay A., Kenji Hata, Stephanie Chen, Joshua Kravitz, David A. Shamma, Li Fei-Fei, and Michael S. Bernstein. “
Embracing Error to Enable Rapid Crowdsourcing
.” In
Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems
.
ACM
,
2016
. https://doi.org/10.1145/2858036.2858115.
Malevé, Nicolas.
2021
. “
Algorithms of Vision: Human and Machine Learning in Computational Visual Culture
.” PhD diss.,
London South Bank University
.
Malevé, Nicolas.
2022
. “
The Computer Vision Lab: The Epistemic Configuration of Machine Vision
.” In
The Networked Image in Post-Digital Culture
, edited by Andrew Dewdney and Katrina Sluis,
83
101
.
New York
:
Routledge
. https://doi.org/10.4324/9781003095019.
Malevé, Nicolas, Katrina Sluis, and Gaia Tedone.
2023
. “
Curating in the Wild
.” In
Curating Superintelligences: Speculations on the Future of Curating, AI and Hybrid Realities
, edited by Joasia Krysa and Magda Tyżlik-Carver.
London
:
Open Humanities Press
.
Müller, Henning, Stephane Marchand-Maillet, and Thierry Pun.
2002
. “
The Truth about Corel—Evaluation in Image Retrieval
.” In
Image and Video Retrieval
, edited by Michael S. Lew, Nicu Sebe, and John P. Eakins,
38
49
. Lecture Notes in Computer Science no.
2383
.
Berlin
:
Springer
. https://doi.org/10.1007/3-540-45479-9_5.
Offert, Fabian, and Thao Phan.
2022
. “
A Sign That Spells: DALL·E 2, Invisual Images, and the Racial Politics of Feature Space
.” arXiv. http://arxiv.org/abs/2211.06323.
Pressman, John David, Katherine Crowson, and Simulacra Captions Contributors.
2022
.
Simulacra Aesthetic Captions (Version 1.0)
.
Stability AI
. https://github.com/JD-P/simulacra-aesthetic-captions.
Raji, Inioluwa Deborah, and Genevieve Fried.
2021
. “
About Face: A Survey of Facial Recognition Evaluation
.” arXiv. http://arxiv.org/abs/2102.00813.
Russakovsky, Olga, et al
2015
. “
ImageNet Large Scale Visual Recognition Challenge
.”
International Journal of Computer Vision
115
, no.
3
:
211
52
. https://doi.org/10.1007/s11263-015-0816-y.
Schroeder, Jonathan E.
2013
. “
Snapshot Aesthetics and the Strategic Imagination
.”
Invisible Culture
no.
18
. https://papers.ssrn.com/abstract=2377848.
Sekula, Allan.
1981
. “
The Traffic in Photographs
.”
Art Journal
41
, no.
1
:
15
. https://doi.org/10.2307/776511.
Sekula, Allan.
1984
.
Photography against the Grain: Essays and Photo Works, 1973–1983
. The Nova Scotia Series: Source Materials of the Contemporary Arts, vol.
16
.
Halifax
:
Press of the Nova Scotia College of Art and Design
.
Sekula, Allan.
1986
. “
The Body and the Archive
.”
October
39
:
3
64
. https://doi.org/10.2307/778312.
Sluis, Katrina.
2022
. “
The Networked Image after Web 2.0: Flickr and the ‘Real-World’ Photography of the Dataset
.” In
The Networked Image in Post-Digital Culture
, edited by Andrew Dewdney and Katrina Sluis,
41
59
.
London
:
Routledge
. https://doi.org/10.4324/9781003095019.
Snow, Jacqueline C., and Jody C. Culham.
2021
. “
The Treachery of Images: How Realism Influences Brain and Behavior
.”
Trends in Cognitive Sciences
25
, no.
6
:
506
19
. https://doi.org/10.1016/j.tics.2021.02.008.
Thomee, Bart, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li.
2016
. “
YFCC100M: The New Data in Multimedia Research
.”
Communications of the ACM
59
, no.
2
:
64
73
. https://doi.org/10.1145/2812802.
Torralba, Antonio, and Alexei A. Efros.
2011
. “
Unbiased Look at Dataset Bias
.” In
CVPR 2011
,
1521
28
. https://doi.org/10.1109/CVPR.2011.5995347.
Torralba, Antonio, Rob Fergus, and William T. Freeman.
2008
. “
80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition
.”
IEEE Transactions on Pattern Analysis and Machine Intelligence
30
, no.
11
:
1958
70
. https://doi.org/10.1109/TPAMI.2008.128.
Wu, Yonghui, and David Fleet.
2022
. “
How AI Creates Photorealistic Images from Text
.” Google: The Keyword (blog),
June
22
. https://blog.google/technology/research/how-ai-creates-photorealistic-images-from-text/.