Abstract
In this essay, written in dialogue with the introduction to this special issue, the authors offer a critical history of the development of large language models (LLMs). The essay's goal is to clearly explicate their functionalities and illuminate the effects of their “generative” capacities—particularly the troubling divergences between how these models came into being, how they are currently developed, and how they are marketed. The evolution of LLMs and of their deployment as chatbots was not rooted in the design of interactive systems or in robust frameworks for humanlike communication or information access. Instead LLMs—in particular, generative pretrained transformers (GPTs)—arose through the steady advance of statistical proxies for predicting the plausibility of automated transcriptions and translations. Buoyed by their increasing faith in scale and “data positivism,” researchers adapted these powerful models for the probabilistic scoring of text to chat interaction and other “generative” applications—even though the models generate convincingly humanlike output without any means of tracking its provenance or ensuring its veracity. The authors contrast this technical trajectory with other intellectual currents in AI research that aimed to create empowering tools to help users to accomplish explicit goals by augmenting their capabilities to think, act, and communicate, through mechanisms that were transparent and accountable. The comparison to this “road not taken” positions the weaknesses of LLMs, chatbots, and LLM-based digital assistants—including their well-known “misalignment” with helpful and safe human use—as a reflection of developers’ failure to conceptualize and pursue their ambitions for intelligent assistance as responsible to and engaged with a broader public.
In a surprising early moment in their long technical essay “Training Language Models to Follow Instructions with Human Feedback,” a team of OpenAI researchers (Ouyang et al. 2022) says the quiet part out loud. Large language models (LLMs), they write, “often express unintended behaviors such as making up facts, generating biased or toxic text, or simply not following instructions. . . . This is because the language modeling objective used for many recent large LMs—predicting the next token on a webpage from the internet—is different from the objective ‘follow the user's instructions helpfully and safely.’ Thus we say that the language modeling objective is misaligned” (emphasis added). That is to say, according to OpenAI's own researchers, models such as the company's GPT-3—the model that evolved into ChatGPT—are “misaligned” with helpful and safe human use. It follows that “averting these unintended behaviors is especially important for language models that are deployed and used in hundreds of applications”—precisely what OpenAI's signature ChatGPT soon became. In the discussion that follows, we elaborate the process through which OpenAI's “Alignment Team” undertook to mitigate the core “misalignment” of a product estimated to have about two hundred million users worldwide at the time of this writing.
Arguably one of the most fundamental deficits of current AI techniques, the problem of misalignment is often framed in the existential rhetoric discussed in the introduction to this special issue—in terms of the catastrophic consequences that a highly capable or “superintelligent” AI agent might someday bring about. According to this logic, even if the advanced technologies of the future do not actually extinguish the human species, the possibility of their producing unintended outcomes means that developers and governments must proactively ensure that future AI systems are “aligned” with, and act in pursuit of, “human values.” Of course, it is one thing to contemplate the preservation of human thriving in the face of some imagined far-off calamity. But as the work of Long Ouyang and colleagues demonstrates, in the actually existing world, alignment is already at the fore of developing LLMs for commercial use, and the “human values” with which they are meant to align are already being decided by teams like OpenAI's. Yet, to speak of “human values” as if they are a predecided universal and inclusive category is, of course, to pass over the crucial question of “Whose goals count?” (Abebe and Kasy 2021).
According to Ouyang et al., an aligned language model is one that is “helpful,” “honest,” and “harmless” (cf. Askell et al. 2021). That GPT-3, the model from which ChatGPT was first developed, did not meet the bar for OpenAI's values—let alone those of the broader public—thus marks a key moment in the “misalignment” story that we now aim to tell. It is a story that begins with the history of LLMs: complicated statistical models that, as we saw in the introduction to this special issue, give rise to myriad and little-understood social, ethical, and environmental problems. Those problems are amplified, moreover, when LLMs are developed and marketed as chatbots. Indeed, as we prepare this essay for publication in May 2024, OpenAI has just released an updated version of its chatbot, ChatGPT-4 Omni, equipped with a free-of-charge voice assistant that, according to the company's chief technology officer (quoted in Metz 2024), marks “the future of the interaction between ourselves and machines.”1
In what follows, we throw light on this murky terrain with special emphasis on the history of LLMs’ development. As OpenAI deliberately courts what has come to be known as the ELIZA effect (Hofstadter 1996)—that is, the tendency of people to project humanlike status onto computational systems that articulate even minimally sensible language—our analysis demystifies the functionalities of these statistical leviathans and explores the effects of their “generative” capacities. Throughout this discussion, we emphasize the troubling divergences between how these models first came into being, how they are currently developed, and how they are marketed not only as consumer-facing assistants and “foundations” for sundry downstream applications, but also as supposed “frontiers” of the very superintelligence that the public is now called on both to fear and anticipate.
In contrast to a long tradition of research in conversational AI that this essay will review, LLMs were not designed to support human language use as it has been understood since the advent of the cognitive revolution in the social sciences (see, e.g., Chomsky 1959; Miller, Galanter, and Pribram 1960). Nor do LLMs frame linguistic exchange through any of the many theorizations of human communication as an interactive, dialogic, and potentially emancipatory practice (e.g., Freire 1970; Habermas 1981; Clark 1996; Fraser 2000; Scott 2000; Tomasello 1999, 2019). Still less were these massive statistical models conceived as tools for transparent problem-solving or robust information retrieval in the mode of curated databases or internet search engines. Rather, LLMs originated as—and remain firmly rooted in—statistical methods for resolving lexico-syntactic ambiguities in practical tasks in natural language processing (NLP), especially machine transcription and translation (a history that we will elaborate). Thus, as OpenAI begins to market LLM-based chatbots as voice assistants—a format calculated to have maximum appeal for consumers—they inject a “misaligned” technology into an arena that hitherto has rested on careful engineering. At the same time, OpenAI and its competitors point to an era of full-fledged “AI agents” that can, for example, “schedule a meeting or book a plane flight” (Metz 2024), even though LLM-based technologies remain highly unsuited for the execution of such tasks. In this way, commercial products that claim to herald the future are developed and marketed with almost no dialogue with or accountability to the publics whose needs and interests they allegedly serve.
Although “AI” research dates to the 1950s, the history of LLMs as we know them is more compact. One major turning point was the invention (announced in a 2018 preprint by Alec Radford et al.) of the generative pretrained transformer (GPT), a machine learning (ML) architecture that—in conjunction with massive data sets and the application of ever-greater computational power—markedly improved the predictive capacity and potential use cases of language models. When trained on gigabytes of textual data (largely scraped from the internet) a GPT's ability to predict a probable sequence of words in response to a given input helped to resolve ambiguities in voice recognition and statistical machine translation. The same architecture also proved useful for tasks including the summarization of text and the correction of grammar. However, it was above all their ability to produce paragraphs of plausibly humanlike text in response to a prompt that foretold the GPT's starring role in something called “generative AI.” In this way, the “generative” architecture of statistical models that were designed to improve machine transcription and translation evolved—almost by accident—into the “foundations” of more ambitious forays into human communicative activities.
Nonetheless, as Ouyang et al. (2022) remind us, to generate plausibly humanlike text is not necessarily to follow a user's intended instructions; to output text that is consistently “helpful,” “honest,” and “harmless”; or, still less, to execute tasks such as scheduling a meeting with specific individuals. Thus, the decision to repurpose these volatile next-token predictors as mass-market chatbots through a process called reinforcement learning from human feedback (RLHF) marks a major turn in the history of conversational applications. In this essay, we place the development of ChatGPT against a backdrop that includes Vannevar Bush's (1945a) forward-thinking advocacy for technologically enhanced scientific research and communication and John McCarthy's (1959) prescient idea that AI systems might “learn” from conversational interactions with human users. We also call out the important legacy of Joseph Weizenbaum (1966, 1976) and his ELIZA program for mimicking the role of a Rogerian psychoanalyst. Because of what came to be called the ELIZA effect (Hofstadter 1996), “chat” was for many years regarded as a concerning mode of human-computer interaction. Rather than an empowering tool that helps users to accomplish explicit goals by augmenting their capabilities through an accountable process, systems for “chat” prioritize the kind of simulation that Alan Turing (1950) famously called an “imitation game.” As users of chatbots (from colleagues of Weizenbaum to executives at Google) have illustrated (e.g., Christian 2022), when people engage automated systems in humanlike conversations, they often misapprehend the mechanisms and limitations of the systems that make these exchanges possible. To court the ELIZA effect is thus to manipulate socio-technical conditions of possibility for commercial gain at the expense of public interest.
At a time when the idea of “moving fast and breaking things” has lost its allure, we believe that researchers across disciplines must scrutinize corporate “alignment” protocols and look past the misguided commercial fantasies of simulated intimacy, limitless productivity, oracular decision-making, and frictionless knowing that generative AI now underwrites. Projects of digital or conversational assistance that claim to serve the public interest should be developed in dialogue with stakeholders and communities. Design choices that propose to delineate “the future of the interaction between ourselves and machines” should not be dictated by any single company or motivated by the business interests of a particular industry. Instead, technologies intended to assist human endeavors should be designed to encounter the myriad challenges of pluralistic global polities that are already struggling to contend with the damaging effects of underregulated social media. It is not too late. Other roads are possible and, indeed, once seemed obvious. Our research suggests that the technology industry and its developers must work with diverse communities to design reliable communicative tools premised on transparent models for problem-solving, knowledge-sharing, and accountable communication. We call on the broader public (including researchers, students, educators, citizens and their legislators) to invest in and demand a community-centered and bottom-up approach to technology “alignment.”
The Rise and Fall of the “Cognitive Assistant”
During the roughly eighty years that academics have pursued computational research, they have often looked to natural language interaction as a platform for enhancing the problem-solving capabilities of people working with digital tools. At the dawn of the field, John McCarthy (1959), who coined the term “artificial intelligence” (McCarthy et al. 1955), envisioned an “advice taker” program (premised on John von Neumann's [1993] insights into “general purpose” computation, first drafted in 1945, and Turing's [1950] insight that natural language conversation could enable interlocutors to address open-ended topics in a general way). McCarthy imagined a computer system that could interpret natural language instructions as specifications for programs that the system itself would synthesize and run. (It is possible that his vision was partly inspired by the intelligent robotic assistants on view in the fictional works of Isaac Asimov [1950], whose classic stories, written over the course of a decade, were republished as I, Robot, the collection that popularized a set of “laws of robotics” that continues to resonate with policy thinkers in our own day [e.g., Pasquale 2020]). In McCarthy's vision, designing systems that could follow instructions boiled down to natural language processing, problem-solving, and computation. These integrated capabilities, he believed, would enable the system to bootstrap new and better behaviors through a virtuous feedback loop that could increase the system's intelligence indefinitely. This ambitious vision befits the coauthor of a document (McCarthy et al. 1955) that had proposed a two-month, ten-person study that would “conjecture that every aspect of learning or any feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.” Machines, according to this view, would be able to “use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves.”
In retrospect, McCarthy's understanding of AI as involving the following of user instructions, problem-solving, and automated self-improvement, may seem to resemble the expansive rhetoric around LLMs in our own time. But in fact, his “advice taker” has little in common with the data-driven deep learning architectures of today's “generative AI.” As an instance of what (in the introduction to this special issue) Lauren M. E. Goodlad and Matthew Stone describe as “symbolic AI” (though sometimes referred to as first-wave or Good Old-Fashioned AI),2 McCarthy expected the advice taker's functions to rest upon hard-coded approximations of a human interlocutor's commonsense understanding of speech. In this way, the system's mechanical interpretation of its interactions with human users could be read by engineers who would access and decode (straightforwardly off the system's internal state) the system's analyses of user utterances, its solutions to user problems, or its plans for future actions. That is to say, anyone concerned about the safety of these systems could look to engineers who were tasked to design the system's goals legibly—with the potential evaluation of other researchers (and, perhaps, public stakeholders) firmly in mind. Moreover, the users of such tools were envisioned as skilled workers who would be held professionally accountable for the system's inputs, uses, and ultimately its decisions. McCarthy's speculative paper thus set the stage for decades of high-profile work in what is today called conversational AI, a field that for several decades prioritized the building of transparent systems whose behaviors and functionalities could be tracked and evaluated with relative ease.3
By the early 2000s, these approaches were ripe for scaling and potential commercialization. The Cognitive Assistant that Learns and Organizes (CALO) project at SRI was funded to explore these challenges, and eventually led to the start-up that became Apple's Siri (see also Goodlad and Stone in this special issue). To that extent, McCarthy's vision—tempered by Weizenbaum's caution about the dangers of “chat”— guided the development of the technology embedded in every smartphone (along with a wide range of other devices, from smart speakers and televisions to home automation, cars, and beyond). By that time, however, the turn to data-driven machine learning was already in force, harnessing the internet's troves of data for projects of statistical pattern-finding at scales that neither McCarthy nor Turing ever envisioned.4 Digital assistants like Siri thus combined data-driven statistical models for tasks such as recognizing a user's voice commands, with hard-coded skills for the execution of useful tasks (e.g., searching the web, setting a user's alarm, or entering a date on a user's calendar). When “conversation” took place, it took the form of brief scripts that support these functions (e.g., “Would you like me to search the web?” “What time?”). Siri's “female-coded” voice (Wilkinson 2024), though sometimes described as robotic, has been aptly likened to “the timbre and tone of a self-assured middle-aged hotel concierge.”
In their introduction to this special issue, Goodlad and Stone thus call out the CALO project's judicious avoidance of deploying or marketing digital assistants in ways likely to trigger the ELIZA effect. As they write, “With no sales pitch pressuring them to believe otherwise, users with little technological expertise could understand Siri as a tool whose delivery of information involves searching the web. . . . The idea that Siri might become an intentional or superintelligent agent was virtually unheard of.” At the same time, however, Goodlad and Stone describe the confusing anthropomorphisms that computer scientists had begun to use in their research as a rhetorical “slippery slope.” Although the engineers of this era understood that the methods they had developed for machine “reasoning” or “learning from experience” were computational proxies for human cognitive faculties, these important provisos were largely implicit. “From an ML standpoint,” Goodlad and Stone explain, “‘experience’ equates to the acquisition of new data,” while “‘learning’ from . . . ‘experience’ involves modes of statistical optimization informed by access to this new data during subsequent rounds of training or fine-tuning.” The intention of these engineers was thus to build reliable tools: not to herald a triumph for “AI” or to claim that machine “experience” was equivalent to human experience like that which the cultural theorist Raymond Williams (1985: 128) has defined in terms both of “past ‘lessons’” and “full and active ‘awareness.’”
But by 2009, when Google published Alon Halevy, Peter Norvig, and Fernando Pereira's well-known essay “The Unreasonable Effectiveness of Data,” the climate and culture of technology research had considerably altered. With their unparalleled access to data, computational power, and investment capital, the largest tech companies had begun to dominate research agendas in academia as well as industry. The combination of their lucrative business models (rooted in the accumulation and monetization of data), the utility of large data sets in improving ML performance, and concomitant hardware advances all conduced toward a concentrated political economy that favored resource-intensive research and celebrated the powers of data positivism. As elaborated by Katherine Bode and Lauren M. E. Goodlad (2023), data positivism names an outlook so confident of and invested in the affordances of massive data sets that it perceives scale as the mechanism behind a wholly new onto-epistemic substrate that only computational methods can fathom. Such an outlook confirms the “ever more highly rationalistic” views of people and the world that concerned Weizenbaum (1976: 11) when, in dialogue with the philosophy of Hannah Arendt, he rejected the confluence between calculation and human judgment that, he argued, had become central to computational research. By the 2010s, powered by notable successes in machine vision (Malevé and Sluis 2023), “deep learning” (DL), a scaled-up mode of data-driven statistical modeling organized through multilayered architectures, had become the dominant paradigm for ML, largely displacing symbolic approaches to machine intelligence like that which McCarthy had envisioned. Notably, this was also the period when the term “AI” made a major comeback in the popular parlance (Whittaker 2021).
From the standpoint of our history of LLMs, this paradigm shift drove a wedge between the transparency of the hard-coded “cognitive assistant” and the highly fluent but unpredictable language models that massive data sets were enabling. Whereas academic researchers had avoided the problems of “chat” from the 1970s into the first decade of the twenty-first century—as when the computer scientist Stuart Shieber (1994) argued against the utility of conversational “Turing tests”—the turn to DL and its romance with scale and “big data” meant that infrastructures for practical human-computer interaction in natural language were growing. A new generation of entrepreneurs and developers, in sync with the surveillant business models of Silicon Valley—some of whom believe that data-driven architectures were a pathway to human-level AI—had begun to envision chatbots as a lucrative means to expanding commercial opportunities and harvesting data across a wide range of online domains.5 The stage was set for an ambitious start-up such as OpenAI—technically a nonprofit but already drawing on billions in investment from Microsoft—to transform a language model designed for the probabilistic scoring of text into the world's first “generative” chatbot.
The Origins of Large Language Models
At a time when the MIT Technology Review sees fit to publish articles in which the chief scientist of OpenAI imputes quasi-consciousness to a state-of-the-art chatbot (Heaven 2023), it is easy to overlook that the techniques for building LLMs originated in statistical approaches to automatic speech recognition in the 1970s and 1980s—contemporaneous with an engineering mentality that by and large eschewed loose thinking about consciousness or humanlike intelligence. In this milieu, language models developed—and continue to function—as procedures for the probabilistic scoring of text. Consider that speech recognizers (whether human or machine) must constantly draw inferences about the plausibility of different possible utterances, since acoustic inputs (spoken words) give rise to a plethora of temporary ambiguities.
For example, imagine hearing an utterance of each of these sentences:
The signals corresponding to “threw out” and to “throughout” will be acoustically very similar, if not identical. To identify which words its user has spoken, a speech recognition system needs a mechanism of some kind for assessing the relevant linguistic contexts and background knowledge.
A “general” approach to this task, modeled on how humans interpret language, would conceive it as a challenge of comprehension that requires a robust framework for systematizing contextual influences and necessary knowledge. For most humans, this interpretive framework is so integral to the communicative situation as to be almost imperceptible—but that does not mean that it is easy (or even possible) to simulate computationally. To the contrary, the considerations that a human listener invokes as she prefers one interpretation of auditory signals over another presuppose what researchers sometimes describe as an internal model of the world—an affordance far beyond the capacities of any current machine.6 Rooted in the evolutionary history of the human species, in conjunction with interactive learning that acculturates people to think, reason, and communicate about themselves and their surroundings through language—a model of the world enables a person's recognition that, for example, “the shopkeeper throughout” is unpromising English because the preposition “throughout” implies a condition of spatiotemporal distributedness that is unlikely to pertain to any “shopkeeper.” In this way, human language use draws on lived experience in the sense of Williams's “past ‘lessons.’” Moreover, latent in these embodied interactions with a myriad world is that fuller “awareness”—onto-epistemological, moral, affective, and aesthetic—which, as Williams saw it, gives rise to human intelligence at its most capacious. Comparable concerns prompted John Searle (1980) to evoke a high bar for language understanding through his Chinese room experiment and led the philosopher Brian Cantwell Smith (2019), a successor to Weizenbaum's legacy, to distinguish judgment from machine calculation or reckoning.
Needless to say, full and active awareness of a complex world was not and is not the practical object of speech recognition technology. Rather, speech recognition set out to resolve a straightforward impasse (“throughout” or “threw out”?) by approaching it as a focused problem of linguistic transcription—as distinct from more exacting frameworks for linguistic comprehension or from the expression of communication through language. As a problem for transcription, the linguistic input is an audio stream from a single speaker and the targeted output is a standardized textual record. This highly useful but decidedly narrow approach to “natural” language locates speech recognition technology squarely in the world of the dictaphones and typing pools of the midcentury office. The model, in this case, is not an agent in dialogue with engineering professionals, as with McCarthy's “advice taker,” but instead occupies the feminized role of a typist who is subject to the norms of authoritarian hierarchy and of impersonal, decontextualized, and homogeneous communication.7 That is, for an automated transcription system, ambiguities need not be resolved with reference to the intentions of a particular speaker communicating in relation to a specific audience or history. They can instead be resolved at the level of the signifying text, through statistics and scoring. The modest language models of the 1980s, 1990s, and early 2000s estimated the probability of a text in a flat-footed way, by breaking it down into short sequences and comparing these “n-grams” to the counts found in a recognized “corpus” or data set.8 While such simple models could disambiguate homonyms like “throughout” and “threw out,” other ambiguities turned out to require more sophisticated frameworks.
Enter the large language model. At bottom, a LLM is a virtual mathematical space in which “vectors” (strings of numbers) articulate and spatialize patterns of language that have been identified and mapped during training. The model's main role is to operationalize this statistical “representation” in order to resolve ambiguities and make plausible predictions about suitable replies to, or completions of, user inputs. Because machine translation requires particularly powerful capabilities for disambiguation and prediction, the term “LLM” originates in work on machine translation (Brants et al. 2007)—as does the term transformer, which is the modeling technique most closely associated with contemporary chatbots (Vaswani et al. 2017).
Consider, for instance, this standard example in the machine translation literature (Hutchins and Somers 1992; Jurafsky and Martin 2023), which involves translating the English word “leg” into French—a language in which the leg of a person translates to jambe, the leg of a dog translates to patte, and the leg of a table translates to pied.
To contend with this complication, a modern system for statistical machine translation (SMT) must rely on a capable LLM to assign the highest probability to the correct French option (whether jambe, patte, or pied). This scoring provides a data-driven proxy for a mode of knowledge that, in a human context, we might associate with taking a course in French or spending a semester studying abroad. For a machine system, by contrast, what is particularly challenging is the ability to navigate a “long-distance” correlation in the surface text (given that the antecedents—peintre, chien, table—occur several words earlier). When DL made the modeling of such relationships feasible, it enabled SMT systems to capture increasingly longer dependencies and more fine-grained syntactic, semantic, and pragmatic preferences. As bigger models drove these improvements, technical progress came to be measured through quantitative metrics that track system performance automatically, with no role for human evaluation, in massive tests (Papineni et al. 2002). If this turn to scale and quantification meant that benchmarks and tests were becoming ever more prone to a lack of “construct validity,” the same trends nonetheless facilitated platforms like Google Translate, which, as of 2022, works across 133 languages (Caswell 2022).9
To be sure, Google Translate is a substantial resource that millions of people refer to on a daily basis: a testimony to the “effectiveness of data” at its most affordant. But it is important to recognize that the ability of LLMs to improve such systems according to quantified metrics encouraged machine researchers to regard translation as a task comparable to transcription—that is, as a matter of data-driven signal processing. In turning to the statistical scoring of LLMs, researchers thus embraced an approach that zeroes in on the most probable response while setting aside the complexities of cross-linguistic difference and the diverse materialities of language use. Accordingly, in a well-known essay on Google Translate, the cognitive scientist Douglas Hofstadter (2018) demonstrated that the platform's reliance on “shallow” pattern-finding produces flawed translations that suffer from a systemic lack of understanding, or even basic awareness “that words stand for things.” Hofstadter also worried that humans “with a lifetime of experience” of “using words in a meaningful way” were poorly prepared to observe the deficits of the system's probabilistic language outputs—a variation on the ELIZA effect. “It's almost irresistible,” he wrote, “for people to presume that a piece of software that deals so fluently with words must surely know what they mean.” Indeed, the adjective deep in “deep learning,” he argued, was being “exploited” to imply that language models deliver profundity, wisdom, or insight, when the “depth” in question merely refers to additional layers in a software architecture.
It is worth considering in this context that works of literature are often translated by humans multiple times: not only because of linguistic differences between, for instance, modern English and ancient Greek but also because literary works themselves are, almost by definition, plurivalent. In other words, what makes certain texts “literary” (a characterization that may apply to nonfictional works of, say, history or philosophy) typically relies on linguistic and formal features that are integral to the experience of reading and interpreting them. These might involve the affective or aesthetic charge of certain words or phrases; the rhythmic dimensions of poetic texts; the multiple meanings that words inscribe due to etymological histories (including puns or embedded cultural contexts); or the impact of character arcs, storyworlds, and other narrative features. Attentive to such variables, scholars of literature and language have created bodies of theory to elaborate the challenges of translation.10 One of the best-known modern theorists of human language—Jacques Derrida—was also famously deferential to the translator's work. In an essay on the topic (as translated into English by Lawrence Venuti), Derrida opens with a cautious French gloss on Shakespeare, after which he describes translation as both a “sublime and impossible” task and a “beautiful and terrifying responsibility” (Derrida and Venuti 2001: 174).
To remark on the “sublime” difficulties of translation is not, however, to imply that Google Translate should be scuttled. It is, rather, to emphasize that the transformation of machine translation into data-driven signal processing has contributed to an ecosystem (potentially understood as a core “misalignment”) that militates against human language learning, while falsely conveying the impression that translation is a “problem” that technology has “solved.” SMT thus became a paradigm example of how technologies that rely on quantitative proxies are subject to lack of construct validity: that is to say, whereas such quantitative metrics ostensibly measure qualitatively robust translation from one language to another, in actuality these proxies measure superficial similarity between SMT-generated outputs and a small number of human-generated “references.”11
Moreover, even setting aside the limited quality of machine translations—to say nothing of the ontological, ethical, and aesthetic differences between a crude proxy and a “beautiful and terrifying responsibility”—it is important to recognize how the perception that SMT models sufficiently enhance cross-lingual communication obscures the dilemma of declining foreign language study and growing monolingualism. Like all technologies built through data scraping, platforms such as Google Translate privilege the cultural perspectives of native English speakers and counterparts who speak a handful of European languages.12 As the organizers of a 2024 Global Humanities Institute on “Design Justice AI” at the University of Pretoria query: “What would be lost from human creativity and diversity if writers or visual artists come to rely on predictive models trained on selective datasets that exclude the majority of the world's many cultures and languages?”13
The Rise of GPTs
The importance of transformer architectures to projects of machine translation after 2017 helped to establish data-driven DL as the dominant paradigm through which researchers could frame the modeling of “natural” language through the same methods of scaling and scoring that had been so propitious (according to quantified metrics) for transcription and SMT. As part of that paradigm, the growing confidence in data positivism and model size to improve performance reinforced a mindset that had begun to celebrate the “unreasonable effectiveness of data” a decade before. As commercialization provided the economic and institutional incentives to supercharge these beliefs, researchers inside and outside industry began to leverage discourses of “mitigation” and “alignment” to paper over the increasing evidence for a fundamental disconnect between data-driven prediction and the criteria necessary for robustly “human-centered” or “responsible” technologies. In what follows, we describe the antecedents and techniques that powered the rise of GPTs in preparation for discussing some key implications.
One early landmark in the history of large-scale language modeling was the word2vec method (Mikolov et al. 2013) for constructing “word embeddings” that capture the statistical behavior of words in vast collections of text.14 The rationale is that words with similar distributions are assigned nearby vectors, so the model can use the observed contexts for one word to inform its predictions about another.15 Vector embeddings thus enabled researchers to probe these structures for evidence of diverse relationships. By focusing on crisp statistical signals (see, e.g., Mikolov et al. 2013; Finley et al. 2017), researchers discovered correlates for morphological relations (like that between singular and plural nouns), semantics (e.g., the marking of gender in words such as king and queen), or general knowledge (e.g., the relations between countries and their capitals). As researchers soon recognized, the contextual relationships discovered through the geometry of word embeddings turned out not only to capture but also to amplify the biases and stereotypes inherent in training data. For example, comparing the embeddings for “sewing” and “carpentry” reveals that word2vec attributes their difference in large part to correlations with gender, as measured by the embeddings for “she” and “he” (Bolukbasi et al. 2016). Researchers like Tolga Bolukbasi et al. (2016) hastened to explore techniques for “mitigating” the “disturbing” propagation of harmful stereotypes in these early systems for textual analysis.16 In truth, however, the undesirable side effects of large-scale statistical modeling are almost impossible to eradicate—a point to which we will return.
Whereas word embeddings could capture data-driven similarities at the level of individual words, LLMs work across sequences of words and can offer statistical proxies for the syntax through which words compose grammatical structures. Transformer architectures—the “T” in GPT—excel at such mappings. Both computationally tractable and robust,17 transformers perform a variety of tasks (e.g., sentiment analyses that involve predicting emotional attitudes from textual utterance). That said, only certain transformers are suitable for generating text. The “G” in GPT thus indicates a generative transformer architecture—which is to say, one that predicts successive words in sequences as those sequences move from the beginning of a sentence to its end. Hence, in completing the sentence “The weather tomorrow in New York City will be . . . ,” a generative transformer might mimic the style of weather forecasts observed in its training data by outputting a continuation such as “ . . . overcast with occasional snow flurries.” The larger the model, the more likely that the predicted output will be syntactically correct as well as potentially relevant to other variables such as common weather patterns in a given city. (This, of course, does not mean that a chatbot's learned patterns about regional weather enable it to pronounce the actual weather at any given time.)
While the point to bear in mind for the moment is that generative transformers turned out to be highly useful approaches for improving textual analysis and machine translation, we call on readers to observe this little-remarked fact: although generative transformers were designed as tools for resolving ambiguity through the probabilistic scoring of text—not for interactive conversation, robust machine understanding, or reliable information access—this nimble architecture unexpectedly ushered in what some have upheld as the long-anticipated realization of Turing's imitation game, if not a “solution” to the “problem” of human communication itself, a “reverse engineering” of the human brain (e.g., Sejnowski 2018: ix), or a pathway to human-level or even superhuman-level intelligence.18
The “P” in GPT stands for that architecture's pretraining. As figure 1 demonstrates, the process of building a generative transformer suitable for development into a chatbot begins with collecting a massive amount of data (step 1) and proceeds to “training” the model on that huge data set (step 2)—a lengthy and resource-intensive process during which the system develops a detailed statistical description (“representation”) of the training set that is formalized (by the number of mathematical elements or “parameters’’ that compose the description) and optimized for the prediction of plausible textual completions. Figure 1’s two-step process roughly corresponds to that used to develop OpenAI's GPT-3 (Brown et al. 2020), which, at 175 billion parameters and at estimated cost of $4.6 million in computing costs alone (Li 2020), was, upon its release in May 2020, the industry's top generative pretrained transformer.19
Though figure 1 cannot elaborate on the proprietary development of ChatGPT, the process is known to have encompassed GPT-3’s years-long transition into a model that was eventually released as GPT-3.5. This process required significant fine-tuning, a technique through which special data sets are used to improve a model's performance for one or more purposes, and a technique that may include the human-intensive annotation practices that we will return to in our discussion of step 3 below.
From Text “Generators” to Chatbots
So far we have discussed the first two steps in the development of LLM-based products such as ChatGPT as sketched in figure 1. In doing so, we have accentuated the tremendous expenditure of resources—from petabyte-size data sets and multibillion-dollar investments to premium chips (Griffith 2023), “eye-watering” computational costs (Altman 2022), and escalating carbon emissions and water consumption. As we move to step 3 in figure 1, which roughly encompasses the processes through which LLMs become more potentially marketable for deployment as chatbots, we see in effect how a technology driven by data positivism and scale is ostensibly “aligned” through a massive investment in largely secretive and often exploitative regimes of human labor.
As schematized in step 3 of figure 1 (“Use RLHF to Improve LLM Behavior”), RLHF is a human-intensive form of “fine-tuning.” Whereas standard fine-tuning of a model the size of GPT-3 for the breadth of capabilities required for chat would require a huge data set custom-built to prioritize an array of “aligned” responses to user prompts, Ouyang et al.’s method significantly automates the task by designing a “reward model” (RM) that requires much less data (6 billion parameters as compared to GPT-3’s 175 billion parameters). To create the RM training set, OpenAI's researchers used crowd workers to identify exemplary outputs that the system should be fine-tuned to emulate, as well as problematic outputs that it should avoid. The company collected prompts from contractors as well as its own users, focusing on the kinds of tasks known to create problems for GPT-3. Through iteration of this process, over successive rounds of specialized fine-tuning, GPT 3.5 eventually came into being.20
According to Ouyang et al.’s evaluation, RLHF produced little or no reduction of bias but successfully lessened toxicity, enhanced factuality, and improved instruction-following. However, we believe that—even setting aside the significant concerns over worker exploitation discussed in the introduction to this special issue—RLHF entails serious limitations. First, despite the partial automation, the task of creating a reward model capable of altering GPT-3’s behavior involves an enormous expenditure of human labor and other resources. Second, the preferences of human labelers are themselves prone to errors, bias, and misconceptions. Moreover, as ChatGPT's persistent flaws confirm, human reinforcement does not root out the serious limitations to which LLMs are subject. The improvements in factuality that Ouyang et al. report likely derived from RLHF's helping GPT-3 to identify questions that it should not confidently answer—for example, questions that are too subjective, or whose answers vary over time (such as the weather in New York City). Nevertheless, confabulations remain rife in ChatGPT outputs. As Goodlad and Stone detail in their introductory essay, “hallucinations” arise because LLMs make confident predictions based on common patterns, not curated stores of fine-grained knowledge.
By “rewarding” the LLM for outputs that resemble the human-preferred content collected in the RM, RLHF creates models that prioritize uncontroversial but humanlike content: formulaic language that (at best) resembles encyclopedia entries.21 One concerning by-product of this approach is the erasure of particularities of usage that track the experiences and perspectives of underrepresented communities—a tendency to which probabilistic models are already subject (see, e.g., Weidinger et al. 2021). In effect, human-reinforced models cannot mitigate stereotypes and sanitize language about marginalized people without further reinforcing the dominant discourse. Yet another problem is ChatGPT's tendency to project the illusion of an authoritative system through the use of human-crafted scripts (e.g., “It depends on context”), when a less deceptive and less anthropomorphized reply—one less likely to cultivate false confidence in “AI”—would be forthright about limitations (e.g., “This system is not designed to provide reliable answers to qualitative questions involving multiple contexts”). To put this differently, in contrast to the norms still prevalent during the CALO era, OpenAI's use of human reinforcement has been implemented to encourage, not discourage, the ELIZA effect.
Indeed, the intention to promote chatbots by making them more like human companions seems to have motivated the choice of a Scarlett Johansson–like voice for “Sky,” the first of OpenAI's new voice assistant options. Widely perceived to be mimicking the role of the fictional digital assistant that Johansson performed in Spike Jonze's 2013 film Her, “Sky” simulates feelings and responds “flirtatiously” (Knight 2024) while delivering a Her-like fantasy through a “deferential” and girlish persona that is “wholly focused on the user” (Wilkinson 2024). Although the company denied using Johansson's voice, CEO Altman, who is an admirer of Jonze's film, tweeted a single word, “her,” after the Sky demonstration. As Kyle Chayka (2024) writes in the New Yorker, OpenAI's gambit places the company in the terrain of startups that, like Replika AI, specialize in humanlike AI “companions” (see also Knight and Rogers 2024). However, whereas such downstream companies sell “the semblance of emotional connecting,” OpenAI is now attempting to merge companion technology with information retrieval. The problem, Chayka notes, is that it is easier for LLM-based systems to simulate conversation than to deliver “reliable information.” The result is “a tool that sounds far more convincingly intelligent than it is.”22
Chayka's observation takes us back to the fundamental “misalignment” that Ouyang et al. (2020) document when they start with the premise that LLMs do not “helpfully and safely” inform or converse with human users. Thus, even as their RLHF technique mitigates some of these harms, it exacerbates others: concealing the flaws of a systematically unreliable tool, while amplifying the hazards and courting the likelihood of the ELIZA effect. This brings us back to insights that were once conventional wisdom in AI research: “chat” itself is a problematic affordance. It invites users to look for answers to questions without their knowing the domains sufficiently to assess these outputs for factuality and soundness. It conceals the fact that many of the bot's responses actually recapitulate, paraphrase, or summarize human-authored content (and since the introduction of ChatGPT-4o) humanlike vocal behaviors; it undermines the attention to provenance that is crucial for research and information literacy (Shah and Bender 2022, 2024; see also Allison and DeRewal in part 2 of this special issue); and it frustrates users’ efforts, so that they can exploit its strengths while appreciating its underpinnings and shortcomings. As such, ChatGPT not only diverges from the rigorous engineering and safety of the CALO era, it also (as we will now demonstrate) stands apart from an even earlier midcentury vision of cognitive assistance.
The Road Not Taken
In surveying both LLMs’ unprecedented capabilities for the probabilistic scoring of text and the roiling controversies around their deployment as commercial tools, we have worked to position this essay (in tandem with the introduction to this special issue of Critical AI) as a forum to initiate a broader discussion that reorients future technological development toward the public interest. Twenty years ago, when new modes of social media took the internet by storm, the general public was unprepared to recognize potential harms and predisposed to have faith in technolibertarian nostrums (Lubin and Gilbert 2023). But the desire to rein in big tech is, by now, commonplace. Might it be possible to bridle “misaligned” technologies, democratize decision-making, and distribute the benefits of worthwhile affordances widely? History has shown that such watersheds can bring profound opportunities as well as perils.
Consider the legacy of Vannevar Bush, the former dean of the School of Engineering at MIT who directed the US military research and development effort during World War II. In taking stock of the prospects for science and technology in peacetime, his Science: The Endless Frontier (Bush 1945b) compellingly argued that access to new knowledge and technological infrastructure were indispensable to human flourishing and should be supported by government funding. The report helped to institute the National Science Foundation. Just as influential was a “thinkpiece,” so to speak, in the Atlantic Monthly that advanced a prescient vision for future practices of machine-mediated reading and writing (Bush 1945a). Bush envisaged interactive tools, such as a hypothetical “memex,” that would enable researchers to track their thinking through annotations, marking critique and justifying conclusions by integrated references to primary sources, relevant data, and other supplementary information. Such products, he imagined, could be shared across research communities.
Bush's description of the memex illustrates how much of his postwar vision has been technically realized—and indeed improved—through hypertext, graphical user interfaces, and networked computing. But it also shows how much remains beyond our grasp—and, just as important, how much of the underlying social outlook has been undermined or forgotten:
The owner of the memex, let us say, is interested in the origin and properties of the bow and arrow. Specifically he is studying why the short Turkish bow was apparently superior to the English long bow in the skirmishes of the Crusades. He has dozens of possibly pertinent books and articles in his memex. First he runs through an encyclopedia, finds an interesting but sketchy article, leaves it projected. Next, in a history, he finds another pertinent item, and ties the two together. Thus he goes, building a trail of many items. Occasionally he inserts a comment of his own, either linking it into the main trail or joining it by a side trail to a particular item. When it becomes evident that the elastic properties of available materials had a great deal to do with the bow, he branches off on a side trail which takes him through textbooks on elasticity and tables of physical constants. He inserts a page of longhand analysis of his own. Thus he builds a trail of his interest through the maze of materials available to him.
And his trails do not fade. Several years later, his talk with a friend turns to the queer ways in which a people resist innovations, even of vital interest. He has an example, in the fact that the outraged Europeans still failed to adopt the Turkish bow. In fact he has a trail on it. A touch brings up the code book. Tapping a few keys projects the head of the trail. A lever runs through it at will, stopping at interesting items, going off on side excursions. It is an interesting trail, pertinent to the discussion. So he sets a reproducer in action, photographs the whole trail out, and passes it to his friend for insertion in his own memex, there to be linked into the more general trail.
To be sure, one may assume that both Bush's researcher and his friend are upper middle-class men: probably white, highly educated, and American (though the details leave open the tantalizing possibility of the memexer enjoying a Turkish coffee with a Middle Eastern friend). Nonetheless, without idealizing the Cold War era, we perceive that Bush's vignette implies a capacious understanding of textual communication and the research tools that support it as intersubjective affordances. The memex is a robust and dynamic digital infrastructure for collaborative efforts to establish interactive relationships of (and socio-technical conditions for) dialogue, shared inquiry, and mutual understanding.
In this sense, Bush's vision adumbrated those rich bodies of late twentieth-century social-scientific research to which we alluded earlier; animated by the perception that human communication is an interactive, dialogic, and potentially emancipatory practice (e.g., Freire 1970; Habermas 1981; Clark 1996; Fraser 2000; Scott 2000; Tomasello 1999, 2019). Although Bush is sketchy about the precise technical underpinnings, his scenario illustrates how computational technologies might realize interactive capabilities that are front and center in theories of human communication but marginal in today's chatbot landscape. For example, Bush's “trail” anticipates a notion of public inquiry and argumentation that holds speakers responsible for their contributions (e.g., Brandom 1994) and emphasizes the need for knowledge tools that link conclusions to their sources, evidence, and implications. His memex evinces a recognition that the products of research and engagement are resources to be remembered, consulted, and shared with others. Likewise, Bush's interactive focus suggests a dyadic, networked, and relational model of human-computer “chat” which underscores the necessity for clarification, paraphrase, elaboration, and summarization both to refine the results of knowledge work, and to overcome potential misconceptions in communication (e.g., Clark and Schaefer 1989). The memex thus highlights the complex role that systems might play as active co-constructors and mediators of users’ written ideas.
Pursuing capabilities of this kind, we contend, reflects a “road not taken” in research, especially in the years that coincided with the rise of concentrated corporate power, surveillant business models, and data positivism. Consider, for example, the features an AI-powered memex might require in order to help a user to construct her “trails.” For example, what happens when such a system encounters multiple options with regard to a user's query about a reference document? Perhaps it might include the means to operationalize this complexity by offering options either to move ahead with a probable interpretation, present an explicit confirmation, or ask for clarification (cf. Paek and Horvitz 2000). Such design choices would be imbued with significant ramifications. The user contributes an utterance; the system is not confident of how to process it and suggests a possible paraphrase; the user accepts it and records it. The user consults the document later to make decisions and distributes it through her organization; meanwhile, she learns to anticipate, adapt to, and perhaps even to trust the system's behavior and its contributions to her work. Although the technological mediations of such a system, depending on circumstances, might not always conduce toward high-quality work, the system's built-in attention to complexity and uncertainty recognizes that human communication and research are seldom reducible to straightforward signal processing problems or measurable through matching a corpus of “correct” answers or through a “reward model” derived from harried crowd-worker preferences. And yet these are the proxies that today's most powerful corporations rely on as they aim to tame “misalignment,” inject seductively deferential 24/7 “companionship,” and even usher in superintelligent machines.
Comparing the hype of today to Bush's vision—or even to the practical research objectives that produced Siri about a decade ago—one perceives that, somewhere along the line, the idea that digital technologies facilitate dialogue and share access to knowledge gave way to the misguided presumption that language, communication, and written expression are problematic inefficiencies in need of (lucrative) commercial solutions. Needless to say, Bush does not imagine that the “code book” will be hosted on a commercial platform, that the platform will surveil the online habits of his network of “friends,” or that it will incorporate design features that optimize for the most attention-grabbing (and profitable) of his “trails.” To this extent, the millennial rise of an online surveillance economy, rooted in the monetization of social relations, reconceives the communication of the many as a financial gold mine for the few.23
Conclusion: Whose Goals Count?
In this essay, we have questioned whether the misalignment of LLMs can ever be rectified through large-scale RLHF. We have deliberately set aside most of the myriad problems of the technology laid out more systematically in the introduction to this special issue, as well as in important precursors such as research by Sasha Costanza-Chock (2020), Emily M. Bender et al. (2021), Inioluwa Deborah Raji et al. (2022), and many others. Thus, instead of situating this essay in relation to recent analyses of environmental harm, labor exploitation, violation of data privacy, and algorithmic bias, we have focused largely on the particular problems that Ouyang and colleagues set out to mitigate through the techniques laid out in their 2022 essay.
Our conclusion is that no amount of further scaling supplemented by RLHF will make GPT-based chatbots sufficiently “aligned” with “human values” in the pluralistic and democratic way that we understand them. That is true if the values in question derive from the modestly progressive humanism that Weizenbaum set forth in his critique of chatbots and his Arendtian case for the empowering of judgment. It is true from the perspective of Bush's memex: a communicative assistant rooted in the values of human cooperation, collaboration, and the sharing and cocreation of knowledge. It is likewise true from the vantage of McCarthy's ambitions for AI tools rooted in the combination of transparent design, accountable engineering, and the insights of commonsense reasoning. It is true, moreover, from the vantage of Ouyang and colleagues’ express goals since the modest improvements of their RLHF techniques do not warrant chatbots that are consistently “helpful,” “honest,” and “harmless” even as they introduce a host of new dangers that court the ELIZA effect. Indeed, according to the essay from which OpenAI's team borrows their alignment framework (Askell et al. 2021), “people and organizations that deploy AI systems need to take responsibility for their behavior.” Although we agree, it seems clear to us the big tech companies now deploying LLM-based chatbots have hardly begun to contemplate what that might mean.
As the computer scientist Rediet Abebe and the economist Maximilian Kasy (2021) put it: “Left to industry, ethical considerations” are likely either to “remain purely cosmetic” or “play only an instrumental role.” According to Abeba Birhane, conversations over AI ethics remain “dominated by abstract and hypothetical speculations” about AI intelligence “at the cost of questions of responsibility” including the “uneven distribution of harm and benefits from these systems” (Birhane et al. 2023). We agree with these colleagues. Abebe and Kasy outline the social interventions necessary not only “to think more deeply about who controls data and algorithms” but also to leverage the collective power of workers and their unions, “civil society actors, nongovernmental organizations,” journalists and their readers, “government policymakers,” regulatory agencies, and politicians. “Those of us who work on the ethics and social impact of AI and related technologies should think hard about who our audience is and whose interests we want to serve.” Again we agree.
In this essay (in dialogue with the larger contexts of the introduction to this special issue), we have focused on a socio-technical analysis from the annals of AI research. Our conclusions are clear: GPTs were already misaligned with safe human use, and their fine-tuning for chatbot use has made them more dangerous and deceptive. For robust alternatives to these unreliable and systems we can begin by going back to the future: reviewing the insights of Bush, McCarthy, and Weizenbaum and renewing them through community-centered tools, informed by interdisciplinary research and designed in dialogue with the publics they are meant to serve.
Acknowledgments
We are grateful to Christopher Newfield for helpful feedback on this essay and to Sabrina Burns, Andi Craciun, Kelsey Keyes, and Jai Yadav for technical assistance with the works cited. We acknowledge the support of NEH RZ-292740–23 (Design Justice Labs) in the preparation of this essay and are grateful as well to Mellon-CHCI for the Design Justice AI global humanities institute funding that has expanded our research into the impact of LLMs on local languages. Stone's work on this essay was also supported by NSF IIS-211926 and DGE-202162 and by a sabbatical leave award from Rutgers.
Notes
As we edit proof pages, Knight and Rogers (2024) report that Open AI’s “system card” for GPT-4o warns that the company’s use of an “anthropomorphic voice may lure some users into becoming emotionally attached to their chatbot.”
One way to track the centrality and influence of the approach is through the IJCAI Computers and Thought Award, given to up-and-coming AI researchers for their early contributions to the field. The award honored Terry Winograd (in 1971) for his work on referential communication, and it honored Henry Kautz (in 1989), Martha Pollack (in 1991), and Sarit Kraus (in 1995) for watershed work in formalizing commonsense accounts of goals, plans, deliberation, and teamwork in collaborative interactions (among AI agents and possibly human users as well). In 2007, a team led by computer scientist James Allen was recognized with the best paper award at the AAAI conference for a system—echoing the advice taker—that built explicit programs for computer automation tasks from a user demonstration narrated in colloquial English (Allen et al. 2007).
Turing (1950) avers, “I should be surprised if more than 109 [bits of storage] was required for satisfactory playing of the imitation game.” By comparison, GPT-3 includes 175 billion parameters, each carrying sixteen bits of precision, requiring storage on the order of 1012 bits.
Potential commercial applications in entertainment (Mateas and Stern 2006) and customer service (McTear 2004) marked an earlier instance of this shift toward the profit-making opportunities of chatbots. The idea of chat—in contrast to AI assistants that can act in support of user goals—is to provide an engaging conversational experience simply through presenting suitable utterances in response to user contributions. This opens up the possibility of building a chatbot using an “end-to-end” model that predicts what the system should say next given the dialogue history, by mining patterns from conversational data without even attempting to identify users’ meanings, goals, or expectations (Vinyals and Le 2015).
For example, Evelina Leivada, Elliot Murphy, and Gary Marcus (2022) explain the shortcomings of the DALL-E2 image tool in terms of the system's lack of a “general cognitive model of the world.” See also Stone's response to Bowman in part 2 of this special issue.
Of course, today, automatically transcribed speech need not appear in text documents. Synergies with other technological developments have given transcripts a broad utility in making information accessible by, for example, improving search, retrieval, and captioning of audio content. For more on the problematic legacies of feminized conceptions of intelligent assistive technology, see Yolande Strengers and Jenny Kennedy (2021).
For example, a bigram language model scores a text by totaling the scores for successive pairs of words. Each component score is based on how likely the second item was to follow the first in training. Larger models (up to 5-gram models remain practical) can capture more linguistic context but increasingly query patterns that were rarely seen in training, if at all. Joshua T. Goodman (2001) describes and evaluates the classical signal processing techniques with rigorous guarantees that such models can use to “smooth” counts of rarely seen patterns to avoid memorization of individual texts and “back off” from unseen patterns to find more reliable evidence from training.
As discussed in the introduction, the problem of “construct validity” as theorized by Inioluwa Deborah Raji et al. (2021) concerns the pervasive use of benchmarks to measure progress in AI through metrics that do not adequately measure or verify what the researchers are claiming. To be clear, we are not arguing that the turn to scale, quantified metrics, and data-centrism awaited the advent of DL. Machine translation researchers were already characterizing translation as data-driven signal processing before the widespread adoption of DL (e.g., Brown et al. 1993; Och and Ney 2004). These techniques dispensed with human-intensive processes of collecting, curating, and preparing input data and of enlisting people to evaluate system performance, turning instead to largely automated processes of scraping and evaluation that stripped language data of social, communicative, and embodied contexts and set aside cross-linguistic differences in grammar and lexicon. One landmark in this transition was Brants et al. 2007, an influential Google publication among speech recognition and translation researchers, which marked both a decided turn to scale and automation as well as the building of the first true LLMs. The “bigger is better” logic thus articulated was, as we have suggested, further codified in “The Unreasonable Effectiveness of Data” (Halevy, Norvig, and Pereira 2009). According to Bode and Goodlad (2023), the latter essay, in conjunction with a much-cited essay from Wired, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” (C. Anderson 2008), became the loci classici of data positivism.
For a brief discussion of some relevant work, see Coundouriotis and Goodlad 2020. For an essay that confirms our view of literary complexity, see Elam 2023, which argues that “literature does not aspire to seamless user experience” and instead “turns our attention to those seams we are seduced into not seeing. After all, fiction is not frictionless; poetry will not optimize.”
While better performance as measured by such metrics is correlated with better performance as observed by human users, (1) there is no guarantee that a selection between two possible SMT output translations for a given input that uses the higher score computed by such a metric would be the one preferred by a given human annotator (Callison-Burch, Osborne, and Koehn 2006; Mathur, Baldwin, and Cohn 2020); and (2) the metric struggles to account for semantic similarity, so it will militate toward a particular subset of possible translations (Sellam, Das, and Parikh 2020). Whether this second limitation represents a problem for a given application depends on the exact requirements for the task it must perform, but it certainly indicates that such heuristics are a poor fit for understanding-based mapping of surface forms.
See Ramati and Pinchevski 2017 for the argument that the transition to “neural” (DL-facilitated) LLMs precipitated a transition from the domination of English to the domination of a Google-mediated multilingual “universal.”
“Guiding Questions,” Design Justice AI, December 13, 2023, https://criticalai.org/designjustice/guiding-questions/.
Another earlier method is the statistical topic modeling technique of David M. Blei, Andrew Y. Ng, and Michael I. Jordan (2003), which sparked substantial interest in the digital humanities (see, e.g., Meeks and Weingart 2012). The approach builds a simple language model by simultaneously clustering words and documents into associations or “topics”: clustering words together if they appear in related documents and clustering documents together if they contain related words. Because they ignore word order completely, topic models could help to analyze corpora but could not improve NLP tasks like speech recognition or translation.
The technique maps each word to a vector—a sequence of numerical values. Since these vectors identify points in a high-dimensional virtual space, the mapping collectively embeds the vocabulary of a corpus into an abstract geometry with intrinsic notions of similarity corresponding to distance and direction. These vectors are optimized in such a way that a simple neural network can use them to predict rates of co-occurrences of pairs of words. Previous data-driven models of word similarity, including Bengio et al. 2003, Schwenk 2007, and Mnih and Hinton 2008, involved computations that were too complex to train on massive datasets and so could not leverage the improvements in scale that are characteristic of work in statistical NLP.
The grammatical basis for linguistic similarity is clear from the partially overlapping continuations of the sequences “athletes run” and “athletes who run.” Either phrase might well be continued with descriptions of running: “quickly,” “long distances,” and so forth. But only “athletes who run” can be directly continued with further descriptions of the athletes: “develop endurance,” “get knee injuries,” and so on. Before the invention of transformers, researchers explored a range of DL methods to find commonalities across language sequences: fixed windows modeled on n-gram methods (Bengio et al. 2003), “convolutional” methods designed for propagating local influences in tasks such as computer vision (Collobert and Weston 2008), and “short-term memory” methods suited to tracking change in the world (Sundermeyer, Schlüter, and Ney 2012). These results met with mixed success.
Although some commentators make the case for successful Turing tests of one kind or another, we concur with Sasha Luccioni and Gary Marcus (2023) in holding (1) that no chatbot has as yet passed any robust Turing test (see, for example, Jones and Bergen 2023, as well as the as yet unmet Alexa Prize challenge described in the introduction to this special issue), and (2) that so-called Turing tests are, in any case, poor measures of humanlike intelligence for a number of reasons (including inappropriate anthropomorphization and the above-discussed ELIZA effect). Machine intelligence and human intelligence are best understood comparatively.
In this essay we focus on GPT-3, simply because it was the original foundation for what became OpenAI's first iteration of ChatGPT. GPT-4, the system on which ChatGPT was built up until November 2023, models huge datasets through an estimated 1.76 trillion parameters (Bastian 2023). As discussed in the introduction to this special issue, GPT-5 is already under discussion, while GPT-4 Turbo was OpenAI's best-performing commercial product until May 2024 when the company released GPT-4 Omni (largely focused on the integration of multimodal features).
Ouyang et al.’s three-step process tasks human annotators to produce examples of appropriate output. The resulting dataset was used to fine-tune a new version of GPT-3 that was more likely to produce “aligned” outputs and was named SFT (supervised fine-tuning). Next, a second set of human-mediated prompts was used to elicit outputs from both the original GPT-3 model and the new fine-tuned SFT model. These paired outputs provided yet another dataset that human annotators labeled (indicating which of the two outputs was preferable). This new dataset was then used to train the “Reward Model” (RM), an ML model that can produce scores for proposed LLM outputs that are broadly consistent with the human preferences encoded in the RM's data. The RM is so called because it can provide numerical scores for arbitrary outputs and therefore can be used as part of a reinforcement learning system that automatically generates variations of outputs for given prompts and uses the RM as a proxy for human labeling—that is, as supervision for the final fine-tuning process that yields InstructGPT. Note, however, that this use of the RM as a proxy for real-world quality is a weakening of the standard reinforcement learning paradigm, which also encompasses the ability to take costly actions with little or no immediate reward to enable the agent to go on to choose and perform subsequent actions that achieve greater rewards (Paek and Horvitz 2000, Knox et al. 2023). As such, it is reminiscent of proxy scoring metrics used for translation quality.
See Shumailov et al. 2023 for the argument that LLMs are so prone to privileging probable outputs and downweighting improbable alternatives that it becomes impossible to train functional models on the outputs of previous LLMs. As one coauthor of that study puts it, “We're about to fill the Internet with blah” (R. Anderson 2023). See Stokel-Walker 2023 for evidence that the multitudinous ChatGPT outputs already “littered across the web” have begun to affect the training of new models.
Although this new dimension of chatbot development is still quite early, it is not unique to OpenAI or its products. As Cade Metz reports (2024), both Apple and Google are equipping their existing voice assistants with LLM-powered chat features. Moreover, in developing a more conversational version of Siri, Apple plans to partner with OpenAI, a deal that, according to the Information (Efrati and Ma 2024), could be worth billions and could overturn Apple's longstanding partnership with Google. On OpenAI's pausing of Sky in the wake of a public letter from Johansson, see, e.g., Spangler 2024. For commentary on the social and cultural implications of provoking this fantasy variation on the ELIZA effect, see Chayka 2024 and Wilkinson 2024. For a Google DeepMind article that elaborates various harms and ethical concerns connected to deployment of anthropomorphized AI assistants, see Gabriel et al. 2024.
See Council 2023 for an eye-opening critique of how Signal, the nonprofit encrypted messaging application, delivers the same basic services as Meta and Apple with only fifty full-time employees for the simple reason that the tens of thousands of engineers, designers, salespeople, and product managers whom the big tech counterparts hire (and fire) are employed to devise data accumulation schemes and to seek out “cunning ways to grab every possible second of our attention, and wring every possible cent out of advertisers.”