Abstract
The salient concern, today, is not whether copyright law will “allow robots to learn.” The pressing question is whether the exploitation of the data ecosystem that has made generative AI possible can be made socially sustainable. Just as the human right to water is only possible if reasonable use and reciprocity constraints are imposed on the economic exploitation of rivers, so is the fundamental right to access culture, learn, and build upon it. To restore this right to its proper place within the balancing act IP law is supposed to enable demands fundamental legal reform. This article explores the merits of reconstructing copyright as a permitted privilege (rather than property right). It also highlights the extent to which, for such reform to bear fruit and contribute to a socially sustainable data ecosystem, it needs to be supported by bottom-up participatory infrastructure.
Just like a river, data can be a source of wealth for some, a source of vulnerability for others, and a public resource for many communities. Where a river can be diverted for milling or irrigation purposes, it can also give rise to navigation and fishing rights, antipollution prohibitions and flooding risks. By the time water-powered technology enabled the first industrial revolution, legal frameworks to juggle those coexistent privileges and responsibilities had evolved in a piecemeal fashion. In many countries, the riparian doctrine that ties rights to use the water to ownership of river-adjacent lands (subject to other owners’ correlative rights) has managed to endure well into the twenty-first century. As the deleterious effects of water scarcity and pollution are increasingly felt around the globe, reforms of water law are finally underway. These reforms are converging away from the riparian doctrine (Koonan 2022; Hedden-Nicely 2022; Macpherson 2019; Cullet 2022) to tackle allocation, quality control, and infrastructure provision through a permissions-centered rather than property-centered system.
Generative AI—a term that refers to any type of artificial intelligence that can be used to create new text, images, video, audio, or code—is arguably accelerating what is sometimes referred to as “the fourth industrial revolution” (Schwab 2017). Instead of being powered by water, this revolution is powered by data. Without it, there is no life as we know it. Unlike water, data is an intangible, nonrival good. In principle, data can be enjoyed at time t by user x without preventing simultaneous consumption by others. This hasn't stopped the corporations that put in place data-collecting infrastructure from hoarding data: artificial scarcity increases the commercial value of data-dependent predictions.
From contracts to trade secrets via database rights, the legal tools backing up this hoarding strategy vary depending on the kind of data at stake and the types of rights pertaining to it. Besides data about our environment, today's machine learning technologies also feed on the data we give rise to as we go about our daily lives. While much of this data is generated passively, some is the result of active or creative endeavors on our part. Aside from the IP rights protecting the latter, the former, more or less passively generated data can also give rise to rights. In many jurisdictions, personal data rights apply when and to the extent that it is reasonably possible to find out to whom the generated information relates.
Introduced as a constraint on the contractual freedom that would otherwise allow for the unfettered exploitation of data for economic or political gains, these personal data rights are a crucial yet underused tool in any bid to counter the power imbalances that stem from (and allow) data collection. These rights reflect the specific nature of the vulnerabilities at stake: the data we leak on a daily basis has become something by reference to which we are continuously nudged and judged (Hildebrandt 2018). Articulated in different terms across different jurisdictions, the values underpinning these rights partake in a commitment to human dignity and moral equality. Fundamental as it is, that commitment is only one of the determinants of the social sustainability of the data ecosystem that underpins generative AI.
This article draws attention to the inadequacy of the legal tools we have at our disposal to safeguard the social sustainability of this data ecosystem. While work is now underway to promote the development of AI tools that could be “compatible with sustaining environmental resources for current and future generations” (van Wynsberghe 2021), the social sustainability of the data ecosystem that powers these tools is seldom considered as a whole. Can the data needs inherent in developing generative AI be met within ecosystem limits? The many artists and novelists whose works have been fed to generative AI models are not the only imperiled ecosystem contributors (even if they are currently the most visible). To allow sustainability considerations to concentrate only on the IP-protected works ingested by generative AI algorithms is to miss a wider set of challenges that defy traditional, rights-based categorizations of data—just as the sustainability of the ecosystem that surrounds freshwater is poorly grasped from a strict rights-based perspective.
Will data fare better than water? The inadequacy of the legal tools we had at our disposal to manage water (as a resource that powered the first industrial revolution) has led to large-scale water scarcity and pollution. The water law reforms that are currently underway around the world have taken centuries to emerge. Given generative AI ’s pace of progress, we don't have that kind of time to address the shortfalls of our data governance framework. At stake is not only the irreversible pollution of content found online. The equity and sustainability of the ecosystem that makes generative AI possible is under threat. Some of the mechanisms underlying this threat are qualitatively similar to those affecting freshwater rivers:
(1) Disregarded reciprocity expectations: Generative AI tools would not have been possible without access to open, high-quality content made available under Creative Commons and open-source licenses. Yet few of those tools respect the reciprocity expectations without which the Creative Commons and open-source movements cease to be sustainable.
(2) Disregarded ecosystem cultivation obligations: These obligations are best understood from an ecosystem-wide perspective and are unlikely to be met through top-down legal reform alone. Just as water law reform is gradually taking on board the need to empower a variety of river-dependent communities, the sustainability of the data ecosystem that enables generative AI presupposes robust bottom-up empowerment infrastructure. Aside from facilitating the collective exercise of personal data and IP rights, this infrastructure can also play a critical role in creatively addressing some of the manipulation, homogenization, mis- and disinformation risks inherent in large language models.
(3) Disregarded nonpollution obligations: the absence of internationally coordinated standards to facilitate the systematic identification of AI-generated outputs opens the door for the rapid colonization of content found online. This not only entails significant mis- and disinformation hazards, Shumailov et al. 2023 found that this is also likely to affect the quality and reliability of large language models themselves.1
Given the structural similarities in the challenges at stake—whether it applies to data or water, a property rights framework does not provide adequately for environmental and social externalities—much can be learned from water law reforms around the world (Hendry 2014). Taking a leaf from the latter, this article outlines the tools we have at our disposal to address points 1 and 2 above, as well as the dire lack of such tools when it comes to point 3. Considering the IP focus inherent in point 1, I emphasize below the underappreciated salience of the human right to access culture, learn, and build upon it. This right was meant to be balanced with the need for incentives to create works by drawing a line between access and reproduction.
Today this balancing act is broken. The distinction between “reading” and “copying” has all but collapsed in an age of reading bots, whose copying is the necessary by-product of their alleged capacity to read. At the same time, the characterization of copyright as a property right leads to an expansion logic that has a poor track record when it comes to incorporating public interest considerations and, crucially, honoring the human right to access culture. My essay thus begins by considering calls for the reconstruction of copyright as a permitted privilege granted by statute rather than a property right. This would not only fall in line with the worldwide trend for water law reforms to favor a move to a “permitting system” (Salman and Bradlow 2006). It would also open the way for the strengthening of copyright protections for vulnerable content producers (such as individual artists) while at the same time preventing some ecosystem-threatening abuses of so-called fair or public use exceptions.
The next section of the essay explores the potential of collective, bottom-up data empowerment infrastructure (as in point 2 above), again taking a leaf from water law reforms. The robust participation mechanisms typically accompanying permission-based reforms not only aim for better representation of diverse stakeholders, they are also key to incentivizing innovation and flexibility in the management of water or, in this case, data. Aside from providing much-needed participation forums, this infrastructure could empower those who want to share their personal data and/or creative works to contribute to common goods (from advancing research to better use of natural resources, etc.) in a way that gives them longer-term agency and preserves any reciprocity expectations (point 1 above). Such infrastructure could also be key to steering the (re)development of copyright in a way that is better able to address the precarious circumstances of many original content creators. Finally, bottom-up advocacy is crucial to counterbalancing the dominance of corporate voices when it comes to upcoming regulatory efforts: among these, the data pollution threat (point 3) requires urgent, internationally coordinated action.
Unspoken For? The Human Right to Access Culture
Just as the fundamental human right to water is only possible if reasonable use and reciprocity constraints are imposed on the economic exploitation of rivers, so is the fundamental right to access culture, learn, and build upon it.2 As IP lawyers and regulators scramble to come to terms with the challenges that stem from the rapid commercialization of generative AI, relatively few speak on behalf of the human right to access, learn from, and build upon culture (Plomer 2013).
From novels to pictures, songs, dances, lesson plans, and computer code, the IP rights that pertain to creative works are structured around a distinction between access and reproduction. This distinction was meant to balance the public's right to access, learn from, and build upon such works with the incentive to create and disseminate works in the first place (see below). In a world where most works are now disseminated through digital media, however, “making digital reproductions is an unavoidable incident of reading” (Litman 1996: 37). The de facto collapse of the reading versus copying distinction has left us with a system that manages to fail many original content producers while at the same time being leveraged by lawyer-rich entities in a way that limits not only reproduction but access too.
Since generative AI tools like OpenAI's GPT-4 became possible through scraping data from the internet—often including copyrighted material—the producers of these systems regard the erection of IP “fences” as a threat to their business model. European and British regulators are alert to it: so-called text and data mining exceptions are at the heart of intense debates, outlined below. In the United States, these debates are taking place in court hearings mostly structured around fair use. By allowing certain socially beneficial uses of copyrighted material that do not significantly harm the copyright holder's interests, the fair use doctrine has traditionally been adduced to attenuate copyright restrictions in light of the human right to access and build upon culture.
That the doctrine of fair use now appears to play a central role as a defense strategy in the upcoming string of lawsuits against various generative AI tools, however, suggests that the right to access culture may not weigh as much as it once did in the balancing act inherent in fair use. When the noncommercial use of such tools is conditional on a fee, it is indeed far from clear how such tools may be said to preserve or enhance the human right to access and build upon culture.
In light of the above, I consider below calls to re-delineate copyright as a privileged and conditional permission to exploit given creative works. This permission (rather than property right) would be granted by statute in a bid to restore the fundamental right to access culture to its rightful place in the balancing act mentioned above (in a way that is not dissimilar to the move to a permissions system in water law).
Copying versus Reading
Copyright law was born in a world of books and other predigital technologies.3 In that world, two concurrent imperatives molded IP law. These imperatives have not lost their salience: as we have seen, creators need legal protections if they are to retain incentives to produce further work, while the public needs to be able to access and enjoy creative works. IP law balanced these imperatives by drawing on a—at the time fairly clear—distinction between copying and reading: reading a copyrighted work, such as a book, does not infringe on the rights of the copyright holder, as long as the reader does not make copies of the work.
Of course, readers did not have to wait for the advent of digital technologies to often become authors too: this is where IP law evolved various kinds of originality tests. If the qualitative transformation between the original work and that produced by the reader is deemed to pass a certain threshold, the reader is not guilty of having made a copy and is deemed an author instead.
Given the contextual and value-loaded judgments through which originality is assessed, the author/reader distinction has never been straightforward. Today's increasingly interactive, sometimes crowdsourced way of producing original content has only added to this complexity.4 Yet it is the extent to which most reading today is done by machines whose reading (a debatable but by now conventional terminology) requires the production of temporary copies that arguably consigns IP's author/reader distinction to the ranks of outdated legal constructs.
Consider the following example: when a search engine downloads the HTML source code of a web page and stores relevant information—such as text content and metadata—into a database of web pages, are we faced with copyright infringement because of this “copy”? Or is this mere reading? (Murray-Rust 2012). The copying/reading distinction is similarly unhelpful when one considers the growing number of computational systems trained on mined data (whether it is scraped off the internet or otherwise). Since generative AI systems no longer only claim to read what they have copied but are also mimicking it, regulators around the globe are scrambling to address the considerable uncertainty currently characterizing copyright law. Often driven by the short-term imperatives of international competition,5 many of those regulatory interventions risk losing sight of what copyright as a balancing act is about: it is not just a case of establishing whether uses of copyrighted material may be deemed “beneficial” (a key fair use consideration). These uses also have to be deemed to support or enhance the human right to access and build upon culture, as an inherent counterweight to copyright restrictions. The next section argues that neither the current EU/UK blanket regulatory approaches (through text and data mining exceptions) nor the United States’ individual litigation path are in a position to do justice to the salience of this human right to access culture.
The Fair Use and Text and Data Mining Battlegrounds
To address the uncertainty that stems from the collapse of the copying/reading distinction while mitigating barriers to innovation, article 3 of the EU Directive on Copyright in the Digital Single Market exempts what would otherwise be copyright infringing copies made by research organizations and cultural heritage institutions in the context of text and data mining for the purposes of scientific research.6 Article 4 of the directive provides for the same exemption with a twist, in that it can be availed of by any type of beneficiary for any type of use. But there's a catch. Copyright holders can opt out of article 4’s exemption (they can, for instance, add some exclusionary metadata that shields their content from text and data mining), whereas they can't opt out of the exemption provided by article 3.7 Some worry about the extent to which the latter provision puts the EU AI sector at a strategic disadvantage,8 given the costs associated with the need to negotiate licenses for vast amounts of training data (Senftleben et al. 2022).9
It is that same concern that has recently led to the UK intellectual property office's consultation on the proposed introduction of a new copyright exception for text and data mining for commercial purposes (retaining a “lawful access” requirement).10 This proposal is currently halted. Among the concerns raised, the “publishers content forum” notably emphasized that “without the ability to license and receive payment for the use of their data and content, certain businesses will have no choice but to exit the UK market or apply paywalls where access to content is currently free.”11
In the United States, there is currently no specific regulatory provision pertaining to text and data mining, thereby entrusting the courts with what is an ongoing, live battleground. In this context, robotic readers —search engines are one common example—have, so far, been accommodated via the concepts of “intermediate copying for non-expressive uses” (Sag 2009, 2018) or “bulk non-expressive copying” as a case of fair use.12 The idea is that those machine-generated copies do not infringe copyright because they are merely an intermediary step in a process that does not aim to produce any new “expressive” work.13 Notice the weight carried by the term expressive: the anthropocentric logic that initially exonerated the perforated rolls used by nineteenth-century player pianos (Borghi 2019) is still at work. These rolls were initially deemed to fall outside the scope of copyright since the rolls were considered “part of a machine” rather than being “addressed to the eye.”14
It took just over a decade for the nineteenth-century perspective that the perforated roll is only read by a piano machine to be reversed, since the machine in question did produce sounds meant for human ears, after all. Today, there is no telling how long it will before the hereto successful “intermediate copying for non-expressive uses” fair use strategy finds its limits, since the algorithms behind tools such as ChatGPT arguably end up expressing content to human eyes too. Techniques designed to enable “guarantees on the lack of similarity . . . between the output of a generative model and any potentially copyrighted data in its training set” (Barak 2023; referring to Vyas, Kakade, and Barak 2023), however, have the potential to circumvent the aforementioned legal uncertainty.15
This uncertainty is tied to the fact that American courts continue to be the main arena for the clash of forces vying for influence over the future of copyright law—as well as other generative AI-relevant issues, such as what constitutes “authorized access.”16 Because fair use is an open-ended test that requires a contextual, value-loaded analysis for each case, Lawrence Lessig (2003: 145) dubbed it “the right to hire a lawyer.” Google LLC v. Oracle America is a case in point.17 After over ten years of expensive court battles, the Supreme Court refrained from deciding whether Java's API code should be deemed copyrightable.18 The court chose to concentrate instead on whether Google's use of the Java API was fair based on four key criteria: the nature of the original works,19 whether the use of copyrighted works was “transformative,”20 the substantiality of the portion of the Java API code used,21 and whether Google's use could be deemed detrimental to the potential market for the Java API.22 Amid all the technical deliberations, the court came closest to a concern for the human right to access culture and build upon it when it referred to the fact that “Google's basic objective was not simply to make the Java programming language usable on its Android systems. It was to permit programmers to make use of their knowledge and experience using the Sun Java API when they wrote new programs for smartphones.”23
The above line of argument is unlikely to be of much use in the context of the string of legal actions currently being launched against a variety of generative AI tools. Among these, the class action lawsuit (ostensibly instigated in the name of all open-source coders) against Microsoft, GitHub, and Open AI is of particular interest for our purposes since it is tied to issues of sustainability.24 This class action raises the issue of noncompliance with the (copied) original software's license terms. If Copilot is regurgitating sections of licensed code without due credit, it will be in breach of those licenses that require attribution as one of the conditions under which the code can be used, modified, and distributed.25 While Copilot appears to be now experimenting with code citation and ways of filtering its output—or flagging potential license issues—the root problems, as we will see, go much deeper.26 They touch upon the very sustainability of the data ecosystem that has made generative AI possible. To understand the way in which these problems are inherently tied to social sustainability issues (therefore demanding democratic debate, rather than being left to the courts), a quick detour via the significance—and limits—of the open-source movement is necessary.
“Creative Commons,” Property Rights, and Sustainability Reforms
In the vast majority of cases, to choose to license one's creative works under the Creative Commons Regime (or the open-source movement) is a value-loaded choice.27 At the heart of it is a commitment to fostering open collaboration and widening the accessibility of knowledge. As such, the Creative Commons project is often seen as an endeavor to counter the progressive shrinking of the public domain and upholding the moral aspirations that underlie the human right to access culture. Several scholars, however, point to the extent to which the Creative Commons’ reliance upon the assertion of property rights over content may end up reinforcing the very phenomenon it is meant to oppose:
Claiming property rights in creative works is therefore communicating a message that information is proprietary, that it always has an owner. It strengthens the perception of informational works as commodities which are subject to exclusive rights. It reinforces the perception that a license is always necessary, and that sharing is prohibited unless authorized. . . . The same rules that would make Creative Commons licenses enforceable would equally make enforceable corporate licensing practices which override user's privileges under copyright law. (Elkin-Koren 2001: 417)
If the progressive shrinking of the public domain is largely facilitated by a property rights discourse that enables private parties to define the conditions underlying public access to information (this is sometimes referred to as “private ordering”28), any reliance on this property rights framework, however well intentioned, risks backfiring.29 Is there an alternative?
Because “the dominant thinking in terms of goods and ownership favors a ‘property logic’ that is capable of concealing the negative effects of IP rights for individual freedoms and the public interest,” Alexander Peukert (2021: 169) argues it is time to abandon “the fiction of immaterial commodities” that are supposed to be the object of intellectual property rights. As soon as one focuses on the fact that IP rights are all about specifying whether third parties need the prior authorization of the rights holder for certain actions, the advantages of a characterization of such rights in terms of permitted privileges rather than property rights become apparent. The privilege in question would be that of being granted the permission to exploit/restrict reproduction of a given piece of creative work, all under conditions specified by statute.
From a historical perspective, such a reconceptualization of IP rights as privileges rather than property rights would resurrect aspects of the legal landscape as it existed in the fifteenth and sixteenth centuries:
The concept of copyright was unknown in the manuscript era and was slow to develop even when printing with movable type had revolutionized the rate at which copies of a book could be produced. Once a book was published, it passed into the public domain. To seek to protect it, even for a short time, from unrestricted reprinting was to ask for an exception to be made in its favor. However, one concept was quite familiar and was to prove useful to authors and publishers asking for such protection. This was the right of a ruler to grant a “privilege” or commercial monopoly, whether permanently or for a fixed period of time, within his jurisdiction, to the inventor or initiator of a new process, a new product or a new source of supply. . . . The justification being to secure a fair return on the enterprise, ingenuity and financial outlay expended to perfect the article and put it on the market. (Armstrong 1990)
Aside from the fact that today such copyright privileges would be granted by statute (rather than “a ruler”), there is no apparent reason why such rights should not be called “privilege rights” in the twenty-first century. “Their character is thus aptly described: they privilege a person to do something that third parties are not allowed to do, even though the nonowners factually could do it without violating the law. The problem of such a right and the requirements for its exceptional justification are emphasized by its designation as a privilege” (Peukert 2021: 149).
From a pragmatic perspective, switching to a permitted privilege rather than property right conceptualization of copyright has several advantages.30 The most important one is to restore the public's right to access culture to its rightful place when it comes to the balancing act IP law was supposed to enable. Whereas the holder of a property right may see the expansion of the scope of her exclusive rights as a natural quest (constrained only by the market forces that delineate the scope of others’ rights), to seek to expand the scope of a privilege would be a politically loaded endeavor that is dependent upon regulatory change. This would therefore demand engagement with—and democratic debate about—the justifications for the types of protections needed for different kinds of content producers.
The contrast between the above proposal and IP law's direction of travel has only been made starker by recent generative AI-related developments. The international rush to clear any regulatory roadblocks that may inhibit domestic generative AI development (with its perceived macroeconomic competitive advantages)31 is such that the views of the general public—and thereby the fundamental right to access culture—are often introduced as a second thought at best. In this respect, Jessica Litman's (1996: 23) scathing remarks—made in the US context—haven't lost their pertinence today:
We have never had a mechanism for members of the general public to exert influence on the drafting process to ensure that the statute does not unduly burden private, non-commercial, consumptive use of copyrighted works. The design of the drafting process (in which players with major economic stakes in the copyright sphere are typically invited to sit down and work out their differences before involving members of Congress in any new legislation) excludes ordinary citizens from the negotiating table.
Reinvigorated democratic debate on the rationale underlying copyright protections is crucial if we are to successfully address the need for a more granular approach to copyright. While this need predates the advent of generative AI, the latter tools might be the final nail in the coffin of a legal regime that is now so ill-fitted to our data-powered digital lives that it undermines their long-term sustainability.
In what follows, I consider the extent to which a switch to copyright as permitted privilege needs to go hand in hand with bottom-up empowerment infrastructure if it is to contribute to a socially sustainable data ecosystem.
Vanished or Silenced? The “Canalization” of the Public Domain and the Need for Bottom-Up Data Empowerment
In a land dependent on concomitant navigation, fishing, and irrigation to sustain its population, a river will be at the center of efforts to balance competing obligations, privileges, and responsibilities. Given the likely changes in that community's circumstances, these efforts will be constantly renewed from the bottom up. The success of the ongoing (re)articulation of the norms governing river-related practices will be tied to several factors. Two of them are of particular relevance to data rivers.
First, as we have seen, is the extent to which fundamental rights—such as the right to water, the right to access culture, or various types of privacy rights—are respected. But there's also the extent to which river-related practices are able to keep constructively integrating disagreement among river-dependent communities. To understand how this bears on data rivers, it's helpful to reconsider both the merits and limits of the data as a river analogy.
As noted earlier, data, unlike fresh water, is a nonrival, intangible good. In principle, data can be enjoyed at time t by user x without preventing simultaneous consumption by others. However, the “first possession” logic enabled by the property law framework underlying IP erects “hard structures” (akin to canals or, worse, dams) that either remove access to data or make such access conditional upon payment, thereby leading to artificial scarcity. This phenomenon reinforces the merits of the data as a river analogy. Once a canal or dam has been built, the integration of tensions between the aspirations of different river-dependent communities evaporates as a concern. These dissenting voices are either submerged or deprived of their livelihood.
The situation today is akin to a once river-rich land that has been frantically built up to power digital mills. While pockets of public domain/open-flowing data rivers remain, the dominance of the property rights framework is such that we find ourselves at loss: “What is the public domain? We know ‘of’ it, but do we know exactly what it is made of or where to find it in the law. . . . Yet we tend to agree that a healthy public domain is necessary. We also tend to agree that the public domain is the ‘other side of the coin’ that holds the copyright framework together. So why is the public domain underrepresented” (Pavis 2019). Today, a growing number of legal scholars (Benabou and Dusollier 2007; Greenleaf and Lindsay 2018) are intent on breaking what has been referred to as “the silence surrounding the public domain” that has led to “every piece of knowledge [being] called intellectual property. . . . Thus, it has become impossible to articulate the non-protection of knowledge” (Peukert 2016). To address the root causes of this silence, however, will take more than legal scholarship.
The significance of the class action lawsuit against Microsoft, GitHub, and Open AI that was mentioned earlier cannot be grasped if one only focuses on the (non)compliance with original software's license terms.32 At stake is a more fundamental question: Can the reciprocity expectations at the root of the open-access/source movement be given due weight? For many of the creators who have chosen to share their work, the use of various “open” licenses is currently the only way they may contribute to a political and moral ideal that is otherwise at odds with IP's direction of travel.
The moral and political aspirations that underlie data sharing in such cases do not mesh well with our current data governance frameworks. Mostly structured around an economically motivated understanding of data sharing, our legal frameworks currently only distance themselves from this transactional model if and when they seek to address the vulnerabilities that stem from the sharing of personal data (Boyd 2008). The need to impose constraints on the unfettered exploitation of personal data became apparent in light of the harms that stem from such exploitation. Never before has “the self we aspire to be” been constrained to such an extent by our past (Hildebrandt 2018): not just in instrumental ways, as recognized by the right to be forgotten, but in a subtly nefarious manner. In an attempt to address the power imbalances between data-collecting entities and “data subjects,” a growing number of jurisdictions grant so-called personal data rights.
Important as they are, these data rights suffer from important limitations (Nissenbaum 2010). In addition to enduring difficulties when it comes to ease of exercise (Ausloos and Dewitte 2018), the way these rights are structured remains focused on individuals in a way that underplays the extent to which personal data is relational (Cohen 2012).33 Most problematically, the consent requirement at the heart of these rights too often consists in a make-believe process, requiring data subjects to click here and there with no credible connection to any sort of empowerment.
Bottom-up data trusts (Delacroix and Lawrence 2019) have been put forward in an endeavor to address the above limitations. They allow groups of individuals to pool together the rights they have over their data in a bid to regain agency over their data. These data thereby become a lever for social and political change—thus promoting more-than-transactional understandings of data sharing.34 Moreover, data trusts’ bottom-up design opens the door to the development of participation habits that are far removed from the passivity encouraged by current, top-down approaches to data governance.
Today, there is a variety of data stewardship mechanisms (Ada Lovelace Institute and UK AI Council 2021) available to build the bottom-up infrastructure (Bühler et al. 2023) necessary to facilitating a continuous, river-like rebalancing of data privileges, obligations, and responsibilities. The fostering of diverse perspectives through bottom-up, participatory data institutions can lend renewed vigor to a range of causes, from supporting health research to better use of natural resources. Reconstructing IP law in a way that rebalances the data ecosystem that makes generative AI possible is among those causes. This entails due weight to the fundamental right to access and build upon culture, in a way that is both individual and relational. It also entails addressing the precarious circumstances of many content creators through better protections and enhanced mechanisms for collective action.
Delineating IP Rights as Privileges: Unworkable without Ongoing, Bottom-Up Articulation
To delineate IP rights as permitted privileges rather than property rights will involve conversations that weigh the public's rights against the needs and vulnerabilities of a variety of original content creators. This cannot be achieved through top-down regulatory interventions alone. In a system that often short-changes those least able to lawyer up, a switch from propertarian frameworks to privileges will deliver on its potential only if it can rely on bottom-up infrastructure for those privileges’ continued rearticulation. Of the various forms this infrastructure can take, data trusts may come to play a salient role in several ways. Data trusts may enable groups of content creators to negotiate the terms and implementation of their permitted privileges and leverage those privileges to shape the ecosystem they are part of (1). Data trusts may also empower the public to have a louder voice when it comes to the delineation and valorization of the ecosystem underlying our data rivers (2).
(1) While the robustness of the safeguards built into the trust as a legal instrument makes it particularly well suited to the vulnerabilities involved in the sharing of personal data, the fuzzy border between personal and nonpersonal data (Finck and Pallas 2020) means that there are many cases where a data trust may be formed to handle the rights related to “hybrid” data. Since any positive rights can be placed in a trust, both personal data rights and IP rights could constitute the object of the data trusts mentioned above. One may, for instance, consider a data trust that represents a group of artists who task a data trustee with the exercise of both their personal data rights (Giannopoulou et al. 2022) and IP privileges. The aim could be to collectively delineate the conditions under which their personal, hybrid data or creative works might be shared, sometimes to further common goods.35 For example, it might be that sharing the letters, drafts, and sketches documenting the creative process may lead to a better understanding of creativity-supporting conditions, and so on. It could also be to negotiate the terms of their IP privileges in a way that does not discriminate between commercial and noncommercial uses, in a context where copyright has otherwise been recast “as an exclusive right of commercial exploitation” by default, as per Litman 1996: 23. This recasting may have resulted from pressure groups and advocacy work made possible by bottom-up participatory infrastructure.
(2) As noted above, there is currently no reliable mechanism for members of the public to exert influence on the (re)balancing of the rights and responsibilities that shape our data rivers (in a striking parallel with the situation of many natural rivers).36 Far from restricted to creators of copyright-protected works, bottom-up empowerment infrastructure such as the data trusts mentioned above can be of relevance to anybody interested in contributing to, or valorizing, the ecosystem underlying our data rivers. Sometimes this contribution will be indirect. When the members of a data trust task their data trustee with the monitoring of data sharing agreements aimed at contributing, in real time, their breathing or jogging patterns, say, this data may not initially enter the public domain. But the data trustee may be tasked with making sure that any research published on the basis of this data be made available and reproducible without restrictions. At other times the contribution will be direct yet require conditional access arrangements to avoid some undesirable side effects: the sharing of data pertaining to endangered animals may need to be sheltered from poachers’ eyes, for instance.
Conclusion
In an influential article, Mark Lemley and Bryan Casey (2021) worried about the extent to which “copyright law [would] allow robots to learn.” This concern is misplaced. The salient question today is whether the fragile data ecosystem that makes generative AI possible can be rebalanced through timely intervention. As I have shown, the three main threats to this ecosystem are comparable in kind to the threats currently affecting rivers across the globe.
First among these, I have argued, are disregarded reciprocity expectations: just as the fundamental human right to water is only possible if reasonable use and reciprocity constraints are imposed on the economic exploitation of rivers, so is the fundamental right to access culture, learn, and build upon it only possible under such constraints. It is that right—and the moral aspirations underlying it—that has led millions to share their creative works under open licenses. Generative AI tools would not have been possible without access to that rich, high-quality content. Yet few of those tools respect the reciprocity expectations without which the Creative Commons and open-source movements cease to be sustainable.
Yet another concern relates to ecosystem cultivation obligations: just as water law reform is now taking on board the need to empower river-dependent communities, so a robust empowerment infrastructure is necessary to revitalize and sustain our data ecosystems. Such an infrastructure can provide the most vulnerable contributors with enhanced mechanisms for collective representation; these include artists and novelists as well as the many who aspire to share data to facilitate a range of common goods. This bottom-up infrastructure is also vital to fostering renewed democratic debate about the kind of protections needed to incentivize the ongoing production of creative works. Answers to these questions are unlikely to be uniform across the board: they will need to take into account the different circumstances of various types of content creators. They will also need to be adaptable.
Finally, communities must collaborate to prevent the disregard of nonpollution obligations: the absence of internationally coordinated standards to facilitate the systematic identification of AI-generated (or human-generated) outputs opens the door for the rapid colonization of content found online. Aside from the resulting homogenization, mis- and disinformation hazards, this type of pollution is also a threat to the long-term quality of generative AI tools themselves.
For data to fare better than freshwaters on any of these fronts will take brave and expeditious action. It's taken centuries for water law reforms to move away from the riparian doctrine, to favor a permission-based system that incentivizes innovative solutions to sustainable water management. To balance the respective needs of a variety of stakeholders while preserving the underlying ecosystem cannot be achieved only from the top-down (and even less so within court forums alone). This article argues that for the reconstruction of copyright as a permitted privilege—rather than a property right—to bear fruit and contribute to a socially sustainable data ecosystem, it will need to be supported by robust participatory infrastructure. Only then do we stand a chance to leverage generative AI tools at the service of the many futures we have yet to conjure up.
Acknowledgments
I am grateful to Boaz Barak, Jacob Bassiri, Beatriz Botero, Neil Lawrence, Vinay Narayan, Eli Papa, Angie Raymond, Chen Zhu, an anonymous reviewer and the editor of this journal, Lauren Goodlad, for their helpful comments. I also benefited from excellent feedback from the Bellairs Invitational Workshop on Time, Input, and Action Abstraction in Reinforcement Learning and a presentation at the ELLIS-LAION Workshop on Foundation Models (University of Tubingen). The research leading to this work was funded by a research fellowship from Omidyar Networks.
Notes
According to Shumailov et al. 2023, once “LLMs contribute much of the language found online,” it will be impossible to train diverse models on this content (see also Lauren M. E. Goodlad and Matthew Stone's introduction to this special issue).
The Universal Declaration of Human Rights, adopted by the United Nations General Assembly in 1948, recognizes the right to access culture as a fundamental human right. Article 27 of the Universal Declaration states that “everyone has the right freely to participate in the cultural life of the community, to enjoy the arts and to share in scientific advancement and its benefits.” This right has been further elaborated and defined in subsequent international human rights instruments, such as the International Covenant on Economic, Social, and Cultural Rights, which recognizes the right of everyone to take part in cultural life and to enjoy the benefits of scientific progress and its applications.
“In a world of books and other pre-digital technologies . . . ordinary acts of reading did not result in any new copies. . . . The boundary between authors and readers was clear and simple: Authors made copies regulated by the copyright system, while readers did not make copies and existed outside its formal bounds” (Grimmelmann 2016).
Examples vary, from a platform like Wikihow that allows users to write and edit content on a variety of topics, to open-source software projects (the web browser Mozilla, for instance, relies on contributions from a community of programmers).
In 2023 Japan announced a policy that would temporarily exempt companies from copyright infringement penalties if copyrighted works were found to be used to train their large language models, while other countries (from Singapore to Israel) are vying to introduce permissive regimes when it comes to the use of copyrighted data to train these tools.
Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on Copyright and Related Rights in the Digital Single Market and amending Directives 96/9/EC and 2001/29/EC [2019] OJ L130/92 (CDSM). For a general, critical analysis see Dusollier 2020. Article 3 of the CDSM also exempts acts of extraction from databases that would otherwise infringe “sui generis” database rights (with the same provisos). Article 2(2) of the CDSM Directive defines “text and data mining” as “any automated analytical technique aimed at analyzing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations.”
Article 2(3) defines a “cultural heritage institution” as “a publicly accessible library or museum, an archive or a film or audio heritage institution.” According to Article 2(1), a “research organisation” is either a not-for-profit entity or an entity tasked by a Member State with a public service research mission. Note that the directive addresses the question of public-private research collaborations in recital 11 (they are not excluded from benefiting from the article 3 exception).
Article 4(3) applies only on condition that right holders have not expressly reserved their rights “in an appropriate manner, such as machine-readable means in the case of content made publicly available online.” According to Recital 18, “It should only be considered appropriate to reserve those rights by the use of machine-readable means, including metadata and terms and conditions of a website or a service. . . . In other cases, it can be appropriate to reserve the rights by other means, such as contractual agreements or a unilateral declaration.”
Article 53(1)c of the EU AI Act requires “providers of general-purpose AI models” to “put in place a policy to comply with Union law on copyright and related rights, and in particular to identify and comply with, including through state-of-the-art technologies, a reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790.” Recital 105 of the EU AI Act also states that “where the rights to opt out has been expressly reserved in an appropriate manner, providers of general-purpose AI models need to obtain an authorisation from rightsholders if they want to carry out text and data mining over such works” (https://artificialintelligenceact.eu/recital/105/; accessed May 27, 2024).
“This stratification of rules enacted in different stages of the process of EU copyright harmonization has the combined effect of absorbing a great deal of previously unprotected knowledge, such as mere facts and data, into low-original (or non-original in the case of the SGDR) works protected against most forms of indirect, incidental and transient reproductions” (Margoni and Kretschmer 2022: 687).
When it was introduced in 2014, the United Kingdom's copyright exception was innovative; since then other jurisdictions have introduced more permissive exceptions. The UK's current exception is to be found in s.29A of the Copyright, Designs and Patents Act 1988 (CDPA): it targets noncommercial research of copyright works to which a person already has lawful access. This exception cannot be overridden by contractual terms. One challenge is that it can be unclear at what point “noncommercial” turns into “commercial” use.
Angela Mills Wade to Kwasi Kwarteng, July 25, 2022, https://www.publishers.org.uk/wp-content/uploads/2022/08/PCF-TDM-Letter-as-of-080822.pdf. As things stand, several UK-based publishers participate in the Copyright Clearance Center, which is meant to allow prospective users of content for commercial purposes to obtain XML-formatted content on demand. See Copyright Clearance Center, https://copyright.com (accessed May 27, 2024).
Sega Enters. v. Accolade, Inc., 977 F.2d 1510 (9th Cir. 1992). See also Jockers, Sag, and Schultz 2013, referring to A.V. ex rel. Vanderhye v. iParadigms, LLC, 562 F.3d 630, 645 (4th Cir. 2009); Perfect 10, Inc. v. Amazon.com, Inc., 508 F.3d 1146, 1168 (9th Cir. 2007); Sony Computer Entm't, Inc. v. Connectix Corp., 203 F.3d 596, 609 (9th Cir. 2000); Sega Enters. Ltd. v. Accolade, Inc., 977 F.2d 1510, 1527–28 (9th Cir. 1992).
“The story told above about transformative fair use is the story of how the courts used fair use to shield robotic reading from liability that would otherwise attach. Exempting robots entirely would have led to the White-Smith problem: uses indisputably intended for human eyes would escape scrutiny. The combination of broad infringement and broad fair use draws the line instead between robot-only and robot-plus-human uses” (Grimmelmann 2016: 670).
White Smith Music Publ'g Co. v. Apollo CO., 209 U.S. 1, 12 (1908). This anthropocentric framework prevailed until the Copyright Act of 1909 acknowledged “any form of record in which the thought of an author may be recorded and from which it may be read or reproduced.” Copyright Act of 1909, Pub. L. No. 60–349, § I(e), 35 Stat. 1075, 1075–76.
Peter Henderson et al. (2023) argue that legal frameworks could “more explicitly consider safe harbors” when “strong” technical tools such as these are used.
In a recent court battle focused on “authorized access,” the Ninth Circuit found that hiQ's scraping of public LinkedIn member profile data was lawful under the Computer Fraud and Abuse Act, even after its access was revoked by LinkedIn. HiQ Labs, Inc. v. LinkedIn Corp., No. 17–16783 (9th Cir. Apr. 18, 2022); see also Ryanair DAC v. Booking Holdings Inc., No. 20–01191 (D. Del. Oct. 24, 2022), While hiQ prevailed on the Computer Fraud and Abuse Act's “unauthorized access” angle, a later ruling held that it had breached LinkedIn's user agreement (notably due to its creation of fake accounts).
Google LLC v. Oracle Am., Inc., 141 S. Ct. 1183 (2021) (henceforth Google).
“We assume, for argument's sake, that the material was copyrightable”: Google, 141 S. Ct. 1183 at 1190. For a powerful critique outlining why “the Supreme Court would have been justified in overturning the Federal Circuit's ruling on copyrightability grounds,” see Samuelson and Crump 2021.
The API's declaring code was deemed more functional than creative in nature and was “if copyrightable at all, further than are most computer programs (such as the implementing code) from the core of copyright” Google, 141 S. Ct. 1183, at 1202.
The court held that Google's use of the API code to create a new platform that would run on Android smartphones (instead of desktops) was “highly creative and innovative”: Google, 141 S. Ct. 1183 at 1204.
The Court rejected the Federal Circuit view that Google took more of the Java code than needed. Google copied 11,500 lines of declaring code, or 0.4 percent of the entire software code. This was deemed reasonable since Google, 141 S. Ct. 1183 at 1205.
The Court judged that “the uncertain nature of Sun's ability to compete in Android's marketplace, the sources of its lost revenue, and the risk of creativity-related harms to the public” all weighed in favor of fair use: Google, 141 S. Ct. 1183 at 1208. See, however, the skepticism expressed by Justice Thomas on the latter front (his dissenting opinion points out that the API version at issue was only used in 7.7 percent of Android devices): Google, 141 S. Ct. 1183, 141 S. Ct. at 1215 (Thomas, J., dissenting).
Google, 141 S. Ct. 1183 at 1205.
DOE 1 et al v. GitHub, Inc. et al (N.D. Cal. 2022).
Such as the Creative Commons Attribution CC-BY license. In contrast, the MIT license does not require attribution per se: it grants permission to use copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software without restriction, provided that this copyright notice and permission notice appear in all copies. On attribution as a moral right, see Zhu 2014.
A note of caution in relation to the automated “filtering” strategy: copyright enforcement algorithms—models trained to sift out copyright-protected content—are increasingly seen as a necessity, given the vast scale of “copyright policing” challenges encountered by any tool built on the basis of data scraped from the internet. The logic behind this necessity argument is blind to the extent to which any automated, algorithmic translation of fair use will transform it beyond recognition: instead of a dynamic, contextual, and value-loaded test, fair use then becomes a set of deterministic, “if that, then . . . ” lines of code. Aside from the unavoidable clumsiness and oversimplification, such algorithmic translation would also defeat the very point of the test. Dan L. Burk (2019: 306) puts the point eloquently:
“Failure to incorporate fair use into copyright enforcement algorithms likely means the de facto loss of the fair use exception, making it available only as a rarefied defense to the few litigants who can afford to persevere until favorable judicial review. However, the alternative of attempting to incorporate fair use into enforcement algorithms threatens to degrade the exception into an unrecognizable form. Worse yet, social internalization of a bowdlerized version of fair use deployed in algorithmic format is likely to become the new legal and social norm.”
A variety of open-source licenses are available: they tend to give users the right to reproduce, distribute, and modify the software, while the licensor's obligation to provide the source code is matched by the user's obligation to distribute any modifications under the same licensing regime. The latter obligation is at the heart of the so-called viral effect of free software (Nadan 2001).
“In recent years, contract has been considered by many to be gradually transforming copyright into an expansive monopoly over cultural and informational goods. By contract, copyright owners sometimes impose provisions that purport to override copyright, such as clauses that prohibit or regulate the exercise of copyright exceptions or fair use as to the work or limit the first-sale doctrine. Contracts can also serve to bind users to not reproduce or disseminate some content that might not be protected by copyright” (Dusollier 2006).
“It is ironic: who would have thought that copyright industries could gain something from the copyleft movement?” (Dusollier 2006).
From a theoretical perspective, a re-delineation of copyright that switches to a privilege that entitles its holder to exclude others from reproducing or creating derivative works, rather than a right to do so, is provocative. If Wesley Newcomb Hohfeld (1923) were to characterize the bundle of legal relationships at the heart of copyright, he would likely have characterized the most fundamental one as a right to exclude others from reproducing, etc., with a corresponding privilege to use the copyrighted work oneself, a power to transfer or authorize its conditional distribution, etc. To switch from a right to a privilege to exclude is counterintuitive until one gives due primacy to the public's fundamental right to access and build upon culture.
See note 5.
DOE 1 et al v. GitHub, Inc. et al (N.D. Cal. 2022), outlined above. This is one among many recently launched class actions targeting a variety of generative AI tools.
When an individual consents to sharing pictures, location, or activity patterns, this will often reveal things about others, too. Even when it doesn't, the social and communal impact of such sharing decisions cannot be overestimated.
Such collective pooling of resources in a bid to acquire a political voice is not unprecedented: when, in nineteenth-century England, the right to vote was conditional upon land ownership, land societies were formed (one strip of land was acquired and split into parcels giving rise to the right to vote).
Some see article 4 of the EU Directive on Copyright in the Digital Single Market (see section 1.2), which allows copyright holders to oppose AI training for commercial purposes (this opposition is meant to be expressed through metadata/open protocols conveying their terms), as presenting “a huge opportunity for creators to build new collective structures” (Keller 2023). See also Strowel 2023. Along this line, see, for instance, the “spawning AI” initiative, which seeks to empower individual artists to opt out (or in) of various training data sets (as an initiative, they maintain a registry of “non-consenting data” against which developers may check and validate their data sets via Spawning's own API): see Spawning, https://spawning.ai (accessed September 30, 2023).
Campaigns to raise awareness and reempower river-dependent communities are, however, starting to gain momentum (O'Donnell 2020).