Digital parasitism: The use of pirated books in AI training datasets

doi:10.1177/13548565261464938

DOI: 10.1177/13548565261464938 ISSN: 1354-8565

Digital parasitism: The use of pirated books in AI training datasets

Terje Colbjørnsen, Tsehaye Haidemariam, Michael Preminger

This article investigates the intersection of generative Artificial Intelligence (AI) and the book industry, focusing on the use of copyrighted literary works for training large language models (LLMs). As LLMs require enormous volumes of content for training, published books and other media have become valuable sources of data. Many training datasets appear to have been assembled without authorization or transparency, raising acute legal, ethical and cultural questions. In this article, we provide a detailed empirical analysis of the Scandinavian content within the Books3 dataset, used to train multiple major AI systems, characterizing it by publication year, publisher, authorship, genre, and language. As comparatively small language markets situated outside of the main hubs for AI development, the Scandinavian countries provide a relevant context for exploring the local implications of global AI deployment. The findings are discussed through a theoretical lens drawing on Michel Serres’s concept of the “parasite.” We argue that the inclusion of Scandinavian books in Books3 is part of a multidimensional ecology of parasitic exchange. This framework interprets the relationship between AI developers, shadow libraries, and publishers as a cascade of information appropriation, signifying a reconfiguration of cultural production and ownership in the age of generative AI.

Outline

Digital parasitism: The use of pirated books in AI training datasets

More from our Archive