The Evolving Definition of Open Source AI: Challenges and Debates
Tutorials

The Evolving Definition of Open Source AI: Challenges and Debates

The precise criteria for classifying a large language model as open source remains a subject of considerable discussion. This ongoing dialogue underscores the complexities involved in applying traditional open-source principles to the rapidly evolving field of artificial intelligence, particularly concerning the transparency and accessibility of training data.

In October of last year, the Open Source Initiative (OSI) put forth its criteria for what would qualify an artificial intelligence model as open source. Stefano Maffulli, the executive director of the OSI, stated at the time that this definition was intended to initiate a broader conversation. This objective has certainly been met. While developers have largely adopted a pragmatic stance regarding open-weight models and their associated licenses, many felt the OSI's definition fell short, advocating for a more assertive interpretation. This sentiment is especially pronounced concerning the data employed in model training; the OSI's framework mandates a detailed description of this data but does not require its direct availability.

During the Open Source Summit in Amsterdam, Maffulli elaborated on the current status of this discussion. He highlighted that the conversation has not only begun but that the definition itself has become a strategic instrument for the OSI to engage with policymakers, including the European Commission. This engagement is particularly crucial given that the EU AI Act is slated for full implementation in August 2026.

Maffulli remarked, "It has served as an invaluable instrument for us in our discussions with the European Commission—and to some degree in the United States and Washington—to influence the interpretation of the AI Act and the EU's directives for general-purpose AI." He further explained that the AI Act aims to streamline processes and grant preferential access to open-source developers and academic researchers. These directives, which constitute the European Commission's interpretation of the AI Act, delineate the responsibilities of providers of "general-purpose AI models," a category that encompasses virtually all large language models. The Act and its guidelines explicitly provide exemptions for open-source AI models. Maffulli pointed out that these provisions align with all the core tenets embedded within the OSI's definition of open-source AI.

He emphasized, "Essentially, to overcome these barriers, transparency is paramount. Therefore, it is essential to be very explicit and precise about the constituents of the training dataset." He further underscored that politicians recognize the inherent challenges in making entire training datasets fully accessible. "They fully grasp the fundamental issue. One does not possess the copyright or ownership of the data being distributed. Consequently, they are well aware of the considerations that led to the revision of the Copyright Act, which introduced exceptions for text and data mining. These text and data mining exceptions explicitly permit the aggregation of data, the comprehensive scanning and crawling of the web, and unrestricted use of such data. However, once the analysis is complete, the data must be discarded. It is not intended for permanent retention. This particular point resonates strongly and proves effective."

Maffulli noted that a significant portion of the work with the broader open-source community has revolved around clarifying the open-source AI definition. He cited popular models like Qwen, which may be open-weight and operate under a permissive, OSI-approved license. However, a developer using Qwen would typically lack the tools, source code, and comprehensive data required to independently replicate the development process undertaken by the Qwen team. Maffulli acknowledged that the current OSI definition establishes a rigorous benchmark, and as a result, very few existing models currently meet its full requirements. The Open Source Initiative has historically avoided being overly prescriptive. It does not function as a standards body empowered to levy penalties. While there are certainly critics who are quick to find fault, the definition of open source, in general, emerged from practical application and the experiences of practitioners. Maffulli believes that the evolution of the open-source AI definition will follow a similar trajectory, adapting as the technology, practices, and legal landscape evolve – a dimension that was not a primary concern two decades ago but is now indispensable.

A particular area of current interest for Maffulli is the composition of datasets utilized for training new models. He observes that many organizations are now striving to construct datasets that are more resilient against legal challenges, though he cautions against using terms like "safe from the copyright perspective," acknowledging that true safety in this regard is elusive and a continuous learning process. Numerous companies are currently facing substantial difficulties in assembling extensive datasets from the public internet, which he described as "diminishing." Common Crawl, recognized as the largest repository of web crawl data, is encountering obstacles in expanding its dataset. This challenge is partly due to the increasing proliferation of AI-generated content on the web, which dilutes data quality, and also because many prominent websites and publishers are requesting the removal of their data.

This situation highlights an increasingly pressing issue concerning the relationship between AI model developers and online publishers. The efficacy of these models relies heavily on high-quality data, much of which originates from news organizations or large platforms such as Reddit and Stack Overflow. These sites, in turn, depend on search engines like Google to direct readers to their content, thereby enabling monetization and the continued production of valuable material. However, the emergence of large language models as viable alternatives to traditional search engines is rapidly disrupting this symbiotic relationship, as users frequently bypass clicking on LLM citations. Maffulli's perspective on this matter may not find favor with publishers. He asserted, "If we aspire to foster public AI, we must safeguard the public web. We need to protect taxes and remove them from the purview of publishers. I believe there is no alternative but to employ a strategy akin to the Google Books precedent. Publishers should not have the ultimate say. The principle that proved effective with Google Books must also apply to AI training, in exchange for a public AI. If one wishes to strike a clandestine agreement with OpenAI, so be it. However, if I represent another AI entity – for instance, the Allen Institute for AI – and my aim is to develop public AI, then, frankly, it seems equitable to some extent." He contends that the current relationship between AI firms, publishers, and initiatives like Common Crawl is imbalanced. Nevertheless, as of now, neither the necessary legal framework (because, he suggests, copyright is ill-equipped to handle such scale) nor the technical infrastructure exists to rectify this disparity. Furthermore, it could be argued that for publishers, mere access to public AI models trained on their data may not provide sufficient incentive to make their data openly accessible. "We have a considerable amount of work ahead if we genuinely wish to establish public datasets – datasets that are shareable, expandable, and capable of supporting the development of large, GPT-style language models. We must address these challenges. We need to discuss governance. Our current mechanisms for proving ownership are inadequate."

\