A new initiative was recently launched by the nonprofit Software Heritage hopes to provide the world’s largest repository of ethically sourced code for training AI. The large language models (LLM) that underlie chatbots and coding assistants are trained on vast reams of data scraped from the Internet. But AI developers rarely provide details of what’s included in their training datasets, and this makes it hard to reproduce results, understand whether models are trained on data from benchmark tests, and for developers to control whether their code is used to train AI.
The project’s goal is to create a freely accessible archive of the world’s digital heritage. The group is launching a project called CodeCommons, which will provide access to those willing to sign up for ethical principles aimed at boosting transparency and accountability in AI training. The group has secured €5 million from the French government over the next two years to build the supporting technology. Software Heritage originally published ethical principles for AI developers keen to use their archive in October 2023.
More information: