Microsoft Embarks On Research To Trace AI Training Data
Microsoft is embarking on a research project aimed at estimating the influence of specific training examples on the outputs of generative AI models, including text, images, and other media.
This initiative was revealed in a job listing from December, recently recirculated on LinkedIn.
Focus on Training-Time Provenance
The job listing, which seeks a research intern, outlines the project’s goal to demonstrate that AI models can be trained in a way that allows the impact of particular data — such as photos and books — on their outputs to be “efficiently and usefully estimated.”
The listing notes, “Current neural network architectures are opaque in terms of providing sources for their generations, and there are […] good reasons to change this.
“[One is,] incentives, recognition, and potentially pay for people who contribute certain valuable data to unforeseen kinds of models we will want in the future, assuming the future will surprise us fundamentally.”
Legal Challenges in AI Development
Generative AI tools for text, code, images, videos, and music are at the center of numerous intellectual property lawsuits.
Many AI companies train their models on vast datasets sourced from public websites, some of which include copyrighted material.
While these companies often argue that fair use doctrine protects their data-scraping practices, creatives — including artists, programmers, and authors — largely disagree.
Read Also: Microsoft Unveils Plans for Malaysia Cloud Region
Microsoft itself is facing legal challenges, including a lawsuit from The New York Times, which accused Microsoft and OpenAI of infringing on its copyright by training models on millions of its articles.
Additionally, software developers have sued Microsoft, claiming its GitHub Copilot AI coding assistant was unlawfully trained on their protected works.
Existing Efforts in Data Compensation
Several companies are already exploring ways to compensate data contributors. AI model developer Bria, for instance, claims to “programmatically” compensate data owners based on their “overall influence.”
Adobe and Shutterstock also provide payouts to dataset contributors, though the exact amounts remain unclear.
However, most large AI labs have yet to establish individual contributor payout programs, opting instead for licensing agreements with publishers, platforms, and data brokers.
Some labs offer copyright holders the option to “opt out” of training, but these processes are often cumbersome and only apply to future models.
Potential Implications and Criticism
Microsoft’s project may ultimately serve as a proof of concept, similar to OpenAI’s earlier announcement of a tool allowing creators to specify how their works are included in training data.
Despite being announced nearly a year ago, OpenAI’s tool has yet to materialise. Critics suggest Microsoft’s initiative could be an attempt to “ethics wash” or preempt regulatory and legal challenges that could disrupt its AI business.
Fair Use and Industry Stances
The project comes amid broader debates over fair use in AI development.
Several leading AI labs, including Google and OpenAI, have advocated for weaker copyright protections to facilitate model training.
OpenAI has explicitly called for the U.S. government to codify fair use for AI training, arguing that such measures would reduce restrictions on developers.
Comments are closed.