Topstore

Apple's AI Training Data Sources Spark Controversy

Apple has always claimed that the data used to train Apple Intelligence is legally obtained and paid for, but reports indicate that one of its suppliers is accused of illegally using YouTube video subtitles.

1724391894_图片30.png

All generative AI relies on large datasets, often referred to as Large Language Models (LLMs). The legitimacy of these data sources frequently sparks controversy, which is why Apple has emphasized that its data sources are ethical, having paid millions to publishers and licensed images from photo library firms.


However, according to Wired, one data company Apple used appears to be less scrupulous about its sources. EleutherAI reportedly created a dataset called the Pile, which Apple used for its LLM training.


A part of the Pile, known as YouTube Subtitles, consists of subtitles downloaded from YouTube videos without permission. This not only potentially violates YouTube's terms of service but may also involve copyright issues.


In addition to Apple, other companies such as Anthropic have used the Pile dataset. Jennifer Martinez, a spokesperson for Anthropic, stated that there is a difference between using YouTube subtitles and using the videos themselves. She pointed out that YouTube's terms cover direct use of its platform, which is distinct from the use of the Pile dataset.


Salesforce also confirmed the use of the Pile dataset in its AI model for academic and research purposes. Salesforce's vice president of AI research stressed that the Pile dataset is "publicly available." However, developers at Salesforce found that the Pile dataset includes profanity and biases against gender and certain religious groups.


So far, only Salesforce and Anthropic have commented on their use of the Pile dataset. It is known that Apple, Nvidia, Bloomberg, and Databricks have also used it, but they have not responded.


Proof News's investigation revealed that the Pile dataset includes subtitles from 173,536 YouTube videos from over 48,000 channels, including seven videos by Marques Brownlee (MKBHD) and 337 from PewDiePie. Proof News also developed an online tool to help YouTubers check if their work has been used.


It is not only YouTube subtitles that have been gathered without permission. Claims suggest that Wikipedia content and European Parliament documents are also included in the Pile. Academics have previously used thousands of Enron staff emails for statistical analysis, and these emails are reportedly used in the Pile's training as well.


Despite Apple's emphasis on training its generative AI legally and ethically, this incident shows that Apple Intelligence may have been trained on unauthorized YouTube subtitles.


This controversy highlights the complexities and legal risks associated with AI training data sources and may prompt further discussions and regulations within the industry regarding data usage.