We have previously come across numerous reports claiming that OpenAI used YouTube content to train its text-to-video model Sora. Now it’s reported that companies like Apple, Nvidia, Anthropic, and more are also using the ‘publicly available data’ generated by users to train their AI models. Apparently, Apple used tens of thousands of YouTube videos with subtitles to train Apple Intelligence, which is against the platform’s content policy.

The news comes from an investigation by Proof News that was co-published with Wired.
According to the investigation, Apple and the other companies were using a dataset called YouTube Subtitles that included transcripts of 173,536 YouTube videos from over 48,000 channels. Videos in the dataset span from educational channels such as Khan Academy and MIT to news sites including The Wall Street Journal, to some of the platform’s top creators like MrBeast and Marques Brownlee.
According to Marques Brownlee, Apple technically avoids “fault” as they sourced their AI from companies that used the transcripts from YouTube videos instead of using the data directly. Nonetheless, the data/transcripts still contribute to the AI models, in which the creators invested their time and money. Brownlee concluded by saying that this is going to be an evolving problem for a long time.
Proof News also created a tool for creators to search for their content in the dataset. The YouTube Subtitles dataset does not include imagery from videos but does include some translated subtitles in languages. The dataset was reportedly created by a non-profit research lab called Eleuther AI, which focuses on promoting open science norms.
None of the companies mentioned above immediately commented on the matter. YouTube chief executive Neal Mohan has already made it clear in an interview that companies using YouTube videos to train their AI models is a “clear violation” of the platform’s policies.
(Source)







Comments