Apple, NVIDIA And Anthropic Reportedly Used YouTube Transcripts Without Permission To Train AI Models

Apple, NVIDIA and Anthropic reportedly used YouTube transcripts without permission to train AI models

"Are You a CEO, Director, or Founder interested in a Feature Interview?"

All Interviews are 100% FREE of Charge

Some of the world’s largest technology companies trained AI models using a dataset that included unauthorized transcripts of more than 173,000 YouTube videos. New Research from Proof News The dataset, created by nonprofit EleutherAI, contained transcripts of YouTube videos from over 48,000 channels and was used by companies like Apple, NVIDIA, and Anthropic. The findings of the investigation highlighted an uncomfortable truth about AI: that AI technology is built on data siphoned from creators without their consent or compensation.

The dataset does not include any YouTube videos or images, but it does include video transcripts from some of the platform’s biggest creators, such as Marques Brownlee and MrBeast, as well as major news publishers, such as: The New York Times, BBCand ABC NewsSubtitles from Engadget videos are also part of the dataset.

“Apple sources data for its AI from multiple companies,” Brownlee says. Post to X“One of them scraped a ton of data and transcripts from YouTube videos, including mine,” he added. “This is going to be a long-term, evolving problem.”

Apple sources data for its AI from multiple companies

One of them scraped a ton of data and transcripts from YouTube videos, including mine.

Apple technically avoids the “flaw” because it doesn’t scrape.

But this will be an evolving issue for a long time. https://t.co/U93riaeSlY

— Marques Brownlee (@MKBHD) July 16, 2024

A Google spokesperson told Engadget: Previous Comments YouTube CEO Neal Mohan said that companies that use YouTube data to train AI models violate the platform’s terms of service, which remain in effect. Apple, NVIDIA, Anthropic, and EleutherAI did not respond to requests for comment from Engadget.

Until now, AI companies have not been transparent about the data they use to train their models. Earlier this month, artists and photographers criticized Apple for not disclosing the origins of the training data for Apple Intelligence, the company’s proprietary generative AI that will be included in millions of Apple devices this year.

In particular, YouTube, the world’s largest video repository, is a treasure trove of audio, video and images, as well as transcripts, making it an attractive dataset for training AI models. Earlier this year, OpenAI’s Chief Technology Officer Mira Murati said: Avoided the question from The Wall Street Journal Asked about whether the company used YouTube videos to train OpenAI’s upcoming AI video generation tool, Sora, Murati said at the time, “I won’t go into the specifics of the data that was used, but it was publicly available or licensed data.” Sundar Pichai It also said that any company that uses YouTube data to train AI models would be violating the platform’s terms of use.

If you want to check if the dataset includes subtitles for YouTube videos or your favorite channels, check out Proof News’ Search tools.

Update, July 16, 2024 3:17 PM PST: This story has been updated to add a statement from Google.