Acquiring enough data to train their artificial intelligence (AI) models has become a priority for major internet companies such as Google, OpenAI, and Meta. The need for vast numbers of high-quality data has increased as AI technology has developed, leading these businesses to investigate unusual and occasionally contentious methods of data collecting, according to a recent New York Times investigation.
Findings of the Report
OpenAI trained its huge language model GPT-4 on nearly a million hours of YouTube videos, according to a report by The New York Times. According to the study, YouTube videos were used as training data for the OpenAI-developed GPT-4 model, which produced new conversational text, through the use of a speech recognition program called Whisper. Since YouTube, which is owned by Google, forbids the use of its videos for independent apps, this method—which included transcribing over a million hours of video content—raised questions regarding compliance with YouTube’s standards.
Neal Mohan, the CEO of Youtube, stated in a WSJ interview that he had no idea whether OpenAI had trained its new video tool using any YouTube data. But if OpenAI trained the new models on YouTube footage, he said, that would be an issue.
Some contentious data has been discovered to be used by Google and Meta. In order to access more user-generated content, the article alleges that Google has altered its terms of service and transcribed YouTube videos for AI training, possibly in violation of copyright laws. Meta talked about purchasing Simon & Schuster in order to have access to a sizable book library and thought about utilizing online copyrighted data, even though doing so would have ethical and legal ramifications.
Data Volume and AI Performance
Training AI models on large amounts of data improves their efficacy considerably, especially when it comes to producing text, images, sounds, and videos that resemble human speech. Some experts believe that by 2026, digital businesses may run out of internet data because of the overwhelming demand for high-quality data.
According to OpenAI, in order to stay competitive in research, every one of their AI models is trained on a different dataset. Google stated that it only uses data from office apps for experimental purposes and acknowledged using some YouTube content for training AI models under agreements with content creators. Leveraging billions of publicly posted photos and videos, Meta highlighted its efforts in incorporating AI into its services.
To know more about AI Behaviour, read this: Re-thinking AI Behaviour