In a revealing Wall Street Journal interview, OpenAI CTO Mira Murati was tight-lipped about the specific data sources used to train Sora, the organization's advanced AI video generator. Amidst growing scrutiny over AI training practices, Murati's reluctance to detail the origins of the data highlights the ongoing debate surrounding copyright and ethical AI development.
OpenAI CTO Evades Detailed Queries on Sora's Training Data Amid Copyright Concerns
That trend has stayed the same with OpenAI's Sora, the company's upcoming text-to-video generative AI that has demonstrated the ability to create lifelike and realistic videos.
In an interview video with the Wall Street Journal, OpenAI's former CEO (she was CEO for two days when Sam Altman was temporarily removed) and current CTO Mira Murati discussed the company's new technology. Murati's interview was intended to discuss the benefits of Sora and hype the upcoming technology. That happens, but Joanna Stern of the WSJ did more than throw softballs; she also asked some difficult questions.
In a three-minute segment, Stern questions Sora's training set. Before the interview, Stern provided OpenAI with some new text descriptions that would be used to create videos for their interview.
"Every time I watch a Sora clip, I wonder what videos this AI model learned from," Joanna says. Did the model see any clips of Ferdinand to know what a bull in a Chinese shop should look like? Was it a fan of Spongebob?"
While she asks these questions, clips from the animated film Ferdinand and the children's television show Spongebob appear side by side with Sora's work, making it difficult not to notice the similarities. The next question was, naturally, what data was used to train Sora?
"We used publicly available data and licensed data," Murati responds.
"So videos on YouTube?" Stern asked. "Videos from Facebook, Instagram? What about Shutterstock? I know you guys have a deal with them."
"I'm actually not sure about that. If they were publicly available, publicly available to use, there might be that data, but I'm not sure. I'm not confident about it," Murati said.
"I'm just not going to go into the details of the data that was used, but it was publicly available or licensed data."
Murati confirmed to Stern after the interview that the licensed data includes Shutterstock content, but her refusal to discuss the topic on camera is telling.
Ethical Quandaries: AI's Content Creation Sparks Copyright Controversy and Artist Concerns
PetaPixel reports that as impressive as generative AI, the debate over how these companies create visual content and the likelihood that it violates artists' copyrights remains constant. There have been reports that the people behind AI image generators specifically target specific artists in their training data under the guise of making it "publicly available." Even when this is not the case, the ease with which photographers can recreate their photos with minimal effort, or the fact that iconic images are just as simple to recreate with minimal effort, tells the story.
These AI systems have likely seen and been trained on those copyrighted images, which explains why they can easily recreate their versions. However, speculation isn't necessary. Midjourney's founder admitted that its AI used a "hundred million" image as a training set without permission. OpenAI admitted that it is "impossible" to train AI without relying on copyrighted content.
That said, Murati is likely aware that discussing using stolen content to train its AI is not something OpenAI wants to admit regularly, so she refuses to respond to Stern's question. It is, however, an easy way to argue that these companies care little about human artists' rights and demonstrate how far they will go to further their interests, regardless of the cost.
Photo: Levart_Photographer/Unsplash


The government is ‘doubling down’ on its social media ban. But bigger penalties for platforms aren’t enough
Trump Administration to Launch Voluntary AI Standards for Frontier Models
EU Chip Industry Faces Growing Risks From China Export Controls and U.S. Technology Dependence: Report
Microsoft Reportedly Plans New Job Cuts Across Sales, Consulting, and Xbox
Open-Source AI Models Gain Ground as Enterprises Seek Lower-Cost Alternatives, Citi Says
Apple Expands iPhone Lineup, Boosts Foldable iPhone Production Plans Through 2027
Samsung to Invest $90 Billion in South Korea to Expand AI Chip, Display, and Battery Production
Meta Stock Jumps as AI Cloud Expansion Challenges AWS, Microsoft, and Google
Baidu Shares Rally as Kunlunxin Eyes $50 Billion Hong Kong IPO
Samsung, SK Hynix to Unveil $1.3 Trillion AI and Semiconductor Investment Plan
Anthropic Restores Claude Fable 5 and Mythos 5 After U.S. Lifts AI Export Controls
AI can be a personal trainer in your pocket – but is it safe?
Anthropic Brings Claude AI Models to Microsoft Azure Foundry With NVIDIA Blackwell GPUs
Super Micro Employees Detained in Taiwan AI Server Export Investigation
Australia Sues Amazon Over Prime Video Ads and Subscription Terms
Switch Seeks $2 Billion Funding at Nearly $50 Billion Valuation Ahead of Potential IPO
Nvidia Stock Rises as SemiAnalysis Sees AI Data Center Revenue Beating Wall Street Forecasts 



