
In April 2025, researchers dropped a bombshell: an analysis suggested that OpenAI may have trained its AI models using copyrighted O’Reilly programming books—books typically hidden behind a paywall. The allegation, first reported by TechCrunch, ignited fresh debate around AI, copyright, and digital fairness. If true, it raises ethical and legal questions about how far companies can go in sourcing data for machine learning—and whether the ends justify the means when advancing technology.
So, did OpenAI actually scrape proprietary books to train ChatGPT? And if it did, is that okay?
Let’s unpack what’s really going on.
🤖 What Does It Mean to “Train AI” on Copyrighted Data?
Training a generative AI model like GPT-4 means exposing it to vast amounts of text—books, websites, code, and forums—so it can learn patterns in language and content. By mimicking how words and ideas connect, the model becomes capable of tasks like answering questions, writing essays, and even generating code.
Companies like OpenAI argue that much of this information is publicly accessible on the internet, claiming legal allowances under the doctrine of “fair use.” But here’s where it gets tricky: copyrighted materials, especially ones behind paywalls, are not “public” in the conventional sense.
The Library Copyright Alliance supports the notion that using copyrighted text for AI training qualifies as fair use, particularly when the purpose is transformative and doesn’t substitute for the original work. Yet the waters are far from clear.
📚 The Case of the O’Reilly Books
According to the TechCrunch report, researchers examined outputs from OpenAI’s models and found “hallucinated” text nearly identical to content from paywalled O’Reilly titles—books users would normally have to pay to access. This discovery suggests either exceptional coincidence or that the model had been exposed to those texts during training.
What makes this problematic? It means OpenAI may be building sophisticated commercial tools with private material—without paying or asking permission. Consider that programming guides, like O’Reilly’s, are often used by students and professionals who already pay to access that knowledge. If the same content becomes freely accessible via an AI chatbot, it potentially undermines original authors’ rights and revenue models.
These revelations were echoed by commentators in Asimov’s Addendum, which called the situation “a microcosm of generative AI’s ethical blind spots.”
🌍 How Europe (and Others) Are Pushing Back
Legally, the U.S. and EU are diverging. While U.S. law leaves room for interpretation on fair use, the EU’s regulatory stance is tightening. Under the EU AI Act, developers must disclose training data sources and secure licenses for copyrighted material.
That means what might be permissible under U.S. law could be illegal in the EU starting later in 2025. Transparency isn’t just a courtesy—it’s the law.
Countries like Singapore and Canada are also evaluating how existing copyright laws need updating to address generative models. As the RAND Corporation points out, there’s a global rush to harmonize intellectual property protections with emerging tech before the policy lag becomes unmanageable.
🧠 Ethical Use vs. Technological Growth
This isn’t just a legal issue—it’s a moral one. If AI models rely on high-quality data to improve, but that data comes from content producers who didn’t consent to its use, are we effectively building progress on exploitation?
You wouldn’t want your personal writings or behind-the-paywall blog ending up in someone else’s product, would you?
Claude.ai’s deep-dive on copyright raises similar concerns, suggesting that the tech industry needs to stop treating scraped content as free-for-all fuel. Creators put effort into making valuable material; profiting from it without acknowledgment or compensation is, at best, ethically murky.
💼 So, What Happens Next?
While OpenAI and others defend their practices, the tide is shifting. As tech developers face growing pressure from courts, regulators, and user communities, clearer boundaries are likely to emerge. Whether through opt-out registries, licensing frameworks, or stricter disclosure laws, the future of AI will need to reconcile innovation with respect for intellectual property.
There’s no question that generative AI is transformative. But transformation without accountability isn’t progress—it’s piracy with better grammar.
As the legal dust settles, one thing’s clear: AI can’t keep growing at warp speed without first answering a fundamental question—whose work are we building it on?
That’s a question worth asking before ChatGPT writes your next code snippet or research paper. Or maybe it already has.
📝 Conclusion
So here’s the real dilemma: if today’s most powerful AI systems are built on foundations they had no right to use, are we witnessing the rise of a revolutionary technology—or the most sophisticated act of digital trespassing in modern history? The boundary between innovation and infringement is blurring fast, and the tools we’re celebrating may also be quietly rewriting the rules of ownership, labor, and trust.
As AI becomes more embedded in our daily lives—writing our code, summarizing our reports, shaping our thoughts—we have to ask: who really owns the intelligence behind the interface? And if we don’t draw clear lines now, will we one day find ourselves in a world where creative work holds value only until a machine can replicate it? This isn’t just about copyright. It’s about the kind of future we’re authoring—word by borrowed word.