
In India, there has been significant discourse lately surrounding copyright concerns in the development of Generative AI models, the most recent contribution being MEITY subcommittee’s Report on AI Governance in India, which declares that storing and copying works to create datasets for training foundation models constitutes infringement. Moreover, it isn’t protected under Section 52(1)(a)(i) of the Copyright Act.
While I have written extensively about these issues elsewhere, this piece focuses on what I believe is a fundamental misdirection in this debate—from both sides—whether it’s those claiming training-purpose usage is infringement or those arguing it constitutes “fair use.” Let us not even touch fair use. Training models using copyright works (including storing or making copies of them for training a model) is not infringement of any exclusionary right provided under Section 14, period.
The MEITY Sub-Committee’s broad conclusion that models infringe copyright holders’ exclusive rights simply by storing and making training copies of publicly available copyrighted works is deeply problematic. This stance, if accepted, would fundamentally overturn our understanding of copyright law. Here’s why:
Consider the implications of this statement. If the mere act of making and storing a copy constitutes copyright infringement, wouldn’t you be liable for printing or saving an article from my blog to read later? Could I legitimately sue you for that? If you showed it to someone else or uploaded oit n a public drive, then maybe, but otherwise could I?
The essence of copyright—whether it is reproduction, distribution, performance, or other rights—lies in the exclusive ability to express one’s original expression, translating to an ability/ or a right, to stop someone else from expressing one’s original expression. It is crucial to understand that to express is fundamentally a relative concept involving two human beings– the human “expresso” and the human “consumer” of that expression. Copyright claims, in respect of publicly available works, are only available, under law if one has substituted the position of the expressor (by becoming the expressor of someone else’s original expression)- not if someone is a mere consumer of the expression. This relative relation does not exist in AI training. It merely involves consumption of the expression of the original creation by the model to learn and train itself.
What’s missing from the current debate is a crucial understanding: copyright protects against unauthorized sharing of my work with others, potentially depriving me of credit or economic compensation that I could have gotten by sharing it with them myself. In simpler terms, while I cannot express your original expression without your permission, I can certainly consume your publicly available original expression without the same (maybe (or not?) barring paywall circumvention, which isn’t part of this current debate). The law focuses on unauthorized expression of original publicly available content—not its unauthorized consumption, as making content public already waives that claim.
This is why I struggle to understand how storing or copying for purposes that don’t involve sharing/expressing the original expression, or a substantial part thereof with third parties (what academics often call non-expressive, consumptive copying) could be considered infringement at all. This question needs to be addressed before we even enter the fair use debate, which only becomes relevant after establishing prima facie infringement. If such copying were illegal, simply printing publicly available web pages for one’s learning/consumption would constitute copyright infringement. If I store content for learning, which I might use to produce a potentially competing article, is that infringement? By this logic, academia (a commercial enterprise), which more often than not requires storing and printing publicly available articles for learning the ideas embedded within them, would equal to an enterprise built on infringement of copyright. Fortunately (and thank god for that!) that is not the case.
Developers of models aren’t exposing any humans to the expression of the inputted works—they’re creating an alternate expression. If this alternate expression substantially resembles the original expression used for learning, that will indeed constitute infringement, but that’s fundamentally different from claiming that storing and copying for model training purposes is inherently infringing.
In short – (i) no, copyright is not the answer for your existential crises, and (ii) it is a “scope of rights” issue, not concerning itself with a backend defense of fair use.
The sooner we understand this and get over copyright, the sooner we will look for other arenas that actually resolve the existential concerns.
I welcome your thoughts on this perspective.
