Indian Copyright Law and Generative AI: Part 2- Transformative and Extractive Use

Co-Authored with Sneha Jain

Having first considered the question of whether storing copyrightable works for training purposes is reproduction that amounts to copyright infringement under Section 51 of the Indian Copyright Act, 1957, in this second post of this series we will specifically be looking at transformative and extractive uses, applicability of exceptions and limitations under Indian Copyright law, as well as implications of Anti-Circumvention laws.

Transformative Use

India does not recognize the transformative use exception to copyright infringement within the parameters of Section 52 of the Copyright Act. However, the Division Bench of the Delhi High Court in University of Cambridge v. BD Bhandari [2011 SCC OnLine Del 3216][i], has held use of a work for purposes of making a guidebook to be a substantially different purpose from the purpose for which the original work of the Plaintiff was made. The Court recognised this purpose to be a transformative purpose, which did not impinge upon the expressive purpose for which the Plaintiff had an exclusive reproduction right. The reproduction right, or its scope, was thus, arguably restricted by the Court to the expressive purpose for which the original work was curated.

Can a similar analogy be extended to use for training genAI models, where genAI developers argue that not even a single human being is exposed to the expressive content of the work? Not even the Large Language Model (LLM) reads or experiences  the work in its expressive sense, and storage of a single copy merely enables the foundational model to discern, among other things, the “structure, syntax, and semantics of language,” including “grammar, sentence construction, and how words and phrases are related to each other” in order to facilitate the generation of “coherent and contextually appropriate output”[ii].

Unlike the United States where there is a contrast in statute- i.e., the Copyright Act, 1976 itself provides for transformed forms of works to be protectable derivatives, as well as provides fair transformative use to be exempted from infringement, the Indian statute is not clear on whether use of a work for an expressively different purpose, or in fact for a non-expressive purpose is within the domain of the creator’s market. The Division Bench of the Delhi High Court inUniversity of Cambridge (supra) recognised that if the use of the work is of a “transformative character” i.e., the purpose served by the use is different from the purpose for which the work was made, it is a limitation to copyright protection or its subject matter. The Court also held guide books to be a transformed work, not amounting to reproduction of the original. The Division Bench of the Calcutta High Court in Barbara Taylor Bradford v. Sahara Media Entertainment [2004 ILR (1) Cal 15] has also recognised that a work which is taken, and then used for producing a subsequent work that is so changed and muted as to make it transformed, and a different work altogether, would not generate an actionable claim for the owner.

This line of decisions presents an important question. Is use for the purposes of training, to enable the Gen AI model to produce accurate responses to user queries, a part of the expressive purpose for which the work was originally created? Or is it a transformed purpose that is beyond the circumscribed domain of exclusionary rights granted to the copyright owner? Is use for training purposes, when the work is primarily expressive, and meant to be expressively consumed as against used for non-expressive training, infringing? This would require an analysis of what really comprises the subject matter of protection for the owner- their primary and secondary markets – and how much of it is linked directly with the purpose for which the work was created- expressive purpose or training purpose? In other words, does use of a copyrighted work for a non-expressive/ non consumptive purpose amount to copyright infringement, or is it a distinct and transformative purpose outside copyright’s boundaries/scope of protection?

Extractive Use

A distinct question here deals with use and copying of even protected material for arguably extracting unprotectable elements, that would otherwise not be possible to be extracted. The affirmative essence of such use is to extract unprotectable elements from copyrighted works, elements which are not a subject matter of copyright protection.

In Akuate Internet Services Pvt. Ltd. v. Star India Pvt. Ltd [2013 SCC OnLine Del 3344][iii], the Division Bench of the Delhi High Court has recognised that copyright’s balance is maintained by ensuring that information, facts and knowledge embedded within expression cannot be monopolized using Copyright law. The Court has further held that protection cannot be extended to information and facts embedded in protectable works, even under the premise of unfair competition. Extending the same would inevitably restrict the ability to extract and disseminate information which is a critical component of Article 19(1)(a) of the Constitution of India. Thus, Indian Copyright jurisprudence clearly recognizes that information embedded within expression is not protectable and no monopoly can be extended in respect thereof. The said rationale of balancing copyright protection with access to unprotected information for the purposes of furthering expressive and speech values has also been recognised by the Division Bench of the Delhi High Court in Wiley Eastern Ltd. v. Indian Institute of Management [61(1996)DLT 281].

This is furthered by the idea expression dichotomy under Copyright law that is widely accepted in Indian Copyright jurisprudence. Useful information contained in any expressive work is not protected. It is only the form in which the said information is contained/presented that is a protectable expression for purposes of Copyright law. This is line with the fundamental purpose of Copyright law which is to reward and incentivize/enable production of creative expressive forms, that disseminate useful information. This, as Prof. Molly V. Houweling recognizes, is not because information and facts are not valuable enough to justify copyright but rather because they are so valuable that they belong to the public domain for everyone to be able to access.[iv]

For instance, in the case of a poem that expresses conceptions of thoughts, copyright in the poem gives no monopoly in the ideas or conceptions of facts expressed by the said words, but merely to the arrangement of the words used to express those thoughts. Others have a right to discern that information and exploit the information within, provided they do not substantially reproduce/adapt/communicate to the public, the concrete form in which the ideas have been arranged or put into shape.

The basic rationale for protecting uses of copyrighted expressions which are not reproductive of the expression or expressing form but are merely to extract the ideas or the unprotectable elements embedded within, flows from this idea expression dichotomy. For extraction however, it is arguably necessary and could be essential to access the whole copyrighted expression, and even store it, without exposing it in its expressive form to a single human being- which is exactly what GenAI systems often do. Without such access to the complete work, extraction of embedded information becomes impossible, inevitably extending copyright protection to such unprotectable elements. That, of course, is not a desired outcome of copyright policy. In other words, copyright does not give the “right to control access” to extract unprotectable elements (Anti-circumvention provisions do- which are dealt with below). It merely gives the right to exclude reproduction/adaptation/communication of the expressive form of the work (No wonder, Section 14 of the Copyright Act does not include “right to control access” within its sub-provisions).

Even well recognised doctrinal principles like the merger and scenes a faire doctrines in Copyright law provide scope for extractive uses of seemingly expressive elements. These doctrines recognize that unprotectable ideas, facts, stock characters, incidents, images and themes sometimes do not lend themselves to a wide variety of expressions. Thus, these doctrines prohibit protection of seemingly expressive elements that represent only a few limited ways of expressing certain ideas. Without being able to extract these seemingly expressive elements which have merged inseparably with the unprotectable limited ways of expressing ideas, and use them, the purpose of the idea-expression and merger doctrine is rendered illusory.

The analysis may, thus, focus on the nature of the expression used, and the purpose of storing that seemingly expressive expression i.e., merged into an idea – whether it is to extract informational content out of it, or for expressively reproducing it? Many a times, we will realize that without accessing, copying and using the entire expressive form that is protected, extracting unprotectable ideas out of such expressions would be impossible.

Codified Exceptions and Limitations:

Under Section 52 of the Indian Copyright Act, fair dealing for the purposes of private or personal use, including research is permissible. An important question that Courts will have to grapple with, as they deal with extension of legal personality to Artificial Intelligence Technologies (separate article soon!), is whether use by AI systems for training and for its models to learn would be private or personal use, that does not expose the expression to a single human being apart from the AI system. Moreover, whether private use by a corporate entity like Open AI for its own learning and development (for its models), even if that learning leads to a competitive product, is permissible or not will also have to be examined. Would the defense of private or personal use under Section 52(1)(a)(i) of the Copyright Act only extend to humans or also to corporates or juristic personalities?

On the side of research use, it is arguable that use for the purposes of extracting information embedded in expressions, without exposing a single individual to the expression, could amount to research use that is protectable under Section 51(1)(a)(i) of the Copyright Act. Importantly, the explanation to Section 52(1)(a) also provides that storage for fair dealing for a private or personal use, including research, is not infringing.

These questions at the back end, however, will only arise if Courts, in the first place, deem such storage and use for training purposes, to be a part of subject matter of protection under Section 14 of the Copyright Act.

Anti-Circumvention and the Training stage (Para-copyright right to “control access”)

Anti-circumvention provisions under Copyright laws are essentially to prevent unauthorized access to copyrighted works that are safeguarded in the digital realm using modes like, inter alia, paywalls etc. In the United States, New York Times in its complaint against Open AI has alleged that Open AI has trained its model by circumventing paywalls and unauthorizedly accessing its copyrighted protected articles that are behind technological protection tools that prevent circumvention. The allegation is synonymous to unauthorizedly circumventing its security measure put in place to prevent access, for purposes of training the model. Would a similar act be actionable under Indian Copyright law?

Section 65A(1) of the Copyright Act provides that circumvention of a technological protection measure is forbidden under the Indian Copyright law. It is the only provision that controls the “access” to copyrighted digital works and is a para-copyright measure to ensure that even unauthorized access is actionable. However, importantly, Section 65A(2) specifically prescribes that technological protection measures can be circumvented if it is for purposes that are legal, or not expressely prohibited by the Act. This provision was specifically inserted keeping in mind the importance of access for permitted purposes. The Standing Committee that was constituted for the 2010 Copyright Amendment Bill, that translated into the Copyright Amendment Act 2012, specifically argued that without a provision that allows circumvention of technological protection measures for permissible purposes under the Act, access to works for permissible purposes would be impossible and exceptions and limitations to Copyright Act would be rendered redundant – “In the absence of the owner of the works providing key to enjoy fair use, the only option was to circumvent the technology to enjoy fair use of works.”[v]

Thus, if Courts find use for training purposes transformative, extractive or outside the subject matter of protection, or for that matter, permitted under Section 52 of the Copyright Act, circumventing technological protection measures to enable extraction would be permissible under the Copyright Act.

Section 65A (2) however comes with a condition, i.e., every person facilitating the circumvention of a technological protection measure (“hacker”) has to maintain a complete record of the name, address, and all relevant particulars of the person (“fair dealer/user”), as well as the purpose for which he has been facilitated. So long as this is maintained by the hacker, Section 65A (2) allows circumvention of technological protection measures. Importantly, this also ensures keeping a record of every protected work that is accessed for training purposes, for the purposes of technologically facilitating attribution, which is a desirable goal of copyright policy.

In the next part of this series, we will transcend from the training stage to the output stage, to analyze whether outputs produced by GenAI systems would be violative of the owners reproduction or the adaptation/derivative rights.


[i] Special Leave Petition before the Supreme Court bearing – SLP(C) No. 029951 / 2011, dismissed vide order dated 27th January 2016

[ii] Understanding Generative AI and its relationship to Copyright, Written Testimony of Christopher Callison-Burch before the U.S. House of Representatives Judiciary Committee Subcommittee on Courts, Intellectual Property, and the Internet Hearing on Artificial Intelligence and Intellectual Property: Part I– Interoperability of AI and Copyright Law, available at <https://docs.house.gov/meetings/JU/JU03/20230517/115951/HHRG-118-JU03-Wstate-Callison-BurchC-20230517.pdf&gt;

[iii] SLP(C) No. 029629 / 2013 pending before the Supreme Court.

[iv] Molly S. Van Houweling, The Freedom to Extract in Copyright Law, (unpublished draft on file with the author)

[v] Standing Committee Report on the Copyright Amendment Bill 2010, available at https://prsindia.org/billtrack/the-copyright-amendment-bill-2010#:~:text=The%20Bill%20allows%20for%20the,for%20use%20by%20such%20persons.

Indian Copyright Law and Generative AI- Part 1 -Mere Storage as infringing?

Co-Authored with Sneha Jain

The scope of copyright liability of Generative AI (‘genAI’) models is a hot topic globally. copyright issues that stem out of genAI technology can be categorized into four heads. All the litigations in the United States form a part of one of these four heads:

  • Allegation of copyright Infringement due to copying/storage of copyrighted works as data sets for the purpose of training models;
  • Allegation of copyright Infringement due to substantial similarity of the output produced, as well as the output produced being based on the inputted copyrighted work;
  • Allegation of copyright Infringement due to lack of attribution or lack of disclosure/tampering with Rights Management Information;
  • Whether genAI models can be “authors” for the purposes of copyright law.

Most defense briefs in the various litigations filed in the US till now, have relied upon the transformative fair use defense to avoid copyright liability. Relying on the idea-expression dichotomy, these briefs have argued that genAI models have not copied any protectable copyright “expression” but only copied unprotectable ideas.

While the contours of the idea-expression dichotomy, the merger doctrine, as well as protectable subject matter, as applied in India, remains largely similar to the US copyright jurisprudence, the transformative fair use defense, as it has developed in the US, is not statutorily available under Indian law (though it arguably is available under judge-made law). A question then arises – how will such litigations fare under Indian copyright law? Will genAI tool providers like ChatGPT, Sora, SDXL Turbo, Google’s Music LM etc., face incremental risk under Indian law, even if they succeed in their transformative fair use defense under US law?

Through these series of articles, we will be exploring the peculiarities of Indian copyright law that may pose incremental risk to genAI tool developers, as well as models. Before we dive into legal issues, it is crucial to understand how a genAI tool is crated and works.

How are genAI tools are developed:

Visualize how a child learns reading and writing – by copying, imitating and repeated tracing of the alphabet (ABCs), followed by simple words, sentences and so on. Similarly visualize how a child learns to speak – by listening to and repeating sounds and words spoken by a parent/teacher or other care giver. Having learnt how to read, write and speak, the same child, being exposed to a wide spectrum of social, cultural and informational content and experiences, is not only intrinsically shaped by such content and experiences but also shapes the cultural realm through her contributions.  It is this exact process of being shaped by, and at the same time shaping back, the cultural realm that genAI is mimicking through its algorithms that read the vast data sets of content and information available in digital form (‘training sets’) and extract ‘knowledge’ from the training sets. The ‘knowledge’ is nothing, but the meta-information embedded within the training sets. This knowledge extraction happens by firstly breaking and categorizing the data into fundamental ‘tokens’, secondly, identifying statistical patterns from the placement of such tokens to learn the relevance and context of each word in a sentence, and thirdly apply the knowledge to predict answers based on the statistical patterns learnt. Thus, what Gen AI systems most likely tend to do is “produce a “reasonable continuation” of whatever text it’s got so far. It essentially mimics the process of learning and knowledge sharing adopted by a human mind, by converting words into numbers (tokens) and finding massive statistical patterns for learning through the numbers. In other words, creators of genAI tools/models are attempting to create a human brain through computers, as opposed to through natural conception or IVF or test tube baby brains.

Whether storage by genAI systems is copyright infringement?

The current stage of training genAI models involves making copies (fixing) of data sets, which include copyright protectable works, and storing them for varied periods. Storage of data sets for the purposes of training can happen in three distinct ways:

  • Storage throughout the subsistence and use of the models.
  • Storage until the data is extracted and absorbed.
  • No storage, and use of Federated or Collaborative learning, where data sets are not stored on a centralized cloud server. The training happens through data on decentralized servers, i.e., without storing data on any particular server.

It is important to note that irrespective of the fact of there being a copy of the work, which is then stored, the same is solely used by the model developers for extracting the meta-information contained within the expression of the content, through the model, and is not exposed to any human. Copying and Storing are two different acts or uses of a copyrighted work. For training genAI models, though the model does read the content per se to tokenize it for the purpose of weighing the model and parameters, to gauge the logic of the next possible sequence, it is however not reading or enjoying a copyrighted work in the context in which a copyrighted work is meant to be seen or heard or enjoyed. For instance, a musician does not produce a song for the primary purpose of it being used for training. The primary purpose of the same is entertainment.

Under the Indian Copyright Act, the exclusive right of reproduction is conferred to owners of literary, dramatic, musical, artistic works, sound recordings and cinematographic films, as well as to the owners of performers rights and the broadcast reproduction rights. While the contours of the right may be different for each of these, the common thread is that reproduction and storage mostly go hand in hand The Copyright Act distinctly provides an exclusive right to copyright owner of a literary work, dramatic work or a musical work, under Section 14(a)(i) to reproduce the work in any material form, including the storing of it in any medium by electronic means. It also provides an exclusive right to the copyright owner of an artistic work under Section 14(c)(i) to reproduce the work, including storing it in any material form. In context of cinematographic films and sound recordings, Section 14(1)(d)(i) and 14(1)(e)(i), distinctly provides an exclusive right to copyright owners – to make a copy of the film/sound recording, including storing of it in any medium. Neither is “reproduction”, nor a “copy” defined in the Act. However, the definition of an “infringing copy” under Section 2(m) of the Act, clearly differentiates the concepts of “reproduction” and “making a copy”, as applicable to different set of works. Arguably, this is to eradicate any associated physicalism with literary, artistic, dramatic or musical – i.e., underlying works- and to showcase as to how reproduction of their forms of expression is relevant – and not the mere act of making copies which may not be for the purpose of reproducing the expression. The dictionary meaning of reproduction is to create or bring into existence again, and of copy is to imitate or transcribe. In MRF v. Metro Tyres, the Delhi High Court has also read the meaning of copy to be expansive to include imitation of the substance copied, and not merely a physical copy.

Reproduction includes the act of storing the expression of the work in any medium by electronic means. This deeming fiction of including “storage” within the meaning of reproduction was brought in by the 1994 Amendment to the Copyright Act to comply with TRIPS which extended protection to broadcasters and producers of phonograms. The Parliamentary Standing Committee Report in 2010, clarified that storage was to be held to be infringing specifically qua Internet Service Providers, who would unauthorizedly store content to provide exposure to the same for impermissible purposes.

The reproduction right protects recompense in the primary market for the owner of the work. It is to protect the owner of copyright from losing out economic returns by substitution in its primary market, by the act of copying the expression of the work, or unauthorizedly exposing the expressive originality of the work. This right is limited by various doctrines that have been developed by courts. For instance, courts do not extend the primary market of the work to ideas embedded within the expression. The idea-expression dichotomy clearly recognizes that protection is only limited to the expressive form, and the right only extends to denuding unauthorized reproduction of the expressive form of the work. This dichotomy has even been recognised in Article 9.2 of the TRIPS Agreement, which also explains that protection extends only to the original way in which the information or idea is expressed, and not to the information or idea embedded in the work. The Supreme Court in RG. Anand v. Deluxe Films has also recognised, while providing helpful guidance on the meaning of what constitutes a “copy” under the Act, that the fundamental fact to be determined for violation of copy is whether the manner, arrangement, situation to situation, scene to scene with minor changes or super additions have been adopted, as against the mere idea or information embedded. Even in Barbara Taylor Bradford v. Sahara Media, the Division Bench of the Calcutta High Court has recognised that ideas embedded within works are not protected, and only if the expression is appropriated would it form subject matter of copyright protection. The rationale of the same stems from the principle that copyright does not give an exclusive right over the information, experiences or facts embedded, but only over the concrete form in which these ideas are developed. Thus, unless reproduction, including storage is for the purposes of exploiting or substituting the market of the copyright owner in this concrete form, it would not be copyright’s concern. This principle espouses the complex compromise that copyright engages in with the freedom of speech, where access to using speech is restricted only to the extent of reproduction of its concrete form- in order to incentivize and acknowledge the creator of the concrete form of the speech, but not to the idea or information embedded within the speech. The Division Bench of the Delhi High Court in Wiley Eastern v. IIM has also recognised that copyright consciously restricts its application to ensure it does not override concerns of Article 19(1)(a) of the Constitution of India.

The merger doctrine, also recognised in India, further limits protection in those cases where the ideas expressed can only be expressed in a limited number of ways, are functional, or core to the genre of expression. Here as well, protection is limited to the concrete expressive form of the work and does not extend to, in any way, monopolize the idea embedded. Moreover, the de minimis rule further limits protection to the extent that trivial parts of the work being used, which do not form a substantial part of the expressive form of the expression, are not protected.

The focus of the reproduction right, as can be seen from these limiting doctrines, is on unauthorized exposure/consumption to the expressive forms of the work, as against use to extract ideas or the meta-information embedded in the works. In fact, these doctrines make sure that copyright does not stifle with the flow of ideas, however, protects the expressive form in which these ideas are embedded in order to provide economic baits for people to clothe these ideas in different original expressions.

The question is whether copying or storing, which is completely non-expressive or non-consumptive, that is – copying that does not involve appropriating the expression of the said work or exposing the expression to any human being, but rather is only for the purpose of extracting meta-information for weighing models and parameters, and training the genAI model, is an act of infringement? Would extraction of ideas constitute an existing market?

A few examples which scholars quote are – can reproduction of a book for use as a doorknob (a purpose for which the book hasn’t been written or published) be infringement, merely because a copy of the physical book was made? Can storage of student papers on a plagiarism software to decode whether the student plagiarized its paper with other papers available on the internet, be infringing use/copy/storage? Can a web-crawling software that makes cached copies of works on the internet, in order to enable search engines to respond to queries of search by matching queries with cached data, be infringing use/copy/storage that is a part of the reproduction right? Can use of a book for following the procedures provided therein be infringement of the reproduction right? Can use of books for allowing search of the said books by search engines, amount to infringement of the reproduction right? These are questions that Courts will have to grapple with in the coming times.

A purposive interpretation of the meaning of “Reproduction, including Storing of works” within Section 14 of the Copyright Act, would probably exclude an exclusive right over storage that is not for the purpose of expressive reproduction and is only for the purpose of extracting meta-information, in any case protected by the limiting idea-expression dichotomy in copyright law. The physical fact of storage or copying would be irrelevant to such an analysis- as long as the form of expression, i.e., the protected element in the work,is not being exposed to anyone.

To the contrary, however, a literal construction of the said provision would probably lead to a conclusion that extends the primary market of the copyright owner even to the mere storage/ or copying of the work, irrespective of whether the same is for a reproductive purpose (in an expressive context) or not.

Which way will the courts go is yet to be seen!

Transient Storage

Even if storage is considered to be infringing under Section 14 read with Section 51 of the copyright Act, Section 52(1)(b) and (c) specifically provide for exemption of transient or incidental storage of a work purely in the technical process of electronic transmission, and transient or incidental storage for the purpose of providing electronic link or access where the same is not expressely prohibited or infringing.

Courts will have to grapple with the question as to- (a) whether storage of training data sets for the training period, can be considered transient; and(b)whether storage of training data sets would amount to being incidental to providing access to the genAI model to extract meta-information.

The concept of “transient and incidental storage” was somewhat clarified by the Delhi High Court in MySpace Inc. v. Super Cassettes Industries Ltd. In My Space, the Court was dealing with the question of whether My Space can be obligated to monitor and review to report any infringing content of Super Cassettes on its platform. The Court while analyzing the purpose of the transient or incidental storage exception held transient to mean temporary, and incidental to mean subordinate to something of greater importance. This was deemed to include “cached data”, or other data generated automatically to improve performance of the core permissible function. Moreover, the text of the Copyright (Amendment) Bill which introduced Section 52(1)(c) shows that storage is permissible when exposure as a result of storage is permissible and non-infringing.

Thus, it is arguable that storage for the sole purpose and functionality of training, which arguably is a transformative and permissible purpose, would be incidental storage that is permissible under the said section. However, Courts are yet to clarify this.

On the aspect of temporary storage, legality would depend on how long the storage is for. If the data set automatically is removed once the meta-information used for training is extracted, it is arguable that storage would be transient and temporary, all the more due to the fact that not even one human is exposed to the stored copy. However, Courts would have to render more clarity on this aspect.

In any case, the next part of this series will delve deeper into use for extractive purposes and whether any of the defenses under Section 52, including fair dealing private use/personal use, use of illegal copies as against lawfully acquired copies, would probably extend to  “use” at the training stage of Gen AI models – by AI or by the facilitator, i.e., the company building the AI, or not.