The lawful basis for web scraping to train generative AI models

In brief: Our position on the lawfulness of using web-scraped data to train generative AI largely remains the same. See the original call for evidence for the full analysis.

The consultation enabled us to refine it in the following way:

Legitimate interests remains the sole available lawful basis for training generative AI models using web-scraped personal data based on current practices. However, this is only when the model’s developer can ensure they pass the three-part test, including necessity. We received consultation responses which suggested that alternative data collection methods may be feasible. We therefore expect controllers who develop generative AI to evidence why other available methods for data collection are not suitable.
The three-part test also includes the balancing test. Web scraping for generative AI training is a high-risk, invisible processing activity. Where insufficient transparency measures contribute to people being unable to exercise their rights, generative AI developers are likely to struggle to pass the balancing test.

Respondents

In January 2024, we published the first chapter of our consultation series. This chapter set out our policy position on the lawful basis for processing web-scraped data to train generative AI models.

We received 77 responses from organisations and 16 responses from members of the public. 31 responses came via our survey, with a further 47 received directly via email. The sectors most represented were:

creative industries (18);
law firms (nine);
the technology sector (eight); and
trade or membership bodies (eight).

Of the survey responses, 19 respondents (61%) agreed with our initial analysis.

Original call for evidence

In our original call for evidence,²⁰ we set out our positions on the lawful basis for using web-scraped data to train generative AI models. For the full analysis we advise consulting the original call for evidence, but our key positions were as follows.

Firstly, we determined that five (consent, contract, legal obligation, vital interests, public task) of the six available lawful bases are not likely to be applicable in this context. This means that, in practice, legitimate interests (LI) is realistically the only lawful basis developers could explore. However, they still need to pass the three-part test²¹ to demonstrate this lawful basis is actually valid.

Secondly, our initial view was that it was likely that most generative AI training would require the volume and kind of data obtained through web scraping – but that we welcomed views on this. If organisations can reasonably achieve their purpose without the high-risk,²² invisible processing²³ involved in web scraping, then they wouldn’t pass the necessity part of the legitimate interest test.

Thirdly, we explained that using web-scraped data to train generative AI is essentially a combination of two high-risk processing activities. This is because it involves innovative technology as well as constituting invisible processing. It therefore hits two triggers on the ICO list of high-risk processing activities.²⁴

Finally, we set out various considerations that generative AI developers could explore to potentially help them pass the third part of the legitimate interests test in this context.

Key points from the responses

When it came to identifying a valid legitimate interest (‘the purpose test’) in processing web-scraped data to train generative AI, the following points arose in the responses:

Respondents regularly cited either commercial or societal interests, or both, as a legitimate interest. This was particularly the case for AI developers within the technology sector and some law firms. Several respondents argued that all generative AI is innovative by default and any innovation was innately beneficial for society.
Some respondents challenged societal interests as a valid legitimate interest. This included representatives from civil society and the creative industries, who highlighted the detrimental impacts generative AI may have on people. These impacts included lack of transparency and the inability to exercise information rights.
Many respondents from the creative industries argued that web-scraped content would constitute unauthorised and unlawful use of copyright data (specifically under Chapter II of the Copyright, Designs and Patents Act (1988)).²⁵ They added the processing would therefore also lack a lawful basis under the lawfulness principle of data protection.

In terms of the need to use web-scraped data for training a generative AI model (the ‘necessity test’), the following points arose in the responses:

Generative AI developers and the wider technology sector stated that, because of the quantity of data required, training generative AI models cannot happen without the use of web-scraped data. Similarly, they argued that large datasets with a wide variety of data help ensure the effective performance of models and avoid biases or inaccuracies.
On the other hand, many respondents, especially from the creative industries, argued that there were alternative ways to collect data to train generative AI models, such as licensing datasets directly from publishers. Therefore, they argued that using web-scraped data couldn’t meet the necessity test.
A variety of responses mentioned synthetic data.²⁶ However, the extent to which developers can rely on it to provide large amounts of training data remains an open question.

In terms of the impact of the processing on people’s interests (the ‘balancing test’), the following points arose in the responses:

Many respondents, especially from civil society and creative industries, argued that in most cases the legitimate interests of generative AI developers would be overridden by people’s rights. This is because of the loss of control over personal data that invisible processing involves and that people are unable to understand the impact of that processing on them. This was also raised in both roundtables with these sectors.
On the other hand, AI developers within the technology sector, among others, argued that the societal benefits of generative AI are a significant factor in passing the balancing test. They cited innovation and potential beneficial uses.
On safeguards and mitigations that may help in passing the balancing test, all respondents recognised the key role of transparency. In particular, they mentioned making clear the extent to which personal data is being processed, where this data came from, how it was processed and communicating this clearly to people. One civil society organisation advocated for an AI registry that could perform this.
They also suggested using licences and terms of use as effective safeguards for generative AI developers, to ensure that “open-access” models used by downstream deployers comply with data protection.

Respondents commented on the processing of special category data and compliance with article 9 of the UK GDPR:

Civil society respondents, and others, identified the processing of special category data²⁷ in the training data as a significant concern. This was because of the greater risks to people’s rights and freedoms. They argued it was difficult to see how developers could ever meet an article 9 condition²⁸ when training generative AI.

Our response

Why legitimate interests is the only available lawful basis

Numerous respondents questioned why we determined that legitimate interests was the only valid lawful basis for using web-scraped personal data to train generative AI. For example, some suggested public task as a lawful basis that the public sector could use. The creative industries often raised consent as an option. To help provide clarity, our reasoning for this in the specific context of web-scraping to train generative AI is as follows:

Consent: This is unlikely to apply here because the organisation training the generative AI model has no direct relationship with the person whose data is scraped. In addition, when people first provided their personal data, they could not have anticipated another organisation would later use it for this purpose. People are unlikely to be able to revoke their consent if removing their data requires model re-training, which is currently an extremely cost and time intensive process.²⁹
Performance of contract: This is not available because the person (whose personal data is in the training data) does not have a contract with the controller undertaking the web-scraping.³⁰
Complying with a legal obligation: The organisations who are training generative AI are under no legal obligation to collect data via web-scraping.
Protecting the vital interest of a person: Training generative AI does not protect a person’s life.
Public task: Controllers can only rely on this when the activity is set out in law or is part of their official role. This is not the case with commercial generative AI developers and would be highly unlikely to apply to public sector developers.

When does data protection law apply to creative content?

It is important to clarify that data protection will only apply to creative content if it constitutes personal data.³¹ The identifiability of any person will determine whether the content is personal data. This needs to be evaluated on a case-by-case basis, depending on the availability of information³² and tools³³ to identify the person.

The ‘purpose test’

When articulating a legitimate interest in the ‘purpose test’, our view remains that it is important for controllers to set out a specific and clear interest, even for models that can be used for various downstream purposes. This can help controllers pass the third part of the legitimate interests test, known as the ‘balancing test’. Organisations can use interests that are generic, trivial or controversial. However, if they do, they are less likely to pass the balancing test or override someone’s right to object.³⁴

Controllers should ensure that the specified purposes make it possible to meaningfully assess the necessity of the processing to achieve that purpose. Even when the processing is necessary, they need to ensure that their interest is not overridden by the interests or fundamental rights and freedoms of the person whose data is processed. Generative AI developers should not assume that general societal interests will be sufficient to rely on as a legitimate interest when considering their lawful basis for web-scraping. As we said in the original call for evidence, developers should evidence the likely benefits rather than assume them.³⁵ Just because certain generative AI developments are innovative, it does not mean they are automatically beneficial or will carry enough weight to pass the balancing test.³⁶

To demonstrate that the chosen approach for achieving a legitimate interest is reasonable, controllers should properly define all of their purposes and justify the use of each type of data collected.

The ‘necessity test’

The consultation responses also clearly showed that the necessity of using web-scraped personal data to train generative AI is not a settled issue. The creative industries in particular challenged our initial position that web-scraping is necessary. We received evidence that other methods of data collection exist, for example where publishers collect personal data directly from people and license this data in a transparent manner. As a result, we encourage developers to seek out other sources of data where possible. Where controllers are seeking to evidence that web-scraping is necessary, they should explain why they are unable to use a different source of data.

We will engage further with developers, tech companies, academic researchers and NGOs on the necessity of web-scraping for the purpose of training generative AI.

The ‘balancing test’

As part of this consultation, we asked for evidence on the potential technical and organisational safeguards organisations could deploy to mitigate any identified risks. We thought this was especially relevant when considering the legitimate interests balancing test.

Some respondents — particularly from industry — argued that using licenses and ToU for “open-access” models provided a safeguard. They said it helped to mitigate risks that people could otherwise be exposed to from downstream deployment of generative AI models. However, when developers want to rely on licenses and ToU to mitigate risks posed by downstream deployment, they will need to demonstrate that these arrangements contain data protection requirements. They also need to assure these requirements are met, so that the safeguard is effective in practice.

We also received many suggestions for technical safeguards that could be deployed. However on the whole, apart from theoretical or proof-of-concept suggestions, the consultation responses did not provide us with sufficient verifiable evidence to properly assess the efficacy of these safeguards in practice.

Additionally, we are aware that many controllers are not meeting their basic transparency obligations under article 14 when relying on web-scraped data to develop generative AI models.³⁷ To demonstrate that the chosen approach for achieving a legitimate interest is reasonable, controllers should properly define all of their purposes and justify the use of each type of data collected. They must express those purposes in a way that allows people to better understand why an organisation is using their data, and, in line with our guidance, what happens with the data in question. Controllers must take into account the modality requirements of article 12, articulating the purpose concisely, transparently, intelligibly and in clear and plain language.

We therefore encourage developers to consider how they can use new and innovative transparency mechanisms and safeguards to enable people to have an enhanced understanding of the data processing and put them in a stronger position to exercise their information rights. This is an area we will continue to monitor, and we strongly encourage generative AI developers to engage with us further on this.

Further, where generative AI model developers are using personal data, they should assess the financial impact on people in the balancing test. For example, a fashion model could lose their income if a generative AI model uses their personal data to create a digital version of them (such as an avatar) to replace them in a fashion show.

Special category data

We did not focus on special category data in our initial consultation. However, many respondents raised the processing of this data for generative AI training as a salient issue. We are currently scrutinising the use of special category data by generative AI developers based on our existing positions.³⁸

Finally, as with processing non-special category personal data, whether a controller intends to process special category data remains irrelevant in determining whether that data falls within article 9. The only exception is if an article 9 category could be inferred but there is no intention to make that inference.³⁹

20 Generative AI first call for evidence: The lawful basis for web scraping to train generative AI models

21 Legitimate interests

22 Examples of processing ‘likely to result in high risk’

23 See glossary.

24 Examples of processing ‘likely to result in high risk’

25 Copyright, Designs and Patents Act 1988

26 See glossary and our related guidance: Synthetic data

27 What is special category data?

28 An article 9 condition is necessary in addition to the article 6 lawful basis when someone processes SCD. See: What are the rules on special category data?

29 In the creative industries context, consent is not likely to be valid in the context of generative AI training as it is not practically revokable. In addition, in this professional context, it can be challenging for consent to be freely given. Any power imbalance between the data subject (the creator) and the controller may render consent unworkable during deployment too. It should also be noted that the concept of ‘consent’ in data protection is distinct from the concept of consent or ‘permission’ in other regimes such as copyright.

30 For the creative industries, the contract lawful basis is very unlikely to apply as it is unlikely that an organisation is under a contractual obligation to use a creator’s content to train its generative AI.

31 See ICO’s guidance on personal data: What is personal information: a guide

32 Such as metadata.

33 For example, facial recognition search engines.

34 What is the ‘legitimate interests’ basis?

35 Generative AI first call for evidence: The lawful basis for web scraping to train generative AI models

36 For example, generative AI is used to create harmful deepfakes, or can leak personal information in certain contexts.

37 Right to be informed

38 Special category data

39 What is special category data?