This consultation sets out the ICO’s emerging thinking on generative AI development and use. It should not be interpreted as indication that any particular form of data processing discussed below is legally compliant.
This post is part of the ICO’s consultation series on generative AI ICO consultation series on generative AI and data protection.
This fifth and final call focuses on the allocation of accountability for data protection compliance across the generative AI supply chain. It addresses the recommendation for ICO guidance on the allocation of accountability in AI as a Service (AIaaS) contexts 1 made in Sir Patrick Vallance’s Pro-innovation Regulation of Technologies Review 2.
The allocation of accountability is complicated because of the different ways in which generative AI models, applications and services are developed, used and disseminated, but also the different levels of control and accountability that participating organisations may have. We are interested in additional evidence on how this works in practice. In the meantime, we provide a summary of our current analysis, the policy positions we want to consult on and some examples which show how this analysis could be applied in practice.
You can respond to this call for evidence using the survey, or by emailing us at [email protected].
The scope of this call
This call focuses on the allocation of roles and responsibilities in the generative AI supply chain. It does not go into the specific obligations in detail.
It also provides some indicative scenarios of processing activities. Generative AI is a fast-moving space and different processing activities and actors may be added into the supply chain. The list of scenarios is non-exhaustive. We seek evidence on additional processing activities and actors not included in this call, alongside the relevant allocation of accountability roles.
The background
Data protection, AI and accountability
Accountability is a principle of data protection law. There are two key elements. First, organisations are responsible for complying with the UK GDPR. Second, organisations must be able to demonstrate their compliance.
Demonstrating compliance with this principle hinges on the accurate allocation of responsibility between three roles an organisation can play when processing personal data. These roles are:
-
A controller – controllers are the main decision makers. They exercise overall control over the purposes and means of the processing of personal data;
-
A joint controller – if two or more controllers jointly determine the purposes and means of the processing of the personal data, they are joint controllers; or
-
A processor – processors act on behalf of, and only on the instructions of, the relevant controller.
Which role an organisation plays will be determined by:
- the specific processing of personal data taking place;
- the circumstances in which this happens; and
- who has genuine, real-life influence and control over the purposes and means of the processing.
Whether an organisation is a controller, joint controller or processor is not necessarily determined by a contract. In generative AI, the roles of ‘developers’ and ‘deployers’ don’t always neatly map onto the concepts of controllers and processors. Roles and responsibilities under data protection law are also not influenced by other legal regimes such as intellectual property or competition law.
Organisations should consult the ‘in more detail’ box below for our existing guidance on controllership. We briefly touch on some key concepts below.
What makes an organisation a ‘controller’?
A controller decides the purpose (the ‘why’) and the means (the ‘how’) of the processing of personal data. An organisation that determines the purpose and means will always be either a controller or a joint controller.
This involves overarching decisions about the processing such as what types of personal data to collect or what that data will be used for.
What makes an organisation a processor?
Processors only process data on behalf of a controller, according to the controller’s documented instructions. Processors cannot make any overarching decisions about the processing. But, within the terms of their arrangement with the controller, they can make more day-to-day operational decisions. For example, the specific hardware used, computational resources required or cybersecurity arrangements.
When does joint controllership come in?
If two or more controllers jointly determine the purposes and means of the processing of the same personal data, they are joint controllers. Joint controllership does not necessarily mean each controller has equal responsibility, or that each is responsible for all processing. But they must clearly set out their respective roles and responsibilities for each processing activity by means of an arrangement between them.
In more detail – ICO guidance on AI accountability, controllers and processors
We understand that allocating accountability in complex supply chains such as generative AI may be more challenging than in simpler contexts. Therefore, we seek evidence on the criteria organisations use to identify their role as controller/processor/joint controller and how they separate the different processing activities when assessing their role(s).
Our analysis
Generative AI’s lifecycle and supply chain
Personal data processing in generative AI can involve distinct entities and can be undertaken for different purposes. In this context, it is useful to distinguish between the AI lifecycle and the AI supply chain. 3
The AI lifecycle refers to the progression of an AI model through several distinct stages (eg model pre-training, fine-tuning, deployment). It includes the processing operations necessary to create and maintain the model.
The AI supply chain is a network of processing activities that happen in a sequential or iterative manner and can be undertaken by different entities for a variety of different purposes. It includes not just the AI lifecycle but a range of other activities such as problem-solving, model improvement or the creation of new applications and services built on top of those models. Some indicative processing activities may include:
- Sending or receiving personal data included in queries to and from third parties via plug-ins (eg a travel booking website) to produce inferences;
- Retrieving additional information that may contain personal data from sources like the web in the process of improving the statistical accuracy of an output via a technique called Retrieval Augmented Generation (RAG); 4 and
- Interaction with third-party AI ‘agentic’ systems 5 that are capable of planning and acting autonomously to execute goals using a variety of sub-systems, models (such as LLMs) and tools.
In generative AI supply chains, we see increasing interdependencies between various entities making decisions about purposes. These entities should carefully consider the nature of the processing activities and the level of control and influence they can exercise, to ensure they have correctly identified whether they are a controller, joint controller or processor, and therefore the responsibilities that they, and other organisations in the supply chain, have. They should also document these outcomes (eg in a record of processing activities). We seek evidence on the influence and control organisations have over determining the means and purposes of distinct processing activities. We are also interested in how they document this, including the use of logs of decision-making, maps and records of processing activities (ROPA), 6 DPIAs, data flow mapping, and other methods.
Controlling and influencing overarching decisions in generative AI deployment
Organisations developing a base generative AI model to provide as a product or service will be controllers for a lot of the development-related processing where they have influence and control over the purposes and means. Where they decide to deploy their model themselves (eg for a consumer-facing application like a chatbot) then they will also be controllers for a lot of the processing activities involved at this stage.
However, in other contexts the level of influence and control over purposes and means requires additional consideration. For example, when an organisation decides to deploy a model that a developer has made available, there may be three possibilities:
-
the developer and deployer may be joint controllers, if they are both involved in the decisions regarding how and why each processing operation is designed and executed; or
-
the developer may be a processor, if they are acting on instruction of a third party who is able to influence both the purposes and means of the processing. In practice this may be possible where a developer provides sufficient information to the deployer about the processing so they can verify compliance with the data protection legislation.
-
the developer may play no role in the processing, as it has merely produced an application, product or service that deployers decide to use in their processing activities. For example, the developer may not have trained the model using personal data. The deployers may be controllers or processors for the processing of personal data by the model, depending on the circumstances.
With this in mind, we seek tangible evidence on how organisations have undertaken or have instructed other entities to undertake, fine-tuning of a generative AI model. We are interested in the data used for the fine-tuning, the allocation of controllership for that processing along with whether or not it was possible to evaluate if that changed the behaviour of the model.
What are the overarching decisions in generative AI?
Our AI and data protection guidance already sets out which overarching decisions would make someone a controller. 7
Overarching decisions are those taken about the nature, scope, context and purpose of the processing. These include decisions about:
-
the type of data (eg media, video, etc);
-
the categories of the data (eg names, social media posts, etc); and
-
the sources of the data (eg Wikipedia, specific social media platforms).
When an organisation deploys a generative AI model that a developer has created, control and influence over the context and purpose of the processing may be straightforward. For example, a law firm decides to use a large language model (LLM) to summarise legal documents. In this instance, the law firm would have decided the purpose of using the LLM to solve a business problem and the context in which that takes place, which is internal use to process documents that may include personal data of clients and third parties.
However, in some scenarios whether the organisation exercises control and influence over the nature and scope of the processing may be less clear. This is particularly the case in [many of] the so-called ‘closed-access’ models we currently see in the market. This is because the way generative AI models are developed and distributed in that context, may not provide downstream deployers with all the information they need to exercise the necessary influence and control over key decisions.
1. Decisions about the scope of the processing in generative AI
The scope of the processing 8 covers decisions on the nature of the data, its sensitivity, the extent of the processing, the duration and more.
Choosing the training data embedded in the model
In a lot of cases, developers will choose the types, categories and sources of training data for the base generative AI model. These choices will need to abide by all relevant data protection requirements (eg lawful basis, 9 special category data conditions, 10 etc). That personal and non-personal data will be embedded in the model architecture to determine its capabilities. Third parties with no practical sight or understanding of the initial training data are not likely to exercise control of this crucial decision. In this scenario, developers will be controllers for the initial collection and curation of the training data.
Model distribution: The ‘open’ and ‘closed-access’ spectrum
Often the term ‘open-source’ is used to imply that third parties wanting to use a publicly released model have access, control and influence over everything they need to fundamentally modify the model. Nevertheless, what we see in practice indicates that ‘closed’ and ‘open’ may be a false dichotomy. This is because different levels of model access, control and influence exist across a spectrum, with organisations releasing different assets and under a variety of conditions and agreements. 11 We believe ‘open-access’ is a more accurate way of describing openness and adopt this term to describe situations where anyone can get access to assets such as model weights, activations, gradients, 12 or any other element that enables the fundamental modification of a base model.
Developers often decide to distribute generative AI models adopting a variation of ‘open-access’ or ‘closed-access’ approaches. This decision has a knock-on effect on any downstream risks deployers of these models may face, such as the model leaking parts of its original training data to the deployer’s customers. 13
What does ‘open-access’ mean for controllership?
Third parties adopting and modifying models at the most ‘open’ end of the spectrum using their own computing resources, will likely be defining the purposes when doing so independently of any agreement with the initial developer. In that case, these third parties may be seen as distinct controllers, separate to the initial controller who developed the system.
As mentioned in the first 14 and second 15 calls for evidence, generative AI developers releasing ‘open-access’ models should consider the development-related data protection implications of not having control over the downstream uses of their model. These may include passing the legitimate interests balancing test or complying with purpose limitation for the training stage. 16
They should also consider that assets such as models, code or training data could include personal data. 17
And where they do, the publication of these assets will be a processing activity – for example, disclosure by dissemination or otherwise making available – which must be fair, lawful and transparent for the data subjects involved.
There are a number of data protection challenges associated with the ‘open’ release of assets. For example, it:
- is irreversible – once it’s out there, it cannot be brought back;
- may impact individual rights – for example, whether individuals retain control over the use of the data once the asset has been published and whether they are informed whenever that data is subsequently processed; and
- may introduce security risks, such as the potential for downstream third parties to remove security measures
Organisations that intend to publish these assets need to consider their data protection compliance obligations or otherwise erase or anonymise the personal data before publication. If erasure or effective anonymisation is not possible, the risks should be mitigated by adopting a ‘structured access’ approach where developers put in place technical and organisational measures as appropriate to the type of release (eg via an API or for deployment on local hardware).
We are interested in evidence on what specific elements (eg training data, weights, etc) organisations releasing ‘open-access’ models make accessible, to whom, under what conditions and following what kind of risk mitigation measures. We are also interested in how they ensure this release is fair, lawful and transparent.
2. Decisions about the nature of the processing in generative AI
The nature of the processing covers decisions about how organisations collect, store and use data as well as security measures, novel approaches to processing and more. We look into generative AI-specific decisions below.
Model architecture: deep neural networks
Developers choose the type of model architecture to use for training (eg LLMs tend to use a deep neural network architecture called the ‘Transformer’). Training data influences the behaviour of deep neural networks more than simpler architectures do.
But what we see in a lot of cases is that third parties deploying the model are not likely to have control and influence over the initial training data and by implication the core behaviour of the model, including its statistical accuracy or any biased results. We note that research is ongoing on how fine-tuning can change model behaviour or be used to mitigate risks. 18
Key model parameters
After choosing the model architecture, generative AI developers put in place key building blocks for the model itself, such as hyperparameters. 19 These define and influence the model’s behaviour. For example, hyperparameters control how the learning process takes place – and, crucially, influence how the model will perform.
Security measures and risk mitigations
Apart from widely reported memorisation risks of data leakage, 20 models such as LLMs often demonstrate ‘emergent’ behaviour. This is behaviour that was not planned by the developers during training. Research shows that risk is difficult to predict or mitigate in advance, 21 and developers may be unable to provide guarantees for a model’s behaviour to third parties deploying the system. 22
Joint controllership can close accountability gaps
Our understanding is that the way in which many developers currently offer third parties access to their models (in particular ‘closed-access’ models) means it can be challenging for any third parties who lack the necessary expertise, agency or resources to understand, control and influence the processing. This, in combination with the above analysis, surfaces two issues:
-
substantial overarching decisions about the means of the processing at the development stage can in practice, to some degree, pre-determine the means of processing during the deployment – deployers are unlikely to have sufficient influence or control if they are unable to change or understand the decisions behind the processing; and
-
deployment risks may not be effectively identified or managed by third parties who are often defined as the sole controller for that part of the processing. 23
We see that organisations who seek to deploy generative AI for their purpose as controllers, may in practice face constraints (for example because of lack of in-house expertise or access to the necessary information) in terms of influencing overarching decisions. In this case, they should consider:
- removing these constraints by requesting information from the developer that enables them to make informed decisions about the processing and compliance; or
- recognising that they lack control over at least some of the processing and identifying the party who does have control for that processing activity (i.e. the developer), who could be a controller or joint controller, to ensure clear accountability.
In practice, the relationship between developers and third-party deployers in the context of generative AI will mean there are often shared objectives and influence from both parties for the processing, which means it is likely to be a joint controllership instead of a processor-controller arrangement. Industry-funded research already appears to acknowledge the potential for developers and deployers to occupy shared roles in the processing activities. 24
Developers may be joint controllers for some aspects of deployment and processors for others. Determining with clarity the different processing activities will help all entities demarcate which processing they are controllers, joint controllers or processors for and justify why. Different processing activities should not be lumped together when they serve different objectives or have distinct data protection risks. For example, search engines built on top of a variety of algorithmic systems or lately LLMs can have different capabilities, functions and risks than ‘traditional’ search engines mainly using ranking systems. Distinct DPIAs may help demarcate the boundaries between them.
Generative AI accountability scenarios
What does this mean in practice?
We provide some examples below that translate the above analysis into practice. Where we refer to ‘processing activities’, it should be assumed that personal data is used in these activities.
Scenario 1: A ‘closed-access’ model provided as AIaaS
Entity A aims to develop a base model, ‘Core’, that they can then monetise by providing it to third parties (including entity B) or embedding it in their own business-to-consumer (B2C) services. This includes the following processing activities and allocation of roles/responsibilities:
Processing activity | Role/ Responsibility | Why? |
---|---|---|
1. A collects training and fine-tuning data (from the internet, third parties, etc) | A is a controller | A solely decides purposes and means |
2. A uses the training data to develop the base model Core | A is a controller | A solely decides purposes and means |
3. A fine-tunes Core with its own data to create Core-1 | A is a controller | A solely decides purposes and means |
4. A develops its own B2C application on top of Core-1 | A is a controller | A solely decides purposes and means |
5. B sends its own customer data to entity A for fine-tuning Core to create Core-2 which B will use for its purposes | B is a controller. A is most likely a processor provided it is acting on behalf of B and B is able to verify that A provides sufficient guarantees of its compliance | B decides purposes and means as it likely exercising full control and influence of how the data is processed and why |
6. A uses B’s data to fine-tune Core to create Core-2. Depending on the control and influence of B on the fine-tuning process there are two options (see right) | 6.1 B is a controller and A is a processor | A processes B’s data on behalf of B and under the instructions of B, while B has meaningful influence and control over the overarching decisions during fine-tuning |
6.2 A and B are joint controllers | A and B jointly determine the purpose and means. They may have different levels of influence (eg B may not have meaningful influence and control over the overarching decisions during fine-tuning) but the purposes and means are inextricably linked in terms of the outcome | |
7. A deploys Core-2 on its own Cloud on behalf of B to create inferences for B | 7.1 B is controller and A is a processor | A processes B’s data on behalf of B and under the instructions of B, while B has meaningful influence and control over the overarching decisions during the deployment of the model to make inferences |
7.2 A and B are joint controllers | A and B jointly determine the purpose and means. They may have different levels of influence, but the purposes and means are inextricably linked in terms of the outcome | |
8. A processes data that data subjects input into Core-2 for training later rounds of training Core | If 6.1 applies, A cannot process the data to further train Core. It can only do that if B agrees to the transfer as separate controllers | Later training is a different purpose and outside the original instructions of B |
If 6.2 applies A is a separate controller | A solely decides purposes and means for this processing which requires a data-sharing arrangement between A and B and the processing to conform with ICO’s Data Sharing code 25 |
Scenario 2: A model built on top of another ‘off-the-shelf' model
Entity A provides an ‘off-the-shelf’ copy of base model Core to entity B. B wants to build a new public-facing application LookUp, combining Core with other models and processing activities (eg RAG, ranking algorithms, third-party services, etc).
Processing activity | Role/ Responsibility | Why? |
---|---|---|
1. A uses its own data to train a base model Core | A is a controller | A solely decides purposes and means |
2. A sells an ‘off-the-shelf’ copy of Core to entity B | A could be a controller | A needs to examine any risks of reidentification inherent in the model to decide if it contains personal data |
3. Entity B builds LookUp combining Core with other AI systems | B is a controller | B solely decides purposes and means |
4. Entity C is providing consultancy services to B by accessing the query data fed into LookUp and undertaking sentiment analysis | 4.1 C is processor and B is controller | C processes B’s data on behalf of B and under the instructions of B, while B has meaningful influence and control over purposes and means |
4.2 C and B are joint controllers | If B’s instructions to C are high-level and as a result, C uses a lot of its own judgement, then C could be a joint controller | |
5. B uses the query data to train its own LLM | B is a controller | B solely decides purposes and means |
We understand that generative AI models - including assets such as weights, gradients or training data - are sometimes being distributed through third-party model intermediaries 26 as ‘open-access’. We are interested in evidence on how organisations who run or use these platforms identify their accountability as controllers, processors, or joint controllers.
Conclusion
The allocation of controller, joint controller or processor roles must reflect the actual levels of control and influence for each different processing activity taking place. Organisations must have the appropriate expertise, resources and agency to undertake the processing in a way that ensures the protection of people’s rights and freedoms.
The ICO understands that many players in the market have sought to frame their processing relationships as one of controller and processor, where in fact joint controllership may more accurately reflect the parties’ respective roles for particular processing activities. While we consult on these positions, we urge generative AI developers to examine joint controllership when considering their relationship with third parties that deploy their models. Joint controllership can be a useful approach for all parties (including data subjects) as it clarifies accountability and can mitigate compliance and reputational risks that could undermine trust in generative AI.
1 Pro-innovation Regulation of Technologies Review: Digital Technologies
2 AIaaS refers to out-of-box AI services provided by AI developing companies to potential third parties, utilising cloud functionalities
3 For examples of AI supply chain structures see: Expert explainer: Allocating accountability in AI supply chains
4 What is retrieval-augmented generation, and what does it do for generative AI? - The GitHub Blog
5 The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey | by Sandi Besen
6 Records of processing and lawful basis
7 See section ‘How may these issues apply in AI’: What are the accountability and governance implications of AI?
8 See How do we do a DPIA? for more information on the nature and scope of the processing.
10 What is special category data?
11 Rethinking open source generative AI: open-washing and the EU AI Act
12 Gradients can help you understand the model’s training process.
13 The Growing Threat of Data Leakage in Generative AI Apps
14 Generative AI first call for evidence: The lawful basis for web scraping to train generative AI models
15 Generative AI second call for evidence: Purpose limitation in the generative AI lifecycle
16 See the ’Generative AI models provided to third parties’ section in the first call for evidence: Generative AI first call for evidence: The lawful basis for web scraping to train generative AI models; See ’Defining a purpose’ section in the second call for evidence: Generative AI second call for evidence: Purpose limitation in the generative AI lifecycle
17 As we mentioned in the generative AI call for evidence on individual rights and generative AI, models trained on personal data, are not just abstract code but carry imprints of the personal data they have been trained on. For example, Large Language Models (LLMs) may encode personal data in the training data in the form of embeddings. Embeddings can reflect value-laden dynamics between attributes (eg education, sex, etc) and an individual. They capture, similarity or analogy which can sometimes lead the model to eventually output information that reflects this analogy rather than observable reality. See the example of an Oxford academic being referenced as teaching at Cambridge: An AI Is Inventing Fake Quotes by Real People and Publishing Them Online
18 See Foundational Challenges in Assuring Alignment and Safety of Large Language Models, pages 46-47.
19 See ICO’s definition: Glossary
20 See Beyond the Safeguards: Exploring the Security Risks of ChatGPT; You Are What You Write: Preserving Privacy in the Era of Large Language Models; Teach LLMs to Phish: Stealing Private Information from Language Models
21 See ’Groups of LLM-Agents May Show Emergent Functionality’
22 See Practices for Governing Agentic AI Systems
23 Useful to note that Recital 78 asks developers of applications based on the processing of personal data to build their products in ways that enable deployers to comply with data protection and mitigate relevant risks.
24 See Practices for Governing Agentic AI Systems, p5-6.
25 Data sharing: a code of practice
26 See Moderating Model Marketplaces: Platform Governance Puzzles for AI Intermediaries by Robert Gorwa, Michael Veale