Regulatory issues

Issue 1: Regulatory certainty. When is genomic data personal data?

Under the UK GDPR, there is no explicit definition of genomic information as either a specific form of personal information or special category data. This means that, without considering the purpose, organisations should consider personally identifiable genomic information as personal information. Beyond this, organisations need to carefully consider:

when and why genomic information may be special category data; and
what risks large-scale classificatory uses of even non-special category personal information may pose.

Some key challenges are listed below.

Identifying personal and special category personal information

Genomic information isn’t always counted as personal information. A large part of a genome (roughly 98%) is non-coding-based DNA that may not be easily linked to a known person in isolation, as it is shared with every human being. ²⁴ In other words, it is not expressed as a gene. When genomic information is considered personal information, it would become hard to define. Context will matter hugely as developing research and analytical tools allow a greater understanding of how this genomic information, once dismissed as ‘junk DNA’, connects to how specific genes and phenotypes are expressed. For more information, organisations should consider our guidance on identifiers and related factors.

While we have made assumptions here when defining genomic information in terms of both personal and special category information. Further clarity is required as to how and when genetic and genomic information overlap for the purposes of data protection regulation. Not all genomic information is genetic, but certain provisions of the UK GDPR provide an indication of how organisations can consider such material under the legislation.

Where genomic information can be clearly defined as personal information, it is almost certainly special category information under Article 9 as it will contain some portion of genetic information which is already defined as special category data under the UK GDPR. Given this, the usual protections for special category data will apply.

There is also a risk that the use of the term ‘genomics’ to include associated areas of related inferences, such as phenotypes may cause uncertainty as to what information may or may not be special category data. Phenotypes play a critical role in the development and use of genomic research. While medical uses, such as identifying pre-cancerous growths, will be special category data under Article 9(1) of the UK GDPR, other commercial uses may not. While these instances are currently rare, the pace of research is likely to change this. It may become possible to use phenotypes for educational or insurance purposes as explored above. If phenotypes are considered distinct from genetic or genomic definitions, then it is unlikely that they would count as biometric information either, as explored in our previous report on emerging biometrics.

Even if there is no direct correlation between genomic information and special category data, then the uses of genomic information for medical or identification purposes are nevertheless likely to be special category genomic or health information under Article 9(1). Organisations will therefore need a lawful basis to use this information under Article 6 and an additional condition for processing special category data under Article 9(2). Organisations must identify the most appropriate basis for processing. Consent may be an appropriate lawful basis and an appropriate special category condition, as long as they can meet the necessary requirements for valid consent.

Conditions for processing special category data

If organisations use genomic information, they should consider whether consent is the most appropriate lawful basis. While medical consent remains a distinct and important issue, explicit consent for using personal information is only one of a variety of appropriate special category conditions under the UK GDPR. It is not inherently ‘better’ than other conditions.

Any wider automatic reliance on consent around the use of genomic information for consumer purposes could also cause confusion and may prove to be inappropriate under the UK GDPR. People may assume they have the right to automatically withdraw consent even when organisations have not used consent as a basis, given the wider dialogue and calls for the use of consent. In fact, organisations may use other appropriate lawful bases. They need to be transparent about which basis they are using and people’s rights as a result. This transparency may prove more effective in helping people to truly understand how organisations are using their information and what their rights are.

Organisations should also be aware that the conditions for processing special category data are likely to be quite limited in a purely commercial scenario. Most of them concern public rather than private interests. There is a high threshold for processing special category data in most circumstances, and where there is no real public interest engaged, organisations will likely have to get explicit consent.

Issue 2: Third-party data, historic data, future data and genomic data

Genomic information poses significant challenges around the appropriate handling of third-party data. Familial information is fundamental to genomic information and any disclosure will inherently involve additional people. Organisations can consider pseudonymisation techniques, but these bring their own challenges as discussed below. We provide guidance as to how and when to appropriately share such third party information.

This is a particular issue when considering appropriate disclosures in response to a subject access request. Use of genomic information or inferences derived from this may risk disclosing special category data linked to family members via genomic analysis. This has already been the case with genetic tests for inherited neurodegenerative conditions. ²⁵ In such situations, inappropriate disclosure or withholding of personal information may present high risks of harm to people.

It also opens up the potential for third-party claims to information, which organisations may think relate to one another. However, given the nature of genomic information, it is likely that organisations will need to closely consider when and how both raw information and complex inferences may relate to and identify (or be capable of identifying) associated third parties. They will also need to consider when they should share this information. Organisations can use our guidance on how and when confidentiality and consent apply to help with this.

There is also an increased risk that direct-to-consumer services, like genomic counselling, may provide high-risk, high-impact information to people without the more traditional supporting structures of the health sector. Organisations using genomic information will need to pay careful attention to the context of their use. They will also have to put sufficient and appropriate security measures in place under Article 5(1)(f) and Article 32. Raw genomic information (i.e. the processed biological sample without further analysis) is currently unlikely to be easily identifiable by members of the public. However, this may rapidly change as access to AI processing increases, along with associated personal information. ²⁶

Under the UK GDPR, information about a living person is classified as personal data. Information about a deceased person is not classified as personal data. Genetic information has already blurred the definition of historic information. As defined under Article 4 of the UK GDPR, genetic and now genomic information which wasn’t previously linked to a living person can now be linked through modern research techniques. Large-scale collection of genomic information heightens the risk further, given the information’s links to multiple living people, even when the person who provided the original sample and information has died.

Organisations retaining, gathering and using what they might assume to be historic information, should take appropriate steps to consider whether they need to consider this information as personal data. This is likely to be highly contextual and will not relate to the biological samples themselves. Rather, processed information and associated inferences such as phenotypes may relate to living people and, given the context of other records, may count as pseudonymised rather than anonymised.

In contrast to historic information is the challenge around potential future discoveries made with genomic information. Article 89 of the UK GDPR makes provisions for research purposes to retain information for both longer periods and a broad purpose. However, organisations using personal information for research purposes must put safeguards in place and, in particular, follow the principle of data minimisation. Organisations will need to ensure that, if new potential means of using genomic information emerge, they pay close attention to the implications and requirements of a change in purpose of processing. They will also need to consider how this may fall under people’s expectation about the fair use of their information.

Issue 3: Inherent identifiability, anonymisation and security of genomic data

Genomic information is highly distinguishable. Linking a genome to a person only requires a few hundred (out of millions) of SNPs. ²⁷ In many instances of business or research, organisations may prefer to either anonymise or pseudonymise personal information, ²⁸ to minimise potential risks and harms to people. While the large-scale nature of genomic information and its associated risks would seem an obvious fit for this, there are challenges to anonymising genomic information. The levels of encryption or data obfuscation through methods like differential privacy may reduce the information’s value to GWAS. ²⁹

Approaches that may offer organisations some means of pseudonymisation include:

Data transformation is likely to remain limited in deployment. This is because genomic information is fundamentally open to re-identification. This can happen through multiple pathways and correlations as high dimensionality information, where the number of dataset features are larger than the number of observations made. ³⁰ Of the millions of SNPs within a genome, only a few hundred are required to identify someone. However, ongoing research in k-anonymity may offer future defences against re-identification.
Data obfuscation is achieved by adding noise to genomic information through techniques such as differential privacy. This is likely to have a significant negative impact on effective genomic research.
Synthetic data offers another approach, through the AI-powered generation of large-scale data sets that do not involve people. However, complexities with this approach emerge as to when and if the analysis and processing undertaken links to an identifiable person.

For more information on privacy-enhancing technologies and data protection, please see our guide to what these are and how you can use them to meet your UK GDPR obligations and expectations.

Organisations may find it difficult to achieve pseudonymisation as technology rapidly increases in power and accessibility. In the longer term, this would allow smaller organisations and even people to conduct what were once challenging and lengthy analyses. Given this, organisations must consider other forms of appropriate security.

This may include data aggregation achieved through implementing a trusted research environment (TRE) that can limit direct access to genomic information and associated health information. Researchers may submit queries without direct access, however this may pose a challenge to forming open and agile research environments.

Issue 4: Fairness, accuracy and opinion – epigenetics and polygenic risk scoring.

Article 5(1)(d) of the UK GDPR states that personal information must be ‘accurate’ and rectified promptly where this is not the case. While the UK GDPR does not provide a definition of accuracy, the Data Protection Act 2018 says that ‘inaccurate’ means ‘incorrect or misleading as to any matter of fact’. With genomic information, the extraction of the DNA itself may introduce inaccuracies. Alternatively, the loss of metadata or the introduction of ambiguous or uncertain labelling of aspects of a genome may also cause inaccuracies. ³¹

Inferences derived from genomic information can also change due to epigenetic modifications. This happens where the instructions accessed and read from DNA can be altered temporarily or even permanently by a specific environmental factor such as stress or diet. ³² Even at the most basic level, we expect organisations to review their genomic information for accuracy, keep it up to date and label it as historic where required.

The probabilistic and contextual nature of second order information, such as polygenic risk scores, means that their accuracy is more likely to be open to debate. Estimating the gap between phenotypic information and genetic information that exists in many areas of study will remain highly challenging. Personal information may be produced that is probabilistic rather than absolute, based upon combined genomic data and potentially limited phenotype information.

The use of increasingly sophisticated AI models to estimate physical and behavioural aspects of a person brings its own risks around accuracy and transparency. ³³ However, being based on statistical analysis rather than subjective opinion may limit the risk around personal information and accuracy under the UK GDPR. Nonetheless, AI systems must still be sufficiently statistically accurate for their purposes. This is to comply with the fairness principle and where organisations are using the models for automated decision making. ³⁴

In the scenarios above, covering sectors ranging from healthcare to SEND provision to criminal charges, high-impact decisions may be taken overlaying human opinion on second order information produced by this type of processing. This in turn produces another set of personal (third order) information. There is a risk that the line between factual information (open to challenge as inaccurate) and opinion (which can be noted as challenged by a person but which remains a fundamentally accurate note as to an opinion held) becomes blurred, which impacts on our ability to assess accuracy.

Our guidance on accuracy sets out measures to ensure that information is accurate or can be updated promptly and appropriately. It also provides details on the use of opinion to inform decisions and how to record challenges. Organisations will need to ensure the adequacy and accuracy of underlying information and to communicate how they reached their decisions in a transparent and robust manner.

It is also likely that it will be increasingly difficult for people to understand when organisations hold inaccurate personal information or when organisations have made inaccurate inferences. This is because genomic information itself is highly complex and even inferences are likely to require significant technical knowledge to interpret. Combined with the challenges of ‘black box’ style algorithmic processing, the challenges to transparency may be significant. Organisations should follow our guidance on the lawful fairness and transparency principle to ensure they meet our expectations. Furthermore, if the algorithmic processing amounts to automated decision-making within the meaning of Article 22, there are enhanced requirements relating to transparency.

Issue 5: Genomic determinism and discrimination

As increasingly large data sets are derived and analysed, new forms of discrimination may emerge. Without robust and independent verification of these models, there is a risk that these approaches will be rooted in systemic bias, providing inaccurate and discriminatory information about people and communities. In many instances, this information may then feed into automated systems, raising further questions over Article 22 processing and transparency.

There are also concerns about inappropriate discrimination arising from the current reliance on genomic data sets based on ancestrally European sources. ³⁵ This focus is likely to generate inaccurate information and inferences about other communities and genetic ancestries. As a result, combining broader healthcare records with genomic information and phenotypes to develop outcomes for predictive treatments may reflect and enhance embedded bias and existing discrimination. ³⁶

Active, rather than systemic, discrimination may also emerge. This may see specific traits, characteristics and information becoming seen as undesirable by organisations or groups, without being considered a protected characteristic. Alternatively, this may feed upon the perceived ‘accuracy’ of polygenic risk scoring, in which probabilistic tendencies become viewed as a guaranteed outcome. People may experience unfair treatment in the workplace or services they are offered based on previously unrecognised characteristics or existing physical or mental conditions. The UK GDPR already sets out requirements that may mitigate these issues. These include (but are not limited to):

the fairness principle;
protections for special category data;
requirements for data protection by design and default; and
protections for automated decision-making and profiling.

In the face of the above risks, organisations should consider our guidance on addressing fairness, bias and discrimination.

In non-medical contexts, genomic information may not be classified as special category data, reducing the legal safeguards and restrictions around its processing. This may result in organisations failing to implement best practice around technical security in order to ensure that genomic information and its associated inferences remain safe from loss or theft.

Issue 6: Data minimisation, purpose limitation and genomic information

Genomic information may pose particular challenges around data minimisation and purpose limitation, particularly in direct-to-consumer services where providers hold significant sections of genomes and, in the future, potentially whole genomes with the intention of future findings becoming accessible for consumers. Organisations will need to consider carefully whether processing entire genomes is necessary for their purposes, both in the initial analysis of raw information and in the longer term.

Multiomic analysis combines large omics information sets (that define the entire biological processes of a biological system) to generate new insights and inferences. This approach highlights the complex interplay in the shift of purpose, and organisations will again need to consider what information they need for a specific purpose. Fundamentally, this reinforces the significant challenges in gathering and using personal information in a rapidly moving area.

Issue 7: AI and genomics

A constant theme throughout the emerging uses of genomic information has been the algorithmic processing of large-scale, complex information. There is significant discussion on the implementation of AI in genomics, for both current and future research which carries significant data protection implications. ³⁷ Key elements to the increased demand of the use of algorithmic processing of genomic information include:

the size and complexity of genomic datasets;
the need for rapid specialist insights and inferences derived from the complex data in an intelligible format and;
correlation between health records and genomic information to interpret phenotypes.

Varying dataset formats, a wide means of gathering information without a universal agreed practice and standards outside of the medical and research sectors and an ancestrally Eurocentric focus on genomic data risks the embedding of fundamental discrimination as noted above. This may only be enhanced through the use of multi-purpose models not initially trained for the purpose of analysing genomic data.

The use of polygenic risk scores to make decisions about individuals may, depending on the level of human involvement, as well as the extent to which these scores are determinative of outcomes for individuals, amount to automated individual decision-making within the meaning of Article 22. ³⁸ Our guidance about automated decision-making and profiling sets out that a decision that has a ‘similarly significant’ effect is something that has an equivalent impact on a person’s circumstances, behaviour or choices. The scenarios set out above, for both preventative healthcare and SEND provision highlight potential areas in which rapid and potentially automated decisions may be made with significant consequences based upon probabilistic predictions.

Where automated processing does amount to solely automated individual decision-making or profiling within the meaning of Article 22, this may present a significant challenge to an organisation’s activities. Organisations may only carry out such processing when they meet one of the conditions in Article 22(2) (where this is necessary for a contract, is required or authorised by domestic law, or where someone gives their explicit consent). They can also only carry out automated individual decision-making based on special category personal information in very limited circumstances. People who are subject to such decisions have rights under the UK GDPR to obtain meaningful human intervention in the decision-making. They must have the opportunity to express their point of view and challenge decisions. Organisations must consider what appropriate intervention may look like for each situation.

There are also increased transparency requirements for organisations undertaking individual automated decision-making. They must provide people with meaningful information about the logic involved, as well as the significance and envisaged consequences (Articles 13(2)(f) and 14(2)(g)).

²⁴ The human genome is, at long last, complete

²⁵ Huntington's disease: Woman with gene fails in bid to sue NHS

²⁶ The GDPR and genomic data

²⁷ Classifying single nucleotide polymorphisms in humans

²⁸ Noting that “pseudonymised personal information” is still personal data within the meaning of UK GDPR.

²⁹ Sociotechnical safeguards for genomic data privacy

³⁰ Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010)

³¹ Caudai, Claudia, Antonella Galizia, Filippo Geraci, Loredana Le Pera, Veronica Morea, Emanuele Salerno, Allegra Via, and Teresa Colombo. 2021. “AI Applications in Functional Genomics.” Computational and Structural Biotechnology Journal 19: 5762–90.

³² Environmental exposures influence multigenerational epigenetic transmission

³³ Dias, Raquel, and Ali Torkamani. 2019. “Artificial Intelligence in Clinical and Genomic Diagnostics.” Genome Medicine 11 (1): 70.

³⁴ See UK GDPR Recital 71 for further details.

³⁵ Kessler, M., Yerges-Armstrong, L., Taub, M. et al. Challenges and disparities in the application of personalized genomic medicine to populations with African ancestry. Nat Commun 7, 12521 (2016).

³⁶ Chen, Irene Y., Peter Szolovits, and Marzyeh Ghassemi. 2019. “Can AI Help Reduce Disparities in General Medical and Mental Health Care?” AMA Journal of Ethics 21 (2): E167-179.

³⁷ DNA.I. - Ada Lovelace Institute

³⁸ Article 22 regulates the circumstances in which solely automated decisions with legal effects or similarly significant effects on individuals may be taken.