Glossary

Anonymisation: the process of rendering data in such a way that the people the data relates to are not, or are no longer, identifiable.

Anonymous information: information that does not relate to an identified or identifiable living person or to personal data is rendered anonymous in such a way that the person is not, or is no longer, identifiable. Anonymous information is not subject to the UK GDPR.

Aggregated data: statistical data about several people that has been combined to show general trends or values without identifying people within the data.

Asymmetric encryption: a form of encryption that uses different keys for encryption and decryption.

Brute-force attack: a brute-force attack involves systematically iterating through all possible combinations of inputs until the correct one is found. The goal is to determine the original data that generated a given hash.

Background knowledge attack: a specific form of attack where an adversary possesses prior knowledge or additional information about the target person they intend to re-identify.

Data release: any process of data dissemination where the data controller no longer directly controls who has access to the data. This ranges from general licensing arrangements (such as end user licensing where access is available to certain classes of people for certain purposes), through to fully open data where access is unrestricted.

Dataset: any collection of data about a defined set of entities. This usually refers to data where data units are distinguishable (ie not summary statistics).

Data utility: the value of a given data release as an analytical resource. The key issue is whether, and how well, the data represents whatever it is it is supposed to represent. Anonymisation methods can have an adverse effect on data utility. Ideally, the goal of any anonymisation process should be to maximise data utility whilst minimising the risk of identification.

De-identification: personal data that has been processed in such a way that it can no longer be attributed, without more information, to a specific data subject (see section 171 of the DPA 2018).

De-identified: data that has been subject to de-identification. It is considered equivalent to pseudonymised data under UK GDPR.

Differential privacy: a mathematical framework that quantifies the privacy loss resulting from the inclusion of a person's data in a dataset. It ensures that the impact of any single record on the overall privacy is limited.

Direct identifier: any data item that, on its own, could uniquely identify a person. Examples include a person’s name, address and unique reference numbers (eg their social security number or National Health Service number).

Disclosure control methods: methods for reducing identification risk, usually based on restricting the amount of, or modifying, the data released.

Disclosure risk: the probability that a motivated intruder identifies or reveals new information, or both, about at least one person in disseminated data. Because anonymisation is difficult and has to be balanced against data utility, the risk that a disclosure will happen will never be zero. In other words, there will be a remote risk of identification present in all useful anonymised data.

Disclosure: the act of making data available to one or more third parties.

Encryption: a mathematical function that encodes data in such a way that only authorised users can access it.

Generalisation: a set of techniques that modifies the scale of data by grouping people which makes identification more difficult. It involves aggregating data to a higher level of abstraction, such as age groups or geographic regions.

Hashing: a process using a one-way mathematical function that transforms input data into a fixed-length output known as a hash. It ensures data integrity and confidentiality by making the data unintelligible. Unlike encryption, hashing is irreversible without access to additional information (eg the original identifiers and other information used to generate the hash).

Homogeneity attack: in k-anonymisation, a homogeneity attack refers to a vulnerability where an adversary can exploit the similarity among indirect identifiers (such as age, gender, and post code) to re-identify people in an anonymised dataset. The adversary leverages the lack of diversity within the indirectly identifying attributes to identify specific people.

Identifiability: refers to the question of whether one person can be distinguished from other people.

Identifiable person: a living person who can be identified via singling out or linking with other data.

Identified person: a person (natural person) identified via singling out or linking with other data.

Inferences: the potential to infer, guess or predict details about someone. In other words, using information from various sources to deduce something about a person.

Indirect identifiers: these can include any piece of information (or combination of pieces of information) used to identify a person. Also sometimes known as quasi identifiers.

K-anonymity: a privacy concept where each record in a dataset is indistinguishable from at least k-1 other records. It ensures that no person can be singled out based on the available information.

Key variable: a variable common to two (or more) datasets, which may therefore be used for record linkage between them. More generally, in scenario analysis, a variable likely to be accessible to a motivated intruder.

Limited access: releasing data within a closed community (ie
where a finite number of researchers or institutions have access to
the data and where its further disclosure is prohibited).

Linkability: the concept of combining multiple records about the same person or group of people. These records may be in a single system or across different systems.

Masking: replacing sensitive data with fictional or scrambled values while preserving the data's format. Common examples include replacing names with pseudonyms or masking credit card numbers.

Motivated intruder: someone who wishes to identify a person from the anonymous information that is derived from their personal data. Motivated intruders are sometimes referred to as attackers, snoopers or adversaries.

Motivated intruder test: a test which consider all the practical steps and all the means that are reasonably likely to be used by someone who is motivated to identify the people whose personal data the anonymous information is derived from. The test is used to assess the identifiability risk of (apparently) anonymous information.

Noise addition: introducing random noise to numerical data to prevent precise identification. For example, adding a small random value to ages or income levels.

Open data: open data is data that can be freely used, re-used and redistributed by anyone. It is subject only, at most, to the requirement to attribute and ShareAlike.

Plaintext: in cryptography, plaintext refers to information that has not been encrypted (or has been decrypted) and is therefore readable.

Pepper: a secret value added to the input, like a password, during the hashing process. A pepper is stored separately, often in a different medium, unlike a ‘salt’, which is stored with the hashed passwords in a database. This enhances security as even if an attacker gains access to the hashed passwords and salts, they would still need the pepper to crack the hashes.

Permutation: swapping or shuffling records in the data by switching values of variables across pairs of records. This approach aims to introduce uncertainty as to whether records correspond to real data elements and increases the difficulty of identifying people by linking together different information relating to them.

Pseudonymisation: a term defined in UK GDPR as the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is subject to technical and organisational measures to keep it separate.

Pseudonymisation secret: the additional information used to re-identify people from pseudonymised data. This could be a encryption key, salts and peppers used for hashing, or a mapping table.

Pseudonymous data: data that can no longer be attributed to a person without using additional information.

Publishing: the act of making data publicly available.

Qualitative data: data gathered and analysed in a non-numeric form, such as interview transcripts, field notes, video and audio recordings, still images and documents such as reports, meeting minutes, e-mails .

Randomisation: this technique involves randomly altering the content of data within a set range. By introducing noise or randomness, it makes identification more challenging. Randomisation can be applied to attributes like dates, ages, or geographic coordinates.

Reasonably likely: the key test for identifiability. It is about whether there are any means that are "reasonably likely" to be used by the organisation holding the information or another person to identify someone, directly or indirectly.

Record linkage: a process that combines records about the same population units in different datasets to produce a single dataset.

Re-identification: the act of a person knowingly or recklessly re-identifying information that is de-identified personal data without the consent of the controller responsible for de-identifying the personal data.

Salting: adding random data (a ‘salt’) to sensitive information before hashing it. Salting enhances security and prevents attackers from easily attributing the hash to the original data by processing the original data using the same hashing algorithm.

Secure multi-party computation (SMPC): a protocol (a set of rules for transmitting information between computers) that allows at least two different parties to jointly process their combined information, without any party needing to share all of its information with each of the other parties.

Singling out: the process of distinguishing data relating to one person from data relating to other people in order to treat that one person differently.

Statistical data: information which is held in the form of numerical data, nominal data (eg gender, ethnicity, region), ordinal data (age group, qualification level), interval data (month of birth) or ratio data (age in months).

Suppression: a disclosure control process where parts of the data are made unavailable to the user. The term is usually used to describe approaches like cell suppression, the removal of outliers and local suppression of particular values within microdata records.

Symmetric encryption: where the same key is used for encryption and decryption.

Synthetic data: data that have been generated from one or more models of the original data. This may or may not be anonymous.

Tabular data: aggregate information on entities presented in tables.

Tokenisation: the process of replacing sensitive data with unique tokens or identifiers. Tokenised data can be used as a pseudonymisation technique for analysis without revealing the original values.

Trusted research environment (TRE): a secure environment that a researcher can enter to perform analysis on data, subject to strict access and output controls. TREs are also commonly known as secure data environments (SDEs) or data safe havens.

Trusted third party (TTP): an independent entity used by two or more parties to hold data used for a collaborative project (eg if the parties don’t have the expertise to store it securely or if they want to increase the protections for the data by avoiding sending whole datasets to each other).

‘Whose hands?’: a way to think about the status of information in the ‘hands’ of other people when disclosing it to them. For example, depending on the circumstances, information may be personal data ‘in your hands’ but not in someone else’s.