Case study: pseudonymising employee data for recruitment analytics

Due to the Data (Use and Access) Act coming into law on 19 June 2025, this guidance is under review and may be subject to change. The Plans for new and updated guidance page will tell you about which guidance will be updated and when this will happen.

Contents

Developed in collaboration with Anonos

Context

Rangreen operates in the UK, EU, and US with approximately 25,000 employees. It uses an internally developed applicant tracking system (ATS) database to track job applications over the last two years, holding approximately 100,000 records.

The ATS data is encrypted and uploaded to a third-party cloud platform for storage.

The information in the ATS is gathered from candidates who applied for a role at Rangreen. The candidate privacy notice explains the ATS data processing purpose and the retention period for successful and unsuccessful candidates.

Each applicant record contains:

several types of direct identifiers (name, email, ID number, address);
demographic information (age, education and work history);
details about the recruiting process (role applied for, application source, interview and assessment scores, and process metrics); and
several outcome measures (offered a job, acceptance, and tenure).

Prior to making any use of the ATS data for analytic purposes, the records are pseudonymised. This process, which is explained below, results in information that:

cannot be attributed to any specific candidate without accessing additional data which Rangreen holds in a separate system; and
is not accessible to the data analysis teams.

Information which relates to candidates who were unsuccessful is anonymised six months after the recruitment outcome. While data about current employees is kept in pseudonymised form for the two year period. All data from the ATS is deleted after two years, whether anonymous or pseudonymised.

Objective

Rangreen processes candidates’ personal data, including those who become employees. This helps them to understand the characteristics of those candidates who are most likely to accept offers of employment and remain with the organisation for a substantial period. For example, Rangreen would like to understand whether there are factors that make employees likely to resign in the first two years, so they can provide additional training and career opportunities and boost retention.

Technical measures

The company has identified pseudonymisation as this would allow them to apply protection to the data while still preserving all the utility they need for the desired processing. This includes building predictive models using machine learning. Using pseudonymisation will help them demonstrate under the accountability principle that they are practising data protection by design and by default.

They use a software application to perform pseudonymisation by connecting to the database to retrieve and transform the data, creating two outputs:

a pseudonymised dataset which does not contain any information which can be attributed to a specific person; and
the ‘additional information’, held by Rangreen, which is used to re-identify people.

The following table shows a sample cleartext candidate record from the ATS system, the intermediary pseudonymisation techniques used (data suppression or generalisation) and the output of the cryptographic hashing technique which is then stored in a database.

The Row R-DDID value allows Rangreen to link the pseudonymised ATS records with the corresponding original cleartext record, which is kept separately and not accessible to the data analysis teams at Rangreen. For data about unsuccessful candidates, the underlying cleartext data is deleted after six months so the Row R-DDID can no longer be used for relinking.

Field name	Cleartext Record	Pre-Pseudonymisation Transformation	Pseudonymised Transformed Value
Row R-DDID	None	Random Value	R-6asd54fa+sdf16as5d1fa6d51fa6df516as5d1fa6sd51fa6sd50
ID	51213984	Random	ZaXutakNdAPIIC-4MHSAC6Sg62Krj_5AUua1NSRsdiZmiYb3LOiw
Name and Surname	Debra Hines	Omit	N/A
Email	[email protected]	Omit	N/A
Gender	female	No transformation	WODrkiUAsA3_FFae07YMTkW4YWHqGWU
Age	30	10 Year Binning (age range)	dTKI00A41W3CIF_aEUcOsYTOEFR91yo
City	London	Omit	N/A
County	UK	No transformation	OmwLfGb8Zo1PsD1cfdGAnT7dLKVF
Highest degree	Masters	No transformation	tRDmKY_TPviRqRBDFoSm_hwVLMov
College	University of Cambridge	No transformation	TdreYysrswINKKAikpBOQzTrXb1HF27f_ezUFg
Major	Statistics	No transformation	m2pRsJ2BT26Qh Va602rXhRKEAhWWKsmafYyiBRCWCeyxilk
Job Title	IT Professional	No transformation	NTTaT2h-UGKXOyCKX2ncCEDFAIrJa/5k5kXEPOhsOw
Tenure	2.9	No transformation	2.9

The following is a description of the technical measures used in the above use case:

Omitting direct identifiers.
Using generalisation to turn precise ages into ranges.
Replacing all human readable indirect identifiers with hashed values, as the machine learning systems used by Rangreen can carry out the analysis on the hashed values without needing to access the underlying data. Rangreen use hash-based message authentication code HMAC) to do this, following the hashing with key and salt procedure set out by ENISA applying k-anonymity scoring to quasi-identifiers (ie combinations of indirect identifiers that are highly identifying in combination), in order to defeat singling out attacks.
Rangreen carries out research to identify the appropriate value to assign for k and establishes that this continues to be an area of ongoing research internationally. While there are no commonly accepted standards, Rangreen learns that a value of 5 is not uncommon. They then test this value by trying to reidentify the pseudonymised dataset without accessing the lookup table, and establish that they are not able to do this with k=5 but can reidentify some individuals when k=3. They therefore set the value of k to be 5.

Organisational measures

The pseudonymisation application that Rangreen uses implements several organisational controls including:

Separating responsibilities via group-based permissions that restrict which data sources and protection configuration different users have access to. This means that not all teams within Rangreen have access to the data, and access is granted only to select people who require the data for their authorised processing.
Segregating duties via role-based permissions that restrict the ability to configure protections, approve the protections, transform data to pseudonymised form, or reverse the protection to different people, who are limited in number.
Log files that allow for auditability of user actions in the application.

How do the technical and organisational measures achieve the objective?

Rangreen uses a combination of techniques that provide effective protection against re-identification for parties without access to the additional information.

However, because of the way the protections are applied (eg pseudonymising categorical fields with limited, and usually fixed, number of possible values) there is no impact on accuracy of results at all (relative to processing clear text).

The organisational measures applied (separation of responsibility, segregation of duties and log files) effectively reduce the likelihood of accidental or intentional misuse of data by requiring multiple people to accomplish both protection and reversal, and logging actions and approvals.

Because of this, the data could be used for internal analysis without making any direct inferences about individual employees. This mitigates the risk of harm for those employees.

As the identity of the employees is not relevant to this analysis, Rangreen had initially considered anonymisation. However, Rangreen decided that it was not feasible to anonymise as the methods available to them would have removed too much information from the dataset and reduced its value for the intended processing.

The cleartext ATS data contains directly identifiable personal data. The use of personal data in this form for analytics poses a risk to Rangreen’s applicants and employees. Though this data is held on employee records accessible to the HR team, its disclosure to Rangreen’s analytics team would give unnecessary access to employee details to people and teams with no HR responsibilities.

Pseudonymising the data means that it is not possible to re-identify specific people within the pseudonymised data without access to the additional information held separately by Rangreen. Therefore, the risk to people is significantly reduced.

Risk and mitigation

Rangreen then identifies and assesses risk using an empirical statistical framework that measures any residual risk of identification via a three-step procedure:

performing privacy attacks against the dataset under evaluation;
measuring the success of such attacks; and
quantifying any residual privacy risk.

This statistical framework evaluates the resilience of the protected data output against the different types of privacy risks represented by attack-based evaluations for singling out, linkability, and inference risks. These are the three key indicators for determining whether information is personal data or not. If the residual risk is not sufficiently remote, the applied technical protections are adjusted and risk measurement repeated as necessary.

The technical and organisational controls are in the first instance implemented as risk mitigations measures themselves. These are based on the effectiveness achieved against the use case requirements and assessed risks. Rangreen concludes no additional mitigation measures are required.