Case study: pseudonymising employee data for recruitment analytics
Developed in collaboration with Anonos
Context
Rangreen operates in the UK, EU, and US with approximately 25,000 employees. It uses an internally developed applicant tracking system (ATS) database to track job applications over the last two years, holding approximately 100,000 records.
The ATS data is encrypted and uploaded to a third-party cloud platform for storage.
The information in the ATS is gathered from candidates who applied for a role at Rangreen. The candidate privacy notice explains the ATS data processing purpose and the retention period for successful and unsuccessful candidates.
Each applicant record contains:
- several types of direct identifiers (name, email, ID number, address);
- demographic information (age, education and work history);
- details about the recruiting process (role applied for, application source, interview and assessment scores, and process metrics); and
- several outcome measures (offered a job, acceptance, and tenure).
Prior to making any use of the ATS data for analytic purposes, the records are pseudonymised. This process, which is explained below, results in information that:
- cannot be attributed to any specific candidate without accessing additional data which Rangreen holds in a separate system; and
- is not accessible to the data analysis teams.
Information which relates to candidates who were unsuccessful is anonymised six months after the recruitment outcome. While data about current employees is kept in pseudonymised form for the two year period. All data from the ATS is deleted after two years, whether anonymous or pseudonymised.
Objective
Rangreen processes candidates’ personal data, including those who become employees. This helps them to understand the characteristics of those candidates who are most likely to accept offers of employment and remain with the organisation for a substantial period. For example, Rangreen would like to understand whether there are factors that make employees likely to resign in the first two years, so they can provide additional training and career opportunities and boost retention.
Technical measures
The company has identified pseudonymisation as this would allow them to apply protection to the data while still preserving all the utility they need for the desired processing. This includes building predictive models using machine learning. Using pseudonymisation will help them demonstrate under the accountability principle that they are practising data protection by design and by default.
They use a software application to perform pseudonymisation by connecting to the database to retrieve and transform the data, creating two outputs:
- a pseudonymised dataset which does not contain any information which can be attributed to a specific person; and
- the ‘additional information’, held by Rangreen, which is used to re-identify people.
The following table shows a sample cleartext candidate record from the ATS system, the intermediary pseudonymisation techniques used (data suppression or generalisation) and the output of the cryptographic hashing technique which is then stored in a database.
The Row R-DDID value allows Rangreen to link the pseudonymised ATS records with the corresponding original cleartext record, which is kept separately and not accessible to the data analysis teams at Rangreen. For data about unsuccessful candidates, the underlying cleartext data is deleted after six months so the Row R-DDID can no longer be used for relinking.
Field name | Cleartext Record | Pre-Pseudonymisation Transformation | Pseudonymised Transformed Value |
---|---|---|---|
Row R-DDID | None | Random Value | R-6asd54fa+sdf16as5d1fa6d51fa6df516as5d1fa6sd51fa6sd50 |
ID | 51213984 | Random | ZaXutakNdAPIIC-4MHSAC6Sg62Krj_5AUua1NSRsdiZmiYb3LOiw |
Name and Surname | Debra Hines | Omit | N/A |
[email protected] | Omit | N/A | |
Gender | female | No transformation | WODrkiUAsA3_FFae07YMTkW4YWHqGWU |
Age | 30 | 10 Year Binning (age range) | dTKI00A41W3CIF_aEUcOsYTOEFR91yo |
City | London | Omit | N/A |
County | UK | No transformation | OmwLfGb8Zo1PsD1cfdGAnT7dLKVF |
Highest degree | Masters | No transformation | tRDmKY_TPviRqRBDFoSm_hwVLMov |
College | University of Cambridge | No transformation | TdreYysrswINKKAikpBOQzTrXb1HF27f_ezUFg |
Major | Statistics | No transformation | m2pRsJ2BT26Qh Va602rXhRKEAhWWKsmafYyiBRCWCeyxilk |
Job Title | IT Professional | No transformation | NTTaT2h-UGKXOyCKX2ncCEDFAIrJa/5k5kXEPOhsOw |
Tenure | 2.9 | No transformation | 2.9 |
The following is a description of the technical measures used in the above use case:
- Omitting direct identifiers.
- Using generalisation to turn precise ages into ranges.
- Replacing all human readable indirect identifiers with hashed values, as the machine learning systems used by Rangreen can carry out the analysis on the hashed values without needing to access the underlying data. Rangreen use hash-based message authentication code HMAC) to do this, following the hashing with key and salt procedure set out by ENISA applying k-anonymity scoring to quasi-identifiers (ie combinations of indirect identifiers that are highly identifying in combination), in order to defeat singling out attacks.
- Rangreen carries out research to identify the appropriate value to assign for k and establishes that this continues to be an area of ongoing research internationally. While there are no commonly accepted standards, Rangreen learns that a value of 5 is not uncommon. They then test this value by trying to reidentify the pseudonymised dataset without accessing the lookup table, and establish that they are not able to do this with k=5 but can reidentify some individuals when k=3. They therefore set the value of k to be 5.
Organisational measures
The pseudonymisation application that Rangreen uses implements several organisational controls including:
- Separating responsibilities via group-based permissions that restrict which data sources and protection configuration different users have access to. This means that not all teams within Rangreen have access to the data, and access is granted only to select people who require the data for their authorised processing.
- Segregating duties via role-based permissions that restrict the ability to configure protections, approve the protections, transform data to pseudonymised form, or reverse the protection to different people, who are limited in number.
- Log files that allow for auditability of user actions in the application.
How do the technical and organisational measures achieve the objective?
Rangreen uses a combination of techniques that provide effective protection against re-identification for parties without access to the additional information.
However, because of the way the protections are applied (eg pseudonymising categorical fields with limited, and usually fixed, number of possible values) there is no impact on accuracy of results at all (relative to processing clear text).
The organisational measures applied (separation of responsibility, segregation of duties and log files) effectively reduce the likelihood of accidental or intentional misuse of data by requiring multiple people to accomplish both protection and reversal, and logging actions and approvals.
Because of this, the data could be used for internal analysis without making any direct inferences about individual employees. This mitigates the risk of harm for those employees.
As the identity of the employees is not relevant to this analysis, Rangreen had initially considered anonymisation. However, Rangreen decided that it was not feasible to anonymise as the methods available to them would have removed too much information from the dataset and reduced its value for the intended processing.
The cleartext ATS data contains directly identifiable personal data. The use of personal data in this form for analytics poses a risk to Rangreen’s applicants and employees. Though this data is held on employee records accessible to the HR team, its disclosure to Rangreen’s analytics team would give unnecessary access to employee details to people and teams with no HR responsibilities.
Pseudonymising the data means that it is not possible to re-identify specific people within the pseudonymised data without access to the additional information held separately by Rangreen. Therefore, the risk to people is significantly reduced.
Risk and mitigation
Rangreen then identifies and assesses risk using an empirical statistical framework that measures any residual risk of identification via a three-step procedure:
- performing privacy attacks against the dataset under evaluation;
- measuring the success of such attacks; and
- quantifying any residual privacy risk.
This statistical framework evaluates the resilience of the protected data output against the different types of privacy risks represented by attack-based evaluations for singling out, linkability, and inference risks. These are the three key indicators for determining whether information is personal data or not. If the residual risk is not sufficiently remote, the applied technical protections are adjusted and risk measurement repeated as necessary.
The technical and organisational controls are in the first instance implemented as risk mitigations measures themselves. These are based on the effectiveness achieved against the use case requirements and assessed risks. Rangreen concludes no additional mitigation measures are required.