Developed in collaboration with Hazy and Nationwide Building Society.
At a glance
In this case study, RetailBank, a high street bank intends to use a third-party solution developed by CareDetect to predict vulnerable individuals based on their income and spending patterns, in order to provide appropriate support to them. RetailBank decides that it does not want to use data from real transactions to test the effectiveness of the CareDetect product. Using real customer transaction data (much of which is personal data) for testing introduces security risks, and does not comply with the data minimisation principle. Instead, the RetailBank decides to use synthetic data to test its effectiveness of the solution. This approach minimises the amount of personal data processed and reduces the risk of customer data being accessed inappropriately, while allowing RetailBank to effectively test the CareDetect product prior to procuring it.
Context
There are two parties involved in the generation of the synthetic data: RetailBank and an external synthetic data solutions provider, SynthGen who will assist RetailBank in generating the synthetic data. The synthetic data solution provided by SynthGen is deployed on-premises of RetailBank, to ensure that the original transaction data remained within RetailBank’s controlled infrastructure. Once the synthetic data is generated, it will then be shared with CareDetect, whose solution to predict vulnerable individuals is being tested by RetailBank.
The dataset to be synthesised consists of several tables containing fields found in an Open Banking transaction feed, which include:
- customer personal data comprising of various direct and indirect identifiers,
- account details,
- transaction amounts and locations,
- merchant names and categories (the type of services it provides).
The dataset covers over 50,000 customers with approximately 100,000 accounts (which are made up of both current accounts and credit cards) spanning around 50 million transactions over a period of 18 months.
Objective
SynthGen develops a solution using data synthesis algorithms that preserve common behavioural signals consistent with vulnerable individuals present in the original transaction data. These signals include patterns of low income, heavy loans and betting behaviour. The processing requires analysis of average transactions per month, including handling outlier cases where some accounts have large numbers of transactions and others which have a very low number.
The synthetic data also needs to maintain realistic behaviours across different customer age-groups and preserve underrepresented groups of the data. To do this, SynthGen oversamples some regions of the data by generating slightly more data in the regions where the typical outliers or underrepresented data resides. They also create a rule to label data to match any vulnerable members under criteria defined by the FCA, RetailBank and betting companies. The labelling process ensures that synthetic vulnerable individuals match their rules and expected behaviours, and CareDetect’s solution would correctly detect the labelled synthetic vulnerable individuals.
Technical measures
SynthGen combines two techniques – generative machine learning models to create the synthetic data and differential privacy to reduce the identifiability of the data. RetailBank and SynthGen also collaborate to measure the quality of the synthetic data, by comparing it to the real customer data.
The diagram below shows an overview of the processing. A generative model takes the real customer transaction data as input and during the training phase updates its internal parameters to learn the probability distribution of the data. Once trained, the real customer transaction data is deleted, and the model is transferred out of the production environment to the team internal to RetailBank which is evaluating the CareDetect product for identifying vulnerable customers. The internal team use the model to generate the synthetic data, and then use the synthetic data to test the different solutions. During the generation phase, RetailBank can trigger retraining of the models, meaning they can update or include more training data when it becomes available (without any additional configuration work). The trained parameters (e.g. the set of internal variables the model has learned to capture the underlying distribution and correlations in the production data) are sampled to generate new synthetic data. Three different models were used for the generation process:
- Two Bayesian networks models were used for the customers and accounts tables as they best suited to modelling static features and attributes which do not change over time.
- An autoregressive model was used for the transactions table; as this is best suited for sampling time series data (features changing through time).
SynthGen uses various metrics to make sure that key characteristics and patterns are present in the synthetic data. Each measure calculates and compares properties of the real and synthetic datasets. These measures include:
- comparison of probability distributions over key attributes, such as transaction amounts and counts as well as initial, running, and final balances.
- comparison of co-dependencies (i.e. how two columns correlate with each other) between pairs of attributes.
- quality of classification tasks, such as classifying behavioural patterns into life events.
- importance given to various attributes, for example, columns from the dataset such as age group, account balance, transaction amount, etc. when using ML to predict other attributes such as merchant category (the type of goods or services it provides).
- autocorrelation (measuring how a variable's past values relate to its current values) of transaction time series (how an entire time series correlates with itself as it is progressively shifts in time).
Organisational measures
RetailBank follow strict technical and organisational measures to safeguard the integrity and confidentiality of data, these include:
- governance and security checks to ensure that no data was transferred outside the secure production environment of RetailBank.
- ensuring the synthetic data solution provided by Synthgen is deployed on-premises of RetailBank, to ensure that the original transaction data remained within RetailBank’s controlled infrastructure.
- standardised contractual agreement with CareDetect to strictly prohibit any reidentification of real customers.
How do the technical and organisational measures achieve the objective?
Using synthetic data, RetailBank is able to significantly reduce the time needed to provision data for the purposes of evaluating the effectiveness of the CareDetect product used to predict vulnerable customers.
Using generative models to generate synthetic data minimises the risk of re-identification as there is no one-to-one mapping from a real to a synthetic customer.
Risk and mitigation
In some cases, generative models may generate personal data by overfitting, i.e. inadvertently memorise specific records and replicate them, either exactly or approximately, in the synthetic data. Several technical measures were used to mitigate the risk of individual’s data being generated by overfitting:
- generative models were trained while satisfying differential privacy, which involves introducing carefully calibrated noise into the parameter updates.
- RetailBank trained several models on the data using different levels of epsilon to determine the optical balance of identifiability and utility. An epsilon of 5 was chosen as this allowed their purposes to be fulfilled while minimising the identifiability of the synthetic data.
- real account/transaction/merchant names/merchant IDs were discarded and new synthetic IDs were generated from them using the same format to mitigate the risk of linkability.
- rare behaviours which are more susceptible to re-identification (e.g. accounts with too few or too many transactions) were modelled with less precision compared to more common behaviours (e.g. customers with typical transaction patterns), to minimise the risk of singling out and linkability.