How do we ensure the accuracy of personal information in our profiling tools?

In detail

What about statistical accuracy?

The accuracy principle in data protection law means you must:

take all reasonable steps to ensure the personal information you use and generate in your profiling tools is not incorrect or misleading about any matter of fact;
keep the personal information up-to-date, if necessary; and
consider any challenges from users about the accuracy of the information you process in your profiling tools.

This is particularly important when you use the personal information generated by your profiling tools to make decisions about whether to take moderation action on users, and what type to take.

For example, if you use a profiling tool to determine that a user’s behaviour breaches your terms of service and make a record of this on their account, you must:

ensure that this record is accurate;
keep it up-to-date, where necessary; and
take reasonable steps to rectify or erase the record without undue delay, if a user shows they have not breached your terms of service.

You must put processes in place to ensure that users can challenge the accuracy of any information generated by or used by your tools that may be inaccurate.

You could provide users with a dashboard where they can view, manage and update the information you hold about them.

In many cases, the outputs of profiling tools are not meant to be treated as factual information about a user. Instead, they are predictions about the likelihood of them exhibiting certain behaviours. This can be further complicated in situations where the behaviour you might be trying to predict or classify is subjective.

To avoid these outputs being misinterpreted as factual, you should ensure that your records indicate that the outputs are statistically informed guesses, rather than facts.

What about statistical accuracy?

Many profiling tools may involve AI and automation. This means that you must consider issues about what’s known as ‘statistical accuracy’.

It’s important to understand that this isn’t quite the same as ‘accuracy’ in data protection law. The accuracy principle is about ensuring that the personal information you process is accurate, and where necessary, kept up-to-date (see earlier section).

Statistical accuracy is about how often an AI system guesses the correct answer, measured against correctly-labelled test data.

You must consider statistical accuracy as part of your compliance with the fairness principle. (See the section on How do we ensure our use of profiling tools is fair? for more information about the fairness principle.)

If you are using profiling tools that make predictions or inferences about people, you must ensure that they are sufficiently statistically accurate for your purposes. This does not mean that they need to be 100% statistically accurate, but you should consider the possibility of incorrect assessments of users and the impact this might have on them.

There are different measures of statistical accuracy that can reflect the balance your system strikes between false positive and false negative results (see the box below for the definitions of ‘false positive’ and ‘false negative’). These measures include:

precision, the percentage of cases identified as positive that are in fact positive. For example, if your profiling tool classes nine out of 10 accounts as bots and this is actually the case in practice, then its precision is 90%; or
recall, the percentage of all cases that are in fact positive that are identified as such. For example, if 10 out of 100 accounts are bots but your profiling tool only identifies seven of them, then its recall is 70%.

There are trade-offs between precision and recall. For example, if you place more importance on finding as many bot accounts as possible (maximising recall), this may come at the cost of some false positives (lowering precision).

You should prioritise your balance of precision and recall based on the severity and nature of the risks to users. You should consider:

the context you deploy your profiling tool in;
the severity of the harm your tool aims to tackle;
the outcome of your systems (including what moderation actions you apply); and
the consequences for users.

(You can find more information about accuracy and statistical accuracy in our guidance on AI and data protection.)

Definitions

A false positive occurs when your system incorrectly identifies a user as exhibiting a certain behaviour or characteristic (eg classifying a user as a bot, when they are actually genuine). Systems optimised for precision are likely to give rise to fewer false positives.

A false negative occurs when your system has not identified a user that is exhibiting the behaviour or characteristic you want to detect, when in fact they are exhibiting that trait (eg not classifying a user as a bot, when in fact they are a bot account). Systems optimised for recall are likely to give rise to fewer false negatives.

Further reading

Principle (d): Accuracy
Guidance on AI and data protection including the section on What do we need to know about accuracy and statistical accuracy