Why facial recognition algorithms can't be perfectly fair
In June 2020, a facial recognition algorithm led to the wrongful arrest of Robert Williams, an African-American, for a crime he did not commit. After a shoplifting incident in in a pricey area of Detroit, Michigan, his driver’s license photo was wrongly matched with a blurry video of the perpetrator. Police released him after several hours and apologized, but the episode raises serious questions about the accuracy of visual-recognition algorithms.
The troubling aspect of the story is that facial recognition algorithms have been shown to be less accurate for black faces than for white ones. But why do facial recognition algorithms make more mistakes for Blacks than whites, and what can be done about it?
To err is human… and algorithmic
Like any prediction algorithm, facial recognition algorithms make probabilistic predictions based on incomplete data – a blurry photo, for example. Such predictions are never completely error-free, nor can they be. Since errors always exist, the question is what is an acceptable level of errors, what kind of errors should be prioritized, and whether you need a strictly identical error rate for every population group.
Facial-recognition algorithms produce two kinds of errors: false positives and false negatives. The first occur when the algorithm thinks there’s a positive match between two facial images, but in fact there is no match (this was the case for Robert Williams). The second take place when the algorithm says there’s no match, but in fact there should be one.
The consequences of these two errors are different depending on the situation. For example, if the police use a facial-recognition algorithm in their efforts to locate a fugitive, a false positive can lead to the wrongful arrest of an innocent person. Alternately, when border-control authorities use facial recognition to determine if a person matches the passport he or she carries, a false positive will lead to the impostor crossing the border with a stolen passport. Each case requires a determination of the cost of different kinds of errors, and a decision on which kind of errors to prioritize. For example, if police are tracking potentially violent suspects, they may want to reduce the number of false negatives so the suspects can’t slip through, but this would drive up the number of false positives – in other words, people falsely arrested.
Race and technology
Racism can arise when there is a higher rate of error, either false negative or false positive, for a subset of a population – for example, Blacks in the United States. These differential error rates are not programmed into the algorithm – if they were, it would be manifestly illegal. Instead, they slip in during the design and “training” process. Most developers send their algorithms to the United States Technical Standards Agency (NIST), to be tested for differential error rates over different parts of the population. NIST uses a large US government database of passport and visa photos, and will test the algorithm based on different nationalities. NIST publishes the results, which show huge error-rate variations for certain nationalities.
Jonathan McIntosh/Flickr, CC BY
This kind of differential performance can be due to inadequate training data or an intrinsic limitation of the learning algorithm itself. If the training data contains a million examples of white males, but only two examples of black females, the learning algorithm will have difficulty distinguishing the faces of black females. The way to correct this is either to have training data that is representative of the entire population (which is nearly impossible), or to give different weights to the data in the training set to simulate the proportions that would exist in a data set covering the whole population.
Inadequate training data is not the sole cause for differential performance. Some algorithms have intrinsic difficulties extracting unique features from certain kinds of faces. For example, infants’ faces tend to look alike and thus are notoriously hard to distinguish from each other. Some algorithms will do better when shown only a few training examples, but if these fixes don’t work, it may be possible to impose a “fairness constraint,” a rule that forces the algorithm to equalize performance among different population groups.
Unfortunately, this can have the effect of bringing down the level of performance for other groups, potentially to an unacceptable level. If we impose a fairness constraint, we also need to identify which population groups should be covered. Should a facial-recognition algorithm treat every possible skin colour and ethnic origin alike, including relatively small population groups? You can break down the population into an almost unlimited number of subgroups.
And what level of difference in performance can be tolerated between groups – do they have to be identical, or can we tolerate a certain percentage differential? And what is the effect of fairness constraints on algorithmic performance? Indeed, a perfectly non-discriminatory facial recognition algorithm may be perfectly useless.
As a society, we make trade-offs like this every day. For algorithms, these trade-offs must be explicit: “less then perfect” fairness becomes an explicit design choice.
Another uncomfortable trade-off is whether we allow data on ethnicity or skin colour to be collected and used to help make algorithms less discriminatory. Europe generally prohibits the collection of data on ethnicity, and for good reason. Databases on ethnicity helped Nazis and cooperating governments locate and murder 6 million Jews in the 1940s. Yet data on ethnicity or skin colour can help make algorithms less racist. The data can help test algorithms for differential treatment, by permitting a test on only black- or brown-skinned individuals. Also, a “racially aware” algorithm can learn to compensate for discrimination by creating separate models for different population groups, for example a “black” model and a “non-black” model. But this runs against an important principle espoused in France and in other countries that rules should be colour blind.
If perfect fairness is impossible, should facial recognition technology be prohibited? Certain cities in the United States have imposed a moratorium on police use of facial recognition until issues of reliability and discrimination can be sorted out. The state of Washington has enacted a law to require testing and strict regulatory oversight for police use of facial recognition.
One aspect of the law is to require a study of differential impacts of the system on different subgroups of the population, and an obligation to introduce mitigation measures to correct performance differentials. Regulation, not prohibition, is the right approach, but regulation will require us to make a series of explicit choices that we’re not used to making, including the key question of how fair is “fair enough”.