As machine learning methods come to be more widely used,
there is a great deal of hand-wringing about whether they produce fair results. For example, Pro
Publica reported that a widely used program intended to assess the
likelihood of criminal recidivism, that is whether a person in custody would be
likely to commit an additional crime, tended to over-estimate the probability
that a black person would commit an additional crime and under-estimate whether
a white person would. Amazon was said to
have abandoned a machine
learning system that evaluated resumes for potential hires, because that
program under-estimated the likely success of women and therefore, recommended
against hiring them.
I don’t want to deny that these processes are biased, but I
do want to try to understand why they are biased and what we can do about it. The bias is not an inherent property of the
machine learning algorithms, and we would not find its source by investigating
the algorithms that go into them.
The usual explanation is that the systems are trained on the
“wrong” data and merely perpetuate the biases of the past. If they were trained on unbiased
data, the explanation goes, they would achieve less biased results. Bias in the training data surely plays a
role, but I don’t think that it is the primary explanation for the bias.
Instead, it appears that the bias comes substantially from how
we approach the notion of fairness itself.
We assess fairness as if it were some property that should emerge
automatically, rather than a process that must be designed in.
What do we mean by fairness?
In the Pro Publica analysis
of recidivism, the unfairness derived largely from the fact that when errors
are made, they tend to be in one direction for black defendants and in the other
direction for white defendants. This
bias means that black defendants are denied bail when they really do not
present a risk, and white defendants are given bail when they really should
remain in custody. That bias seems to be
inherently unfair, but the race of the defendant is not even considered explicitly
by the program that makes this prediction.
In the case of programs like the Amazon hiring recommendation
system, fairness would seem to imply that women and men with similar histories
be recommended for hiring at similar rates.
But again, the gender of the applicant is not among the factors
considered explicitly by the hiring system.
Race and gender are protected factors under US law (e.g., Title VII of the Civil
Rights Act of 1964). The law states
that “It shall be an unlawful employment practice for an employer … to
discriminate against any individual with respect to his compensation, terms,
conditions, or privileges of employment, because of such individual’s race,
color, religion, sex, or national origin.”
Although the recidivism system does not include race
explicitly in its assessment, it does include such factors as whether the
defendant has any family
members who have ever been arrested, whether they have financial resources,
etc. As I understand it, practically every
black person who might come before the court is likely to have at least one
family member who has been arrested, but that is less often true for whites. Black people are more likely than whites to
be arrested,
and once arrested, they are more likely than whites to be convicted and
incarcerated. Relative to their
proportion in the population, they are substantially
over-represented in the US prison system compared to whites. These correlations may be the result of other
biases, such as racism in the US, but they are not likely to be the result of any
intentional bias being inserted into the recidivism machine learning system. Black defendants are substantially more
likely to be evaluated by the recidivism system and were more likely to be
included in its training set because these same factors. I don’t believe that anyone set out to make
any of these systems biased.
The resumes written by men and women are often
different. Women tend to have more
interruptions in their work history; they tend to be less assertive about
seeking promotions; they use different language than men to talk about their
accomplishments. These tendencies,
associated with gender are available to the system, even without any desire to
impose a bias on the results. Men are
more likely to be considered for technical jobs at Amazon because they are more
likely to apply for them. Male resumes
are also more likely to be used in the training set, because historically, men
have filled a large majority of the technical jobs at Amazon.
One reason to be skeptical that imbalances in the training
set are sufficient to explain the bias of these systems is that machine
learning systems do not always learn what their designers think that they will
learn. Machine learning works by
adjusting internal parameters (for example the weights of a neural network) to
best realize a “mapping” from the inputs on which it is trained to the goal
states that it is set. If the system is
trained to recognize cat photos versus photos of other things, it will adjust
its internal parameters to most accurately achieve that result. The system is shown a lot of labeled pictures,
some of which contain cats, and some of which do not. Modern machine learning systems are quite
capable of learning distinctions like this, but there is no guarantee that they
learn the same features that a person would learn.
For example, even given many thousand of training examples
to classify photographs, a deep neural network system can still be “duped” into
classifying a photo
of a panda as a photo of a gibbon, even though both photos look to the
human eye very much like a panda and not at all by a gibbon. All it took to cause this system to
misclassify the photo was to add a certain amount of apparently random visual
noise to the photograph. The
misclassification of the picture when noise was added implies that the system
learned features, in this case pixels, that were disrupted by the noise and not
the features that a human used.
The recidivism and hiring systems, similarly, can learn to
make quite accurate predictions without having to consider the same factors
that a human might. People find some
features more important than others when classifying pictures. Computers are free to choose whatever
features will allow correct performance, whether a human would find them
important or not.
In many cases, the features that it identifies are also
applicable to other examples that it has not seen, but there is often a decrease
in accuracy when a well-trained machine learning system is actually deployed by
a business and applied to items (e.g., resumes) that were not drawn from the
same group as the training set. The
bigger point is that for machine learning systems, the details can be more
important than the overall gist and the details may be associated with the
unfairness.
Simpson’s paradox and unfairness
A phenomenon related to this bias is called Simpson’s paradox, and
one of the most commonly cited examples of this so-called paradox concerns the appearance
of bias in the acceptance rate of men versus women to the University of
California graduate school.
The admission figures for the Berkeley campus for 1973
showed that 8442 men applied, of which 44% were accepted, and 4321 women applied,
of which only 35% were accepted. The
difference between 44% and 35% acceptance is substantial and could be a
violation of Title VII.
The difference in proportions would seem to indicate that the
admission process was unfairly biased toward men. But when the departments were considered
individually, the results looked much different. Graduate admission decisions are made by the
individual departments, such as English, or Psychology. The graduate school may administer the
process, but it plays no role in deciding who gets in. On deeper analysis it was found
(P. J. Bickel, E. A. Hammel, J. W. O'Connell, 1975) that 6 of the 85
departments showed small bias toward admitting women and only four of them
showed a small bias toward admitting men.
Although the acceptance rate for women was substantially lower than for men,
individual departments were slightly more likely to favor women than men. This
is the apparent paradox, departments are not biased against women, but the
overall performance of the graduate school seems to be.
Rather, according to Bickel and associates, the apparent
mismatch derived from the fact that women applied to different departments on
average than the men did. Women were
more likely to apply to departments that had more competition for their
available slots and men were more likely to apply to departments that had relatively
more slots per applicant. In those days, the “hard” sciences attracted more
male applicants than female, but they were also better supported with teaching
assistantships and so on than the humanities departments that women were more
likely to apply to. Men applied on average to departments with high rates of
admission and women tended to apply to departments with low rates. The bias in admissions was apparently not
caused by the graduate school, but by the prior histories of the women, which
biased them away from the hard sciences and toward the humanities.
A lot has been written about Simpson’s paradox and
even whether it is a paradox at all. The
Berkeley admissions study as well as the gender bias and recidivism bias can
all be explained by the correlation between a factor of interest (gender or
race) and some other variable. Graduate
applications were correlated with patterns of department selection, gender bias
in resume analysis is correlated with such factors as work history, language
used to describe work, and so on. Recidivism
predictors are correlated with race. Although
these examples all show large discrepancies in the size of the two groups of
interest (many more men applied to graduate school, many more of the defendants
being considered were black rather than white, and many more the Amazon
applicants were men), these differences will not disappear if all we do is add
training examples.
These systems are considered unfair, presumably because we
do not think that gender or race should play a causal role in whether people
are admitted, hired, or denied bail (e.g., Title VII). Yet, gender and race are apparently
correlated with factors that do affect these decisions. Statisticians call these correlated variables
confounding variables. The way to remove
them from the prediction is to treat them separately (hold them fixed). If the ability to predict recidivism is still
accurate when considering just blacks or just whites, then it may have some
value. If hiring evaluations are made for
men and women separately, then there can be no unintentional bias. Differences between men and women then,
cannot explain or cause the bias because that factor is held constant for any
predictions within a gender. Women do
not differ from women in general in gender-related characteristics, and so
these characteristics are not able to contribute to a hiring bias toward men.
We detect unfairness by ignoring a characteristic, for example,
race or gender, during the training process and then examining it during a
subsequent evaluation process. In
machine learning, that is often a recipe for disaster. Ignoring a feature during training means that
that feature is uncontrolled in the result.
As a result, it would be surprising if the computer were able to produce
fair results.
Hiring managers may or may not be able to ignore gender. The evidence is pretty clear that they cannot
really do it, but the US law requires that they do. In an attempt to make these programs consistent
with laws like Title VII, their designers have explicitly avoided including
gender or race among the factors that are considered. In reality, however, gender and race are
still functionally present in the factors that correlate with them. Putting a man’s name on a woman’s resume, does
not make it into a male resume, but including the questions about the number of
a defendant’s siblings that have been arrested does provide information about the
person’s race. The system can learn
about them. But what really causes the bias,
I think, is that these factors are not included as part of the system’s goals.
If fairness is really a goal of our machine
learning system, then it should be included as a criterion by which the success
of the system is judged. Program designers
leave these factors out of the evaluation because they mistakenly (in my
opinion) believe that the law requires them to leave them out, but machines are
unlikely to learn about them unless they are included. I am not a lawyer, but I believe that the law
concerns the outcome of the process, not the means by which that outcome is
achieved. If these factors are left out
of the training evaluation, then any resemblance of a machine learning process
to a fair one is entirely coincidental.
By explicitly evaluating for fairness, fairness can be achieved. That is
what I think is missing from these processes.
The goals of machine learning need not be limited to just the
accuracy of a judgment. Other criteria,
including fairness can be part of the goal for which the machine learning
process is being optimized. The same
kind of approach of explicitly treating factors that must be treated fairly can
be used in other areas where fairness is a concern, including mapping of voting
districts (gerrymandering), college admissions, and grant allocations. Fairness can be achieved by discriminating
among the factors that we use to assess fairness and including these factors directly
and explicitly in our models. By
discriminating we are much more likely to achieve fairness than by leaving
these factors to chance in a world where factors are not actually independent
of one another.
No comments:
Post a Comment