Sunday, December 6, 2020
What the Fox Said to Achilles (and friends): A Graphic Story of Artificial Intelligence
Sunday, April 7, 2019
The sum of three cubes problem and why it is interesting for artificial intelligence
The holy grail of artificial intelligence is the quest to
develop general artificial general intelligence. A considerable amount of progress has been made
on specific forms of intelligence. Computers are now able to perform many tasks
at suprahuman levels. But extending
these specific capabilities to more general intelligence has so far proven
elusive.
One of the most important barriers to extending computational intelligence from
specific to general capabilities is an inadequate understanding of the different
kinds of problems that a general intelligence system would have to solve. Research has focused on one narrow range of
problems with the apparent expectation that solving those problems will lead eventually
to general problems solving. These are
problems that can be solved by parameter optimization. Optimization means that the learning
algorithm adjusts the values of its model parameters to gradually approximate
the desired output of the system. But
there are problems that cannot be solved using these methods.
The sum of three cubes
problem is one of these problems. Conceptually,
it is not very complicated. It
could straightforwardly be solved by brute force, that is by trying numbers
until a solution is found. Still it has resisted solutions for some numbers despite more half a century of effort.
In general form, the three cubes problem is this: For any integer
k, express that number as the sum of three integers cubed. For example, the integer 29 can be expressed
as 29 = 3³ + 1³ + 1³ (29 = 27 + 1 + 1). It is easy to determine that some numbers cannot be represented as the sum of three cubes. For example, the number 32 cannot be
expressed as the sum of three cubes, but until just recently, no one knew
whether the integer 33 could be. Is there
some set of three integers that satisfy the equation 33 = x³+ y³+ z³? In fact, until recently, there were only two
numbers below 100 for which a solution was unknown, 33 and 42. All of the others were either known to be
impossible or the three integers were known.
There is no known optimization method for finding the three numbers that
when cubed sum up to 33 or 42. There are no known methods to gradually
approximate a solution. Once the correct
three integers have been found, it is easy to verify that they are, in fact,
correct, but there is no solution that is partially correct, only solutions
that are correct or incorrect. The best that one can do is to guess at likely
numbers. Andrew Booker, at the
University of Bristol, was recently able to solve the problem for k = 33 by
improving somewhat the methods used to guess potential solutions. His method reduced the number of integers
that needed to be searched by an estimated 20%, but even after this improvement,
his solution consumed 23 core-years of processing time. That is a substantial amount of effort for a
fairly trivial problem. According
to Booker, “I don’t think [finding solutions to the sum of three cubes
problems] are sufficiently interesting research goals in their own right to
justify large amounts of money to arbitrarily hog a supercomputer.”
Why this problem is interesting for artificial intelligence
The sum of three cubes problem has resisted solution for
over half a century. This problem is
very easy to describe, but difficult, or at least tedious, to solve. Understanding the difficulty posed by this
kind of problem and how that challenge was addressed is, I think, important for understanding why general intelligence is a challenge and what
can be done to meet that challenge.
Current versions of machine learning can all be described in
terms of three sets of numbers. One set
of numbers maps properties of the physical world to numbers that can be used by
a computer. One maps the output of the
computer to properties of the physical world.
The third set of numbers represents the model that maps inputs to outputs. Machine learning consists of adjusting this
set of model numbers (using some optimization algorithm) to better approximate the
desired relation between inputs and outputs.
This kind of framework can learn to recognize speech, to create novel
musical creations, and play chess, go, or Jeopardy. In fact, some version of this approach can
solve any problem that can be represented in this way.
But it is still the case, that the success of these systems
relies heavily on the ability of human designers to construct these three sets
of numbers and select optimization algorithms to adjust the model. The sum of three cubes problem is not amenable
to an optimization approach because there is no way to determine which changes
to X, Y, and Z, will bring it closer to the desired solution. There is no way to define closer.
In 1965, I. J. Good raised the possibility of an
ultraintelligent computer system that would surpass
human intelligence:
“Let an ultraintelligent machine be defined as a machine
that can far surpass all the intellectual activities of any man however clever.
Since the design of machines is one of these intellectual activities, an
ultraintelligent machine could design even better machines; there would then
unquestionably be an 'intelligence explosion,' and the intelligence of man
would be left far behind. Thus the first ultraintelligent machine is the last
invention that man need ever make, provided that the machine is docile enough
to tell us how to keep it under control.”
Presumably, solving the sum of three cubes problem would
also be among the intellectual activities that such a machine would address,
since it continues to be a problem addressed by humans. This problem is conceptually much simpler
than designing intelligence programs, but it may be even less tractable.
Booker’s improved algorithm was not discovered
automatically. There is no algorithm
that we know of that can produce new algorithms like the one he produced. It took humans over 64 years to come up with
one even this good, despite fairly widespread interest in the problem. We do not know how Booker came up with the
insight leading to this new algorithm, nor do we know how to go about designing
a method that could do so predictably. General
intelligence will require computers to be able to generate new problem
representations and new algorithms to solve new problems, but we have little
idea of how to get there.
Even this new method faced a huge combinatoric
challenge. There are just so many
combinations of three numbers that could be the solution to the problem. No matter how intelligent a system is, there
ultimately may be no amount of intelligence that can eliminate this combinatoric
problem. If even the problem of finding
three numbers can be combinatorically challenging, what will a general
intelligence system face when trying to solve problems with even more variables? The time required to test a large number of integers
and their cubes may be reduced, but it cannot be eliminated.
To this point, no one has come up with a computer system
that can design its own models. Deep
learning systems that are said to come up with their own representations actually
still work by adjusting parameters in a prestructured model. The transformations that occur within the
model (moving from one layer to the next) are determined by the architecture of
those layers. For example, a linear autoencoder
layer does not learn an arbitrary representation of the data, it “learns” to
perform a principal component
analysis, a well-known statistical technique. So far
someone still has to come up with the design of the network and optimization
methods used to solve the problems.
The sum of three cubes problem could be solved by simple brute
force if we were to allocate sufficient resources to its solution. With other kinds of problems even the space
in which to apply the brute-force search may be obscure. Some insight problems, for example, are
difficult until the solver finds the right representation, at which point they
are typically easy to solve. Like the
sum of three cubes problem, insight problems do not admit of partial solutions
that can be selected through optimization.
The key to solving is to think of the problem in the right way. Solving these problems requires a switch in
how those problems are represented.
Here’s an insight problem whose solution may be familiar: Curt and Goldie are lying dead on a wet rug
in a locked room. The room is in an old
house near some railroad tracks. How did
they die?
Once you come up with the right model for this situation,
solving it is trivial, but the difficult part is often coming up with the right
representation. There are many other
insight problems, but these have not at all been studied by computer scientists
so far as I am aware. But the problem of
coming up with good representations has been the very mechanism of progress in
artificial intelligence. So far it has
been done slowly and painstakingly by people.
There are many other problems that a generally intelligent
system will have to address if it is ever to achieve general intelligence, let
alone superintelligence. We may someday
be able to create artificial general intelligence systems that can address
these problems, but it will require a different computational approach than any
we have available today.
Monday, February 18, 2019
The Singularity Called: Don't Wait Up
Dylan Azulay at emerj
has just published
another in a series of surveys that have been conducted over the last several
years by different groups about when the technological singularity is likely to
happen. The singularity is the idea that
computers will get so smart that their intelligence will grow explosively.
The notion of a technological singularity was initially proposed
by Vernor Vinge in
1993, expanding on some ideas from I. J. Good and John Von Neumann.
Good wrote:
“Let an ultraintelligent machine be defined as a machine
that can far surpass all the intellectual activities of any man however clever.
Since the design of machines is one of these intellectual activities, an
ultraintelligent machine could design even better machines; there would then
unquestionably be an "intelligence explosion," and the intelligence
of man would be left far behind. Thus the first ultraintelligent machine is the
last invention that man need ever make.”
Good, I. J. (1965). Speculations Concerning the First Ultraintelligent
Machine, in Advances in Computers, vol 6, Franz L. Alt and Morris Rubinoff,
eds., 31-88, Academic Press.
According to Vinge: “It's fair to call this event [the explosion
in machine intelligence] a singularity (‘the Singularity’ for the purposes of
this piece). It is a point where our old models must be discarded and a new
reality rules, a point that will loom vaster and vaster over human affairs
until the notion becomes a commonplace.”
The notion of the singularity combines the idea of artificial
general intelligence, with the idea that such a general intelligence will be
able to grow at exponential velocity. General
intelligence is a difficult enough problem, but it is solvable, I think. But, contrary to the speculations of Good,
Vinge, Bostrom, and
others, it will not result in an intelligence explosion.
To understand why there will be no explosion, we can start
with the 18th Century philosophical conflict between Rationalism
and Empiricism. Simplifying
somewhat, the rationalist approach assumes that the way to understanding, that
is intelligence, lies principally in thinking about the world. The empiricist approach says that understanding
comes from apprehension of facts gained through experience with the world. In order for there to be a singularity explosion,
the rationalist position has be completely correct, and the empiricist position
has to be completely wrong, at least so far as computational intelligence is
concerned. If all it took to achieve
explosive growth in intelligence was to think about it, then the singularity
would be possible, but it would leave a system lost in thought.
If understanding depends on gleaning facts from experience,
then a singularity is not possible because the rate at which facts become
available is not changed by increases in computational capacity. In reality, neither pure Rationalism nor pure
Empiricism is sufficient, but if we view intelligence as including the ability
to solve physical world, not just virtual, problems, then a singularity of the
sort Vinge discussed is simply not possible.
Computers may, indeed, increase their intelligence over time, but well designed machines and being
good at designing them are not sufficient to cause an explosive expansion of
intelligence.
Imagine, for example, that we could double computing capacity
every few (pick one) months, days, or years.
As time goes by, the size of the increase becomes indistinguishable from
vertical, and an explosion in computing capacity can be said to have
occurred. If all the computer had to do
was to process symbols or mathematical values, then we might achieve a
technological singularity. The computer
would think faster and faster and faster and be able to process more
propositions more quickly. Intelligence,
in other words, would consist entirely of the formal problem of manipulating
symbols or mathematical objects. A
computer under these conditions could become super-intelligent even if the
entire universe around it somehow disappeared because it is the symbols that
are important, the world is not. But the
world is important.
The board game go is conceptually very simple, but because
of the number of possible moves, winning the game is challenging. Go is a formal problem, meaning that one
could play go without actually using stones or a game board, just by
representing those parts symbolically or mathematically. It is the form of the problem, not its
instantiation in stones and boards that is important.
In fact, when AlphaGo played Lee
Sedol, its developers did not even bother to have the computer actually
place any stones on the board. Instead, the computer communicated its moves to
a person who placed the stones and recorded the opponents responses. It could have played just as well without a
person placing the stones because all it really did was manipulate symbols for
those stones and the board. The physical
properties of the stones and board played no role and contributed nothing to
its ability to play. The go game board and
stones were merely a convenience for the humans, they played no role in the operation
of the computer.
AlphaGo was trained
in part by having two versions of the game play symbolically against one
another. With more computer power, it could play faster and thus, theoretically
learn faster. Learning to play go is the
perfect rationalist situation. Improvement
can be had just by thinking about it. No experience with a physical world is
needed. With enough computer power, its
ability to play go might be seen to “explode.”
But playing go is not a good model for general
intelligence. After playing these
virtual games, it knew more because of
its ability to think about the game, but intelligence in the world requires
different capabilities beyond those required to play go. Go is a formal, perfect information problem. The two players may find it challenging to
guess what the future state of the game will be following a succession of moves,
but there is no uncertainty about the current state of the game. The positions of the stones on the playing grid
are perfectly known by each player. The
available moves at any point in time are perfectly known and the consequences
of each move, at least the immediate consequences of that move are also perfectly
known. Learning to play consisted completely of learning to predict the future
consequences of each potential move.
Self-driving
vehicles, in contrast, do not address a purely formal problem. Instead, their sensors provide incomplete,
faulty, information about the state of the vehicle and its surroundings. Although some progress can be made by learning
to drive a virtual simulated vehicle, there is no substitute for the feedback of
driving a physical vehicle in a physical world.
Learning to drive is not a purely rationalist system. Rather it depends
strongly on the system’s empirical experience with its environment.
At least some of the problems faced by an artificial general
intelligence system will be of this empiricist type. But a self-driving vehicle that computed
twice as fast, would not learn at twice the rate, because its learning depends
on feedback from the world and the world does not increase its speed of providing
feedback, no matter how fast the computer is. This is one of the main reasons
whey there will be no intelligence explosion.
The world, not the computer, ultimately controls how fast it can
learn.
Most driving is mundane.
Nothing novel happens during most of the miles driven so there is nothing
new for the computer to learn. Unexpected
events (why simulation is not enough) occur with a frequency that is entirely
unrelated to the speed or capacity of the computer. There will be no explosion in the
capabilities of self-driving vehicles.
They may displace truck and taxi drivers, but they will not take over
the world, and they will not do it explosively.
There are other reasons why the singularity will be a
no-show. Here is just one of them. Expanding machine intelligence will surely
require some form of machine learning. At
its most basic, machine learning is simply a method of modifying the values of certain
parameters to find an optimal set of values that solve a problem. AlphaGo was capable of learning to play go
because the DeepMind team structured the computational problem in an important
new way. Self-driving cars became
possible because the teams competing in the second
DARPA grand challenge figured out a new way to represent the problem of
driving. Computers are great at finding
optimal parameter values, but so far, they have no capability at all for figuring
out how to structure problem representations so that they can be solved by
finding those parameter values.
Good assumed that “the design of machines is one of these
intellectual activities” just like those used to play go or drive, but he was
wrong. Structuring a problem so that a
computer can find its solution is a different kind of problem that cannot be
reduced to parameter value adjustment, at least
not in a timely way. Until we can
come up with appropriate methods to design solutions, artificial general
intelligence will not be possible. Albert
Einstein was not known as brilliant for his ability to solve well-posed problems,
rather he was renowned for his ability to design new approaches to solving certain
physics problems—new theories. Today’s
computers are great at solving problems that someone has structured into
equations, but none is able yet to build create new structures. General intelligence requires this ability, and
it may be achievable, but as long as general intelligence depends on empirical feedback,
the chances of a technological singularity are nil.
Monday, October 22, 2018
Discriminate for fairness
As machine learning methods come to be more widely used,
there is a great deal of hand-wringing about whether they produce fair results. For example, Pro
Publica reported that a widely used program intended to assess the
likelihood of criminal recidivism, that is whether a person in custody would be
likely to commit an additional crime, tended to over-estimate the probability
that a black person would commit an additional crime and under-estimate whether
a white person would. Amazon was said to
have abandoned a machine
learning system that evaluated resumes for potential hires, because that
program under-estimated the likely success of women and therefore, recommended
against hiring them.
I don’t want to deny that these processes are biased, but I
do want to try to understand why they are biased and what we can do about it. The bias is not an inherent property of the
machine learning algorithms, and we would not find its source by investigating
the algorithms that go into them.
The usual explanation is that the systems are trained on the
“wrong” data and merely perpetuate the biases of the past. If they were trained on unbiased
data, the explanation goes, they would achieve less biased results. Bias in the training data surely plays a
role, but I don’t think that it is the primary explanation for the bias.
Instead, it appears that the bias comes substantially from how
we approach the notion of fairness itself.
We assess fairness as if it were some property that should emerge
automatically, rather than a process that must be designed in.
What do we mean by fairness?
In the Pro Publica analysis
of recidivism, the unfairness derived largely from the fact that when errors
are made, they tend to be in one direction for black defendants and in the other
direction for white defendants. This
bias means that black defendants are denied bail when they really do not
present a risk, and white defendants are given bail when they really should
remain in custody. That bias seems to be
inherently unfair, but the race of the defendant is not even considered explicitly
by the program that makes this prediction.
In the case of programs like the Amazon hiring recommendation
system, fairness would seem to imply that women and men with similar histories
be recommended for hiring at similar rates.
But again, the gender of the applicant is not among the factors
considered explicitly by the hiring system.
Race and gender are protected factors under US law (e.g., Title VII of the Civil
Rights Act of 1964). The law states
that “It shall be an unlawful employment practice for an employer … to
discriminate against any individual with respect to his compensation, terms,
conditions, or privileges of employment, because of such individual’s race,
color, religion, sex, or national origin.”
Although the recidivism system does not include race
explicitly in its assessment, it does include such factors as whether the
defendant has any family
members who have ever been arrested, whether they have financial resources,
etc. As I understand it, practically every
black person who might come before the court is likely to have at least one
family member who has been arrested, but that is less often true for whites. Black people are more likely than whites to
be arrested,
and once arrested, they are more likely than whites to be convicted and
incarcerated. Relative to their
proportion in the population, they are substantially
over-represented in the US prison system compared to whites. These correlations may be the result of other
biases, such as racism in the US, but they are not likely to be the result of any
intentional bias being inserted into the recidivism machine learning system. Black defendants are substantially more
likely to be evaluated by the recidivism system and were more likely to be
included in its training set because these same factors. I don’t believe that anyone set out to make
any of these systems biased.
The resumes written by men and women are often
different. Women tend to have more
interruptions in their work history; they tend to be less assertive about
seeking promotions; they use different language than men to talk about their
accomplishments. These tendencies,
associated with gender are available to the system, even without any desire to
impose a bias on the results. Men are
more likely to be considered for technical jobs at Amazon because they are more
likely to apply for them. Male resumes
are also more likely to be used in the training set, because historically, men
have filled a large majority of the technical jobs at Amazon.
One reason to be skeptical that imbalances in the training
set are sufficient to explain the bias of these systems is that machine
learning systems do not always learn what their designers think that they will
learn. Machine learning works by
adjusting internal parameters (for example the weights of a neural network) to
best realize a “mapping” from the inputs on which it is trained to the goal
states that it is set. If the system is
trained to recognize cat photos versus photos of other things, it will adjust
its internal parameters to most accurately achieve that result. The system is shown a lot of labeled pictures,
some of which contain cats, and some of which do not. Modern machine learning systems are quite
capable of learning distinctions like this, but there is no guarantee that they
learn the same features that a person would learn.
For example, even given many thousand of training examples
to classify photographs, a deep neural network system can still be “duped” into
classifying a photo
of a panda as a photo of a gibbon, even though both photos look to the
human eye very much like a panda and not at all by a gibbon. All it took to cause this system to
misclassify the photo was to add a certain amount of apparently random visual
noise to the photograph. The
misclassification of the picture when noise was added implies that the system
learned features, in this case pixels, that were disrupted by the noise and not
the features that a human used.
The recidivism and hiring systems, similarly, can learn to
make quite accurate predictions without having to consider the same factors
that a human might. People find some
features more important than others when classifying pictures. Computers are free to choose whatever
features will allow correct performance, whether a human would find them
important or not.
In many cases, the features that it identifies are also
applicable to other examples that it has not seen, but there is often a decrease
in accuracy when a well-trained machine learning system is actually deployed by
a business and applied to items (e.g., resumes) that were not drawn from the
same group as the training set. The
bigger point is that for machine learning systems, the details can be more
important than the overall gist and the details may be associated with the
unfairness.
Simpson’s paradox and unfairness
A phenomenon related to this bias is called Simpson’s paradox, and
one of the most commonly cited examples of this so-called paradox concerns the appearance
of bias in the acceptance rate of men versus women to the University of
California graduate school.
The admission figures for the Berkeley campus for 1973
showed that 8442 men applied, of which 44% were accepted, and 4321 women applied,
of which only 35% were accepted. The
difference between 44% and 35% acceptance is substantial and could be a
violation of Title VII.
The difference in proportions would seem to indicate that the
admission process was unfairly biased toward men. But when the departments were considered
individually, the results looked much different. Graduate admission decisions are made by the
individual departments, such as English, or Psychology. The graduate school may administer the
process, but it plays no role in deciding who gets in. On deeper analysis it was found
(P. J. Bickel, E. A. Hammel, J. W. O'Connell, 1975) that 6 of the 85
departments showed small bias toward admitting women and only four of them
showed a small bias toward admitting men.
Although the acceptance rate for women was substantially lower than for men,
individual departments were slightly more likely to favor women than men. This
is the apparent paradox, departments are not biased against women, but the
overall performance of the graduate school seems to be.
Rather, according to Bickel and associates, the apparent
mismatch derived from the fact that women applied to different departments on
average than the men did. Women were
more likely to apply to departments that had more competition for their
available slots and men were more likely to apply to departments that had relatively
more slots per applicant. In those days, the “hard” sciences attracted more
male applicants than female, but they were also better supported with teaching
assistantships and so on than the humanities departments that women were more
likely to apply to. Men applied on average to departments with high rates of
admission and women tended to apply to departments with low rates. The bias in admissions was apparently not
caused by the graduate school, but by the prior histories of the women, which
biased them away from the hard sciences and toward the humanities.
A lot has been written about Simpson’s paradox and
even whether it is a paradox at all. The
Berkeley admissions study as well as the gender bias and recidivism bias can
all be explained by the correlation between a factor of interest (gender or
race) and some other variable. Graduate
applications were correlated with patterns of department selection, gender bias
in resume analysis is correlated with such factors as work history, language
used to describe work, and so on. Recidivism
predictors are correlated with race. Although
these examples all show large discrepancies in the size of the two groups of
interest (many more men applied to graduate school, many more of the defendants
being considered were black rather than white, and many more the Amazon
applicants were men), these differences will not disappear if all we do is add
training examples.
These systems are considered unfair, presumably because we
do not think that gender or race should play a causal role in whether people
are admitted, hired, or denied bail (e.g., Title VII). Yet, gender and race are apparently
correlated with factors that do affect these decisions. Statisticians call these correlated variables
confounding variables. The way to remove
them from the prediction is to treat them separately (hold them fixed). If the ability to predict recidivism is still
accurate when considering just blacks or just whites, then it may have some
value. If hiring evaluations are made for
men and women separately, then there can be no unintentional bias. Differences between men and women then,
cannot explain or cause the bias because that factor is held constant for any
predictions within a gender. Women do
not differ from women in general in gender-related characteristics, and so
these characteristics are not able to contribute to a hiring bias toward men.
We detect unfairness by ignoring a characteristic, for example,
race or gender, during the training process and then examining it during a
subsequent evaluation process. In
machine learning, that is often a recipe for disaster. Ignoring a feature during training means that
that feature is uncontrolled in the result.
As a result, it would be surprising if the computer were able to produce
fair results.
Hiring managers may or may not be able to ignore gender. The evidence is pretty clear that they cannot
really do it, but the US law requires that they do. In an attempt to make these programs consistent
with laws like Title VII, their designers have explicitly avoided including
gender or race among the factors that are considered. In reality, however, gender and race are
still functionally present in the factors that correlate with them. Putting a man’s name on a woman’s resume, does
not make it into a male resume, but including the questions about the number of
a defendant’s siblings that have been arrested does provide information about the
person’s race. The system can learn
about them. But what really causes the bias,
I think, is that these factors are not included as part of the system’s goals.
If fairness is really a goal of our machine
learning system, then it should be included as a criterion by which the success
of the system is judged. Program designers
leave these factors out of the evaluation because they mistakenly (in my
opinion) believe that the law requires them to leave them out, but machines are
unlikely to learn about them unless they are included. I am not a lawyer, but I believe that the law
concerns the outcome of the process, not the means by which that outcome is
achieved. If these factors are left out
of the training evaluation, then any resemblance of a machine learning process
to a fair one is entirely coincidental.
By explicitly evaluating for fairness, fairness can be achieved. That is
what I think is missing from these processes.
The goals of machine learning need not be limited to just the
accuracy of a judgment. Other criteria,
including fairness can be part of the goal for which the machine learning
process is being optimized. The same
kind of approach of explicitly treating factors that must be treated fairly can
be used in other areas where fairness is a concern, including mapping of voting
districts (gerrymandering), college admissions, and grant allocations. Fairness can be achieved by discriminating
among the factors that we use to assess fairness and including these factors directly
and explicitly in our models. By
discriminating we are much more likely to achieve fairness than by leaving
these factors to chance in a world where factors are not actually independent
of one another.
Monday, June 18, 2018
Welcome to dead salmon data
The name for this blog was inspired by a paper by Craig M.
Bennett, Abigail A. Baird, Michael B. Miller, and George L. Wolford. They
studied how a dead salmon placed in a functional magnetic resonance imaging (fMRI)
machine “responded” to pictures of people in emotional situations. The salmon was not entirely forthcoming with
its judgments, but Bennett and his colleagues did record some of the dead salmon’s
“brain activity.”
fMRI images are organized into tiny cubes, called “voxels.” In flat, 2-D images, each tiny
region in the image is represented by a pixel.
With the 3-D images derived using fMRI, each small 3-D region is called a
voxel. Just as the signal in a pixel represents
how much light there was at that small region, the fMRI voxel represents how much biological
activity there was in the corresponding 3-D region. In living brains, that
activity usually represents the level of neural activity in the region. Of the available voxels, Bennett and his
colleagues found significant activity in 16 voxels, about 0.01% of the total
volume.
Based on these findings and using statistical analyses that
were commonly used at the time, the authors might have concluded that they had
identified the part of the fish’s brain that was responsible for making
judgments about the emotional state of human beings depicted in
photographs. Before Bennett’s study was available. About 25 – 30% of studies using fMRI published
in journals used the same simple techniques as did about 80% of the papers
presented at a neuroscience conference.
Obviously, there is something wrong here. The idea of a dead fish responding to human
emotion does not pass the “smell test.” It
would be remarkable enough for a live fish to identify human emotion, but it is
utterly beyond reason to think that a dead one could, or that the authors were
measuring brain activity in a dead fish. On top of that, we have no evidence
that the fish actually engaged in the task it was assigned, as it never
revealed its decisions about the emotional valence of the pictures it “observed.”
The measures of brain activity obtained using functional magnetic
resonance imaging, like most other measures of just about any phenomenon
include some random variation in the measured quantity. Standard statistical
measures were designed to take this random variability into account. Two measures may be numerically different
without being reliable different from one another when both measures include
some level of randomness. Statisticians have evolved the notion of statistical
significance to indicate that an observed numerical difference is unlikely to
have occurred by chance.
But these assessments of statistical significance also
involve some degree of random variability. Statisticians recognize that there
are two kinds of errors that can occur.
Type I errors occur when we find a difference, when there is no actual
difference (“false positive”), Type II errors occur when we fail to find a
difference when there actually is one (“false negative”).
fMRI studies compare the measured activity in each voxel during
the task, with its measured level of activity during a control interval. The more brain activity there is, the stronger
the signal measured by the fMRI machine. If we were to compare the activity
levels of 130,000 voxels under these two conditions, some of them would show
high levels of activity just by chance because our measure is imperfect. We would expect some percentage of these
comparisons to result in false positive errors.
The specific percentage we would expect is determined by our standard of
significance, often called the “p value.” It was pretty common in fMRI studies to
set this value to 0.001, meaning that we would expect about 130 of the 130,000
voxels to show what appears to be a real difference, when there really is nothing
but random measurement error. Multiple comparisons raise opportunities for multiple
false positive errors.
There are statistical techniques for dealing with large
numbers of comparisons. In summary, these
methods modify the significance standard so that differences relative to
control have to be greater before one accepts that they are due to anything but
chance. When those adjustments are
applied to the dead salmon fMRI data, it turns out that there are no
significant areas of brain activity in the fish, as we really should expect. When the same techniques were applied by
Bennett to 11 published fMRI studies, three of them had no significant results
to report at all. If these corrections
for multiple comparisons are not properly applied, researchers can come to
false conclusions about the validity of their data, which can lead to wasted
research effort and dead ends.
The importance of these findings extends far beyond one dead
salmon, far beyond fMRI. Spurious
correlations are common throughout big-data data science. For example, annual US spending on space,
science and technology correlates over the years 1999 to 2009 with the annual
incidence of suicides by hanging, strangulation, and suffocation. http://www.tylervigen.com/spurious-correlations If there is some kind of causal relationship
here, it completely escapes me. If you have enough data and look hard enough
you can find any number of correlations, but that is no guarantee that those
correlations are at all meaningful.
People have a way of finding patterns, even when there are none. Sometimes the statistics help.
Traditional statistics were developed for a time when the expensive
part of research was collecting data.
Now data are plentiful, but extracting meaning from those data is still
a challenge. Simple methods are often
the best, but they can also be too simple. In later posts we will consider
other potential hazards of data science and what we can do to avoid them. Unless we are careful with our data and our
analyses, it is easy to get seduced into believing that we understand something
when really we do not. On the other
hand, it is also easy to get seduced by approaches and methods that are
popular, but may not be the best approach for the questions at hand.
In future posts, I hope to consider many of the ways in
which data science, machine learning, and artificial intelligence can go wrong,
but also some of the ways it can go right.
Data science is of growing importance to our businesses and even to our
culture, but it is also easily misunderstood. In the same way that we see animals
and faces in clouds, we may attribute to data science, particularly to
artificial intelligence, properties that are not really there. I hope to address some of these tendencies and
make data science stronger and more useful as a result.
Craig Bennett, Abigail A. Baird, Michael B. Miller, George L. Wolford (2009). 15th Annual
Meeting of the Organization for Human Brain Mapping. San Francisco, CA: 2009.
Neural correlates of interspecies perspective taking in the post-mortem
atlantic salmon: an argument for proper multiple comparisons correction.
Craig M. Bennett, Abigail A. Baird, Michael B. Miller,
George L. Wolford. (2010). Neural correlates of interspecies perspective taking
in the post-mortem Atlantic Salmon: An argument for multiple comparisons
correction. Journal of Serendipitous and Unexpected Results.
Subscribe to:
Posts (Atom)