The name for this blog was inspired by a paper by Craig M.
Bennett, Abigail A. Baird, Michael B. Miller, and George L. Wolford. They
studied how a dead salmon placed in a functional magnetic resonance imaging (fMRI)
machine “responded” to pictures of people in emotional situations. The salmon was not entirely forthcoming with
its judgments, but Bennett and his colleagues did record some of the dead salmon’s
“brain activity.”
fMRI images are organized into tiny cubes, called “voxels.” In flat, 2-D images, each tiny
region in the image is represented by a pixel.
With the 3-D images derived using fMRI, each small 3-D region is called a
voxel. Just as the signal in a pixel represents
how much light there was at that small region, the fMRI voxel represents how much biological
activity there was in the corresponding 3-D region. In living brains, that
activity usually represents the level of neural activity in the region. Of the available voxels, Bennett and his
colleagues found significant activity in 16 voxels, about 0.01% of the total
volume.
Based on these findings and using statistical analyses that
were commonly used at the time, the authors might have concluded that they had
identified the part of the fish’s brain that was responsible for making
judgments about the emotional state of human beings depicted in
photographs. Before Bennett’s study was available. About 25 – 30% of studies using fMRI published
in journals used the same simple techniques as did about 80% of the papers
presented at a neuroscience conference.
Obviously, there is something wrong here. The idea of a dead fish responding to human
emotion does not pass the “smell test.” It
would be remarkable enough for a live fish to identify human emotion, but it is
utterly beyond reason to think that a dead one could, or that the authors were
measuring brain activity in a dead fish. On top of that, we have no evidence
that the fish actually engaged in the task it was assigned, as it never
revealed its decisions about the emotional valence of the pictures it “observed.”
The measures of brain activity obtained using functional magnetic
resonance imaging, like most other measures of just about any phenomenon
include some random variation in the measured quantity. Standard statistical
measures were designed to take this random variability into account. Two measures may be numerically different
without being reliable different from one another when both measures include
some level of randomness. Statisticians have evolved the notion of statistical
significance to indicate that an observed numerical difference is unlikely to
have occurred by chance.
But these assessments of statistical significance also
involve some degree of random variability. Statisticians recognize that there
are two kinds of errors that can occur.
Type I errors occur when we find a difference, when there is no actual
difference (“false positive”), Type II errors occur when we fail to find a
difference when there actually is one (“false negative”).
fMRI studies compare the measured activity in each voxel during
the task, with its measured level of activity during a control interval. The more brain activity there is, the stronger
the signal measured by the fMRI machine. If we were to compare the activity
levels of 130,000 voxels under these two conditions, some of them would show
high levels of activity just by chance because our measure is imperfect. We would expect some percentage of these
comparisons to result in false positive errors.
The specific percentage we would expect is determined by our standard of
significance, often called the “p value.” It was pretty common in fMRI studies to
set this value to 0.001, meaning that we would expect about 130 of the 130,000
voxels to show what appears to be a real difference, when there really is nothing
but random measurement error. Multiple comparisons raise opportunities for multiple
false positive errors.
There are statistical techniques for dealing with large
numbers of comparisons. In summary, these
methods modify the significance standard so that differences relative to
control have to be greater before one accepts that they are due to anything but
chance. When those adjustments are
applied to the dead salmon fMRI data, it turns out that there are no
significant areas of brain activity in the fish, as we really should expect. When the same techniques were applied by
Bennett to 11 published fMRI studies, three of them had no significant results
to report at all. If these corrections
for multiple comparisons are not properly applied, researchers can come to
false conclusions about the validity of their data, which can lead to wasted
research effort and dead ends.
The importance of these findings extends far beyond one dead
salmon, far beyond fMRI. Spurious
correlations are common throughout big-data data science. For example, annual US spending on space,
science and technology correlates over the years 1999 to 2009 with the annual
incidence of suicides by hanging, strangulation, and suffocation. http://www.tylervigen.com/spurious-correlations If there is some kind of causal relationship
here, it completely escapes me. If you have enough data and look hard enough
you can find any number of correlations, but that is no guarantee that those
correlations are at all meaningful.
People have a way of finding patterns, even when there are none. Sometimes the statistics help.
Traditional statistics were developed for a time when the expensive
part of research was collecting data.
Now data are plentiful, but extracting meaning from those data is still
a challenge. Simple methods are often
the best, but they can also be too simple. In later posts we will consider
other potential hazards of data science and what we can do to avoid them. Unless we are careful with our data and our
analyses, it is easy to get seduced into believing that we understand something
when really we do not. On the other
hand, it is also easy to get seduced by approaches and methods that are
popular, but may not be the best approach for the questions at hand.
In future posts, I hope to consider many of the ways in
which data science, machine learning, and artificial intelligence can go wrong,
but also some of the ways it can go right.
Data science is of growing importance to our businesses and even to our
culture, but it is also easily misunderstood. In the same way that we see animals
and faces in clouds, we may attribute to data science, particularly to
artificial intelligence, properties that are not really there. I hope to address some of these tendencies and
make data science stronger and more useful as a result.
Craig Bennett, Abigail A. Baird, Michael B. Miller, George L. Wolford (2009). 15th Annual
Meeting of the Organization for Human Brain Mapping. San Francisco, CA: 2009.
Neural correlates of interspecies perspective taking in the post-mortem
atlantic salmon: an argument for proper multiple comparisons correction.
Craig M. Bennett, Abigail A. Baird, Michael B. Miller,
George L. Wolford. (2010). Neural correlates of interspecies perspective taking
in the post-mortem Atlantic Salmon: An argument for multiple comparisons
correction. Journal of Serendipitous and Unexpected Results.