Statistics and Ecologists Today: More from the “Emperor Has No Clothes Chronicles”
In my opinion, the question most often asked
by ecology practitioners today is “what statistical method should I use in
my study?”. Why? Because whether you
are just beginning to learn about ecology, e.g., you are a first-year graduate
student preparing your proposal or you are further on and thinking about
publishing in a peer-reviewed journal, the pressure is on to understand and
decide upon your statistical approach. It
is a modern paradigm that statistics are fundamental in ecology, i.e., most
likely your supervisor and your journal editors/reviewers will demand
statistical analyses be included in your publication. The question of the method to use is both good
and bad. One of the paradigm’s great
outcomes is the training of ecologists to design effective and meaningful
studies (see also Prof. Kreb’s thoughts - https://www.zoology.ubc.ca/~krebs/ecological_rants/on-defining-a-statistical-population/#comments). However, one massive failure is the mushrooming
of complex statistical approaches and easy software packages that are neither as
effective nor meaningful as some ecologists wish them to be.
The Covid-19 lock-down let me catch up on
some statistics papers I’d tucked into my “to-read” folder. My younger colleagues are very bright and doing
very complex, statistical analyses that aren’t easy to understand, I’m reviewing
work using these techniques for journals and funding agencies, and I wanted to
invest time exploring and hopefully learning more about these emerging ideas
about statistical approaches and applications.
There are many, very excellent papers describing methods and
applications that are clearly written by intelligent people who have spent time
thinking about statistical approaches in ecology, and more generally biology. I’ve been reading about AAN, AIC, Bayes, CV-R2,
GAM, GLMM, LLM, PLS, RDA, and RF among others.
My first conclusion is that there is a
direct, positive correlation between the abundance and complexity of data arising
from emerging sampling tools, e.g., remotely sensed data in my world, and the
abundance and complexity of statistical analyses, e.g., Lortie et al. (2020). I posit that the correlation began about the
time that SAS and its 2000-page manuals hit our desks in the 1980s (SAS 1989). Here is great quote that summarizes
statistics in biology more broadly today: “The suite of statistical tools available to
biologists and the complexity of biological data analyses have grown in tandem…The
availability of novel and sophisticated statistical techniques means we are
better equipped than ever to extract signal from noisy biological data… [statistical]
models are powerful yet complex tools.” (Harrison et al. 2018). The quote is true regarding the much larger
and complex data sets and the complex statistical analyses, analytical
approaches, and packages today; however, the phrase “noisy biological data” glosses
over the fact that it is the fundamental nature of biology to be messy and stay
messy. My second conclusion is that if
you want to explain the fiery heat in your spicy chili, then trying to count and
find patterns among the chili molecules isn’t a great investment of your time -
”Blackholes are simpler…But even if those equations could be solved for
immense aggregates of atoms, they wouldn’t offer the enlightenment that
scientists seek.” (https://aeon.co/ideas/black-holes-are-simpler-than-forests-and-science-has-its-limits).
It is the inherent nature of living things
to be and stay messy. All biological
systems must have continuous variability and random or not opportunities to
break moulds, otherwise selection for survival in a given environment can’t
occur and life and lineages end. This is
evolution and broadly, natural selection with some mutations thrown in along
the way. It is this dynamic variability
of living systems that jams up biology as it tries to fit into classical,
physics-based definitions of the natural world (e.g., Egler 1986; Pigliucci
2002). Biology has one law for certain (for
now at least): there will be lots of fluxing
and variability and the occasional, and often unpredictable, mutations. Stability is a major discussion point in
ecology, but it is a temporal illusion because if you wait a few or 1,000,000
years, change will happen.
The rise of numeracy in biology is a great
thing and there is no arguing that numbers are important and useful, especially
in ecology. The problem is that
mathematics is bounded and as a consequence, it doesn’t always get along with the
especially “noisy” data of ecology. Counting,
measuring, and summarizing are cornerstones of ecology. Correlations and relations between measured
factors and comparing groups are all very useful for developing an understanding
of ecological systems. The clash comes
when ecologists, seeing their complex data become enthralled by complex number-busters
of mathematics, e.g., statistics.
The clash occurs because 99% of
mathematicians don’t understand that 99% of the rest of the world doesn’t get
math. Then add to this the rise of the
machines. Today there isn’t much in the
way of statistical analyses that anyone with a few moments of ‘online help’
can’t do, e.g., the rise of “R” (Lortie et al. 2020; and a useful
overview is
https://blog.eduonix.com/software-development/rise-r-programming-language-usefulness-data-science/). The story-line has been as follows: there is very
beautiful mathematics (more on this later), it is translated through a machine with
the virtual pressing of a button and with instant results, there is a growing throng
of intelligent “applied statistics” crusaders, and voilĂ , we have the perfect
recipe for disaster. The crusaders are
smart people, know there is a math issue, and sincerely also advise ecologists
to ‘consult a statistician’. But this is
one of the disconnects: normal people (the 99%) don’t understand that mathematicians
can’t conceive of a system not bounded by equations, and they are very happy
that anyone is interested in mathematics and will dive into any math problem
presented.
Taking a few steps back, I’m a math nerd
and have been since I was 8 years old and calculating the least expensive set
of groceries on a hand-held, pocket counter while wheeling the cart through the
store. I went to university to study
mathematics, did two years in Canada’s elite mathematics’ programme, realized
that math had other uses (leading to my mostly ecology career), and 30 years
later I now teach statistics to undergraduate and graduate students in the
environmental sciences. I have been in the
community of mathematicians and, while I am generalizing about them for
literary purposes (apologizes to my math friends), I live at the mathematics-ecology
nexus and I have happily added to the mathematics-ecology mash-up during my
career, especially early on.
Math is cool even if 99% of us don’t get
it. Watching your hard work and collected
data become a statistically significant regression or show statistically
significant differences supporting your original hypothesis are powerful
moments, especially early in your career.
Turning your very large and complex, interconnected data set into a
2-dimensional principle component space is amazing. These analyses, among many math applications,
have moved ecology far beyond counting and measuring, and the math can be very
informative for advancing our understanding of complex systems. Math gives ecologists many useful tools. But – and there is a big but - mathematics
has rules and ecological systems flaunt every one of those rules. Ecology, i.e., the study of living things and
their environments, is inherently variable across space and time, within and among
individuals, families, groups, populations, species, communities, ecosystems,
and at many more levels we can’t yet comprehend. The biological and environmental information
collected today will vary later today, be different tomorrow, and so on. Natural systems want to change and have to
change, but math needs stability and it has boundaries.
I describe our current situation as statistics
running amok over ecology at the beckon of ecology. The math is beautiful and the people applying
it and creating software to perform the complex computations are more
intelligent than I, but the ever-increasing birds-nests of statistical analyses
are mostly unnecessary as other more intelligent people than I have pointed out
(see for example, Murtaugh 2007 and Amrheim et al. 2019). Imagine I need to get from my home on the
east coast of Canada to the west coast some 5,400 km away. I used to drive a 1976 Chevy Nova which was
the most standard car design on the road for a couple of decades. I could successfully achieve my goal by
driving that car across Canada with some simple assumptions that hold true: I
have a paper map and there is gas and a mechanic who can fix any problem in
every town. Alternatively, I could drive
a somewhat hypothetical, but close to reality automobile of today that is self-driving,
GPS linked and controlled, electric fueling, and so on. My assumptions are that self-driving is
possible on all roads, all my computer systems don’t fail, the GPS satellites
are detectable, Siri isn’t leading me astray, the computers that run the remote
things don’t fail, and so on. I also assume
I can get these things fixed, but anyone who drives a lesser-imagined car today
knows you can’t get it fixed unless you are at a big city dealer with a service
computer plug-in for your car. My
analogy in statistical language: a simple t-test may not look pretty in today’s
psychedelic statistical landscape, but it achieved the same result.
I’m also sensing a negative impact on the advancement
of ecology because we get distracted creating and promoting, more and more
complex statistical analyses and software.
It is an ever-deepening rabbit’s hole because our inherently complex ecological
systems are far beyond our current ability to comprehend and creating more
complex statistical models and computational processes will never advance our
understanding of the original question in ecology. Trying to extract a signal we won’t recognize
from guaranteed-to-be-increasingly-complex and noisy data is a never-ending
do-loop.
My take home message is what I try to
instill in my young learners in the environmental sciences: “If your experiment
needs statistics, you ought to have done a better experiment.” A statement most often attributed to E.
Rutherford, I explain that he (or whoever) wasn’t slamming statistics but was appealing
for better experimental design. I follow
with Curry’s Corollary : “If you need statistics to tell something is
significant, then it is not significant”.
This is about the natural variability you will face, that there is no
magic bullet to overcome it, and that a clear question and good sampling design
is the foundation you need to find or get you close to a solution, including which,
if any, statistical approach you choose to use in your studies. Statistics is just a tool, like many that we
use in the environmental sciences. It has
value, but it remains just one tool in your toolbox. I also emphasize ad nauseam, including
with journal editors, that a far better tool is a well-thought out figure
showing an effect/no effect.
I encourage and teach the use of
mathematics, especially statistics, in all the environmental sciences. These disciplines should be very grateful
because statistic’s greatest gift has been the teaching to think about our
questions, sampling design, and interpretation and presentation of data (there
are many useful guides, e.g., Kass et al. 2016; Zuur and Ieno 2016).
A final note to readers. There aren’t many references herein on
purpose. If you feel the need for a
rebuttal, then you will already have many, very effective references at your
fingertips to slam into your response.
Or you will get my point, smile, and contemplate your next study design a
priori over a beverage of your choice.
Allen
References:
Amrhein
V, Greenland S, McShane B. 2019. Scientists rise up against statistical
significance. Nature 567(7748):305-307.
Egler, FE.
1986. Physics envy in
ecology. Bulletin of the Ecological
Society of America 67:233-235.
Harrison
XA, Donaldson L, Correa-Cano ME, Evans J, Fisher DN, Goodwin CED, Robinson BS,
Hodgson DJ, Inger R. 2018. A brief introduction to mixed effects modelling and
multi-model inference in ecology. PeerJ 6:e4794.
Pigliucci,
M. 2002.
Are ecology and evolutionary biology ‘soft’ sciences? Annales Zoologici Fennici 39:87-98.
Kass
RE, Caffo BS, Davidian M, Meng X-L, Yu B, Reid N. 2016.
Ten simple rules for effective
statistical
practice. PLoS Comput Biol 12(6):e1004961.
Lortie,
CJ, Braun, J, Filazzola, A, Miguel, F. 2020. A
checklist for choosing between R packages in ecology and evolution. Ecol Evol
10:1098– 1105.
Murtaugh,
PA. 2009. Performance of several variable‐selection
methods applied to real ecological data. Ecology Letters 12: 1061-1068.
SAS
Institute Inc. 1989. SAS STAT user's guide, version 6. 4th ed. SAS Institute Inc., Cary, N.C.
Zuur AF, Ieno EN. 2016. A
protocol for conducting and presenting results of regression‐type analyses.
Methods Ecol Evol 7: 636-645.