Monday, 24 August 2020

Statistics and Ecologists Today

Statistics and Ecologists Today: More from the “Emperor Has No Clothes Chronicles”

In my opinion, the question most often asked by ecology practitioners today is “what statistical method should I use in my study?”.  Why? Because whether you are just beginning to learn about ecology, e.g., you are a first-year graduate student preparing your proposal or you are further on and thinking about publishing in a peer-reviewed journal, the pressure is on to understand and decide upon your statistical approach.  It is a modern paradigm that statistics are fundamental in ecology, i.e., most likely your supervisor and your journal editors/reviewers will demand statistical analyses be included in your publication.  The question of the method to use is both good and bad.  One of the paradigm’s great outcomes is the training of ecologists to design effective and meaningful studies (see also Prof. Kreb’s thoughts - https://www.zoology.ubc.ca/~krebs/ecological_rants/on-defining-a-statistical-population/#comments).  However, one massive failure is the mushrooming of complex statistical approaches and easy software packages that are neither as effective nor meaningful as some ecologists wish them to be.

The Covid-19 lock-down let me catch up on some statistics papers I’d tucked into my “to-read” folder.  My younger colleagues are very bright and doing very complex, statistical analyses that aren’t easy to understand, I’m reviewing work using these techniques for journals and funding agencies, and I wanted to invest time exploring and hopefully learning more about these emerging ideas about statistical approaches and applications.  There are many, very excellent papers describing methods and applications that are clearly written by intelligent people who have spent time thinking about statistical approaches in ecology, and more generally biology.  I’ve been reading about AAN, AIC, Bayes, CV-R2, GAM, GLMM, LLM, PLS, RDA, and RF among others. 

My first conclusion is that there is a direct, positive correlation between the abundance and complexity of data arising from emerging sampling tools, e.g., remotely sensed data in my world, and the abundance and complexity of statistical analyses, e.g., Lortie et al. (2020).  I posit that the correlation began about the time that SAS and its 2000-page manuals hit our desks in the 1980s (SAS 1989).  Here is great quote that summarizes statistics in biology more broadly today:  “The suite of statistical tools available to biologists and the complexity of biological data analyses have grown in tandem…The availability of novel and sophisticated statistical techniques means we are better equipped than ever to extract signal from noisy biological data… [statistical] models are powerful yet complex tools.” (Harrison et al. 2018).  The quote is true regarding the much larger and complex data sets and the complex statistical analyses, analytical approaches, and packages today; however, the phrase “noisy biological data” glosses over the fact that it is the fundamental nature of biology to be messy and stay messy.  My second conclusion is that if you want to explain the fiery heat in your spicy chili, then trying to count and find patterns among the chili molecules isn’t a great investment of your time - ”Blackholes are simpler…But even if those equations could be solved for immense aggregates of atoms, they wouldn’t offer the enlightenment that scientists seek.” (https://aeon.co/ideas/black-holes-are-simpler-than-forests-and-science-has-its-limits).

It is the inherent nature of living things to be and stay messy.  All biological systems must have continuous variability and random or not opportunities to break moulds, otherwise selection for survival in a given environment can’t occur and life and lineages end.  This is evolution and broadly, natural selection with some mutations thrown in along the way.  It is this dynamic variability of living systems that jams up biology as it tries to fit into classical, physics-based definitions of the natural world (e.g., Egler 1986; Pigliucci 2002).  Biology has one law for certain (for now at least):  there will be lots of fluxing and variability and the occasional, and often unpredictable, mutations.  Stability is a major discussion point in ecology, but it is a temporal illusion because if you wait a few or 1,000,000 years, change will happen.

The rise of numeracy in biology is a great thing and there is no arguing that numbers are important and useful, especially in ecology.  The problem is that mathematics is bounded and as a consequence, it doesn’t always get along with the especially “noisy” data of ecology.  Counting, measuring, and summarizing are cornerstones of ecology.  Correlations and relations between measured factors and comparing groups are all very useful for developing an understanding of ecological systems.  The clash comes when ecologists, seeing their complex data become enthralled by complex number-busters of mathematics, e.g., statistics.

The clash occurs because 99% of mathematicians don’t understand that 99% of the rest of the world doesn’t get math.  Then add to this the rise of the machines.  Today there isn’t much in the way of statistical analyses that anyone with a few moments of ‘online help’ can’t do, e.g., the rise of “R” (Lortie et al. 2020; and a useful overview is https://blog.eduonix.com/software-development/rise-r-programming-language-usefulness-data-science/).  The story-line has been as follows: there is very beautiful mathematics (more on this later), it is translated through a machine with the virtual pressing of a button and with instant results, there is a growing throng of intelligent “applied statistics” crusaders, and voilà, we have the perfect recipe for disaster.  The crusaders are smart people, know there is a math issue, and sincerely also advise ecologists to ‘consult a statistician’.  But this is one of the disconnects: normal people (the 99%) don’t understand that mathematicians can’t conceive of a system not bounded by equations, and they are very happy that anyone is interested in mathematics and will dive into any math problem presented.

Taking a few steps back, I’m a math nerd and have been since I was 8 years old and calculating the least expensive set of groceries on a hand-held, pocket counter while wheeling the cart through the store.  I went to university to study mathematics, did two years in Canada’s elite mathematics’ programme, realized that math had other uses (leading to my mostly ecology career), and 30 years later I now teach statistics to undergraduate and graduate students in the environmental sciences.  I have been in the community of mathematicians and, while I am generalizing about them for literary purposes (apologizes to my math friends), I live at the mathematics-ecology nexus and I have happily added to the mathematics-ecology mash-up during my career, especially early on.

Math is cool even if 99% of us don’t get it.  Watching your hard work and collected data become a statistically significant regression or show statistically significant differences supporting your original hypothesis are powerful moments, especially early in your career.  Turning your very large and complex, interconnected data set into a 2-dimensional principle component space is amazing.  These analyses, among many math applications, have moved ecology far beyond counting and measuring, and the math can be very informative for advancing our understanding of complex systems.  Math gives ecologists many useful tools.  But – and there is a big but - mathematics has rules and ecological systems flaunt every one of those rules.  Ecology, i.e., the study of living things and their environments, is inherently variable across space and time, within and among individuals, families, groups, populations, species, communities, ecosystems, and at many more levels we can’t yet comprehend.  The biological and environmental information collected today will vary later today, be different tomorrow, and so on.  Natural systems want to change and have to change, but math needs stability and it has boundaries.    

I describe our current situation as statistics running amok over ecology at the beckon of ecology.  The math is beautiful and the people applying it and creating software to perform the complex computations are more intelligent than I, but the ever-increasing birds-nests of statistical analyses are mostly unnecessary as other more intelligent people than I have pointed out (see for example, Murtaugh 2007 and Amrheim et al. 2019).  Imagine I need to get from my home on the east coast of Canada to the west coast some 5,400 km away.  I used to drive a 1976 Chevy Nova which was the most standard car design on the road for a couple of decades.  I could successfully achieve my goal by driving that car across Canada with some simple assumptions that hold true: I have a paper map and there is gas and a mechanic who can fix any problem in every town.  Alternatively, I could drive a somewhat hypothetical, but close to reality automobile of today that is self-driving, GPS linked and controlled, electric fueling, and so on.  My assumptions are that self-driving is possible on all roads, all my computer systems don’t fail, the GPS satellites are detectable, Siri isn’t leading me astray, the computers that run the remote things don’t fail, and so on.  I also assume I can get these things fixed, but anyone who drives a lesser-imagined car today knows you can’t get it fixed unless you are at a big city dealer with a service computer plug-in for your car.  My analogy in statistical language: a simple t-test may not look pretty in today’s psychedelic statistical landscape, but it achieved the same result. 

I’m also sensing a negative impact on the advancement of ecology because we get distracted creating and promoting, more and more complex statistical analyses and software.  It is an ever-deepening rabbit’s hole because our inherently complex ecological systems are far beyond our current ability to comprehend and creating more complex statistical models and computational processes will never advance our understanding of the original question in ecology.  Trying to extract a signal we won’t recognize from guaranteed-to-be-increasingly-complex and noisy data is a never-ending do-loop. 

My take home message is what I try to instill in my young learners in the environmental sciences: “If your experiment needs statistics, you ought to have done a better experiment.”  A statement most often attributed to E. Rutherford, I explain that he (or whoever) wasn’t slamming statistics but was appealing for better experimental design.  I follow with Curry’s Corollary : “If you need statistics to tell something is significant, then it is not significant”.  This is about the natural variability you will face, that there is no magic bullet to overcome it, and that a clear question and good sampling design is the foundation you need to find or get you close to a solution, including which, if any, statistical approach you choose to use in your studies.  Statistics is just a tool, like many that we use in the environmental sciences.  It has value, but it remains just one tool in your toolbox.  I also emphasize ad nauseam, including with journal editors, that a far better tool is a well-thought out figure showing an effect/no effect.

I encourage and teach the use of mathematics, especially statistics, in all the environmental sciences.  These disciplines should be very grateful because statistic’s greatest gift has been the teaching to think about our questions, sampling design, and interpretation and presentation of data (there are many useful guides, e.g., Kass et al. 2016; Zuur and Ieno 2016).

A final note to readers.  There aren’t many references herein on purpose.  If you feel the need for a rebuttal, then you will already have many, very effective references at your fingertips to slam into your response.  Or you will get my point, smile, and contemplate your next study design a priori over a beverage of your choice.

Allen

References:

Amrhein V, Greenland S, McShane B.  2019.  Scientists rise up against statistical significance.  Nature 567(7748):305-307.

Egler, FE.  1986.  Physics envy in ecology.  Bulletin of the Ecological Society of America 67:233-235.

Harrison XA, Donaldson L, Correa-Cano ME, Evans J, Fisher DN, Goodwin CED, Robinson BS, Hodgson DJ, Inger R. 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ 6:e4794.

Pigliucci, M.  2002.  Are ecology and evolutionary biology ‘soft’ sciences?  Annales Zoologici Fennici 39:87-98.

Kass RE, Caffo BS, Davidian M, Meng X-L, Yu B, Reid N.  2016.  Ten simple rules for effective

    statistical practice. PLoS Comput Biol 12(6):e1004961.

Lortie, CJ, Braun, J, Filazzola, A, Miguel, F.  2020.  A checklist for choosing between R packages in ecology and evolution. Ecol Evol 10:1098– 1105.

Murtaugh, PA.  2009.  Performance of several variable‐selection methods applied to real ecological data. Ecology Letters 12: 1061-1068.

SAS Institute Inc.  1989.  SAS STAT user's guide, version 6.  4th ed. SAS Institute Inc., Cary, N.C.

Zuur AF, Ieno EN.  2016.  A protocol for conducting and presenting results of regression‐type analyses. Methods Ecol Evol 7: 636-645.

No comments:

Post a Comment