Cover Page

Series Editor
Françoise Gaill

Statistics in Environmental Sciences

Valérie David

images

Preface
Statistics: Essential Tools to be Carefully Considered

“Don’t interpret what you don’t understand...

call for an understanding of the methods, the supporters

and outcomes before making sense of

the conclusions”.

Nicolas Gauvrit, Statistiques, méfiez-vous ! (2014)

In the 19th Century, the British Prime Minister, Benjamin Disraeli, defined three kinds of lies: “lies, damned lies and statistics”, already underlining the controversy surrounding these tools. Within the scientific community, pro-statistics and anti-statistics are in conflict and this is manifested by disciplines that are still reluctant to use them. Many media use statistics to sedate public opinion and advance results that ultimately make no sense without some prior clarification.

More recently, in the article published in Le Monde on October 4, 2017, “Publier ou périr : la Science poussée à la faute” (“Publish or perish: Science pushed to a fault”), journalists highlight the misuse of statistics as one of research’s bad practices in the race to publish. These misuses are related to a poor understanding of how this tool works with biased sampling, simplistic experiments limiting reproducibility, exaggerated results in terms of the statistical population used or their significance, or an underestimation of the risks of error associated with statistical tests.

Despite these criticisms, it is clear that once these tools begin to be used within a scientific discipline, they quickly become essential. The use of statistics is the only way to generalize sample results to the population level due to problems related to sampling fluctuations and the inherent variability of “natural” objects. However, it can undoubtedly lead to biased results if they are not carried out accurately.

Statistics require a kind of “calibration”. It would not be appropriate for a biogeochemist to use an oxygen probe without first calibrating it according to environmental parameters such as temperature, or for a system ecologist to identify species without using rigorous and expertly recognized determination keys. The “calibrationof statistical tools is done through verifying the conditions under which the tests used to meet a specific objective are applied (e.g. comparison of population averages, existence of trends, etc.). These tests are based on the use of mathematical equations according to certain hypotheses. Failure to respect these hypotheses makes the application of the equations used in the test flawed due to the mathematical properties used for its design.

Thus, these tools are essential for an objective scientific approach, but their use requires particular accuracy in their implementation and interpretation. The objective of this book is therefore to understand the use of statistics by explaining the spirit behind their design, and to present the most commonly used analyses in environmental sciences, their principles, their advantages and disadvantages, their implementation via the R software and their interpretations in the field of environmental sciences.

Valérie David

May 2019

Introduction1

I.1. What is the relevance of statistical analysis in environmental sciences?

I.1.1. The intrinsic variability of “natural” objects

“Artificial” objects from industrial manufacture, such as Bijou branded madeleines, are characterized by distinctive features, in accordance with precise specifications, in other words, weight, height, calorie intake, carbohydrate content, etc. This calibration is particularly monitored during the production chain and any deviation from the imposed standard leads to the removal of the object from sale. For example, the weighing of a “sample” of 1,000 madeleines (Bijou brand) shows only very small fluctuations in weight between 24.85 and 25.19 g (Figure I.1).

In contrast, “natural” objects, alive or not, are marked by strong interindividual variability for a given species, or between hydrological or pedological samples for, respectively, a theoretically homogeneous water body or soil. Thus, although classified by size according to weight, a sample of 1,000 size-5 oysters that correspond to the same age group shows a fairly broad distribution curve with an average of 37 g and a range of variation mainly fluctuating between 30 and 45 g (Figure I.1).

This high variability is inherent in “natural” objects and distinguishes them from “artificial” objects.

image

Figure I.1. Weight distribution for 1,000 industrially produced Bijou brand madeleines and 1,000 individual living “natural” objects, hollow oysters (size-5)

This intrinsic variability requires several elements/individuals to be considered in order to characterize a “natural” object, called replicas. The greater the variability, the greater the number of elements to be considered will be in order to better understand the characteristics of the object under consideration.

I.1.2. Describing a natural population

Let us take an oyster population present in the Arcachon Bay (South-West of France) with a number that amounts to 30 in total. It is simple to determine which of them are infested by the parasitic worm Polydora spp. (Figure I.2). If the number of parasitized oysters is a total of 12, the actual parasitic prevalence (in other words, the proportion of infested oysters) is therefore 40% at the bay level.

However, it is completely utopian to consider an exhaustive analysis of the oyster population for a given environment. Indeed, the number of individuals is often much higher so taking a total census would be very time-consuming. In addition, the study of oyster parasitism involves dissection and therefore the euthanasia of hosts (like most studies conducted on living beings), and it is totally unacceptable to decimate a population under the pretext of scientific research. The analysis of a natural population, living or not (e.g. oysters, water bodies), will therefore require representative sampling: only a few elements will be sampled, but in such a way as to represent the composition and complexity of the entire population.

image

Figure I.2. Hypothetical example 1: all oysters in the Arcachon Bay were counted (N = 30 in total) and analyzed; it is possible to observe the true parasitic prevalence value of the Polydora worm, which is 40% here

Let us consider the prevalence obtained from a first sample of eight oysters taken in the Arcachon bay. This is 25% (two out of eight infested oysters; sample 1, Figure I.3). This value is different from that obtained from the population as a whole (40%) due to sampling fluctuations (in other words, only eight oysters are considered out of 30). In addition, the consideration of a second sample on this same population gives a prevalence of 50% (four out of eight infested oysters; sample 2, Figure I.3). This value is not only different from the value obtained at the population level, but also from that of sample 1. Thus, two samples giving different values can come from the same initial population.

image

Figure I.3. Comparison of two geographically distinct hypothetical oyster populations (Arcachon Bay and Marennes-Oléron Bay). The total number of oysters in each population is 30, but the actual prevalence is different, 40% and 20%, respectively. Three samples were taken, two from the Arcachon population and one from the Marennes-Oléron Bay. For a color version of this figure, see www.iste.co.uk/david/statistics.zip

Let us now compare the parasitic prevalence of two oyster populations between two renowned oyster farming bays, the Arcachon Bay and the Marennes-Oléron Bay, each composed of a total of 30 individuals (Figure I.3). The real prevalence is 40% for Arcachon and 20% for Marennes-Oléron. A sample taken from the population of the Marennes-Oléron Bay gives a prevalence of 25%, which is not only different from the actual prevalence of the system, but also identical to that of sample 1 of the Arcachon Bay (Figure I.3). Thus, two samples from two different populations can give identical results.

In conclusion, going through samples does not give fair values and thus complicates the generalization that we could make from these samples at the population level.

Although sampling is mandatory in environmental sciences because of the variability inherent in any “natural” object, the difference observed between two samples from two populations to be compared may be related to (1) a “real” difference between these two populations and (2) in part, to sampling fluctuations. The respective part of one or the other, the real effect versus the random effect, is impossible to quantify because the value of the variable considered at the level of the entire population is not accessible. Using the “statistical” tool is the only way to distinguish between assessing the risk of being wrong when making the decision to consider these two populations as different, or not (Figure I.4).

image

Figure I.4. Comparison of two hypothetical oyster populations that are geographically separated (Arcachon Bay and Marennes-Oléron Bay). Only statistics can be used to distinguish between assessing the risk of being wrong when making the decision to consider that these two populations are different, by properly generalizing the results from the consideration of the samples taken within each population

Thus, statistics refer to the set of scientific methods from which data are collected, organized, summarized, presented and analyzed, and which allow conclusions to be drawn and sound decisions to be made regarding the populations they represent.

I.2. The statistical mind: from the representative sample of the population

In general, the scientific approach consists of taking a sample (or samples), describing it and “extrapolating” the results obtained from the population(s) using a statistical approach. The sample(s) must therefore be representative of the population.

I.2.1. The representativeness of a sample

A representative sample must best reflect the composition and complexity of the entire population. It is therefore a question of giving each element of the population an equal opportunity to be sampled. Only randomly carried out sampling in the population, in other words, random sampling, allows this. However, being carried out randomly does not just mean it can be carried out in any way. Box I.1 shows that a human mind is too sophisticated to deal with chaos appropriately.

While this may seem surprising, only a precise methodology can provide access to random sampling. Once again, using the population of oysters in the Arcachon Bay, this time with a probable number of individuals (N = 8,000). In theory, all elements of the population should be identified and numbered from 1 to 8,000 (Figure I.5). If the sample consists of eight elements, eight numbers should be drawn at random (by computer or using a random number table) and only the oysters corresponding to these numbers should be considered in the sample (Figure I.5).

image

Figure I.5. Precise procedure for a simple random sampling of eight oysters out of a population of 8,000. For a color version of this figure, see www.iste.co.uk/david/statistics.zip

This is a simple random sampling, but it is the only one on which the statistical tests available in the study can be applied as they are. Other types exist, such as those that particularly consider a spatial or temporal variation factor known to affect the variability of the elements (see Chapter 3).

I.2.2. How to characterize the sample for a good extrapolation of the population

Some characteristics of the sample will allow us to obtain/extrapolate the properties of the population. Although the average is the first factor that comes to mind when summarizing the sample, it is not enough.

In order to demonstrate this, let us put ourselves in the shoes of an Arcachon oyster farmer, who has just lost a large part of his production due to the development of a pathogen. He decides to go to the wild oyster reefs of the bay to ensure the supply of his stall at the Arcachon Christmas market. This will ensure he earns a revenue, which will certainly be less than with cultivated oysters (as the price per dozen is lower), but will allow him to limit the damage to his company from an economic perspective.

He is looking for size-5 oysters, weighing about 37 g (30–45 g), which are particularly affected by the pathogen on his own production. In order to facilitate his work, he contacts the Comité régional de la conchyliculture (Regional Shellfish Farming Committee) (CRC) to get an idea of the average weights of wild oysters in different parts of the bay and thus target his samples. In fact, the average weight of oysters has recently been assessed at three sites in the bay: Cap Ferret (on average 37 g), Comprian (36 g) and Jacquets (45 g). He therefore moves away from the Jacquets site, which clearly had oysters that were too large, and finally decides to take the oysters from Cap Ferret because of the accessibility to the reefs. After hours of sorting, his hands bloody, he is only able to harvest about 30 size-5 oysters. He angrily returns to the CRC, who show him the graphic results of the movement carried out in these three sites. The average weight of oysters has indeed been assessed on the basis of 100 oysters for each site and the averages are indeed accurate, but the oyster farmer kicks himself following the observation of the distribution curves of the weights obtained per site (Figure I.6).

It is clear from the curves relating to Cap Ferret and Comprian that wild oysters have very similar average weights, but that they have a very different weight structure. Most of the population has a weight close to the average weight of 36 g in Comprian and corresponds to size-5 oysters, whereas the population of Cap Ferret has two groups of sizes, one smaller and the other larger than the average weight: few oysters in Cap Ferret actually have the characteristics of the size desired by the oyster farmer. Paradoxically, the oyster farmer could have collected many more size-5 oysters with a sampling effort comparable to Jacquets, although the overall average would have been higher.

Thus, this example illustrates that the description of a sample by its average alone is not sufficient. It requires the consideration of a factor reflecting interindividual variability. This variability will be addressed in statistics by so-called dispersion parameters such as variance, standard deviation, standard error, etc. Moreover, the average is not the only factor describing a central value of a population, although it is the factor that is mainly used. Parameters such as the median or mode may sometimes be more appropriate (see section 2.2.1).

image

Figure I.6. Distribution curve of the weights of wild oysters taken from Cap Ferret, Comprian and Jacquets by the CRC. The calculated averages are shown in gray. The gray box corresponds to the size-5 oysters

I.2.3. Effect of sample size

Contrary to popular belief, the size of a sample does not influence its representativeness: it is not the number of individuals considered that will determine whether the sample reflects the complexity of the population, but its type of sampling. As described above, only random sampling can be representative of this.

Let us look at the distribution of the oyster population at Comprian. This is in the form of a bell curve (Gaussian curve) centered on an average of 36 g with a high weight dispersion between 20 and 49 g (Figure I.7).

image

Figure I.7. Curve of the weight distribution of wild oysters in Comprian for the entire population (N = 8,000 oysters) and two samples of n = 8 oysters and n = 3,000 oysters. As the sample size increases, the confidence interval of the average calculated from the sample’s characteristic parameters decreases. For a color version of this figure, see www.iste.co.uk/david/statistics.zip

Two random samples of different sizes (sample 1 of eight oysters and sample 2 of 3,000 oysters) were taken from this population. The average value of the samples is likely to be unequal to the true population average due to sampling fluctuations, but we will see later that the dispersion parameters (in other words, standard error) calculated from these samples allow us to estimate a confidence interval in which there is a 95% chance of finding the true average of the population (see section 2.3.1). The larger the sample, the smaller this confidence interval is and therefore the estimate of the improved “true” average. For the small sample, the population average is likely to be between 33.5 and 38.5 g (a variation of 5 g), while this range is almost halved for the large sample (2.8 g; Figure I.7). The estimate will therefore be more accurate for a large sample.

In environmental science, the choice of the sample size (in other words, sampling effort) will be the best compromise between feasibility and the power of the statistical test implemented. It is not a matter of spending too much time analyzing the sample or decimating a population. However, the sample must be large enough to make the estimate as accurate as possible in order to conclude on the objective set. For example, if it is a question of comparing two populations, the smaller the confidence intervals, the more likely it will be possible to determine that they do not overlap (or only overlap a little) and that the populations are different if this is the case. The ability of a statistical test to determine a difference between two sample populations is known as a “power test” (see section 4.3.1).

I.3. The “statistical” tool in environmental research

I.3.1. Environmental research: from description to explanation

Scientific research is not limited to the study of a single parameter, in the sense that the researcher’s objective will not only be to describe the spatial and temporal variability of a parameter of interest (e.g. parasitic prevalence of oysters), but to also understand how other factors control its fluctuations (e.g. size, condition indicators, site contamination, parasite life cycle, etc.). Thus, many parameters related to the interest parameter will be sampled simultaneously during a study. For example, looking for environmental factors influencing parasite prevalence may be done by analyzing the joint evolution of parasite prevalence with factors proposed by graphs and statistical analyses by considering parameters in pairs in order to determine whether the relationships are “real” (called significance).

However, it is unlikely that a single factor will affect the interest parameter. For example, the parasitic prevalence could be higher if the oysters are older (longer contact time with parasites) and/or if the oysters are physiologically weakened due to the presence of a contaminant. A global approach that considers all the potentially explanatory parameters will therefore have to be used and the human mind is too limited to grasp it solely on the basis of the global data table.

An analysis of the parameters in pairs is not only time-consuming but also biased by the set of “hidden” variables and the covariation between explanatory factors. For example, a parameter pair approach would reveal a relationship between prevalence and condition indicators, between prevalence and contamination, between size and prevalence and a last one between contamination and condition indicators. A global statistical approach would make it possible to prioritize the effect of these factors by highlighting, for example, that the relationship with size is not due to prolonged contact with the parasite, but rather due to an indirect impact of the contaminant on the physiology of the oyster.

I.3.2. The place of statistics in the scientific process

The scientific approach is comparable in some respects to that discussed in a criminal investigation. Let us take the example of investigations conducted as part of a well-known television series (Figure I.8). This series always begins with a crime scene where the team picks up clues and interviews witnesses. Samples are sent to the laboratory for further investigation. Witnesses or suspects are questioned on their alibis at the police station. Finally, all the evidence is combined to form evidence and to draw a conclusion on the murderer.

In comparison, any scientific approach begins with field sampling to identify variables of interest. For example, in the case of a study analyzing the effect of factors controlling the state of health of oysters in the Arcachon Bay, variables reflecting the physiological state of the species, as well as others characterizing its physico-chemical environment, would be recorded at different facilities (temperature, salinity, metal contamination, etc.). These field variables are complemented by laboratory analyses (e.g. metal assays) with or without experimentation (e.g. to determine the specific impact of the copper contaminant on the survival of oysters). Lastly, all data are processed in order to meet the objectives set. The overall “field–laboratory–analysis” approach is therefore similar to a police investigation with variables measured in the field or in the laboratory, which is similar to the “clues” collected during a criminal investigation, and the processing of data collected can be compared to the investigators’ reflection on evidence.

image

Figure I.8. Comparison of the scientific approach with criminal investigations. For a color version of this figure, see www.iste.co.uk/david/statistics.zip

The fundamental difference, however, is that investigators must explore all possible leads, whereas scientists cannot access the entire population and will therefore use statistics to fill this gap. Sampling plans or experimental plans should be able to generalize the results of the target population and consider the knowledge acquired by the literature on the subject under study (see Chapter 3). For example, facilities sampled in the field could be randomly selected from known oyster aquaculture areas.

The scientific approach is therefore divided into three stages (Figure I.9):

  1. 1) the field with a sampling strategy that is representative of the population in order to highlight the relationships between variables;
  2. 2) the laboratory to analyze certain variables or experimentally approach certain causal relationships between variables;
  3. 3) modeling to compare the relationships observed in the field with those obtained through experimentation and thus better understand these relationships.

image

Figure I.9. The synergy of research approaches

I.3.3. The importance of the conditions of application and the statistical strategy

Statistical tests are based on the use of mathematical equations whose properties are based on certain assumptions. These assumptions must therefore be verified: these are the conditions of application. If the performance of a test is conditioned, for example, by the fact that the data follow a known distribution law, such as the normal distribution law (see Chapter 2), then failure to comply with this condition makes the application of the equations used in this test difficult because of the mathematical properties used for its design.

In addition, the scientific objective requires a statistical strategy on which the sampling plan or experimental design depends. Although the statistical tool only comes into play at the end of the scientific process, it must be considered beforehand in order to guide choices and digital analyses: number of replicas, sampling strategy, experimental approaches. The traditional statistical tests developed in textbooks are designed for a simple random sampling strategy. Any other strategy requires an adjustment of the tests (see Chapter 3).

It is a mistake to think that statistics only come into play during the data processing phase. Indeed, for the latter to be effective, certain choices must be made upstream, starting from the design of the sampling plan or the experimental design, whether it is in the type of approach, the choice of variables, the number of replicas, etc. It is indeed frustrating to think that the objective could have been better understood “if...”, after the analysis (Figure I.10). All the stages of the scientific process are interconnected by the choice of statistical tests that will be ultimately used. In theory, it would be necessary to already know the type of test that will be implemented when the study objective is considered.

image

Figure I.10. Successive/interconnected approaches through the “choices” made throughout the scientific process

I.4. Book structure

The statistics presented in this book are the most commonly used in the field of environmental sciences to compare parameters and to carry out linear models of a Y variable to be explained by one or more explanatory variables with parametric or non-parametric tests. Their descriptions are made through concrete examples in environmental sciences. They are presented using the R software, which is widely used in the scientific world. The book is composed as follows:

  • – Chapter 1 introduces the use of the R software with some recommendations for its correct use, basic operations, data import, design of pivot tables and graphs;
  • – Chapter 2 presents the fundamental concepts in statistics: the basic vocabulary, how to summarize a sample with the parameters of central tendency and dispersion, as well as the main probability laws on which the statistical tests are based;
  • – Chapter 3 describes the design of sampling and experimental designs based on the objectives and assumptions made prior to the study;
  • – Chapter 4 introduces the main principles of statistical tests, including decision theory and the global approach adopted using the example of the most traditional tests;
  • – Chapter 5 proposes keys for choosing statistical tests according to the objective and available data, as well as statistical tools useful for testing the conditions under which these tests are applied;
  • – Chapter 6 presents the tests for comparing parameters with bilateral or unilateral alternatives such as two averages, two or more distributions or proportions, etc.;
  • – Chapter 7 focuses on traditional and generalized linear models describing a quantitative Y variable by proposing equations from several other explanatory variables with their principles, application conditions and different application examples according to the plan considered;
  • – Chapter 8 describes alternatives to these linear models in the case of non-compliant conditions, alternatives based on rank or permutation tests;
  • – the Conclusion describes in particular how to introduce statistics into a scientific report or publication.
  1. 1 This introduction is a simplified and summarized overview of Scherrer’s books (1984) and online books such as Poinsot’s (2004).
  2. 2 This experiment is based on Poinsot’s book (2004), Statistics for Statophobes, available at: https://perso.univ-rennes1.fr/denis.poinsot/statistiques_%20pour_statophobes/STATISTIQUES%20POUR%20STATOPHOBES.pdf.