University of Vienna
Journal of Statistics Education Volume 9, Number 1 (2001)
Copyright © 2001 by Erhard Reschenhofer, all rights
This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.
Key Words: Model selection; Selecting the level of significance; Testing.
In statistics courses, students often find it difficult to understand the concept of a statistical test. An aggravating aspect of this problem is the seeming arbitrariness in the selection of the level of significance. In most hypothesis-testing exercises with a fixed level of significance, the students are just asked to choose the 5% level, and no explanation for this particular choice is given. This article tries to make this arbitrary choice more appealing by providing a nice geometric interpretation of approximate 5% hypothesis tests for means.
Usually, we want to know not only whether an observed deviation from the null hypothesis is statistically significant, but also whether it is of practical relevance. We can use the same geometrical approach that we use to illustrate hypothesis tests to distinguish qualitatively between small and large deviations.
The histograms of many datasets occurring in practice have the appearance of a bell. They are symmetric about their means and tail off rapidly as we move away from the means. A typical example is shown in Figure 1a, which summarizes the mean temperatures in May recorded from 1845 to 1978 in St. Louis. (This dataset will be described in more detail in Section 3.) Of course, histograms can also look quite different from the distribution in Figure 1a. They may be skewed, have thick tails, or exhibit more than one peak. In this paper, we are interested in the last case, particularly that of two peaks. Histograms with two peaks are called bimodal, and those with only one peak are called unimodal. An example of a bimodal histogram is shown in Figure 2b, which summarizes a dataset containing mean temperatures observed in July and in September. Here bimodality is due to the fact that the dataset is heterogeneous. It could easily be dissected into two more homogeneous parts by studying the July temperatures and the September temperatures separately.
Figure 1a. Histogram of the Mean St. Louis Temperature in May (1845-1978).
Figure 1b. Histogram of the Mean St. Louis Temperature in September (1845-1978).
Clearly, bimodality does not always occur when we have a mixture of two sets of observations with different means -- the means must be sufficiently different. Consider, for example, Figure 2a, which summarizes a dataset containing mean temperatures observed in May and September. In this case, the difference between the mean of the May temperatures and that of the September temperatures is too small to cause bimodality. If we want to assess the difference between the means of two datasets, we could examine the shape of the histogram of the combined dataset. Bimodality of this histogram could serve as a qualitative indicator for a big difference between the means. This idea will be explained in more detail in Section 3.
Figure 2a. Histogram of the Mean St. Louis Temperature in May and September (1845-1978).
Figure 2b. Histogram of the Mean St. Louis Temperature in July and September (1845-1978).
Figure 2c. Histogram of the Mean St. Louis Temperature
in July and September (1845-1978).
(Choosing class intervals that are too small gives rise to spurious peaks!)
Figure 2d. Histogram of the Mean St. Louis Temperature
in July and September (1845-1978).
(Choosing class intervals that are too wide conceals genuine bimodality!)
A different question, namely whether or not an observed difference between two means is statistically significant, is discussed in Section 4. To answer this question, we must examine the distributions of the sample means rather than the distributions of individual observations. Again we might check for bimodality. But this time we must examine the combination of the distributions of the sample means. It turns out that bimodality occurs whenever the null hypothesis of identical means is rejected by a hypothesis test at an approximate 5% level of significance. Hence this approach provides a nice geometric interpretation of tests for differences between means at the 5% level.
The question of how the 5% level of significance was chosen as a standard is examined in Section 2. Finally, Section 5 discusses the usefulness of fixed-level significance testing versus the mere reporting of p-values, describes class reaction to the bimodality principle, and gives suggestions for covering this material with students.
A crucial problem in statistics is to discriminate between two or more competing hypotheses or models. The first problem of this kind faced by a beginner is that of testing the null hypothesis that the mean of a normal distribution is equal to a specified value c. When the sample size is large, it is usually suggested that we reject this null hypothesis whenever the distance between c and the sample mean exceeds two standard deviations of the sample mean. A similar problem is that of testing the null hypothesis that the means of two normal distributions are identical. The latter null hypothesis is usually rejected whenever the distance between the two sample means, and , exceeds two standard deviations of . In each of the two cases, the stated rejection rule guarantees that the probability of rejecting a true null hypothesis is only 5% (approximately).
But how can the choice of the 5% level be justified? Cowles and Davis (1982) investigated the question of how the 5% level of significance was chosen as a standard. Examining early literature in probability and statistics, they found that Fisher (1925) was perhaps the first to formally mention the 5% level. In his book Statistical Methods for Research Workers, Fisher stated that deviations exceeding twice the standard deviation are regarded as significant. However, Cowles and Davis (1982) stressed that Fisher should not be credited with introducing the 5% level because the choice of this level by Fisher was not casual and arbitrary, but was influenced by previous scientific conventions. At the beginning of the 20th century, statements about statistical significance were still given in terms of the probable error, which was the nineteenth-century measure of the width of a distribution (see Porter 1986 and Stigler 1986). (The German astronomer Frederik Wilhelm Bessel appears to have coined the term 'probable error' or 'der wahrscheinliche Fehler' by 1815 (see Walker 1929, p. 186). The term 'standard deviation' was introduced almost 80 years later by Karl Pearson (see Stigler 1986, p. 328).) Deviations exceeding three times the probable error were considered significant (see, e.g., Student 1908). The probable error is defined as the median deviation from the mean. If the mean coincides with the median, which is the case for symmetric distributions, the probable error is just half the interquartile range. Observing that the upper quartile of a standard normal distribution lies between 0.66 and 0.67, we note that the probable error roughly corresponds to 2/3 of a standard deviation. In the normal case, a deviation of three probable errors therefore corresponds to a deviation of two standard deviations. Hence it seems that the 5% level has a longer history than is generally appreciated.
We use a simple meteorological example to introduce the
bimodality principle. The variable of interest is the
monthly mean temperature in St. Louis, Missouri. Data are
available for the period from January 1845 to December 1978
(see Marple 1987). In view of the
extreme unreliability of long-term weather forecasts, these
measurements may be considered as roughly independent
observations. This dataset is considered as a sample of 134
years from the population of all years. For each
month we have
Clearly, it depends on the circumstances whether or not this difference is considered as important. For an average citizen of St. Louis it may be insignificant, whereas for the operator of a solar power station it may be very important. A purely formal approach for assessing the size of this difference is to combine both samples into a single sample and then produce a histogram for the combined sample. If the distance between the means is large enough, this histogram will exhibit two peaks, each of which corresponds to a peak in one of the two original histograms. In our case, the difference is too small. The histogram for the combined dataset has only one peak (see Figure 2a); hence it does not indicate an important difference. In contrast, if we compare the mean temperatures in July, J1,..., Jn, with those in September, we find two peaks in the histogram of the combined dataset (see Figure 2b). The bimodality of the histogram of the combined sample may be considered an indication of an important location difference between the two datasets. Indeed, the first peak is close to the mean of the September measurements ( = 21.1) and the second peak is close to the mean of the July measurements ( = 26.3).
The above procedure for distinguishing between important and unimportant location differences is not completely objective because it contains a subjective component, namely the choice of the classes used for the construction of the histograms. Unfortunately, this choice strongly influences the appearance of the histogram. Choosing the width of the class intervals too small could give rise to spurious peaks (see Figure 2c). On the other hand, genuine bimodality could be concealed by choosing the width of the class intervals too large (see Figure 2d). An obvious way to get rid of this subjective component is to use another graphical tool for the description of the data instead of the histogram. The probability distribution of a continuous random variable like the air temperature is characterized by its probability density function. The probability that the random variable takes on a value in the interval from a to b is just the area under the graph of the probability density function between a and b. A histogram can be regarded as an estimate of the probability density function. Many continuous random variables occurring in practice have bell-shaped probability density functions. Figures 1a and 1b suggest that this might be true also for our random variables M and S. For both datasets, neither the Kolmogorov-Smirnov test nor the Anderson-Darling test detects any deviation from normality at the 10% level of significance. We may therefore assume that their probability density functions are of the normal type. Normal probability density functions are completely determined by two parameters, the mean and the standard deviation. Clearly, we do not know the means and the standard deviations of the random variables M and S, but we can use estimates instead. Estimates of the means and the standard deviations of M and S are obtained by calculating the sample means,
and the sample standard deviations,
we obtain estimates of the probability density functions of M and S, namely f (x | , sM) and f (x | , sS). (Note that we use and as the parameters of f rather than and .) The graphs of these functions are shown in Figures 3a and 3b, respectively. (DERIVE 4 was used to generate the plots of the probability density functions.) The first graph summarizes the dataset M1,..., Mn, and the second one summarizes the dataset S1,..., Sn.
Figure 3a. Estimated Probability Density Function for the Mean Temperature in May.
Figure 3b. Estimated Probability Density Function for the Mean Temperature in September.
What we need next is a graphical summary of the combined
dataset M1,..., Mn,
Unfortunately, since all normal probability density
functions are unimodal, they are of no use for the
description of the combined dataset. Instead, we try to
construct a summary for the combined dataset by combining
the summaries of the two original datasets. To see how this
can be done, we again consider histograms. The histogram
for the combined dataset can be constructed either directly
from the data M1,...,
Sn or indirectly from the histograms of
the original datasets. In the latter case, we must use the
same classes for all histograms. For example, consider the
class with endpoints 16 and 17.5 degrees Celsius.
Twenty-four (17.9%) of the 134 measurements
M1,..., Mn, 3 (2.2%) of
the 134 measurements S1,...,
Sn, and 27 (10.1%) of all 268
Sn fall in this class. The proportion of
all measurements falling in this class is just the average
of the other two proportions. This is a consequence of the
fact that the original samples are of the same size. Each
measurement Mi or Si
Combining the two probability density functions depicted in Figures 3a and 3b we obtain the function
Figure 4a. Combination of the Estimated Probability Density Functions for May and September.
Figure 4b. Combination of the Estimated Probability Density Functions for July and September.
Figure 5. Averages of Two Normal Probability Density Functions with Equal Standard Deviations but Different Means.
In statistics, the difference between two sample means is usually assessed in two ways. First, the size of the difference is judged by its practical importance. In most applications, this can easily be accomplished without sophisticated decision rules. Only if the investigator has absolutely no clue which differences should be considered important, he/she might have recourse to a formal rule like the one based on a bimodality check. According to this rule, called the bimodality principle, a location difference between two (estimated) normal probability density functions is regarded as large (or important) if their mixture density is bimodal. In the previous section, we have applied this principle to distinguish between small and large location differences.
The second interesting question regarding the difference between two sample means is whether it is large enough to indicate that the population means also differ. In our example, we could wish to determine whether the overall mean temperature in May differs significantly from that in September. This question may be answered by applying a 5% level hypothesis test. In the second part of this section, we will show how the bimodality principle can be used to illustrate this test. But first we consider the one-sample case.
Suppose we are given a sample x1,..., xn from a normal distribution with mean and standard deviation . We formulate a simple null hypothesis, H0, and an appropriate alternative hypothesis, HA:
The null hypothesis states that the mean is equal to a specified value c, and the alternative hypothesis states that the mean differs from this value. It is natural to test the null hypothesis by calculating the sample mean and rejecting the null hypothesis whenever the discrepancy between and c is too large. The significance of any discrepancy depends on the reliability of the sample mean. To assess the reliability of a sample mean, we may consider its sampling distribution. The sampling distribution of is normal with mean and standard deviation . An estimate is given by . Under the null hypothesis, should be close to c, hence the two probability density functions and should not differ too much (see Endnote). The null hypothesis could be rejected if their mixture density is bimodal. Recalling from Section 2 that the mixture density of two normal probability density functions is bimodal if the difference between the means exceeds two standard deviations, we note that in this case the bimodality principle rejects the null hypothesis if . Thus the bimodality principle makes the same decision as the standard large sample significance test at the 5% level. (Actually a large sample t-test rejects the null hypothesis if .)
We now return to our meteorological hypothesis testing problem which involves two samples,
Rewriting this inequality as
we notice immediately that our two-sample test based on a bimodality check agrees with the standard large sample test for comparing two means if the 5% level of significance is chosen for the latter test.
In our example,
and hence the hypothesis of identical means is rejected at the 5% level of significance. Correspondingly, the combination of the probability density functions and exhibits two peaks (see Figure 6).
Figure 6. Combination of the Probability Density Functions and .
Today's statistical software calculates p-values automatically; hence the practice of fixed-level significance testing is no longer dictated by the availability of tables. Of course, stating whether a hypothesis is rejected or not at some level of significance is not as informative as giving the p-value itself. Reporting the actual p-value indeed makes it much easier for the reader of a report to judge the significance of a result. Nevertheless, there are still situations, e.g., in economic forecasting, where statisticians must decide for or against some hypothesis before they can carry on with their work. Ideally, if a statistician is going to make such a decision, he/she should take the consequences of his/her decision into account in choosing the level of significance. Unfortunately, this often cannot be accomplished in an objective and verifiable way. At best, it will only be possible to decide whether the 10% level is more appropriate than the 1% level, but certainly not whether the 4% level is more appropriate than the 6% level. Hence it still makes sense to have standards like the 1% level, the 5% level, or the 10% level. The mere existence of such standards already makes cheating more difficult. Clearly, if someone reports that he/she has rejected a hypothesis at the 6% level, the reader of the report will check suspiciously whether there are good reasons for using just this level of significance.
I have used the bimodality principle to illustrate 5%-level hypothesis tests in introductory statistics courses for science, education, and engineering students. However, I did not explain all the details and omitted the proof. I just showed the figures and used approximately half an hour to explain them. Student reaction was mixed. Only a few students, particularly those who frequently asked questions, explicitly appreciated the explanation. The majority never did question the use of the 5% level and therefore felt no need for an illustration. In my explanation I focussed on the coincidence that, on the one hand, the critical value of a large sample t-test at the 5% level is approximately 2, and, on the other hand, the mixture density of two normal probability density functions is bimodal if the difference between their means exceeds two standard deviations. This material (including the proof and all details) could possibly be appropriate for an investigation involving extra effort outside of a typical class. This may be an honors project associated with a class or even a senior project.
I wish to thank the Editor, the Associate Editor, and the Referees for helpful comments. This paper was written at the Sultan Qaboos University, Oman.
Note that the sample standard deviation is a reasonable estimate of under both H0 and HA. Under H0, we could use a different estimate of obtained by replacing by c in the definition of the sample standard deviation. However, this would have the unpleasant consequence that the two normal probability density functions would have unequal standard deviations. In case of a large discrepancy between the standard deviations, bimodality could occur even if there were no significant discrepancy between the means. In addition, the case of unequal standard deviations is technically demanding. Finally, it is quite common in situations where we must distinguish between different hypotheses (or models) to use the estimate of the nuisance parameter obtained under the weakest hypothesis (or with the largest model) also for the assessment of the stronger hypotheses (smaller models).
Theorem: The mixture density
of two normal probability density functions with the same standard deviation, , but with different means, and , respectively, is bimodal if and only if .Proof: Depending on the distance between and , the mixture density
will have either a maximum at (the unimodal case) or a local minimum at (the bimodal case). Indeed, x0 is a stationary point because
Now we must check the second derivative to see whether a maximum or a minimum occurs.
if or, equivalently, if . Thus, a minimum occurs only if the distance between the two means exceeds two standard deviations.
Cowles, M., and Davis, C. (1982), "On the Origins of the .05 Level of Significance," American Psychologist, 5, 553-558.
Fisher, R. A. (1925), Statistical Methods for Research Workers, Edinburgh: Oliver & Boyd.
Marple, S. L., Jr. (1987), Digital Spectral Analysis, Englewood Cliffs: Prentice Hall.
Porter, T. M. (1986), The Rise of Statistical Thinking 1820-1900, Princeton, NJ: Princeton University Press.
Robertson, C. A., and Fryer, J. G. (1969), "Some Descriptive Properties of Normal Mixtures," Skandinavisk Aktuarietidskrift, 69, 137-146.
Stigler, S. M. (1986), The History of Statistics, Cambridge, MA: The Belknap Press of Harvard University Press.
Student (W. S. Gosset) (1908), "The Probable Error of a Mean," Biometrika, 6, 1-25.
Walker, H. M. (1929), Studies in the History of Statistical Method, Baltimore: Williams & Wilkins. Reprinted 1975, New York: Arno Press.
Department of Statistics and Decision Support Systems
University of Vienna
Volume 9 (2001) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications