Journal of Statistics Education Volume 14, Number 2 (2006), jse.amstat.org/v14n2/datasets.aerts.html
Copyright © 2006 by Ziv Shkedy, Marc Aerts, and Herman Callaert, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words:Normal mixture; Normal probability plot; Outlier; Skew-Normal distribution; Truncation.
The recommendations towards a significant pedagogical reform include emphasis on so-called “data-driven” mathematics education, and by extension, on data-driven statistics education. In the K-12 mathematics curriculum, statistics (referred to as “Data Analysis and Probability” in the NCTM 2000 Principles and Standards for School Mathematics) has a much more prominent role than ever before. The recent introduction of the AP (Advanced Placement) Examination in Statistics focused attention on the importance of statistics even more.
In “The Case for Undergraduate Statistics” Richard L. Scheaffer and Carl Lee write: “Many now realize that statistical thinking and statistical concepts are essential skills for all academic disciplines as well as for life-long learning. As a result of this thinking, spurred on by the influx of freshmen who have already seen basic statistics by the time they arrive at the college door, statistics is receiving more attention at two-year colleges, four-year colleges, and universities." Also here, the increased attention given to statistics goes hand in hand with an increased emphasis on “statistical thinking" and “data and concepts", as recommended by the ASA-MAA Joint Curriculum Committee.
The Guidelines for Assessment and Instruction in Statistics Education (GAISE) were endorsed by the ASA on May 17, 2005. The GAISE reports contain excellent recommendations for statistics education, as well in PreK-12 years as at the introductory college level.
This change in the teaching of statistics is not confined to the U.S. alone, but it takes place in many countries all over the world, and at different pace and intensity. In Flanders for example (the Flemish part of Belgium), statistics played only a minor role in mathematics education at school. A recent curriculum reform (compulsory for all schools) imposes a significant increase of statistics in grades eight through twelve.
At the university level, the recent introduction of a European-wide bachelor-master system has forced almost all institutions to restructure their curricula. Flemish universities have turned this burden into an opportunity for reshaping the statistics education, starting with an introductory course where data and concepts prevail over formal mathematical proofs.
In the light of these reforms, several real life projects are being developed, providing examples of what can be done in class. Not so long ago, the introduction of the Euro has had a major impact in the daily life of everyone (many people still have to “count back into the old currency” to make sure that they rightly value the Euro price!). It therefore was considered a good idea to look a bit closer at the “new” coin itself, and to consider some of its physical properties. After all, everybody in Belgium (and in the many countries of the European Union) has to use such coins now. Here, the choice was made to study the “one Euro” coin. This coin has one side that is the same everywhere (with the number 1 on it), but the other side has a representation that is country dependent. Therefore, the choice was made to restrict the study to the “Belgian 1 Euro” coin.
The paper is organized as follows. The data are introduced in Section 2 and a first analysis, using the Kolmogorov-Smirnov and Shapiro-Wilk goodness of fit tests, is presented in Section 3. A second analysis, based on normal mixture models, is discussed in Section 4. As it will contribute to a better understanding, the details of the data collection procedure are given in Section 5. In a third and final analysis, in Section 6, we do not assume that the distribution of the euro coins is symmetric. The family of skew normal distributions, a class of distributions which includes the normal ones (Azzalini 1985), offers an interesting alternative to model the weight of euro coins. A possible classroom use of the data is discussed in Section 7.
H0: Y is normally distributed, H1: Y is not normally distributed. | (1) |
The two-sided Kolmogorov-Smirnov (KS, see e.g. Section 6.5.2 in Shao 1999) test statistic equals 0.0234 resulting in a p-value of 0.0131, indicating that the null hypothesis can be rejected at a significance level of 5%. The Shapiro-Wilk (SW) statistic is equal to 0.975 (p-value < 0.0001). This latter test is designed specifically for testing normality and is therefore more powerful than the KS test. The SW statistic is also attractive because it has a simple, graphical interpretation as an approximate measure of the correlation in the normal probability plot in Figure 1 (see e.g. Stuart, Ord, and Kendall 1991).
Figure 1: Normal probability plot for the weight of 2000 euro coins.
It is interesting to examine the influence of the three outliers 7.201 g,7.656 g and 7.752 g on the analysis. Figure 2 shows the corresponding normal probability plot after excluding these outliers. Now, the KS test statistic equals 0.0246 (p-value=0.0071) and the SW statistic is 0.9974 (p-value=0.0022), indicating, once again, that the null hypothesis of normally distributed weights can be rejected. This shows that non-normality of the weight variable Y is not simply a matter of excluding outliers but that there are more fundamental features of the data that are not in line with a normal distribution.
Figure 2: Normal probability plot for the weight of 1997 Euro coins, excluding the three outliers 7.201 g,7.656 g and 7.752 g from the original dataset.
As the observed deviation from normality can be induced by heterogeneity in the mean and variance parameter of the normal distribution, an interesting option is to extend a single normal distribution to a hierarchical normal mixture model. The plausibility of this first extension is discussed in the next section.
where , are called the mixture components, are the mixture probabilities and are the parameters to be estimated. In what follows, we focus on a mixture of two normal populations with possibly different mean and variance parameters and, following the approach of Congdon (2003) we formulate the mixture model in terms of an hierarchical model using a latent indicator variable.
(2) |
The variable Gi can be seen as a latent classification random variable for which we assume a Bernoulli distribution
(3) |
In this stage we temporarily assume that the mixing probability is known. To complete the specification of the hierarchical model, the following (non informative) prior and hyperprior distributions are assumed
(4) |
Here, gamma(, ) denotes a gamma distribution with mean equal to / and variance /2. Hence, by choosing = = 0.0001, the prior (and hyperprior) distribution for the variance parameters is a gamma distribution with mean 1 and variance 10000, which reflects our uncertainty about the true value of the parameters (see, for example Gilks, et al. 1996). Three different values for were examined, = 0.95, 0.99, or 0.995. Table 1 presents the posterior means for the mean and variance parameters of the two normal mixture components. The upper panels of Figure 3 show the posterior means of Gi while the lower panels show the posterior medians for the three values of . Note that since Gi is a binary variable, the median and the mode of the posterior distribution are equal. Based on the two sets of plots the coins can be classified into the two components of the mixture. For = 0.995 (right panels), 4 coins are classified into the second component of the mixture. This group of coins does contain the three outliers. So the two component normal mixture is able to cope with the three outliers. Note that as the value of decreases the number of coins which are classified into the second component of the mixture increases (5 and 11 for equal to 0.99 and 0.95, respectively).
N2 | |||||
---|---|---|---|---|---|
0.995 | 7.521 | 0.001087 | 7.504 | 0.06362 | 4 |
0.99 | 7.521 | 0.001072 | 7.509 | 0.01987 | 5 |
0.95 | 7.521 | 0.001016 | 7.52 | 0.004859 | 11 |
Figure 3: Upper panels: posterior mean for Gi with = 0.95 (left), 0.99 (middle) and 0.995 (right). Lower panels: posterior medians for Gi with = 0.95 (left), 0.99 (middle) and 0.995 (right).
(5) |
The likelihood in Equation (2) and the hyperprior distributions in Equation (4) remain the same.
A density estimate for the posterior distribution of is shown in Figure 4 and Table 2 presents the posterior means for the mean and the variance parameter of both mixture components. The posterior mean of equals 0.9964. Figure 5 shows that, according to the posterior mean and mode of Gi, in this case only two coins are classified into the second component. These two coins correspond to the two extreme outliers (7.201 g and 7.752 g).
Figure 4: Density estimate for the posterior distribution of .
Parameter | Estimate |
---|---|
7.521 | |
7.431 | |
0.001093 | |
3.928 | |
posterior mean | 0.9964 |
N2 | 2 |
Figure 5: Left (right) panel: posterior mean (median) for Gi with = 0.9964.
The two component normal mixture approach is able to attribute the outlying values to a separate normal component. But since our first analysis showed that the normal distribution hypothesis could not be retained after excluding the outliers, this approach is still not giving us a satisfactory insight in which way the weight distribution deviates from a normal one. Of course many other distributions could be fit to these data. But to better understand the reason for this non-normality, we investigated the nature of the data and its collection in more detail.
But what happened in reality? From our side, we had asked the cooperation of the university administration (section finances) since they have good contacts with local banks. They contacted a bank and sent an employee to go and collect the coins. But when the employee asked for “Belgian only” coins, he found out that banks do not store Belgian euros separated from other euros. Fortunately, he was told that the “National Bank of Belgium” possesses “Belgian only” coins. So, he went to that place and came back with the 2000 coins.
We further asked our two assistants about the way in which they “physically” received those coins. Here came the big surprise. The “National Bank of Belgium”, which is the distributor of money to other banks, had given 8 packages of 250 brand new coins each. Our assistants explained further that they had worked systematically, package by package, and that the weights were entered into the spreadsheet in that order. This gives additional information, in the sense that the weight of each coin can also be classified according to the package to which it belongs.
Knowing that the 2000 coins came from 8 different packages made us perform a further analysis, as described in the next paragraph. Of course, this analysis now should be seen in the context of the peculiar way in which the data were collected, and any resulting conclusion should be handled with caution.
Figure 6: Normal probability plots by package (outliers excluded). Upper plots: packages 1 to 4, from left to right; lower plots: packages 5 to 8, from left to right.
Package | KS (p-value) | SW (p-value) | Mean | SD | N |
---|---|---|---|---|---|
1 | 0.038 (0.500) | 0.995 (0.6830) | 7.520 | 0.034 | 250 |
2 | 0.033 (0.500) | 0.991 (0.1219) | 7.523 | 0.035 | 250 |
3 | 0.078 (0.001) | 0.863 (<0.0001) | 7.510 | 0.037 | 250 |
4 | 0.045 (0.500) | 0.995 (0.6827) | 7.531 | 0.029 | 250 |
5 | 0.035 (0.500) | 0.991 (0.1290) | 7.531 | 0.030 | 250 |
6 | 0.055 (0.063) | 0.984 (0.0068) | 7.515 | 0.033 | 250 |
7 | 0.042 (0.500) | 0.990 (0.1120) | 7.523 | 0.033 | 250 |
8 | 0.070 (0.005) | 0.936 (<0.0001) | 7.517 | 0.036 | 250 |
3 | 0.047 (0.500) | 0.989 (0.0565) | 7.511 | 0.031 | 249 |
5 | 0.045 (0.500) | 0.993 (0.4204) | 7.531 | 0.029 | 249 |
8 | 0.053 (0.081) | 0.989 (0.0833) | 7.516 | 0.033 | 249 |
Some of the normal probability plots in Figure 6 reveal a clear pattern of departure from normality: package 3, 6 and 8 (U-shaped about the straight line, indicating some skewness to the right), and to some lesser extent, package 7 (U-shaped below the straight line, indicating some skewness to the left). This was the motivation to consider a non-symmetric extension of the normal distribution: the skew-normal distribution as introduced by Azzalini (1985).
(6) |
where z is a z-score, is the standard normal density function, is the standard normal distribution function and is the shape parameter. Note that for = 0, the skew-normal density reduces to the standard normal density. Figure 7 shows some graphs of the skew-normal density functions for several values of . Note that as goes to infinity the skew-normal distribution converges to the so called half-normal distribution.
Figure 7: Skew-normal densities for different values of the shape parameter .
For a random variable Z ~ SN(), it can be shown that E(Z) = b and Var(Z = 1 - (b)2, where and . Using a linear transformation , we get a random variable with mean , variance 2 and shape parameter . For more details on moments and cumulants, see Azzalini (1985).
The shape parameter was estimated for the data in each package, as shown in Table 4. In the sequel, we work with the reduced data, excluding the three outliers. Using maximum likelihood estimation, the shape parameter in package 6 is estimated to be 0.512, with estimated standard error 0.143. This indicates a right skewed distribution. Within this three-parameter family of distributions the normality hypothesis is equivalent to the hypothesis
H0: = 0. | (7) |
Using the Wald test
(8) |
the normality hypothesis (7) can be tested using a chi-square null distribution (see e.g. Section~6.4.2 in Shao 1999). For package 6 the value of W equals 12.85 which corresponds to a p-value of about 0.0003 leading us to reject normality. So, this confirms the previous analysis of this particular package. Moreover the fitted skew-normal distribution can now be used to model the package 6 data.
Package | ||
---|---|---|
1 | 0.161 | 0.176 |
2 | -0.142 | 0.136 |
3 | 0.225 | 0.145 |
4 | 0.000 | 0.025 |
5 | -0.072 | 0.175 |
6 | 0.512 | 0.143 |
7 | -0.189 | 0.164 |
8 | 0.390 | 0.147 |
Also for package 8 the estimated lambda parameter and its estimated standard error leads to a rejection of the normality hypothesis (7): = 0.39, = 0.147, and the value of W equals 7.04 with p-value 0.008. These are the only packages for which the normality hypothesis (7) can be rejected at the 5% significance level (which is in line with the results in the previous analysis). Also for other packages, like package 3 and 7 and even package 2, the magnitude of relative to its agrees with the p-values in Table 3.
Histograms of the weight of the euro coins by package, overlaid with the estimated skew normal density (solid line), are shown in Figure 8. The 95% confidence intervals for , by package, are shown in Figure 9. As expected, the confidence intervals for package 6 and 8 do not cover the value of zero.
Figure 8: Histogram, fitted skew normal density (solid line) and fitted normal density (dashed line), by package. Upper row: packages 1 to 4, from left to right; lower row: packages 5 to 8, from left to right.
Figure 9: Estimated shape parameter with 95% confidence intervals, by package.
A second aspect to be discussed in class is certainly as important as the exploration of the dataset. It is about the design of the study and about the many practical issues that occur “in real life”. A big lesson can be learned here. Even a well-designed study can produce data, which are collected in such a way that one has to be cautious about the conclusions of the further statistical analysis. Apart from the design, one has to look at the size and the complications of the data collection process. If one has to rely on collaborators for collecting the data, one should monitor also this process carefully.
Within the restrictions of what has been said above, one could continue and go for a further “tentative” analysis. Fortunately, we were able to identify which coin belonged to which batch, so that a study by batch was possible. This is carried out in Section 6, using normal probability plots as well as formal tests. Several of the plots seem to point in the direction of some skewness. In a more advanced course in statistics, one could take the opportunity here to study an interesting class of distributions, having a shape parameter , but reducing to the symmetrical normal when = 0. This class of skew-normal distributions, introduced by Azzalini (1985), is studied in Section 7 for fitting the data of the 8 packages of coins.
>y<-(euro$MASSA..G) >upy<- max(y) >loy<- min(y) qqmath( ~ y , data = euro,ylim=c(loy,upy), prepanel = prepanel.qqmathline, panel = function(x, y) { panel.grid() panel.qqmathline(y, distribution = qnorm) panel.qqmath(x, y)}, layout = c(1, 1), aspect = 1, xlab = "Unit Normal Quantile", ylab = "Weights (gr)")
For the KS two sided test we used
>ks.gof(y,alternative = "two.sided", distribution = "normal")
and for the SW test:
>shapiro.test(y)
model { for( i in 1 : Nsub ) { Yi[i]~dnorm(mu.y[T[i]],tau.i[T[i]]) #likelihood T[i]~dcat(P[]) #prior for G_i ti[i]<-T[i]-1 } for(k in 1:Nmix){ tau.i[k] ~ dgamma(0.0001, 0.0001) mu.y[k] ~ dnorm(0,tau.mu) sigma[k]<-1/tau.i[k] } tau.mu ~ dgamma(0.0001, 0.0001) P[1]~dunif(0,1) #prior for pi P[2]<-1-P[1] }
>sn.mle(,x1,plotit=T)
Congdon, P. (2003), Applied Bayesian Modelling, Chichester, England: Wiley.
GAISE Reports (2005). Electronic text at it.stlawu.edu/~rlock/gaise/
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (1995), Bayesian Data Analysis, London: Chapman and Hall.
Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996), Markov Chain Monte Carlo in Practice, London: Chapman and Hall.
Scheaffer, R. L. and Lee, C. (2000), The Case for Undergraduate Statistics Electronic text at jse.amstat.org/meetings/jsm/2000/usei/case.html.
Shao, J. (1999), Mathematical Statistics, New York: Springer.
Stuart, A., Ord, J. K., and Kendall, M. (1991), Kendall's Advanced Theory of Statistics, Volume 2, London: Arnold.
Ziv Shkedy
Center for Statistics
Hasselt University
Diepenbeek
Belgium
ziv.shkedy@uhasselt.be
Marc Aerts
Center for Statistics
Hasselt University
Diepenbeek
Belgium
marc.aerts@uhasselt.be
Herman Callaert
Center for Statistics
Hasselt University
Diepenbeek
Belgium
herman.callaert@uhasselt.be
Volume 14 (2006) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications