University of Portsmouth, U.K.
Journal of Statistics Education Volume 13, Number 3 (2005), ww2.amstat.org/publications/jse/v13n3/wood.html
Copyright © 2005 by Michael Wood, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words:Active learning; Approaches to statistical thinking; Bootstrapped confidence intervals; Computer simulation; Probability distributions; Resampling.
The underlying simulation model is obviously very general. It provides the user with a very powerful tool. Furthermore, it is a tool specified not by mathematical formulae but by a physical process, albeit one performed by a computer. This all has clear implications for both pedagogy and the nature of the statistical knowledge we expect students to learn. The aim of this article is to explore these implications.
There are, of course, many other simulation models that are useful in statistics. Simulation is widely recognised as a useful tool for teaching statistics (Mills 2002; Simon, Atkinson, and Shevokas 1976). Usually it is seen as an aid to learning standard methods. It can also, however, be seen as an alternative to some of these standard methods: if the student has a well-understood and practical simulation method available, is it necessary to learn the conventional method as well? I will start by describing the two bucket story and its applications, and then use this as an example of a simulation model to address these more general issues.
I will start by describing the story in fairly abstract terms, and then describe some specific applications. In a teaching context, it would probably be better to reverse this order.
The first bucket, Bucket 1, holds b1 balls representing some collection (a sample, a set of possible outcomes, or something else). Each ball has some information stamped on it representing a member of this collection. For example, the collection might be a sample of people and the information stamped on the ball representing each person might be the number of cars owned by the person in question. A random “draw” of n balls is now taken from Bucket 1, and a statistic (e.g. the mean) is calculated from the information on these n balls and the answer stamped on a ball that is then put in Bucket 2. This can be done in two ways: either with replacement (replacing each ball after drawing it), or without replacement (which means, of course, that n must be less than or equal to b1 or we will run out of balls). This is repeated b2 times so that there are b2 balls in Bucket 2. The contents of Bucket 2 can then be used to analyse the simulated distribution of the statistic (derive probabilities, percentiles, etc). The italicised words all represent variables in the sense that they can be varied between different applications of the model.
The programs I used to implement the two bucket story are Resample.exe and Resample.xls. The first of these can analyse the following statistics: mean, sum, standard deviation, variance, percentiles (including maximum and minimum), median, range, inter-quartile range. The second, a spreadsheet, allows the analysis of bivariate distributions, and so enables the analysis of statistics such as correlation and regression coefficients, and other functions of two variables.
Suppose, first, that the population is finite — say of size 48. The first step is to form a suitable guessed population by taking four copies of the sample. We don't know what the real population is like, but this seems a reasonable guess on the basis of the only information we have — the sample. This guessed population is represented by Bucket 1 — which contains 48 balls representing the 48 members of this guessed population. (Bucket 1 contains, for example, 16 balls labelled 0, since there are four 0s in the original sample, and we are taking four copies of this.) Now we take random draws —“resamples” without replacement — of 12 balls from this guessed population in Bucket 1, find the mean of each resample, and stamp these means on separate balls — which we put in Bucket 2. Drawing the balls without replacement means simply that each ball is drawn and not replaced, which is, of course, equivalent to drawing a whole sample of 12 balls.
This models, very directly, the process of sampling from the real population, except, of course, that we only have the guessed population. When I did this, the first resample comprised 5, 0, 1, 0, 3, 5, 0, 1, 1, 0, 0, 2, which has a mean of 1.58. (Note that although all these numbers occur in the original sample, the resampling process means that the frequency with which they occur is not the same.) Repeating this whole process 10000 times gave a total of 10000 resample means in Bucket 2. Ninety five percent of these were between 0.75 and 2.33 (the 2.5 and 97.5 percentiles).
We can then use these results from Bucket 2 to estimate how much means of samples of 12 are likely to differ from the mean of the population from which they are drawn. In this case, as the guessed population is simply four copies of the sample, its mean is the same as the mean of the sample—1.5. The simulation results showed that the 2.5 and 97.5 percentiles of the distribution of the resample means were 0.75 and 2.33. The lower of these figures (0.75) corresponds to a resample mean that is 0.75 lower than the true mean of the guessed population (1.5), and the upper figure (2.33) is 0.83 above the true mean of the guessed population. These results suggest that there is a 95% probability that the errors in the means of these samples of 12 are less than about 0.8 — the qualifier “about” being necessary because of the slight difference between the errors above and below the true mean. This, in turn, suggests that we can be 95% confident that the true mean of the real population is in the range 1.5 (the mean of the original sample of data) plus or minus about 0.8 — i.e. 0.7 to 2.3. This is roughly the same as the percentiles of the resample means (in Bucket 2), so we can use these percentiles to define a confidence interval for the mean. This is called the bootstrap percentile interval for obvious reasons. The method, and the terminology, can easily be extended to populations of different sizes. Strictly, population sizes that are not integral multiples of the sample size create a problem, but in practice, taking the nearest integral multiple is likely to be good enough.
Infinite populations can be modelled by letting Bucket 1 represent the sample (b1 is the sample size), and then assuming that the distribution of the guessed population is identical to that of the sample. This means, for example, that as the value 1 occurs in 25% of the sample measurements, the same will be true of the guessed population. We then draw our sample of 12 balls from Bucket 1, replacing each ball after drawing it so that the distribution remains unchanged for when the next member of the sample is drawn — as it would in an infinite population. In this case the results are similar — the 95% interval derived from the results in Bucket 2 extends from 0.67 to 2.50. This interval is slightly wider than the interval based on the population of 48—for reasons that should be intuitively obvious if you follow through the two processes. (In the extreme, if the population and the sample are the same size, the width of the interval will be zero because there is only one sample that can be drawn.) It is also slightly more asymmetrical. Bootstrapping enables us to model the distinction between finite and infinite populations in a straightforward and transparent manner, whereas conventional methods are generally restricted to infinite populations.
The idea of a guessed population plays a crucial role here. There are other ways of conceptualising the process: Simon (1992) refers to a “pseudo-universe,” and Efron and Tibshirani (1993, p. 87) and Lunneborg (2000) write of the population distribution in the “bootstrap world.” It would be possible to describe the resampling process in purely mechanical terms, but it is important to tell the story in terms of the “guessed population,” or something similar, to clarify the rationale behind the process in intuitive terms, and to assess the assumptions on which the method’s validity depends.
The argument above, in terms of guessed populations, does make a number of assumptions which may not be exactly satisfied in practice — e.g. the sample (or a number of copies of the sample) can be used to form a surrogate population which will give an accurate idea of sampling error, and the extent of sampling error is roughly the same in both directions (see Wood 2003 for more detail of the assumptions implicit in this argument). One of the strengths of this bootstrap method is that it has a relatively simple rationale, so problems and assumptions are relatively clear.
The same method can be used for any other statistic which is calculated from a random sample — e.g. a median, proportion exhibiting some characteristic, various correlation coefficients, regression coefficients, etc. The only difference is the statistic we calculate from each draw from Bucket 1.
There are, of course, more sophisticated bootstrap methods, which may be useful when the assumptions on which the percentile interval is based are unreasonable. For example the sample of 12 above is unlikely to give an accurate idea of rare extreme values — e.g. some people doubtless have 10 cars. If we were interested in these, it may make sense to use the sample to fit a suitable probability distribution, and then use this to generate a guessed population for Bucket 1. More elaborate methods of bootstrapping are discussed in the technical literature on bootstrapping (e.g. Davison and Hinkley 1997; Efron and Tibshirani 1993; Good 2001; Lunneborg 2000).
There are also non-technical explanations of bootstrapping aimed at general readers and beginners (Diaconis and Efron 1983; Gunter 1991, 1992a, 1992b; Simon 1992; Wood 2003), and a few articles on the use of bootstrapping and similar methods for teachers (Braun 1995; Butler, Rothery, and Roy 2003; Duckworth and Stephenson 2002, Ricketts and Berry 1994; Simon, et al. 1976). The Resampling Stats website at resample.com also has links to a range of articles and books on bootstrapping and related ideas.
Other possibilities are to simulate distributions that approximate to the Poisson distribution (by taking p small and n large in the binomial simulation), and the normal distribution (by taking n large in the binomial simulation), as described in Wood (2003). And the fact that the distributions in Bucket 2 are frequently normal provides a convincing illustration of the central limit theorem and the ubiquity of the normal distribution.
The model can also deal with the hypergeometric distribution: for example, it can be used to simulate probabilities in the UK National Lottery (Wood 2003 and Example 1 in the Read this sheet in Resample.xls) by coding the six balls selected by a player as 1, and the 43 not selected as 0. Bucket 1 then contains six balls labelled 1 and 43 labelled 0. The lottery can then be simulated by drawing six balls without replacement, and the sum of the numbers (usually many 0’s and occasionally some 1’s) on the six balls drawn represents the number of numbers correctly forecast. Bucket 2 then represents the scores from each lottery ticket, and can be used to estimate the probabilities of the various prizes (although jackpot winners are so rare that a very large value of b2 is required—beyond the capacity of the two programs mentioned above).
This generality means that the two bucket story is potentially very powerful. Instead of having separate models for deriving confidence intervals for different statistics, and other models for the binomial, Poisson, normal and hypergeometric distributions, and for the birthday problem and for control lines in quality control charts, they are all brought together under one umbrella.
However, a physical story can be used without any appreciation of its meaning in just the same way that a symbolic argument or formula can be used blindly. For the rationale behind the story to be appreciated so that it can be used intelligently and adapted to new situations, it is necessary for users to think about what is going on, and here suitable terminology is likely to be helpful. In this article I have suggested the phrase “two bucket story” as a general label to avoid prejudging the interpretation of its components. In the particular application to the derivation of confidence intervals, ideas such as a “guessed population” help in the interpretation of the contents of the first bucket, and in clarifying the rationale behind the bootstrapping process. These terms should be taken as suggestions: like any language, the use of words is likely to evolve as they are used in different contexts.
There are, of course, many useful simulation approaches that do not fall under the umbrella of the two bucket story. One example is provided by approximate randomization tests (Noreen 1989; Wood 2003). These are randomization tests (Edgington 1995) which assess significance by “shuffling one variable … relative to another …” (Noreen 1989, page 9). This is a general simulation method that can often be used as a substitute for a number of traditional hypothesis tests — t test, one way analysis of variance, test of the hypothesis that a correlation is zero, Mann-Whitney test, etc. However, it does not fit the format of the two bucket story—it would need a third bucket so that two buckets can be reserved for the data allowing them to be “shuffled” relative to each other. There is a spreadsheet implementation of this shuffling principle at Resamplenrh.xls
The rationale behind many simulation approaches is simple enough to be understood by users without an extensive background in statistics. This means that a “relational” understanding (Skemp 1976) of why the method works, as well as an “instrumental” understanding of how to do it, is a reasonable expectation for most students. This represents a substantial cultural shift. When deriving a conventional confidence interval for the mean, for example, non-mathematical students would not normally be expected to understand where the values of t or z come from: the explanation may be that they are found in tables, or that mathematicians, or computers, have calculated them. These values need to be taken on trust. This is not true of the bootstrap percentile interval. Here, it is possible for the non-mathematical student to follow the whole rationale: there are no gaps to be filled by the mysterious activities of tables, computers or mathematicians.
To put this in different, but more or less equivalent, terms, beginning students are much more likely to be able to take an “active” or a “constructivist” approach with simulation methods: the whole method becomes a story which the student can run through and make sense of. In the words of a participant in one study “... the resampling method makes one feel that we are physically doing it, or actually seeing it being physically done, without having to take any theoretical mathematics into consideration” (Ricketts and Berry 1994, page 43). Simon, et al. (1976) go even further:
“The Monte Carlo method is not explained by the instructor. Rather it is discovered by the students. With a bit of guidance the students invent, from scratch, the procedures for solution” (p. 734)
This is in strong contrast to many formula-based methods, where the story behind the formula or the tables may be too long and complicated for students to “construct” in their minds, and certainly too complicated for them to “discover” for themselves. The virtues of “active learning” are generally accepted by educationalists: see, for example, the review of constructivism in Mills (2002) and the British Higher Education Funding Councils’ journal entitled simply Active learning. Simulation approaches, such as the two bucket story, must be in line with this ethos. They provide a way of seeing a statistical analysis as a physical story, instead of an abstract mathematical model.
The generality of the two bucket story is another important advantage over traditional approaches. Instead of learning a method, and associated formulae, for a confidence interval for a mean, and another for a confidence interval for a proportion, and the theory of the normal, binomial and hypergeometric distributions, there is a just a single method which applies across the board. Furthermore, this single method will cope with problems where there are no well-known methods. This makes the learner’s task far simpler, and gives the successful learner a tool that is far more powerful than a collection of formulae.
These simulation approaches offer the possibility that relatively inexperienced users will be able see the importance of the assumptions on which conclusions are based (random samples, for example), and may be able to adapt methods to new circumstances. The days when statistics was a collection of hazily understood, and often misused recipes, may be replaced by an age when people have a collection of general approaches—like the two bucket story—which they can use, actively and intelligently, to devise ways of tackling problems of current concern.
Unfortunately, there are very few empirical studies of actual benefits. One such study (Simon, et al. 1976) reported the results of “three controlled experimental tests of the pedagogical efficiency of the Monte Carlo [simulation] method.” Most of the results favoured the simulation approach, although there were difficulties in comparing the simulation approach with conventional approaches. They claim that they are not suggesting the simulation approach as an alternative to conventional approaches, but as an “underpinning”. However, this study was performed in the 1970s in the early days of computer technology, so the practical advantage of the simulation approach would have been smaller. More recently, Ricketts and Berry (1994) looked at the experiences of a class using resampling approaches instead of mathematical theory. Unfortunately, the results leave a little to be desired from the statistical point of view, being confined to two enthusiastic comments from students, and a comment from the authors that “our experience suggests that it [the resampling approach] is highly acceptable to students with a range of mathematical abilities.” (p. 44) The ideal would obviously be another controlled trial like the experiments by Simon, et al. (1976) but with modern computing facilities: however these are difficult to organise, particularly as the aims of the new approach might have a different emphasis from the aims of the old approach, with the consequent problems of defining a suitable measure of performance.
From the wider perspective of the academic development of statistics, the difficulty with conventional approaches is that the circumstances in which they work is typically restricted: simulation approaches are widely acknowledged to be more general and more robust (Lunneborg 2000). The formulae may work in interesting special cases, but the best general approach may be the simulation. Sometimes the simulation approach may be the only option. This principle is by no means restricted to statistics: economists, weather forecasters, engineers and many other scientists all make widespread use of simulation. In the words of Stephen Wolfram's “new kind of science ”... “there can be no way to know all the consequences of these rules [which describe the universe], except in effect just to watch and see how they unfold.” (Wolfram 2002, p. 846).
As an example of the first, consider the normal probability distribution. It is easy to simulate distributions which approximate the normal distribution, but the use of the mathematical formula – via tables or a computer – is likely to be far easier and more elegant. Although simulation is helpful as a teaching aid, there seems to me to be an extremely strong case for expecting even beginners to use tables of the normal distribution or an equivalent computer function. Understanding the mathematical rationale behind the formula (and so the tables and the computer functions) is obviously an unrealistic goal for most people, but understanding the results empirically as a bell shaped curve which conforms to many commonly experienced contexts is obviously realistic.
On the other hand, for the computation of confidence intervals, in my judgment, the simulation approach of bootstrapping should be regarded as a replacement for formula based methods in most circumstances. This is on the basis of an informal cost benefit analysis (Simon, et al. 1976, p. 738) balancing the costs of learning about the method (much greater for conventional methods because of the pre-requisite concepts which need mastering, and the variety of different methods for different statistics) with their likely benefits (potentially greater for bootstrapping because of the generality of the approach, and greater likelihood of the results being interpreted accurately). This, however, like the assertion in the previous paragraph, just reflects my judgment. These judgments could be checked by means of a controlled trial, although any such experiment would face many obvious difficulties (e.g. quantifying the costs and the benefits, ensuring the comparison is “fair” in terms of learning resources).
The second reason for preferring conventional methods is that they may, sometimes, yield more insight. For example, the standard formula for the confidence interval of a mean or a proportion shows how the width of the intervals depends on the square root of the sample size. This is an aspect of the deep structure of the statistical universe that cannot be unlocked by simulation. It can be illustrated, but not proved or explained. Simulation approaches are in a sense cheating: they amount to little more than some crude experiments with a model of the situation, and in some circumstances such experiments may provide less insight than the sort of theory provided by a mathematical model.
Despite this, there are obviously circumstances – discussed in the previous section — in which conventional methods have an obvious and useful role to play. A balance needs to be drawn between the two approaches. My feeling is that, in the U.K. at least, the virtues of simulation approaches are under-estimated. In particular, if all that is wanted is a transparent method of deriving answers to specific questions—and many students learning statistics are in this position — simulation approaches do seem adequate for a great many problems, and offer the promise of liberating statistics from the shackles of the symbolic arguments that many people find so difficult (Wood 2001).
Braun, W. J. (1995), “An illustration of bootstrapping using video lottery terminal data,” Journal of Statistics Education [Online}, 3(2). ww2.amstat.org/publications/jse/v3n2/datasets.braun.html
Butler, A., Rothery, P., and Roy, D. (2003), “Minitab macros for resampling methods,” Teaching Statistics, 25(1), 22-25.
Davison, A. C., and Hinkley, D. V. (1997), Bootstrap Methods and Their Application, Cambridge: Cambridge University Press.
Diaconis, P., and Efron, B. (1983), “Computer intensive methods in statistics,” Scientific American, 248, 96-108.
Duckworth, W. M., and Stephenson, W. R. (2002), “Beyond traditional statistical methods,” The American Statistician, 56(3): 230-233.
Edgington, E. S. (1995), Randomization Tests, 3rd edition, New York: Dekker.
Efron, B., and Tibshirani, R. J. (1993), An Introduction to the Bootstrap, New York: Chapman and Hall.
Garrett, L., and Nash, J C. (2001), “Issues in teaching the comparison of variability to non?statistics students,” Journal of Statistics Education [Online], 9(2). ww2.amstat.org/publications/jse/v9n2/garrett.html
Good, P. I. (2001), Resampling Methods, 2nd editon, Boston: Birkhauser.
Gunter, B. (1991, December), “Bootstrapping: how to make something from almost nothing and get statistically valid answers — Part 1: Brave new world,” Quality Progress, 97-103.
Gunter, B. (1992a, February), “Bootstrapping: how to make something from almost nothing and get statistically valid answers — Part 2: the Confidence game,” Quality Progress, 83-86.
Gunter, B. (1992b, April), “Bootstrapping: how to make something from almost nothing and get statistically valid answers — Part 3: examples and enhancements,” Quality Progress, 119-122.
Lindley, D. V. (1985), Making Decision, 2nd Edition, London: Wiley.
Lunneborg, C. E. (2000), Data Analysis by Resampling: Concepts and Applications, Pacific Grove, CA, USA: Duxbury.
Mills, J. D. (2002), “Using computer simulation methods to teach statistics: a review of the literature,” Journal of Statistics Education [Online], 10(1). ww2.amstat.org/publications/jse/v10n1/mills.html
Noreen, E. W. (1989), Computer Intensive Methods for Testing Hypotheses, Chichester: Wiley.
Ricketts, C., and Berry, J. (1994), “Teaching statistics through resampling,” Teaching Statistics, 16(2), 41-44.
Simon, J. L. (1992), Resampling: The New Statistics, Arlington, VA: Resampling Stats, Inc.
Simon, J. L., Atkinson, D. T., and Shevokas, C. (1976), “Probability and statistics: experimental results of a radically different teaching method,” American Mathematical Monthly, 83(9), 733-739.
Skemp, R. R. (1976), “Relational understanding and instrumental understanding,” Mathematics Teaching, 77, 20-26.
Wolfram, S. (2002), A New Kind of Science, Champaign, IL: Wolfram Media.
Wood, M., Kaye, M., and Capon, N. (1999), “The use of resampling for estimating control chart limits,” Journal of the Operational Research Society, 50, 651-659.
Wood, M. (2001, May), “The case for crunchy methods in practical mathematics,” Philosophy of Mathematics Education Journal [Online], 14. www.ex.ac.uk/~PErnest/
Wood, M. (2003), Making Sense of Statistics:A Non-Mathmematical Approach, Basingstoke: Palgrave.
Department of Strategy and Business Systems
University of Portsmouth
Volume 13 (2005) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications