John C. Nash and Tony K. Quon
University of Ottawa
Journal of Statistics Education v.4, n.1 (1996)
Copyright (c) 1996 by John C. Nash and Tony K. Quon, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.
Key Words: Spreadsheet software; Computing paradigm; Statistical functions; Audit files.
Spreadsheet software is widely used and now includes statistical functionality. This paper discusses the issues raised in teaching statistics with spreadsheet software. The principal concerns relate to aspects of the spreadsheet view of computation that make it difficult to keep track of what calculations have actually been carried out or to control the spreadsheet by means of a script. We also discuss a number of other advantages and deficiencies of spreadsheets for teaching statistics.
1 Recent spreadsheet software has begun to include quite extensive statistical functionality, thereby inviting its use for statistical analysis. Because spreadsheet software is particularly common in business use, there has been a strong interest from students and some faculty members to use spreadsheets in teaching statistics in business schools and elsewhere. Countering this pull towards spreadsheets are a number of conceptual, pedagogical, and technical issues that are the basis of this paper.
2 Our treatment of the support provided by spreadsheets for statistics and statistics education considers capabilities offered by the products, features or capabilities lacking, incorrect or misleading features, and treatment of missing values. It is worth emphasizing that it is dangerous to have tools that conduct incorrect analyses of data. For the professor, there is the added danger of having lesson preparation wasted, student confidence eroded, and the general confusion and loss of class time involved with attempting a correction.
3 Before we present the body of our argument, it is important to recognize that spreadsheet software offers a number of important conveniences to both teachers and students.
4 The claim for the convenience of the spreadsheet interface and its suitability for a large range of "real" applications must be tempered with the realization that
5 These two points have parallels with regard to most types of software. Our point here is simply to underline the fact that spreadsheet processors are not free from these difficulties simply because of their widespread usage.
6 The size of spreadsheet packages with statistical features is such that they require large amounts of memory (RAM) and disk space. (Excel 5.0 takes up over 13MB of disk space, considerably more than Excel 4.0 or Quattro Pro 5.0.) These constraints, together with their high prices (with the exception of Quattro Pro which is relatively inexpensive), make these packages unattractive as compulsory materials, particularly in environments where the available computers may be of varying capability. By contrast, several statistical packages exist in inexpensive student editions that do not require hardware upgrades.
7 In our experience, the argument of convenience and expediency in favor of teaching statistics with spreadsheets is most strongly made in connection with MBA courses. The case is made that since MBA students are likely to have spreadsheets on their office computers, they will be more inclined to perform statistical analyses later when the need arises. That is, learning to use spreadsheets for a data analysis course may have side benefits in terms of their use in other courses and upon students' return to the workplace. We believe this is a hypothesis that remains to be tested.
8 An argument is also made that widespread knowledge of spreadsheets makes them more readily accessible to students. However, the differences in functionality and use among spreadsheet packages are at least as great as between the spreadsheets and specialized statistical packages, especially since Minitab, Stata, Systat, and others now have menu-driven options and spreadsheet data entry and edit facilities. Unless teaching institutions are prepared to say there is only one "true religion" of spreadsheets, the argument for a particular spreadsheet software as the interface that students will meet in the real world is moot. In a world where software choices seem limitless, it is strange to argue that students should be introduced to only one, albeit general-purpose, software tool.
9 As one referee has noted, spreadsheets are particularly useful for data entry, editing, and manipulation prior to input to a standard statistics package for analysis. We agree with this observation and have often used spreadsheets in this way. (Excel 5.0 is particularly good for importing ASCII data, whether qualitative or quantitative, that are not separated by blanks or commas.) Unfortunately, the transfer of data from general spreadsheet software formats to statistical packages has, in our experience, been a task novice students find troublesome. A similar comment applies generally to their understanding of the difference between binary and text form storage of data, even within a single software package. The use of spreadsheet data entry by statistical packages goes some way toward easing this concern.
10 The presentation of material in an ordered and/or structured fashion is central to introductory lessons in any subject. That developers of statistical packages are aware of this need is illustrated by features such as the "slideshow" tools in version 5 of Data Desk. Spreadsheets lack such facilities explicitly, but one could consider use of presentation software such as Microsoft Power Point along with the on-line embedding (OLE) facilities of Microsoft Windows to arrange the appropriate sequencing of displays. We believe that this offers an answer to the needs for lecture or overview presentations. However, there are important configuration and data security issues if one wants students to easily incorporate their own data or additional calculations. For more general audiences, we have so far been unable to link spreadsheets to World Wide Web (HTML) pages. By contrast, this is relatively easy to do with some statistical packages, for example, Lisp-Stat.
11 Teachers also want to prepare calculations in advance so they can be rerun easily with controlled and verifiable inputs (e.g., Minitab EXEC and JOURNAL, Stata DO and ADO files, Systat CMD files). Actions on data in a spreadsheet are recorded within the spreadsheet or as macro commands. Even the latter are commonly stored within the spreadsheet and are usually not accessible for external storage or editing; Excel is an exception. We claim, however, that macro languages in current spreadsheets do not provide sufficient reward in functionality for the heavy effort involved in their preparation to make them a serious option for preparing pedagogical material.
12 Third parties could offer to do the work of preparing the macros. There are at least two products available that offer spreadsheet templates and macros for teaching statistics (see Siegel 1990; Berk and Carey 1995). These have the common failing that they cannot lead the student through the steps of an analysis, revealing the parts one by one; however, the tree structure of the menus does reflect the logic required to select the appropriate test.
13 When macros are used, it is important to recognize how computations are ordered. The spreadsheet model of computations presumes immediate action on data throughout the spreadsheet. Calculations are only sequential in the sense of the dependencies of cells one upon the other, unless special calculation ordering is specified (O'Leary 1989, p. 365). Changes in data are manifested "immediately" throughout the spreadsheet as a result of the cell formulas. From a pedagogical perspective, this immediacy can be a great advantage. We can examine the effect of outliers or misrecorded numbers on our statistical results, including graphs. Unfortunately, an important exception may be "analysis boxes," which do not get updated and are a potential source of misleading output. Immediate updating is not, however, the preserve of spreadsheets. Notable among statistical packages are Data Desk and Lisp-Stat.
14 When calculations are presented or the professor must review or mark student results, it is important to be able to quickly discern what calculations have been carried out. Professors have just as much difficulty as students in realizing which spreadsheet cells have been altered and in what ways. The ability to log results (Minitab uses the OUTFILE or JOURNAL command and Stata uses the LOG USING to keep a record of the commands and outputs) does not exist to our knowledge in any common spreadsheet, making an audit of activity next to impossible. The selection and operation on cells in a spreadsheet manifests itself on the screen anywhere in the array of cells. This does not lend itself naturally to being logged sequentially. Moreover, unless the affected cells are somehow highlighted and such highlighting maintained during and after actions, it can be quite difficult for passive viewers such as students to observe what has happened and where.
15 A particular example concerns the presentation of a spreadsheet showing results of calculations for exponential smoothing on a forecasting examination. For example, Winter's exponential smoothing requires a column for data, one for the smoother, one for the smoothed slope, and one for the smoothed seasonal factors. In addition there will be residuals and possibly squares of residuals. The alignment of these with respect to the time index (yet another column) can be critical to correct answers. The array of numbers in the output has proved very difficult for students to quickly interpret effectively. We note that some statistical packages, including the Student Edition of Minitab for Windows, offer several such forecasting techniques in a form that is easy to understand, although for Winter's seasonal exponential smoothing the components of the computations are not presented. That is, spreadsheets present too much detail for effective learning, while statistical packages may present too little. A temporary compromise one of us (JN) has adopted is to have students learn how such methods work by having them build the appropriate spreadsheet in an assignment project, whereby they are more directly and actively involved with the operations on the numbers, but then use a statistical package (currently Minitab) for subsequent analyses of data.
16 Of particular concern for tracking activity are the statistical analysis tools that use dialog boxes. From the student's point of view, the use of analysis tools (Quattro Pro and Excel) in the form of dialog boxes is user-friendly. From the point of view of verification of results, we are convinced that dialog boxes are an invitation to disaster. Unlike @ functions which follow the spreadsheet convention and update themselves whenever previously referenced data cells are altered, the dialog boxes perform an analysis only at the time of the dialog. It is therefore very easy (and we ourselves have fallen into the trap) to perform an analysis, edit an observation, then fail to rerun the analysis. While this is true of non-spreadsheet statistics software too, one normally expects to have to rerun analyses outside the spreadsheet world.
17 The inclusion of dialog boxes in spreadsheets is, we feel, a serious design failure since it mixes two very different ways of viewing computations -- the "all at once" spreadsheet view and the sequential step-by-step procedural approach. From the professor's perspective, checking whether the student has properly carried out statistical computations is only possible by fully duplicating the output because there is no complete trace of what was input into the procedure. We are not arguing that @ functions are easy to learn or teach, but that they provide a consistent approach to computations for spreadsheets. A dialog box could be used to develop the appropriate syntax for such functions, thereby preserving such consistency without imposing the learning cost and syntactic nuisance of conventional @ functions.
18 Beyond the immediate issue of tracing the computations that have been carried out, logging what has gone on during a computational session poses some difficult philosophical questions. As pointed out by Paul Velleman (in Goldstein 1993), such matters are complicated by the increasing popularity of graphical interfaces, of which spreadsheets can be considered one form.
19 Above we have considered some of the general features of spreadsheet software in relation to teaching statistics. We now turn to specifics. Clearly, comments about the features in software reflect the particular versions of products considered. In the present investigation, we looked at the following spreadsheet packages, all of which run under Microsoft Windows:
Lotus 1-2-3 for Windows 4.0 (June 11, 1993)
Borland Quattro Pro for Windows 5.0 (August 11, 1993)
Microsoft Excel version 4.0 (April 1, 1992)
Microsoft Excel version 5.0 (December 15, 1993)
We have included the date of the major executable file in each package to avoid confusion over software releases in case there are multiple releases within a labelled "version."
20 Overall, current spreadsheet statistical features correspond to topics covered in a more traditional introductory statistics course that does not emphasize exploratory data analysis. As such, they seem out of step with more modern approaches in that they do not allow quick and easy explorations of the data. For example, there is a constant need to copy or move data to create multiple columns or contiguous blocks of data and a continuing need to decide where to place the output. This is ironic, given that spreadsheets are frequently the tool of choice for "try it and see" explorations of business and administrative data.
21 Nevertheless, many statistical tools are available:
22 Spreadsheet software is generally considered to provide strong support for graphical analysis of data. In fact, we find that spreadsheets can be very useful in creating presentation graphs after the analysis has been completed. However, we need to distinguish presentation from analytical graphics. The latter need not be pretty, nor very high resolution, but they must allow the user to extract useful information. While character plots have fallen from fashion with the rise of high-resolution printers, they can be output as part of the log file so they do not get detached from the data and commands that created them. Despite fairly persistent haranguing, students continue to label files with such uninformative names as "graph1" or "plotb." We regard training in documenting computer analysis as an informal part of a statistics course. Unfortunately, the ubiquitous labelling of a variable as X, or less commonly Y, in most textbooks does not encourage students to use meaningful names.
23 From the point of view of statistics, some desirable graphs are
Of the graphs above, only histograms with non-contiguous equal-width bars can be displayed easily, although one can specify unequal class widths in the frequency table. This can be misleading. To prepare a histogram in Quattro Pro, one must first create a frequency table summary (with the option of specifying bins using the left endpoints) and then apply the graphing tools to the resulting summary, a sequence of operations that is, in our opinion, too complicated for a commonly needed tool. Excel 5.0 is an improvement in that "Chart Output" is an option; however, the resulting histogram bars are marked by the values of the right endpoints in the middle of the intervals where one would like to see midpoints. Again, this may lead to confusion and misinterpretation. (See Berk and Carey 1995 for macros that correctly draw histograms and normal probability plots. Since the histograms there are dynamically linked to the frequency table, one can adjust the bins to see the changes in the histogram. The only limitation is that no fewer than ten bins are allowed initially to facilitate the drawing of a normal frequency curve, which, incidentally, is not redrawn well when the bins are changed.)
24 A graphic that statisticians and their clients often desire is a line or curve with error bars (the mean plus and minus the standard deviation). This does not appear to be offered in any current spreadsheet.
25 Computational deficiencies of spreadsheets also exist; for example, there is only a limited selection of time-series smoothing methods. However, the professor, student, or other user can redress such deficiencies by means of templates (example spreadsheets) or macros. The spreadsheet developer could add computational features without large-scale interaction with existing design features of the software. Thus, to add a computational function (an @ function) adds program code that is more or less independent of the rest of the package. For example, a useful additional function well-suited to the spreadsheet environment would be one that calculated weighted means and variances. In contrast, adding graphics along with computations, such as we mention in the next paragraph, requires the menu structure, and hence the user interface, to be altered.
26 Some statistical computations have come to be almost automatically followed by graphical output. The most obvious case is regression analysis where residual plots of various types are a natural part of a proper treatment of data. As far as residual plots go, only Excel gives residual plots against each independent variable as an option in the regression module. (Excel also provides for scatterplots of the dependent variable and the predicted values against each independent variable and calculates the residuals and predicted values.) Quattro Pro only calculates the residuals and predicted values as an option, and Lotus calculates only the summary statistics and does not have automated residual calculations.
27 Sometimes software is described as doing one thing but is actually doing something else, or it may use a different convention than that presumed by the user. This can be especially confusing for novice users of the software or for users who are unfamiliar with the application area. We have already mentioned problems with histogram displays.
28 An example of such a problem occurs in the regression dialog box in Quattro Pro and in both versions of Excel. One of the Excel options is a normal probability plot; however, whereas one might expect a diagnostic tool to check the normality of the residuals, instead one gets a cumulative distribution plot of the dependent variable. Similarly, Quattro Pro produces a cumulative frequency table if one specifies "Probability Output."
29 Another example is Quattro Pro's two-sample t-test assuming unequal variances which ends up calculating a pooled variance (as if one assumed equal variances); the same is true of Excel 4.0. Both software packages also have an entry for a "Pearson correlation" for the two-sample t-test assuming unequal variances; Quattro Pro will calculate it if the samples sizes are equal, whereas Excel 4.0 always returns a value of "#N/A." Excel 5.0 has gotten rid of both of these superfluous entries.
30 The expression -2^4 is evaluated by Excel as 16, implying that the sign has precedence over exponentiation in contrast to the traditional order of calculation. While there are other occasions when spreadsheet software gives incorrect computational results in the form of the wrong output numbers for specified inputs (see, for example, Nash 1989), we believe that errors are more likely to arise when unreasonable inputs are provided.
31 While the utility of the F-test for equality of variances may be questioned, it remains a part of many introductory textbooks and courses. Using the speedbar selection to open up a dialog box, Quattro Pro performs adequately if the first array has a larger sample variance than the second. However, if the arrays are exchanged so the first has the smaller sample variance, the F-statistic is still greater than one, but the degrees of freedom are reversed, and the corresponding one-sided critical value given is for double the alpha level of .05. Excel 4.0 is similar, except that the degrees of freedom are correct, but the corresponding critical value given is also for double the alpha-value. Excel 5.0 always calculates the F-statistic using the first sample variance over the second; however, when the first array has the smaller variance, the critical value is based on the reversed degrees of freedom. The use of the @FTEST function in Quattro Pro and Excel, however, results in correct (two-sided) p-values. Lotus only has the @FTEST option and this returns the two-sided p-value. The advantage of the @ function is that it is recalculated when the data change, whereas the more detailed output from Quattro Pro or Excel obtained through the dialog box results in numbers that are not subject to recalculation.
32 The issue of missing values has been a perennial concern in conventional statistical software. A simple search of the 1993 Current Index of Statistics using the search BATch file provided therein by Gerry Dallal for "missing value" yielded 104 references. A typical example is Donner (1982). Missing values are particularly dangerous in the context of spreadsheets, where a missing cell and one containing blank text may be presented identically on the screen or printouts. Worse, the software may take blank OR empty cells as zeros, when in fact they represent data that are truly missing. Students often have very little understanding of the critical difference between zero and missing; textbooks could be more helpful in this regard.
33 Quattro Pro illustrates the difficulty in the analysis of variance calculation (and correspondingly in the two-sample t-tests using either the dialog box or the @TTEST function). For example, consider the case of a one-way analysis of variance with three groups where one sample has only five elements, and where the other two have six. The means and standard deviations are computed correctly for the three samples, but the degrees of freedom for the analysis of variance summary seem to treat the missing value as a "zero." The Quattro Pro "Help" facility does state that the one-way analysis of variance should only be done with equal sample sizes, but this is an unnecessarily restrictive condition. Furthermore, the fact that the software does not trap the empty cell is an open invitation to serious errors. Excel 4.0 does not allow any empty cell to occur in the input block of equal size columns when performing the analysis of variance, but allows and treats properly empty cells in the two-sample t-tests (using either the dialog boxes or the @TTEST function). Excel 5.0 has been adjusted to allow for empty cells in a block of equal size columns. Lotus does not handle the analysis of variance per se, but treats empty cells properly in the two-sample t-tests. None of the three packages will allow a regression analysis to proceed if there are any missing values. This is obviously an inconvenience that forces the user to filter the data before proceeding.
34 Spreadsheet vendors must be encouraged to do better. Closer attention to statistical issues would result in tools better suited to data exploration and analysis, and cleaner software design would avoid some obvious sources of errors. For example, because spreadsheet users are accustomed to instant recalculation, the use of dialog boxes resulting in output that is not dynamically linked to the data can lead to problems.
35 While we fully expect a continuing debate over the use of spreadsheet software for statistics (and other) teaching along with the evolution of the software itself, it is our opinion that professors are currently better served by traditional statistics packages for computational support of their courses.
The referees of this paper made a number of sensible and constructive comments that have been incorporated here.
Berk, K. N., and Carey, P. (1995), Data Analysis with Microsoft Excel 5.0 for Windows, Cambridge: Course Technology Inc.
Donner, A. (1982), "The Relative Effectiveness of Procedures Commonly Used in Multiple Regression Analysis for Dealing With Missing Values," The American Statistician, 36, 378-381.
Goldstein, R. (1993), Statistical Computing: Editor's Notes, The American Statistician, 47, 46-47.
Nash, J. C. (1989), "Letter to the Editor," SIGNUM Newsletter, 24(4), October 1989, p. 16.
----- (1994), "Obstacles to Having Software Packages Cooperate on Problem Solving," in Computer Science and Statistics - Volume 25, Proceedings of the 25th Symposium on the Interface, eds. M. E. Tarter and M. D. Lock, Interface Foundation of North America, pp. 80-85.
O'Leary, Timothy J. (1989), The Student Edition of Lotus 1-2-3, Reading, MA: Addison-Wesley.
Siegel, A. F. (1990), Practical Business Statistics with Statpad, Boston: Irwin.