Journal of Statistics Education v.2, n.2 (1994)
Copyright (c) 1994 by Allan J. Rossman, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.
Key Words: Correlation; Causation; Transformation; Prediction.
This dataset contains information on life expectancies in various countries of the world and the densities of people per television set and of people per physician in those countries. The example has proven very useful for helping students to discover the fundamental principle that correlation does not imply causation. The data also give students an opportunity to explore data transformations and to consider whether a causal connection is necessary for one variable to be a useful predictor of another.
1 One of the most important principles for introductory statistics students to grasp is that a strong association between two variables does not necessarily imply a cause-and-effect relationship between them. This dataset, taken from The World Almanac and Book of Facts 1993, helps students to discover this idea for themselves. For each of the 40 countries in the world with populations of more than 20 million as of 1990, the dataset records the life expectancy at birth, the number of people per television set, and the number of people per physician. Life expectancies are also provided for males and females separately, with the average of those two figures used as the country's overall life expectancy.
2 I use this dataset in an introductory course where students work through activities designed to help them discover statistical concepts and explore statistical techniques for themselves. The course is taught in a microcomputer- equipped classroom in which two students sit at each machine. The students work collaboratively with their partners and use the technology to facilitate the calculation of summary statistics and the production of graphical displays. Each student records his/her responses to the activities' questions in an "activity guide" which serves as a structured journal for the student. The course and its pedagogical approach are described in Rossman (1992).
3 I begin by asking students (in their activity guides) to make guesses about life expectancies in various countries; I then ask them to guess the number of people per television set in these countries. Since this variable is a somewhat unusual one, it forces them to stop and reflect. I also ask students to form a conjecture as to whether any relationship exists between a country's life expectancy and its density of people per television set and, if so, whether the association is positive or negative. Asking for these "guesses" often produces lively debate among the students. I believe that it also helps them to think about data not as naked numbers but as numbers with specific contexts. Each student records his/her guesses for future reference.
4 I then present the students with the actual data and have them comment on how well they have guessed. I ask them to identify the countries with the highest (Japan) and lowest (Ethiopia) life expectancies as well as those with the highest (Myanmar) and lowest (United States) ratios of people to television sets. Even though the issue at hand concerns the relationship between the variables, I first ask them to examine and comment on the distributions of each variable separately.
5 Students then use the computer to produce a scatterplot of life expectancy vs. people per television and comment on the obvious negative association between these variables. They also use the computer to calculate the value of the correlation coefficient (-0.606). In the absence of a computer-equipped classroom, the instructor could use an overhead projection device to display the scatterplot and correlation or supply students with handouts. Since the dataset contains only forty countries, students could even construct the scatterplot by hand.
6 Having discovered the negative association between life expectancy and people per television, students then grapple with the issue of causation. I ask them if sending shiploads of televisions to countries with short life expectancies would cause their inhabitants to live longer. Most students find the question laughable, so I follow up by asking them to suggest a more plausible explanation for the association between life expectancy and people per television. Finally, I ask students to write a conclusion about whether correlation implies causation. The vast majority of students succeed in discovering this crucial distinction between correlation and causation for themselves.
7 Despite the lack of a cause-and-effect relationship between the variables, one can still ask whether the number of people per television set in a country is a useful predictor of that country's life expectancy. I like to complicate the issue by asking students which variable they would prefer to use as a predictor of life expectancy: number of people per television set or number of people per physician. Since the number of people per physician seems to be more directly related to life expectancy, most students choose that variable.
8 Upon analyzing scatterplots, students find that the association between life expectancy and people per physician is very similar to that between life expectancy and people per television. The correlation between life expectancy and people per physician (-0.666) is comparable to that between life expectancy and people per television (-0.606). In neither case is the association linear, however, so I ask students to consider transformations of the data.
9 It turns out that logarithmic transformations of people per television and of people per physician make the relationships with life expectancy fairly linear. The data suggest that people per television is a more useful predictor of life expectancy than is people per physician, even though the latter variable seems to be much more directly related to the response variable. The values of r-squared and mean square error turn out to be .731 and 4.10 for the model predicting life expectancy from the log of people per television, compared to .693 and 4.63 for the model predicting life expectancy from the log of people per physician. Analysis of residuals reveals no striking violations of the normal assumptions for either regression model.
10 Students can also pursue other interesting questions raised by these data. They might explore outliers and their effects on the associations. Another issue might involve asking whether the people per television variable is better at predicting a country's male or female life expectancy. One can also examine differences among the continents or some other classification of the countries in terms of the strength of this relationship.
11 I have found this dataset to be extremely effective for leading students to discover some important ideas related to correlation, causation, and prediction. The variables are interesting and easily understood by students, who often engage in spirited discussions about their findings.
12 The file televisions.dat.txt contains the raw data. The file televisions.txt is a documentation file containing a brief description of the dataset.
Values are aligned and delimited by blanks.
Missing values are denoted with *.
Rossman, A.J. (1992), "Introductory Statistics: The 'Workshop' Approach," in Proceedings of the Section on Statistical Education, American Statistical Association, pp. 352-357.
The World Almanac and Book of Facts 1993 (1993), New York: Pharos Books.
Allan J. Rossman
Department of Mathematics and Computer Science
P.O. Box 1773
Carlisle, PA 17013