Robert W. Hayden
statistics.com
Journal of Statistics Education Volume 13, Number 1 (2005), jse.amstat.org/v13n1/datasets.hayden.html
Copyright © 2005 by Robert W. Hayden, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words: Data displays; Inliers; Interpretation in context; Presidents.
For many years I have been urging my students to scan data for outliers. There are two common definitions of what an outlier is, and I have a strong preference between them. One definition concentrates on outliers that are unusually large or small. Here is an example from Bluman (2000, p. 123):
An “outlier” is an extremely high or an extremely low data value when compared with the rest of the data values.
Boxplots implement a specific version of this definition. However, this definition does not generalize well beyond a single variable.
Figure 1. A Plot of Points along y=20-x2 including (0,0).
Nine points of the pseudodata in Figure 1 fall on a perfect parabolic curve while one point is quite far from that curve. However, neither the vertical nor horizontal coordinate of the "outlier" is unusually large. In fact, both coordinates are exactly equal to the mean (and median) of the corresponding coordinate of the other nine points.
I prefer an admittedly more subjective definition that covers a much wider class of situations. Here is an example from Ross (1996, p. 59):
... “outliers” ... are data points that do not appear to follow the pattern of the other data points.To get my students started thinking in these terms, I wanted examples of data on a single variable for which the outliers were not very large or very small values. Such points are important in research as well as in teaching. They are called “inliers” by Winkler (www.census.gov/srd/papers/pdf/rr9805.pdf). His concern is with identifying such points when they represent errors in the data that are not apparent because they are not unusually large or small values.
In my classes, I use some pseudodata examples, such as
5 | 0 | |
6 | 00 | |
7 | 000 | |
8 | 0000 | |
9 | 0000 | |
10 | 000 | |
11 | 00 | |
12 | 0 | |
13 | ||
14 | ||
15 | ||
16 | 0 | |
17 | ||
18 | ||
19 | ||
20 | 0 | |
21 | 00 | |
22 | 000 | |
23 | 0000 | |
24 | 0000 | |
25 | 000 | |
26 | 00 | |
27 | 0 |
We might imagine this as prize monies in athletic events, with one peak representing males, another females, and an "outlier" in the middle that needs further investigation. Once again the "outlier" is at the center of the rest of the data.
Figure 3. Boxplot of Days.
At first glance, the boxplot in Figure 3 suggests symmetry with no outliers – until we notice the location of the median at one end of the box, something beginners might not notice immediately. To more experienced eyes, this suggests a (single) sharp peak around 1500.
The histogram in Figure 4 suggests a bimodal distribution with no outliers.
Figure 4. Histogram of Days.
3 | 0 | 014 | |
6 | 0 | 889 | |
(19) | 1 | 0124444444444444444 | |
18 | 1 | 568 | |
15 | 2 | 00 | |
13 | 2 | 788999999999 | |
1 | 3 | ||
1 | 3 | ||
1 | 4 | 4 |
Leaf Unit = 100
The stem and leaf in Figure 5 suggests a bimodal distribution with a mild outlier at the high end.
Figure 6. Dotplot of Days.
The dotplot in Figure 6 is the most revealing of our displays. Most of the observations fall in two peaks around 1500 and 3000. Since the majority of the observations fall at these two sharp peaks, we might consider all of the remaining data to be "outliers".
We can look at the data in greater detail by tallying the values.
Days | Count |
---|---|
31 | 1 |
199 | 1 |
491 | 1 |
881 | 1 |
895 | 1 |
967 | 1 |
1036 | 1 |
1110 | 1 |
1260 | 1 |
1418 | 1 |
1427 | 1 |
1460 | 12 |
1461 | 2 |
1503 | 1 |
1655 | 1 |
1886 | 1 |
2027 | 1 |
2039 | 1 |
2727 | 1 |
2810 | 1 |
2864 | 1 |
2921 | 6 |
2922 | 3 |
4452 | 1 |
Table 1 shows that the peaks are very sharp indeed. There are 14 observations at 1460-1461 and 9 at 2921-2922. More than half the data take on one of these four values. It is interesting to note that the values at one peak are about two times the values at the other. Can you guess what these data are? (Hint: 1460 = 4 x 365)
Days
2864 | 1460 | 2921 | 2921 | 2921 | 1460 | 2921 | 1460 | 31 |
1427 | 1460 | 491 | 967 | 1460 | 1460 | 1503 | 1418 | 2921 |
1460 | 199 | 1260 | 1460 | 1460 | 1460 | 1655 | 2727 | 1460 |
2921 | 881 | 2039 | 1460 | 4452 | 2810 | 2922 | 1036 | 1886 |
2027 | 895 | 1461 | 2922 | 1461 | 2922 | 1110 |
Days (sorted)
31 | 199 | 491 | 881 | 895 | 967 | 1036 | 1110 | 1260 |
1418 | 1427 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 | 1460 |
1460 | 1460 | 1460 | 1460 | 1460 | 1461 | 1461 | 1503 | 1655 |
1886 | 2027 | 2039 | 2727 | 2810 | 2864 | 2921 | 2921 | 2921 |
2921 | 2921 | 2921 | 2922 | 2922 | 2922 | 4452 |
President | Days | ||
---|---|---|---|
1 | Washington | 2864 | |
2 | Adams | 1460 | |
3 | Jefferson | 2921 | |
4 | Madison | 2921 | |
5 | Monroe | 2921 | |
6 | Adams | 1460 | |
7 | Jackson | 2921 | |
8 | Van Buren | 1460 | |
9 | Harrison | 31 | |
10 | Tyler | 1427 | |
11 | Polk | 1460 | |
12 | Taylor | 491 | |
13 | Filmore | 967 | |
14 | Pierce | 1460 | |
15 | Buchanan | 1460 | |
16 | Lincoln | 1503 | |
17 | Johnson | 1418 | |
18 | Grant | 2921 | |
19 | Hayes | 1460 | |
20 | Garfield | 199 | |
21 | Arthur | 1260 | |
22 | Cleveland | 1460 | |
23 | Harrison | 1460 | |
24 | Cleveland | 1460 | |
25 | McKinley | 1655 | |
26 | Roosevelt | 2727 | |
27 | Taft | 1460 | |
28 | Wilson | 2921 | |
29 | Harding | 881 | |
30 | Coolidge | 2039 | |
31 | Hoover | 1460 | |
32 | Roosevelt | 4452 | |
33 | Truman | 2810 | |
34 | Eisenhower | 2922 | |
35 | Kennedy | 1036 | |
36 | Johnson | 1886 | |
37 | Nixon | 2027 | |
38 | Ford | 895 | |
39 | Carter | 1461 | |
40 | Reagan | 2922 | |
41 | Bush | 1461 | |
42 | Clinton | 2922 | |
43 | Bush | 1110 |
President | Days | ||
---|---|---|---|
1 | Harrison | 31 | |
2 | Garfield | 199 | |
3 | Taylor | 491 | |
4 | Harding | 881 | |
5 | Ford | 895 | |
6 | Filmore | 967 | |
7 | Kennedy | 1036 | |
8 | Bush | 1110 | |
9 | Arthur | 1260 | |
10 | Johnson | 1418 | |
11 | Tyler | 1427 | |
12 | Adams | 1460 | |
13 | Adams | 1460 | |
14 | Van Buren | 1460 | |
15 | Polk | 1460 | |
16 | Pierce | 1460 | |
17 | Buchanan | 1460 | |
18 | Hayes | 1460 | |
19 | Cleveland | 1460 | |
20 | Harrison | 1460 | |
21 | Cleveland | 1460 | |
22 | Taft | 1460 | |
23 | Hoover | 1460 | |
24 | Carter | 1461 | |
25 | Bush | 1461 | |
26 | Lincoln | 1503 | |
27 | McKinley | 1655 | |
28 | Johnson | 1886 | |
29 | Nixon | 2027 | |
30 | Coolidge | 2039 | |
31 | Roosevelt | 2727 | |
32 | Truman | 2810 | |
33 | Washington | 2864 | |
34 | Jefferson | 2921 | |
35 | Madison | 2921 | |
36 | Monroe | 2921 | |
37 | Jackson | 2921 | |
38 | Grant | 2921 | |
39 | Wilson | 2921 | |
40 | Eisenhower | 2922 | |
41 | Reagan | 2922 | |
42 | Clinton | 2922 | |
43 | Roosevelt | 4452 |
The two peaks in the data represent presidents who served one (1460 or 1461 days) or two (2921 or 2922) full terms. The fact that there are two values at each peak is due to changes in how the starting and ending dates of a standard term are defined. This is more pronounced in the case of Washington, who is actually a part of the upper peak. He served two full terms but his "start-up" term as the first President of the United States was shorter than subsequent terms. If we count Washington, there are 24 Presidents "in the pattern". The remaining 19 Presidents (44%) that fall off the two peaks are "outliers" in the sense that some explanation is required as to why these Presidents failed to serve one or two full terms.
Franklin Roosevelt is the one high outlier because he was the only President elected to more than two terms. He was actually elected to four terms, but died in office during his fourth. This is probably the only outlier that is covered by the "too big or too small" definition of outliers. That depends, of course, on your cut-off points for too- extremeness. For example, the definition built into the boxplot doesn't tag Roosevelt as an outlier.
The clear links with history make this a good dataset to use with a colleague in that discipline. One possible exploration might involve the names listed twice on the list of Presidents. Your colleague can help your students look into history to find explanations for these apparent duplications. Some are father-and-son, one pair are grandfather-and-grandson, and the two Cleveland's are the same man, elected to two nonconsecutive terms. Here we could discuss whether this is an "outlier" in the sense that it needs fixing. For some purposes it might make more sense to list Cleveland but once and total his days in office. Apparently not for every purpose, though; the U. S. State Department has ruled that Cleveland shall be counted as both the 22ndand 24th President.
One might also note that in addition to being an outlier as a result of being elected to the Presidency four times, Franklin Roosevelt also served a truncated fourth term due to death in office, and, like Washington, a truncated first term because the date of inauguration was changed.
Your students may have an almanac, a friendly, nearby history teacher, or their own knowledge of U. S. History to fall back on to answer such questions. In real studies, it may be that peculiarities in the data have no ready explanation. Then the analysis of the data may stimulate new research to find an explanation. A famous example from the history of science is the discovery of unknown planets, see O'Connor and Robertson www-gap.dcs.st-and.ac.uk/~history/HistTopics/Neptune_and_Pluto.html. Here peculiarities in the data on the known planets suggested where to look for new planets.
My goals with this data are more modest than discovering new planets. I hope to illustrate an underappreciated kind of "outlier", to have students see that data displays can tell us much about the underlying situation, and that we may have to delve into the originating discipline to understand what we see in our displays.
The file outlier.dat.txt is a tab delimited text file containing 43 rows. The rows, in chronological order, list the President’s name (with no embedded spaces) and his number of days in office. The file outlier.txt is a documentation file containing a brief description of the dataset.
O’Connor, J.J. and Robertson, E.F., “Mathematical discovery of planets”, www-gap.dcs.st-and.ac.uk/~history/HistTopics/Neptune_ and _Pluto.html.
Ross, Sheldon (1996), Introductory Statistics, New York: McGraw-Hill.
Winkler, William, “Problems with Inliers”, U.S. Census Bureau, www.census.gov/srd/papers/pdf/rr9805.pdf.
Robert W. Hayden
82 River Street
Ashland, NH 03217
USA
bob@statland.org
Volume 13 (2005) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications