# A Dataset that is 44% Outliers

Robert W. Hayden
statistics.com

Journal of Statistics Education Volume 13, Number 1 (2005), ww2.amstat.org/publications/jse/v13n1/datasets.hayden.html

Copyright © 2005 by Robert W. Hayden, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.

Key Words: Data displays; Inliers; Interpretation in context; Presidents.

## Abstract

The data illustrate outliers that are not mistakes and not observations that are unusually high or low. The reasons for them are all interesting historically. They illustrate that "outliers" need not be errors but may instead be particularly interesting cases. The data also illustrate that different data displays may differ in their ability to reveal interesting data structure.

## 1. Introduction

For many years I have been urging my students to scan data for outliers. There are two common definitions of what an outlier is, and I have a strong preference between them. One definition concentrates on outliers that are unusually large or small. Here is an example from Bluman (2000, p. 123):

An “outlier” is an extremely high or an extremely low data value when compared with the rest of the data values.

Boxplots implement a specific version of this definition. However, this definition does not generalize well beyond a single variable.

Figure 1

Figure 1. A Plot of Points along y=20-x2 including (0,0).

Nine points of the pseudodata in Figure 1 fall on a perfect parabolic curve while one point is quite far from that curve. However, neither the vertical nor horizontal coordinate of the "outlier" is unusually large. In fact, both coordinates are exactly equal to the mean (and median) of the corresponding coordinate of the other nine points.

I prefer an admittedly more subjective definition that covers a much wider class of situations. Here is an example from Ross (1996, p. 59):

... “outliers” ... are data points that do not appear to follow the pattern of the other data points.
To get my students started thinking in these terms, I wanted examples of data on a single variable for which the outliers were not very large or very small values. Such points are important in research as well as in teaching. They are called “inliers” by Winkler (www.census.gov/srd/papers/pdf/rr9805.pdf). His concern is with identifying such points when they represent errors in the data that are not apparent because they are not unusually large or small values.

In my classes, I use some pseudodata examples, such as

 5 0 6 00 7 000 8 0000 9 0000 10 000 11 00 12 0 13 14 15 16 0 17 18 19 20 0 21 00 22 000 23 0000 24 0000 25 000 26 00 27 0

Figure 2. Stem and Leaf of Pseudodata Example.

We might imagine this as prize monies in athletic events, with one peak representing males, another females, and an "outlier" in the middle that needs further investigation. Once again the "outlier" is at the center of the rest of the data.

## 2. Data

Recently I became aware of a real dataset that can serve my purpose and also illustrates the strengths and weaknesses of different data displays. It would give away too much to tell you just what the numbers represent (though I should warn you that residents of the United States will have an advantage in guessing where these data came from), so let us begin with some displays.

Figure 3

Figure 3. Boxplot of Days.

At first glance, the boxplot in Figure 3 suggests symmetry with no outliers – until we notice the location of the median at one end of the box, something beginners might not notice immediately. To more experienced eyes, this suggests a (single) sharp peak around 1500.

The histogram in Figure 4 suggests a bimodal distribution with no outliers.

Figure 4

Figure 4. Histogram of Days.

 3 0 014 6 0 889 (19) 1 0124444444444444444 18 1 568 15 2 00 13 2 788999999999 1 3 1 3 1 4 4

Leaf Unit = 100

Figure 5. Stem and Leaf of Days.

The stem and leaf in Figure 5 suggests a bimodal distribution with a mild outlier at the high end.

Figure 6

Figure 6. Dotplot of Days.

The dotplot in Figure 6 is the most revealing of our displays. Most of the observations fall in two peaks around 1500 and 3000. Since the majority of the observations fall at these two sharp peaks, we might consider all of the remaining data to be "outliers".

We can look at the data in greater detail by tallying the values.

Table 1. Tally of Days.

Days Count
31 1
199 1
491 1
881 1
895 1
967 1
1036 1
1110 1
1260 1
1418 1
1427 1
1460 12
1461 2
1503 1
1655 1
1886 1
2027 1
2039 1
2727 1
2810 1
2864 1
2921 6
2922 3
4452 1

Table 1 shows that the peaks are very sharp indeed. There are 14 observations at 1460-1461 and 9 at 2921-2922. More than half the data take on one of these four values. It is interesting to note that the values at one peak are about two times the values at the other. Can you guess what these data are? (Hint: 1460 = 4 x 365)

Days

 2864 1460 2921 2921 2921 1460 2921 1460 31 1427 1460 491 967 1460 1460 1503 1418 2921 1460 199 1260 1460 1460 1460 1655 2727 1460 2921 881 2039 1460 4452 2810 2922 1036 1886 2027 895 1461 2922 1461 2922 1110

Days (sorted)

 31 199 491 881 895 967 1036 1110 1260 1418 1427 1460 1460 1460 1460 1460 1460 1460 1460 1460 1460 1460 1460 1461 1461 1503 1655 1886 2027 2039 2727 2810 2864 2921 2921 2921 2921 2921 2921 2922 2922 2922 4452

## 3. Context of the Data

Table 2. Past Presidents of the United States.

President Days
1 Washington 2864
3 Jefferson 2921
5 Monroe 2921
7 Jackson 2921
8 Van Buren 1460
9 Harrison 31
10 Tyler 1427
11 Polk 1460
12 Taylor 491
13 Filmore 967
14 Pierce 1460
15 Buchanan 1460
16 Lincoln 1503
17 Johnson 1418
18 Grant 2921
19 Hayes 1460
20 Garfield 199
21 Arthur 1260
22 Cleveland 1460
23 Harrison 1460
24 Cleveland 1460
25 McKinley 1655
26 Roosevelt 2727
27 Taft 1460
28 Wilson 2921
29 Harding 881
30 Coolidge 2039
31 Hoover 1460
32 Roosevelt 4452
33 Truman 2810
34 Eisenhower 2922
35 Kennedy 1036
36 Johnson 1886
37 Nixon 2027
38 Ford 895
39 Carter 1461
40 Reagan 2922
41 Bush 1461
42 Clinton 2922
43 Bush 1110

Table 3. Past Presidents of the United States (sorted).

President Days
1 Harrison 31
2 Garfield 199
3 Taylor 491
4 Harding 881
5 Ford 895
6 Filmore 967
7 Kennedy 1036
8 Bush 1110
9 Arthur 1260
10 Johnson 1418
11 Tyler 1427
14 Van Buren 1460
15 Polk 1460
16 Pierce 1460
17 Buchanan 1460
18 Hayes 1460
19 Cleveland 1460
20 Harrison 1460
21 Cleveland 1460
22 Taft 1460
23 Hoover 1460
24 Carter 1461
25 Bush 1461
26 Lincoln 1503
27 McKinley 1655
28 Johnson 1886
29 Nixon 2027
30 Coolidge 2039
31 Roosevelt 2727
32 Truman 2810
33 Washington 2864
34 Jefferson 2921
36 Monroe 2921
37 Jackson 2921
38 Grant 2921
39 Wilson 2921
40 Eisenhower 2922
41 Reagan 2922
42 Clinton 2922
43 Roosevelt 4452

The two peaks in the data represent presidents who served one (1460 or 1461 days) or two (2921 or 2922) full terms. The fact that there are two values at each peak is due to changes in how the starting and ending dates of a standard term are defined. This is more pronounced in the case of Washington, who is actually a part of the upper peak. He served two full terms but his "start-up" term as the first President of the United States was shorter than subsequent terms. If we count Washington, there are 24 Presidents "in the pattern". The remaining 19 Presidents (44%) that fall off the two peaks are "outliers" in the sense that some explanation is required as to why these Presidents failed to serve one or two full terms.

Franklin Roosevelt is the one high outlier because he was the only President elected to more than two terms. He was actually elected to four terms, but died in office during his fourth. This is probably the only outlier that is covered by the "too big or too small" definition of outliers. That depends, of course, on your cut-off points for too- extremeness. For example, the definition built into the boxplot doesn't tag Roosevelt as an outlier.

## 4. Discussion

Some of my students subscribe to the OWTH (Off With Their Heads) school of thought on how to deal with outliers. They simply want to delete them. This dataset is a case where that is clearly a foolish policy. What we generally want to do with outliers is investigate them more fully and find out why they are special. Often this has some significance in the realm where the data were collected. Here are some examples that can be turned into exercises for students. In many cases the Presidents who failed to serve one or two full terms died in office. (Who are they?) But for each of those, there is another who served a partial term by serving out the remaining term of the president who died. (Which Presidents are these?) There is also a President who resigned from office, and a matching one that served out his term. (Who are they?) Finally, there is the current President, whose term is not yet over. For him, there is no other President associated with the remainder of his term. (Should he even be included? Is the number of days served accurate in his case? Is he an outlier?) In general, there is a reason for each "outlier" that can be discovered by looking into the context of the data. (I should note that one outlier was removed in the data gathering process. David Rice Atchison may have been Acting President for one day in 1849. See www.senate.gov/artandhistory/history/minute/President_For_A_Day.htm

The clear links with history make this a good dataset to use with a colleague in that discipline. One possible exploration might involve the names listed twice on the list of Presidents. Your colleague can help your students look into history to find explanations for these apparent duplications. Some are father-and-son, one pair are grandfather-and-grandson, and the two Cleveland's are the same man, elected to two nonconsecutive terms. Here we could discuss whether this is an "outlier" in the sense that it needs fixing. For some purposes it might make more sense to list Cleveland but once and total his days in office. Apparently not for every purpose, though; the U. S. State Department has ruled that Cleveland shall be counted as both the 22ndand 24th President.

One might also note that in addition to being an outlier as a result of being elected to the Presidency four times, Franklin Roosevelt also served a truncated fourth term due to death in office, and, like Washington, a truncated first term because the date of inauguration was changed.

Your students may have an almanac, a friendly, nearby history teacher, or their own knowledge of U. S. History to fall back on to answer such questions. In real studies, it may be that peculiarities in the data have no ready explanation. Then the analysis of the data may stimulate new research to find an explanation. A famous example from the history of science is the discovery of unknown planets, see O'Connor and Robertson www-gap.dcs.st-and.ac.uk/~history/HistTopics/Neptune_and_Pluto.html. Here peculiarities in the data on the known planets suggested where to look for new planets.

My goals with this data are more modest than discovering new planets. I hope to illustrate an underappreciated kind of "outlier", to have students see that data displays can tell us much about the underlying situation, and that we may have to delve into the originating discipline to understand what we see in our displays.

## 5. Getting the Data

The file outlier.dat.txt is a tab delimited text file containing 43 rows. The rows, in chronological order, list the President’s name (with no embedded spaces) and his number of days in office. The file outlier.txt is a documentation file containing a brief description of the dataset.

## References

Bluman, Allan (2000), Elementary Statistics, brief version, New York: McGraw-Hill.

O’Connor, J.J. and Robertson, E.F., “Mathematical discovery of planets”, www-gap.dcs.st-and.ac.uk/~history/HistTopics/Neptune_ and _Pluto.html.

Ross, Sheldon (1996), Introductory Statistics, New York: McGraw-Hill.

Winkler, William, “Problems with Inliers”, U.S. Census Bureau, www.census.gov/srd/papers/pdf/rr9805.pdf.

Robert W. Hayden
82 River Street
Ashland, NH 03217
USA
bob@statland.org