A Dataset that is 44% Outliers

Robert W. Hayden
statistics.com

Journal of Statistics Education Volume 13, Number 1 (2005), jse.amstat.org/v13n1/datasets.hayden.html

Copyright © 2005 by Robert W. Hayden, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.

Key Words: Data displays; Inliers; Interpretation in context; Presidents.

Abstract

The data illustrate outliers that are not mistakes and not observations that are unusually high or low. The reasons for them are all interesting historically. They illustrate that "outliers" need not be errors but may instead be particularly interesting cases. The data also illustrate that different data displays may differ in their ability to reveal interesting data structure.

1. Introduction

For many years I have been urging my students to scan data for outliers. There are two common definitions of what an outlier is, and I have a strong preference between them. One definition concentrates on outliers that are unusually large or small. Here is an example from Bluman (2000, p. 123):

An “outlier” is an extremely high or an extremely low data value when compared with the rest of the data values.

Boxplots implement a specific version of this definition. However, this definition does not generalize well beyond a single variable.

Figure 1

Figure 1. A Plot of Points along y=20-x² including (0,0).

Nine points of the pseudodata in Figure 1 fall on a perfect parabolic curve while one point is quite far from that curve. However, neither the vertical nor horizontal coordinate of the "outlier" is unusually large. In fact, both coordinates are exactly equal to the mean (and median) of the corresponding coordinate of the other nine points.

I prefer an admittedly more subjective definition that covers a much wider class of situations. Here is an example from Ross (1996, p. 59):

... “outliers” ... are data points that do not appear to follow the pattern of the other data points.

To get my students started thinking in these terms, I wanted examples of data on a single variable for which the outliers were not very large or very small values. Such points are important in research as well as in teaching. They are called “inliers” by Winkler (www.census.gov/srd/papers/pdf/rr9805.pdf). His concern is with identifying such points when they represent errors in the data that are not apparent because they are not unusually large or small values.

In my classes, I use some pseudodata examples, such as

5 0

6 00

7 000

8 0000

9 0000

10 000

11 00

12 0

13

14

15

16 0

17

18

19

20 0

21 00

22 000

23 0000

24 0000

25 000

26 00

27 0

Figure 2. Stem and Leaf of Pseudodata Example.

We might imagine this as prize monies in athletic events, with one peak representing males, another females, and an "outlier" in the middle that needs further investigation. Once again the "outlier" is at the center of the rest of the data.

2. Data

Recently I became aware of a real dataset that can serve my purpose and also illustrates the strengths and weaknesses of different data displays. It would give away too much to tell you just what the numbers represent (though I should warn you that residents of the United States will have an advantage in guessing where these data came from), so let us begin with some displays.

Figure 3

Figure 3. Boxplot of Days.

At first glance, the boxplot in Figure 3 suggests symmetry with no outliers – until we notice the location of the median at one end of the box, something beginners might not notice immediately. To more experienced eyes, this suggests a (single) sharp peak around 1500.

The histogram in Figure 4 suggests a bimodal distribution with no outliers.

Figure 4

Figure 4. Histogram of Days.

3 0 014

6 0 889

(19) 1 0124444444444444444

18 1 568

15 2 00

13 2 788999999999

1 3

1 3

1 4 4

Leaf Unit = 100

Figure 5. Stem and Leaf of Days.

The stem and leaf in Figure 5 suggests a bimodal distribution with a mild outlier at the high end.

Figure 6

Figure 6. Dotplot of Days.

The dotplot in Figure 6 is the most revealing of our displays. Most of the observations fall in two peaks around 1500 and 3000. Since the majority of the observations fall at these two sharp peaks, we might consider all of the remaining data to be "outliers".

We can look at the data in greater detail by tallying the values.

Table 1. Tally of Days.

Days Count

31 1

199 1

491 1

881 1

895 1

967 1

1036 1

1110 1

1260 1

1418 1

1427 1

1460 12

1461 2

1503 1

1655 1

1886 1

2027 1

2039 1

2727 1

2810 1

2864 1

2921 6

2922 3

4452 1

Days	Count
31	1
199	1
491	1
881	1
895	1
967	1
1036	1
1110	1
1260	1
1418	1
1427	1
1460	12
1461	2
1503	1
1655	1
1886	1
2027	1
2039	1
2727	1
2810	1
2864	1
2921	6
2922	3
4452	1

Table 1 shows that the peaks are very sharp indeed. There are 14 observations at 1460-1461 and 9 at 2921-2922. More than half the data take on one of these four values. It is interesting to note that the values at one peak are about two times the values at the other. Can you guess what these data are? (Hint: 1460 = 4 x 365)

Days

2864 1460 2921 2921 2921 1460 2921 1460 31

1427 1460 491 967 1460 1460 1503 1418 2921

1460 199 1260 1460 1460 1460 1655 2727 1460

2921 881 2039 1460 4452 2810 2922 1036 1886

2027 895 1461 2922 1461 2922 1110

Days (sorted)

31 199 491 881 895 967 1036 1110 1260

1418 1427 1460 1460 1460 1460 1460 1460 1460

1460 1460 1460 1460 1460 1461 1461 1503 1655

1886 2027 2039 2727 2810 2864 2921 2921 2921

2921 2921 2921 2922 2922 2922 4452

3. Context of the Data

Table 2. Past Presidents of the United States.

President Days

1 Washington 2864

2 Adams 1460

3 Jefferson 2921

4 Madison 2921

5 Monroe 2921

6 Adams 1460

7 Jackson 2921

8 Van Buren 1460

9 Harrison 31

10 Tyler 1427

11 Polk 1460

12 Taylor 491

13 Filmore 967

14 Pierce 1460

15 Buchanan 1460

16 Lincoln 1503

17 Johnson 1418

18 Grant 2921

19 Hayes 1460

20 Garfield 199

21 Arthur 1260

22 Cleveland 1460

23 Harrison 1460

24 Cleveland 1460

25 McKinley 1655

26 Roosevelt 2727

27 Taft 1460

28 Wilson 2921

29 Harding 881

30 Coolidge 2039

31 Hoover 1460

32 Roosevelt 4452

33 Truman 2810

34 Eisenhower 2922

35 Kennedy 1036

36 Johnson 1886

37 Nixon 2027

38 Ford 895

39 Carter 1461

40 Reagan 2922

41 Bush 1461

42 Clinton 2922

43 Bush 1110

	President	Days

1	Washington	2864
2	Adams	1460
3	Jefferson	2921
4	Madison	2921
5	Monroe	2921
6	Adams	1460
7	Jackson	2921
8	Van Buren	1460
9	Harrison	31
10	Tyler	1427
11	Polk	1460
12	Taylor	491
13	Filmore	967
14	Pierce	1460
15	Buchanan	1460
16	Lincoln	1503
17	Johnson	1418
18	Grant	2921
19	Hayes	1460
20	Garfield	199
21	Arthur	1260
22	Cleveland	1460
23	Harrison	1460
24	Cleveland	1460
25	McKinley	1655
26	Roosevelt	2727
27	Taft	1460
28	Wilson	2921
29	Harding	881
30	Coolidge	2039
31	Hoover	1460
32	Roosevelt	4452
33	Truman	2810
34	Eisenhower	2922
35	Kennedy	1036
36	Johnson	1886
37	Nixon	2027
38	Ford	895
39	Carter	1461
40	Reagan	2922
41	Bush	1461
42	Clinton	2922
43	Bush	1110

Table 3. Past Presidents of the United States (sorted).

President Days

1 Harrison 31

2 Garfield 199

3 Taylor 491

4 Harding 881

5 Ford 895

6 Filmore 967

7 Kennedy 1036

8 Bush 1110

9 Arthur 1260

10 Johnson 1418

11 Tyler 1427

12 Adams 1460

13 Adams 1460

14 Van Buren 1460

15 Polk 1460

16 Pierce 1460

17 Buchanan 1460

18 Hayes 1460

19 Cleveland 1460

20 Harrison 1460

21 Cleveland 1460

22 Taft 1460

23 Hoover 1460

24 Carter 1461

25 Bush 1461

26 Lincoln 1503

27 McKinley 1655

28 Johnson 1886

29 Nixon 2027

30 Coolidge 2039

31 Roosevelt 2727

32 Truman 2810

33 Washington 2864

34 Jefferson 2921

35 Madison 2921

36 Monroe 2921

37 Jackson 2921

38 Grant 2921

39 Wilson 2921

40 Eisenhower 2922

41 Reagan 2922

42 Clinton 2922

43 Roosevelt 4452

	President	Days

1	Harrison	31
2	Garfield	199
3	Taylor	491
4	Harding	881
5	Ford	895
6	Filmore	967
7	Kennedy	1036
8	Bush	1110
9	Arthur	1260
10	Johnson	1418
11	Tyler	1427
12	Adams	1460
13	Adams	1460
14	Van Buren	1460
15	Polk	1460
16	Pierce	1460
17	Buchanan	1460
18	Hayes	1460
19	Cleveland	1460
20	Harrison	1460
21	Cleveland	1460
22	Taft	1460
23	Hoover	1460
24	Carter	1461
25	Bush	1461
26	Lincoln	1503
27	McKinley	1655
28	Johnson	1886
29	Nixon	2027
30	Coolidge	2039
31	Roosevelt	2727
32	Truman	2810
33	Washington	2864
34	Jefferson	2921
35	Madison	2921
36	Monroe	2921
37	Jackson	2921
38	Grant	2921
39	Wilson	2921
40	Eisenhower	2922
41	Reagan	2922
42	Clinton	2922
43	Roosevelt	4452

The two peaks in the data represent presidents who served one (1460 or 1461 days) or two (2921 or 2922) full terms. The fact that there are two values at each peak is due to changes in how the starting and ending dates of a standard term are defined. This is more pronounced in the case of Washington, who is actually a part of the upper peak. He served two full terms but his "start-up" term as the first President of the United States was shorter than subsequent terms. If we count Washington, there are 24 Presidents "in the pattern". The remaining 19 Presidents (44%) that fall off the two peaks are "outliers" in the sense that some explanation is required as to why these Presidents failed to serve one or two full terms.

Franklin Roosevelt is the one high outlier because he was the only President elected to more than two terms. He was actually elected to four terms, but died in office during his fourth. This is probably the only outlier that is covered by the "too big or too small" definition of outliers. That depends, of course, on your cut-off points for too- extremeness. For example, the definition built into the boxplot doesn't tag Roosevelt as an outlier.

4. Discussion

Some of my students subscribe to the OWTH (Off With Their Heads) school of thought on how to deal with outliers. They simply want to delete them. This dataset is a case where that is clearly a foolish policy. What we generally want to do with outliers is investigate them more fully and find out why they are special. Often this has some significance in the realm where the data were collected. Here are some examples that can be turned into exercises for students. In many cases the Presidents who failed to serve one or two full terms died in office. (Who are they?) But for each of those, there is another who served a partial term by serving out the remaining term of the president who died. (Which Presidents are these?) There is also a President who resigned from office, and a matching one that served out his term. (Who are they?) Finally, there is the current President, whose term is not yet over. For him, there is no other President associated with the remainder of his term. (Should he even be included? Is the number of days served accurate in his case? Is he an outlier?) In general, there is a reason for each "outlier" that can be discovered by looking into the context of the data. (I should note that one outlier was removed in the data gathering process. David Rice Atchison may have been Acting President for one day in 1849. See www.senate.gov/artandhistory/history/minute/President_For_A_Day.htm

The clear links with history make this a good dataset to use with a colleague in that discipline. One possible exploration might involve the names listed twice on the list of Presidents. Your colleague can help your students look into history to find explanations for these apparent duplications. Some are father-and-son, one pair are grandfather-and-grandson, and the two Cleveland's are the same man, elected to two nonconsecutive terms. Here we could discuss whether this is an "outlier" in the sense that it needs fixing. For some purposes it might make more sense to list Cleveland but once and total his days in office. Apparently not for every purpose, though; the U. S. State Department has ruled that Cleveland shall be counted as both the 22^ndand 24^th President.

One might also note that in addition to being an outlier as a result of being elected to the Presidency four times, Franklin Roosevelt also served a truncated fourth term due to death in office, and, like Washington, a truncated first term because the date of inauguration was changed.

Your students may have an almanac, a friendly, nearby history teacher, or their own knowledge of U. S. History to fall back on to answer such questions. In real studies, it may be that peculiarities in the data have no ready explanation. Then the analysis of the data may stimulate new research to find an explanation. A famous example from the history of science is the discovery of unknown planets, see O'Connor and Robertson www-gap.dcs.st-and.ac.uk/~history/HistTopics/Neptune_and_Pluto.html. Here peculiarities in the data on the known planets suggested where to look for new planets.

My goals with this data are more modest than discovering new planets. I hope to illustrate an underappreciated kind of "outlier", to have students see that data displays can tell us much about the underlying situation, and that we may have to delve into the originating discipline to understand what we see in our displays.

5. Getting the Data

The file outlier.dat.txt is a tab delimited text file containing 43 rows. The rows, in chronological order, list the President’s name (with no embedded spaces) and his number of days in office. The file outlier.txt is a documentation file containing a brief description of the dataset.

References

Bluman, Allan (2000), Elementary Statistics, brief version, New York: McGraw-Hill.

O’Connor, J.J. and Robertson, E.F., “Mathematical discovery of planets”, www-gap.dcs.st-and.ac.uk/~history/HistTopics/Neptune_ and _Pluto.html.

Ross, Sheldon (1996), Introductory Statistics, New York: McGraw-Hill.

Winkler, William, “Problems with Inliers”, U.S. Census Bureau, www.census.gov/srd/papers/pdf/rr9805.pdf.

Robert W. Hayden
82 River Street
Ashland, NH 03217
USA
bob@statland.org

5		0
6		00
7		000
8		0000
9		0000
10		000
11		00
12		0
13
14
15
16		0
17
18
19
20		0
21		00
22		000
23		0000
24		0000
25		000
26		00
27		0

3	0	014
6	0	889
(19)	1	0124444444444444444
18	1	568
15	2	00
13	2	788999999999
1	3
1	3
1	4	4