Eric Langford

California State University, Chico

Journal of Statistics Education Volume 14, Number 3 (2006), ww2.amstat.org/publications/jse/v14n3/langford.html

Copyright © 2006 by Eric Langford all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.

**Key Words:** Percentiles; Quantiles

I want to point out that the emphasis in this paper will be on the computation of quartiles and will be at a level suitable for classroom discussion at the level of a first course in statistics. More general definitions of quantiles (percentiles) are given here, but not stressed. With the increasing emphasis on exploratory data analysis (EDA) in the elementary classroom, in particular the ideas of the five-number summary and box-and-whisker plots (boxplots), a thorough understanding of quartiles is mandatory, but a detailed discussion of quantiles may not be necessary for the beginning student.

Elementary discussion of quartiles can be found in Dr. Twe (2002), Freund and Perles (1987), Hayden (1997), Joarder and Firozzaman (2001), John (2000), Journet (1999), and Wessa (2006). Wessa’s website also contains a link to an online calculator which will calculate quartiles using eight different methods. For a more complete discussion of quantiles, together with a number of references, see Hyndman and Fan (1996).

It might be thought that with the increasing use of graphing calculators (for example, the TI-83 Plus) and computer
packages (MINITAB, SAS, *Mathematica*, JMP, *Microsoft Excel*) in the classroom, the need for consistency in the
textbook definition of quartiles would be lessened. But the widespread use of such tools makes the need for a consistent
definition of the quartiles *more* necessary, rather than less, since as we shall see later, the TI-83 Plus, MINITAB, SAS,
*Mathematica*, and *Microsoft Excel* use five different definitions of the quartiles! (The methods used by each
of these packages are summarized later in Table 1.) In fact, one recent text
(McClave and Sincich (2003)) reproduces results from a TI-83 (p. 46),
MINITAB (p. 48), and SAS (pp. 50 and 65), all of which use different methods. What are students to do when they check a
MINITAB or SAS or *Microsoft Excel* calculation on their TI-83 Plus calculator and get a different answer, all of
which differ from the answer in the back of the book? This is not an idle concern; a very confused student wrote to the
“Ask Dr. Math” section of The Math Forum@Drexel inquiring why his TI-83, *Excel*, MINITAB, and his own paper-and-pencil
calculations all gave different answers for the quartiles of his data set. (See Dr.
Twe (2002).)

There is a tendency for statisticians to say, “Why worry? The differences are small so who cares?” Freund and Perles (1987) answer this well:

“Before we go into any details, let us point out that the numerical differences between answers produced by the different methods are not necessarily large; indeed, they may be very small. Yet if quartiles are used, say to establish criteria for making decisions, the method of their calculation becomes of critical concern. For instance, if sales quotas are established from historical data, and salespersons in the highest quarter of the quota are to receive bonuses, while those in the lowest quarter are to be fired, establishing these boundaries is of interest to both employer and employee. In addition, computer-software users are sometimes unaware of the fact that different methods can provide different answers to their problems, and they may not know which method of calculating quartiles is actually provided by their software.”

If there are repeated data values, we must replace “greater than” by “to the right of” and similarly for “less than,”
“greater than or equal to,” and “less than or equal to.” But consider the data set (1, 2, 2, 3, 4). No one would
disagree that the median is 2. But it is the second “2” in the set and not the first “2” which has the above properties.
(All twos are equal, but some are more equal than others!) What we would like to have is a definition of the median (in
this case 2) that depends only on its *numerical value* and not on the particular occurrence of that value. Thus we
take the following definition (which is the key to defining percentiles in a precise fashion):

DEFINITION 1: The median is that number which puts at least half of the data values at that number or below and at least half of the data values at that number or above; if more than one such number exists, there will be an entire interval of such and the median is the midpoint of that interval.

The most naive approach in defining quartiles is to think of the median as dividing the data set into halves (“bottom half”
and “top half”) and then defining the lower (first) quartile *Q*_{1} to be the median of the bottom
half, and the upper (third) quartile *Q*_{3} to be the median of the top half. This makes good
sense and is an easy “sell” to students. It works well if *n* is even, but if it is odd, the question remains:
“What do we do with the median value itself?” As you might expect, different authors give different answers. For the
remainder of this paper, *n* will denote the number of data values in the data set.

**METHOD 1 (“Inclusive”)**: Divide the data set into two halves, a bottom half and a top half. If *n* is odd,
*include* the median value in both halves. Then the lower quartile is the median of the bottom half and the upper
quartile is the median of the top half. As an example, if *S*_{5} = (1, 2, 3, 4, 5), then the
inclusive lower half is (1, 2, 3) and hence *Q*_{1} = 2. (A summary of all of the methods
considered will be given later in Table 2.)

This method is used by Siegel and Morgan (1996) and is equivalent to Method 3 below.

**METHOD 2 (“Exclusive”)**: As above except that in the case of *n* odd, the median value is *excluded* from
both halves. As an example, if *S*_{5} = (1, 2, 3, 4, 5), then the exclusive lower half is (1, 2)
and hence *Q*_{1} = 1.5.

This method is used by Moore (2003), Peck, Olsen, and Devore (2001)(p. 117), Brase and Brase (2003), and Moore and McCabe (2003). Because of this last reference, I have seen this method referred to as the “M&M Method.” Method 1 of Joarder and Firozzaman (2001) covers both of our Methods 1 and 2.

According to its instruction book (p. 12 - 29) the TI-83 Plus defines the lower quartile as being the “median of the points between the minimum and the median” and the upper quartile similarly. This would lead one to believe that Method 1 is being used. However, in using the TI-83 Plus on the test data sets defined later in this paper, it appears that Method 2 is actually being used. (The TI-84 Plus and TI-89 seem to use the same method.)

Before proceeding further, we will need some notation. To simplify matters, we always assume that the data values are
ordered in nondecreasing order: . To say that we take value #(*k*)
where *k* is an integer is to say we take *x _{k}*. If

In his classic book on EDA, Tukey (1977) introduced the concepts of
*box-and-whisker plot* and *five-number summary* in terms of what he calls the upper and lower *hinges*
(see p. 33). The two hinges form the ends of the box in the box-and-whisker plot and, together with the maximum,
minimum, and median values, form the five-number summary. The *upward rank* of a data value
*x _{k}* is simply

Tukey is careful to define his box-and-whisker plots and five-number summaries entirely in terms of the hinges, and does not involve quartiles. However, many authors use the quartiles rather than the hinges in their definitions, which is where the confusion arises, because of the many different definitions of the quartiles. We shall formalize the Tukey hinges as Method 3, even though, strictly speaking, Method 3 is used to find hinges not quartiles. In Table 2 later on, we shall see that Tukey hinges are numerically equal to Method 1 quartiles, so we need not worry about what “Tukey quartiles” are.

**METHOD 3 (“Tukey”)**: Let the median be #(*M*) = #((*n* + 1)/2) and define
. Count *H* measurements from the bottom and *H* measurements
from the top to get the lower and upper hinges; if *H* is not an integer, then interpolate; i. e., the lower hinge is
#(*H*) and the upper hinge is #(*n* + 1 – *H*). As an example, if *S*_{5} = (1, 2, 3,
4, 5), then the median is #(*M*) = #(3) = 3 and so *H* = 2 making the lower hinge also 2.

In addition to Tukey (1977), this approach is used by Milton, McTeer, and Corbet (1997). Also, MINITAB can be used to calculate the hinges by using the EDA option and asking for “letter values.” Curiously enough, MINITAB when asked to draw a box-and-whisker plot will use its own calculation (Method 11) of the quartiles, rather than the Tukey letter values.

In general, unless authors define quartiles using one of the three methods above, they define percentile values and let the
lower quartile (25^{th} percentile) and upper quartile (75^{th} percentile) be special cases of that
definition. These definitions are usually based on the generalization of the “definition” of the median as being that
value which puts “half of the data set above and half of the data set below.” (Recall our previous discussion, which
yielded Definition 1.) This generalized “definition” is: “The *P*^{th} percentile value puts *P*
percent of the data set below and (100 - *P*) percent of the data set above.” As we shall discuss in the next section,
this must be made more precise as we have already done for the median. (For simplicity of notation, we let *p* =
*P*/100, so that, for example, the 50^{th} percentile corresponds to *p* = 0.5.)

One method used is the following. We shall see in the next section that this method, although unwieldy to apply, is the only method that satisfies our precise definition of percentile. We call it the “CDF Method” since it is based on the CDF (cumulative distribution function) of the empirical distribution given by the data set. SAS refers to it as “empirical distribution function with averaging.”

**METHOD 4 (“CDF”)**: The *P*^{th} percentile value is found as follows. Calculate *np*. If *np*
is an integer, then the *P*^{th} percentile value is the average of #(*np*) and #(*np* + 1). If
*np* is not an integer, the *P*^{th} percentile value is
; that is, we *round up*. Alternatively, one can look at
#(*np* + 0.5) and *round off* unless it is half an odd integer, in which case it is left unrounded. As an
example, if *S*_{5} = (1, 2, 3, 4, 5) and *p* = 0.25, then #(*np*) = 1.25, which is not
an integer so that we take the next largest integer and hence *Q*_{1} = 2. Using the alternative
calculation, we would look at #(*np* + 0.5) = #(1.75) which would again round off to 2. Note that this method can be
considered as “Method 10 with rounding.”

This method is used by Johnson and Bhattacharyya (1996), Johnson (2000), and Ross (1996). It is Definition 2 of Hyndman and Fan (1996) and Definition 4 of Joarder and Firozzaman (2001), who refer to Smith (1997), p. 36, who uses the alternative calculation. It is the default option PCTLDEF = 5 of the SAS System computer package and is also Method 4 of Wessa (2006).

Yet another method is found in Mendenhall and Sincich (1995).

**METHOD 5 (“M&S”)**: For the lower and upper quartile values take #((*n* + 1)*p*) with *p* = 0.25
for the lower quartile and *p* = 0.75 for the upper quartile. Then round to the nearest integer. If
(*n* + 1)*p* is half an odd integer, round *up* for the lower quartile and *down* for the upper quartile.
For example, if *S*_{5} = (1, 2, 3, 4, 5) and *p* = 0.75, then #((*n* + 1)*p*) =
#(4.5) and hence *Q*_{3} = 4. Note that this can be considered as “Method 11 with complete
rounding,” in the same way that Method 4 can be considered as “Method 10 with rounding.” For general percentiles, the
authors say to “take #(*n* + 1)*p* and “round to the nearest integer,” perhaps implying the same kind of
rounding as for the quartiles when (*n* + 1)*p* is half an odd integer.

A method very similar to this is used by Lohninger (1999).

**METHOD 6 (“Lohninger”)**: This method is the same as the previous method except in the case of (*n* + 1)*p*
equal to half an odd integer we always round *up*. Using the same example as above, we would round up rather than
down and obtain *Q*_{3} = 5.

Joarder and Firozzaman (2001) refer to a method of Vining (1998), p. 44:

**METHOD 7 (“Vining”)**: Define *Q*_{1} to be #((*n* + 3)/4) if *n* is odd and
#((*n* + 2)/4) if *n* is even and define *Q*_{3} to be #((3*n* + 1)/4) if *n*
is odd and #((3*n* + 2)/4) if *n* is even. For example, if *S*_{5} = (1, 2, 3, 4, 5),
then we take *Q*_{1} = #(8/4) = 2. (We shall see from
Table 2 that this is equivalent to Method 1.)

Joarder and Firozzaman (2001) also propose formulas which they call the
“Remainder Rule.” In terms of our notation, it looks like the following: First write *n* = 4*m* + *k*,
where *k* = 0, 1, 2, or 3. If *k* = 0 or 1, let *Q*_{1} be #(*m* + 0.5) and
*Q*_{3} be #(*n* – *m* + 0.5). If *k* = 2 or 3, let
*Q*_{1} be #(*m* + 1) and *Q*_{3} be #(*n* – *m*). After a
little algebra, this rule can be seen to be equivalent to the following:

**METHOD 8 (“J&F”)**: Define *Q*_{1} to be #((*n* + 1)/4) if *n* is odd and
#((*n* + 2)/4) if *n* is even and define *Q*_{3} to be #((3*n* + 3)/4) if
*n* is odd and #((3*n* + 2)/4) if *n* is even. For example, if *S*_{5} =
(1, 2, 3, 4, 5), then we take *Q*_{1} = #(6/4) = 1.5. (We shall see from
Table 2 that this is equivalent to Method 2.)

Still another method is used by Hogg and Ledolter (1992).

**METHOD 9 (“H&L”)**: The *P*^{th} percentile value is found by taking that value with #(*np* +
0.5). If this is not an integer, take the average (not the weighted average) of
and .
As an example, if
*S*_{5} = (1, 2, 3, 4, 5) and *p* = 0.25, then #(*np* + 0.5) = #(1.75) and so we
average #(1) and #(2) implying that *Q*_{1} = 1.5.

These authors observe (p. 21, bottom) “alternatively, one could interpolate using the weighted averages ... [but that the]
differences, however, will usually be quite small.” This provides still another method, distinct from all of the others,
since it gives a value of 1.75 for *Q*_{1} when applied to *S*_{5}. Even
though this method was not actually used by any of the texts that I have examined, it is referred to in the literature and
is used by *Mathematica*. Note that it makes a nice complement to Methods 11 and 12.

**METHOD 10 (“H&L-2”)**: The *P*^{th} percentile value is found by taking that value with #(*np* +
0.5). If this is not an integer, take the *interpolated value* between
and . As an
example, if *S*_{5} = (1, 2, 3, 4, 5) and *p* = 0.25, then #(*np* + 0.5) = #(1.75) and
so *Q*_{1} = 1.75.

This method is Method 5 of Hyndman and Fan (1996) who refer to it as
“a very old definition, proposed by Hazen (1914) and popular among hydrologists
... .” It is used by *Mathematica* in calculating “Quartiles” or “InterpolatedQuantiles.”

Other texts use a method which is used by MINITAB.

**METHOD 11 (“MINITAB”)**: The *P*^{th} percentile value is found by taking that value with
#((*n* + 1)*p*). If (*n* + 1)*p* is not an integer, then interpolate between
and as
explained previously. For example, if *S*_{5} = (1, 2, 3, 4, 5) and *p* = 0.25, then
#((*n* + 1)*p*) = #(1.5) and hence *Q*_{1} = 1.5.

This method is used by Mendenhall, Beaver and Beaver (2003),
Hogg and Tanis (1997), and by
Khazanie (1996), as well as by MINITAB and JMP (See
*JMP® User’s Guide* (1994), p. 159). It is also Definition 6 of
Hyndman and Fan (1996) who refer to
Weibull (1939) and Gumbel (1939).
It is Method 5 of Joarder and Firozzaman (2001), Method 2 of
Wessa (2006), and it can also be found in
Snedecor (1946), p. 51. It is also the PCTLDEF = 4 option of the SAS
System computer package. Method 7 of Wessa, which he calls the
“TrueBasic” method is similar to this except it uses a “backwards interpolation”; for example,
*x*_{2.25} is calculated as one quarter of the way from *x*_{3} back to
*x*_{2}.

*Microsoft Excel* has a built-in quartile and percentile routine. Under its “Help Topics,” *Excel* states that
“If *k* is not a multiple of PERCENTILE interpolates to
determine the value at the *k*’th percentile.” This implies that the method is given by the following:

**METHOD 12 (“ Excel”)**: To calculate the

I have not seen this method used by any textbook, but it is Method 7 of Hyndman and Fan (1996) who refer to Gumbel (1939). It can also be found in Freund and Perles (1987) and is Method 5 of Wessa (2006).

Note that all of the first twelve methods with the exception of the Lohninger Method 6 are what I call *symmetric*.
That is, the two quartiles *Q*_{1} and *Q*_{3} have equal depth in the
sense of Tukey. Symbolically, if *Q*_{1} = #(*q*_{1} ) and
*Q*_{3} = #(*q*_{3} ) then
*q*_{1} + *q*_{3} = *n* + 1. You can verify that this is indeed true
by looking at Table 2.

The SAS System, in its univariate procedures, offers the user five different options for computing percentiles, using its
“PCTLDEF =” option. (See *SAS® Procedures Guide* (1990), p. 625.) As
noted before, the default option, PCTLDEF = 5 (“empirical distribution function with averaging”), is the same as our
Method 4 (“CDF”) and the PCTLDEF = 4 option is the same as our Method 11 (“MINITAB”). The first three options,
PCTLDEF = 1, 2, and 3, in certain circumstances give values for the median that are not consistent with the usual
definition. We present them here for completeness, but we shall not consider them further.

**METHOD 13 (“SAS-1”)**: To calculate the *P*^{th} percentile take #(*np*) with interpolation.
SAS refers to this as “PCTLDEF = 1.” This method gives in every case values for the median which are not the same as the
usual values. For example, if *S*_{3} = (1, 2, 3), this method would give the median as 1.5
rather than 2.

This method is Definition 4 of Hyndman and Fan (1996) who refer to
Parzen (1979) and is Method 1 of Wessa
(2006). It is also used by *Mathematica* in calculating “AsymmetricQuartiles.”

**METHOD 14 (“SAS-2”)**: To calculate the *P*^{th} percentile take *x _{k}* where

This method is Definition 3 of Hyndman and Fan (1996). A similar method is
Method 6 of Wessa (2006), which he refers to as the “closest observation”
method. Wessa’s method is: To calculate the *P*^{th} percentile take *x _{k}* where

**METHOD 15 (“SAS-3”)**: To calculate the *P*^{th} percentile take
. SAS refers to this as “PCTLDEF = 3,” the “empirical distribution
function” method. It is not hard to see that this gives the usual value for the median if *n* is odd, but not if
*n* is even.

This method is Definition 1 of Hyndman and Fan (1996) and Method 3 of
Wessa (2006). It is also used by *Mathematica* in calculating “Quantiles.”

For the convenience of the user of calculator/computer statistical packages, we now give a table which gives the method each such package uses.

Computer/Calculator Package | Uses Following Method |
---|---|

SAS PCTLDEF = 1Mathematica “AsymmetricQuartiles” | Method 13 (“SAS-1”) |

SAS PCTLDEF = 2 | Method 14 (“SAS-2”) |

SAS PCTLDEF = 3Mathematica “Quantiles” | Method 15 (“SAS-3”) |

SAS PCTLDEF = 4, MINITAB, JMP | Method 11 (“MINITAB”) |

SAS PCTLDEF = 5 (default) | Method 4 (“CDF”) |

Mathematica “Quartiles” or“InterpolatedQuantiles” | Method 10 (“H&L-2”) |

Excel (late versions) | Method 12 (“Excel”) |

TI-83 Plus, TI-84 Plus, TI-89 | Method 2 (“Exclusive”) |

MINITAB EDA “letter values” | Method 3 (“Tukey”) equivalent to Method 1 (“Inclusive”) |

A little thought will show that if we are considering just quartiles, then the results that the various methods give depend
only on the congruence class (mod 4) in which *n* falls, that is, on the remainder that occurs when *n* is
divided by 4. It is also possible to show by taking the four cases of *n*= 4*k*, *n*= 4*k* + 1,
*n*= 4*k* + 2, *n*= 4*k* + 3 that we need look at only four “canonical” data sets:
*S*_{4}, *S*_{5}, *S*_{6},
*S*_{7}, consisting of (1, 2, 3, 4),
(1, 2, 3, 4, 5), (1, 2, 3, 4, 5, 6), and (1, 2, 3, 4, 5, 6, 7) respectively. (In a sense we are simply looking at the
position of the data value in the data set, rather than its actual numerical value.) As was observed by
Peck, Olsen, and Devore (2001), two methods are the same if and only if they
agree on these four data sets. (With one exception: Method 14. However we are not considering this method.) Here is a
table (Table 2) comparing the lower and upper quartile values
(*Q*_{1}, *Q*_{3}) given by each of the methods for each of the four
canonical data sets, together with the interquartile range (IQR).

S_{4} = (1, 2, 3, 4) |
S_{5} = (1, 2, 3, 4, 5) |
S_{6} = (1, 2, 3, 4, 5, 6) |
S_{7} = (1, 2, 3, 4, 5, 6, 7) | |||||
---|---|---|---|---|---|---|---|---|

Method | (Q_{1}, Q_{3}) | IQR | (Q_{1}, Q_{3}) | IQR | (Q_{1}, Q_{3}) | IQR | (Q_{1}, Q_{3}) | IQR |

1 “Inclusive” | (1.5, 3.5) | 2 | (2, 4) | 2 | (2, 5) | 3 | (2.5, 5.5) | 3 |

2 “Exclusive” | (1.5, 3.5) | 2 | (1.5, 4.5) | 3 | (2, 5) | 3 | (2, 6) | 4 |

3 “Tukey” | (1.5, 3.5) | 2 | (2, 4) | 2 | (2, 5) | 3 | (2.5, 5.5) | 3 |

4 “CDF” | (1.5, 3.5) | 2 | (2, 4) | 2 | (2, 5) | 3 | (2, 6) | 4 |

5 “M&S” | (1, 4) | 3 | (2, 4) | 2 | (2, 5) | 3 | (2, 6) | 4 |

6 “Lohninger” | (1, 4) | 3 | (2, 5) | 3 | (2, 5) | 3 | (2, 6) | 4 |

7 “Vining” | (1.5, 3.5) | 2 | (2, 4) | 2 | (2, 5) | 3 | (2.5, 5.5) | 3 |

8 “J&F” | (1.5, 3.5) | 2 | (1.5, 4.5) | 3 | (2, 5) | 3 | (2, 6) | 4 |

9 “H&L” | (1.5, 3.5) | 2 | (1.5, 4.5) | 3 | (2, 5) | 3 | (2.5, 5.5) | 3 |

10 “H&L-2” | (1.5, 3.5) | 2 | (1.75, 4.25) | 2.5 | (2, 5) | 3 | (2.25, 5.75) | 3.5 |

11 “MINITAB” | (1.25, 3.75) | 2.5 | (1.5, 4.5) | 3 | (1.75, 5.25) | 3.5 | (2, 6) | 4 |

12 “Excel” | (1.75, 3.25) | 1.5 | (2, 4) | 2 | (2.25, 4.75) | 2.5 | (2.5, 5.5) | 3 |

We can make several observations from the table. The Tukey Method 3 and the Vining Method 7 are seen to be the same as the Inclusive Method 1, whereas the J&F Method 8 is seen to be the same as the Exclusive Method 2. Henceforth, we shall not consider these to be separate methods. The first nine methods can be thought of as “averaging” methods, since their quartile (indeed, percentile values in the cases of the CDF Method 4 and the H&L Method 9) are always individual data values or halfway between two successive data values. The last three methods can be thought of as “interpolation” methods, since their quartile (and percentile) values may lie elsewhere between successive data values.

The M&S Method 5 and the Lohninger Method 6 are unique in the sense that they give only values which are data values
themselves. The other averaging methods all agree if *n* is even, whereas if *n* is odd, then the CDF Method 4
agrees with the Inclusive Method 1 if *n* is of the form 4*k* + 1 and with the Exclusive Method 2 if *n*
is of the form 4*k* + 3, whereas exactly the opposite is true for the H&L Method 9. Therefore these four methods
(remember that Methods 3, 7, and 8 are redundant) exhaust all possibilities for the inclusion and exclusion of the median
value in the “top-half, bottom-half” idea. More precisely, the Inclusive Method 1 *includes* the median
(in *both* halves) in
*both* of the cases 4*k* + 1 and 4*k* + 3; the Exclusive Method 2 *excludes* it in *both* of
the cases; the
CDF Method 4 *includes* it in the case 4*k* + 1 and *excludes* it in the case 4*k* + 3; and the
H&L Method 9
*excludes* it in the case 4*k* + 1 and *includes* it in the case 4*k* + 3.

The three interpolation methods can be thought of as different generalizations of the median value as
. The *Excel* Method 12 looks at the first form, the H&L-2
Method 10 looks at the second, and the MINITAB Method 11 looks at the third. As was noted by
Freund and Perles (1987), these three methods when applied to the quartiles
*Q _{i}* (

The interpolation methods can be viewed as various methods of “smearing” the data values so that the “stair-step” CDF is
replaced by a piecewise linear function from which the percentiles are calculated as they would be for a continuous
distribution. (C. f. the method discussed in the appendix.) See Journet (1999)
and John (2000) for graphs of some of these functions. If the data values are
distinct, this is fairly straightforward, but if there are repeated values, difficulties arise. For example, one would
expect that the quartile values for the data set *S*_{5} = (1, 2, 3, 4, 5) would be the same as for
the data set 2*S*_{5} = (1, 1, 2, 2, 3, 3, 4, 4, 5, 5), but they are not for Methods 10 and 11 as can
be seen by comparing Table 2 and
Table 3. Similarly Method 12 gives different results on the data sets
*S*_{7} = (1, 2, 3, 4, 5, 6, 7) and 2*S*_{7} =
(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7).

We see that we now have an entire infinite family of possible interpolation methods! For each of these, we can obtain other possible methods by “rounding” (i. e., by rounding to the nearest integer except when we get a value which is half an odd integer as in the CDF Method 4) and by “complete rounding” (i. e., by rounding to the nearest integer, with some rule as to what to do when we get a value which is half an odd integer as in Methods 5 and 6). For example, the CDF Method 4 is the case of = 1/2 with rounding, and Method 6 of Wessa (2006) is the same case with complete rounding. Method 8 of Wessa (2006) is the case of = 1 with rounding, whereas the M&S Method 5 and the Lohninger Method 6 are the same case with two different kinds of complete rounding.

Finally, looking at the IQRs, we can see, for example, that in every case, the *Excel* Method gives IQR values which are no
larger than those given by any other method. We can summarize all such relationships in the following diagram
(Figure 1) where if Method A lies above Method B in the figure, then the IQR
values of Method A are at least as large as those of Method B in every case.

DEFINITION 2: AP^{th}percentile value is a number which puts at leastPpercent of the data values at that number or below and at least (100 -P) percent of the data values at that number or above. If more than one such number exists, there will be an entire interval of such and we choose theP^{th}percentile value to be the midpoint of that interval.

The question remains, how are such values to be found? We claim that it is the CDF Method 4 which does the job. That the CDF Method meets the definition for all percentiles is not totally obvious and we include a proof for completeness.

**THEOREM**: The CDF Method 4 provides the *P*^{th} percentile value for all possible values of *P*.

**PROOF**: We first assume for the sake of simplicity that the data values are all distinct and are ordered. Consider
the random variable *X* which puts probability 1/*n* at each data value and let be its cumulative
distribution function (CDF). In terms of the CDF, a number *x* is *a* *P*^{th} percentile value
(note the article) if and only if and
. But where
so we have that a necessary and sufficient condition that *x* be a
*P*^{th} percentile value is that

(1) |

We see then that we have two cases:

**Case 1**: The line *y* = *p* does *not* intersect the graph of *y* = *F*(*x*); it passes
through a jump at *x* = *x _{k + 1}*. This occurs if and only if

That is, this occurs if and only if *np* is not an integer and lies between *k* and *k* + 1. It is easy to
see that *x* = *x _{k + 1}* is the only value of

**Case 2**: The line *y* = *p* does intersect the graph of *y* = *F*(*x*). Since the graph of
the CDF has a “stair-step” shape, the line must intersect the graph along an entire interval, say the interval
[*x _{k}*,

If there are repeated values, the argument is similar. Suppose, for example that *x _{k - 1}* <

A little thought will show that if we are talking only about quartiles, then to meet Definition 2, the first quartile
values *Q*_{1} for *S*_{1}, *S*_{2},
*S*_{3}, *S*_{4} would have to be 1.5, 2, 2, and 2 respectively, as *any*
number between 1 and 2 inclusive would serve as a 25^{th} percentile value for *S*_{1}.
The Lohninger Method 6 does not even provide a 75^{th} percentile value in the case of *S*_{5},
but it appears that the M&S Method 5 gives quartile values consistent with the first part of Definition 2 anyway. This
is true, but the M&S Method fails to give values which meet even the first part of Definition 2 for other quantiles. As
an example, consider finding the second decile value *D*_{2}
(i. e. the first quintile) of *S*_{6}. Then (*n* + 1)*p* = 7/5 = 1.4 which rounds to 1,
implying that *D*_{2} = 1. But this puts only 1/6 = 17% of the data values at or below
*D*_{2}, rather than the required 20%. Looking at Table 2
we can see that the CDF Method 4 is the only method that provides quartile values consistent with the complete Definition 2.

2S_{4} = (1, 1, 2, 2, 3, 3, 4, 4) |
2S_{5} = (1, 1, 2, 2, 3, 3, 4, 4, 5, 5) |
2S_{6} = (1, 1, 2, 2, 3, 3,4, 4, 5, 5, 6, 6) |
2S_{7} = (1, 1, 2, 2, 3, 3,4, 4, 5, 5, 6, 6, 7, 7) | |||||
---|---|---|---|---|---|---|---|---|

Method | (Q_{1}, Q_{3}) | IQR | (Q_{1}, Q_{3}) | IQR | (Q_{1}, Q_{3}) | IQR | (Q_{1}, Q_{3}) | IQR |

1 “Inclusive” | (1.5, 3.5) | 2 | (2, 4) | 2 | (2, 5) | 3 | (2, 6) | 4 |

2 “Exclusive” | (1.5, 3.5) | 2 | (2, 4) | 2 | (2, 5) | 3 | (2, 6) | 4 |

4 “CDF” | (1.5, 3.5) | 2 | (2, 4) | 2 | (2, 5) | 3 | (2, 6) | 4 |

5 “M&S” | (1, 4) | 3 | (2, 4) | 2 | (2, 5) | 3 | (2, 6) | 4 |

6 “Lohninger” | (1, 4) | 3 | (2, 4) | 2 | (2, 5) | 3 | (2, 6) | 4 |

9 “H&L” | (1.5, 3.5) | 2 | (2, 4) | 2 | (2, 5) | 3 | (2, 6) | 4 |

10 “H&L-2” | (1.5, 3.5) | 2 | (2, 4) | 2 | (2, 5) | 3 | (2, 6) | 4 |

11 “MINITAB” | (1.25, 3.75) | 2.5 | (1.75, 4.25) | 2.5 | (2, 5) | 3 | (2, 6) | 4 |

12 “Excel” | (1.75, 3.25) | 1.5 | (2, 4) | 2 | (2, 5) | 3 | (2.25, 5.75) | 3.5 |

I offer the following proposal for classroom use: *Define* the quartiles by using the “25% below, 75% above” idea and
present the Inclusive and Exclusive Methods 1 and 2, discussing the problem of the “middle measurement.” Then tell the
students that if they could split the middle measurement in half (one might discuss the doubling idea), they would get
quartile values that meet the definition. Then use the following method to calculate the quartiles. As noted before, the
CDF Method 4 *includes* the middle measurement in the case of *n*= 4*k* + 1 and *excludes* it in the case of
*n*= 4*k* + 3. But in each of these cases, we end up with an odd number of data values in both of the top and
bottom halves. Thus the following method is equivalent to the CDF Method 4, yet has the flavor of the Inclusive and
Exclusive Methods 1 and 2 and thus should be more accessible to students.

**SUGGESTED METHOD**: Divide the data set into two halves, a bottom half and a top half. If *n* is odd, include
or exclude the median in the halves so that each half has an *odd* number of elements. The lower and upper quartiles are
then the medians of the bottom and top halves respectively.

I have not yet had the opportunity to test this method in the classroom, but in a statistics class I recently taught, I
used Hogg and Ledolter (1992). Not wishing to change the definition of
quartiles given in the book, I used the equivalent form which says: Divide the data set into two halves, a bottom half and
a top half. If *n* is odd, include or exclude the median in the halves so that each half has an *even* number of
elements. The lower and upper quartiles are then the medians of the bottom and top halves respectively. The class had no
trouble using this definition and thought that it was much easier to apply than the form given in the book. I expect that
the situation will be the same in using the suggested method.

Bain, L. J. and Englehardt, M. (1992), *Introduction to Probability and Mathematical Statistics* (2^{nd} ed.),
Belmont, CA: Duxbury Press.

Benard, A. and Bos-Levenbach, E. C. (1953), “Het Uitzetten van Waarnemingen op Waarschijnlijkheitspapier,” *Statistica*,
7, 163 - 173.

Blom, G. (1958), *Statistical Estimates and Transformed Beta-Variables*, New York: John Wiley & Sons.

Brase, C. H. and Brase, C. P. (2003), *Understandable Statistics (Concepts and Methods)*(7^{th} ed.),
Lexington, MA: D. C. Heath and Company.

Dr. Twe (2002), Reply to “Tom” about quartiles, online at mathforum.org/library/drmath/view/60969.html.

Freund, J. E. and Perles, B. M. (1987), “A new look at quartiles of ungrouped data,” *The American Statistician*,
41(3), 200 - 203.

---------- (2004), *Statistics a First Course* (8^{th} ed.), Upper Saddle River, NJ: Pearson Prentice Hall.

Gumbel, E. J. (1939), “La Probabilité des Hypothèses,” *Comptes Rendus de l’Académie des Sciences* (Paris), 209,
645 - 647.

Hayden. R. (1997), “Ticky-Tacky Boxes,” online at either exploringdata.cqu.edu.au/docs/tt_box2.doc or exploringdata.cqu.edu.au/ticktack.htm

Hazen, A. (1914), “Storage to be Provided in Impounding Reservoirs for Municipal Water Supply,” (with discussion),
*Transactions of the American Society of Civil Engineers*, 77, 1539-1669.

Hoaglin, D. C. (1983), “Letter Values: A Set of Selected Order Statistics” in Hoaglin, D. C., Mosteller, F., and Tukey, J.
W. (Editors), *Understanding Robust and Exploratory Data Analysis*, New York: John Wiley & Sons.

Hoel, P. G. (1966), *Elementary Statistics* (2^{nd} ed.), New York: John Wiley & Sons.

Hogg, R. V. and Ledolter, J. (1992), *Applied Statistics for Engineers and Physical Scientists*, New York: Macmillan.

Hogg, R. V. and Tanis, E. A. (1997), *Probability and Statistical Inference* (5^{th} ed.), Upper Saddle River,
NJ: Prentice Hall.

Hyndman, R. J. and Fan, Y. (1996), “Sample quantiles in statistical packages,” *The American Statistician*, 50(4),
361 - 365.

Joarder, A. H. and Firozzaman, M. (2001), “Quartiles for discrete data,” *Teaching Statistics*, 23, 86-89.

John, R. (2000), “How Statistics Packages Calculate Sample Quartiles”; an earlier version of this paper entitled “How to Calculate a Quartile (If You Must),” can be found online at www.maths.murdoch.edu.au/units/statsnotes/samplestats/quartilesmore.html

Johnson, R. A. (2000), *Miller and Freund’s Probability and Statistics for Engineers* (6^{th} ed.), Upper
Saddle River, NJ: Prentice Hall.

Johnson, R. A. and Bhattacharyya, G. K. (1996), *Statistics - Principles and Methods* (3^{rd} ed.), New York:
John Wiley & Sons.

Journet, D. (1999), “Quartiles: How to Calculate Them?” online at www.haiweb.org/medicineprices/manual/quartiles_iTSS.pdf

Khazanie, R. (1996), *Statistics in a World of Applications* (4^{th} ed.), New York: HarperCollins.

Lohninger, H. (1999), *Teach/Me Data Analysis*, Berlin-New York-Tokyo: Springer-Verlag.

McClave, J. T. and Sincich, J. (2003), *A First Course in Statistics* (8^{th} ed.), Upper Saddle River, NJ:
Prentice Hall.

Mendenhall, W., Beaver, R. J., and Beaver, B. M. (2003), *Introduction to Probability and Statistics* (11^{th}
ed.), Pacific Grove, CA: Brooks/Cole-Thompson.

Mendenhall, W. and Sincich, T. (1995), *Statistics for Engineering and the Sciences* (4^{th} ed.), Upper
Saddle River, NJ: Prentice Hall.

Milton, J. S., McTeer, P. M., and Corbet, J. J. (1997), *Introduction to Statistics*, New York: McGraw-Hill.

Moore, D. S. (1996), *Statistics - Concepts and Controversies* (4^{th} ed.), New York: W. H. Freeman and Co.

---------- (2003), *The Basic Practice of Statistics* (3^{rd} ed.), New York: W. H. Freeman and Co.

Moore, D. S. and McCabe, G. P. (2003), *Introduction to the Practice of Statistics* (4^{th} ed.), New York:
W. H. Freeman and Company.

Parrish, R. S. (1990). “Comparison of quantile estimators in normal sampling,” *Biometrics*, 46, 247-257.

Parzen, E. (1979), “Nonparametric Statistical Data Modeling” (with discussion), *Journal of the American Statistical
Association*, 74, 105 - 131.

Peck, R., Olsen, C., and Devore, J. (2001), *Introduction to Statistics and Data Analysis*, Pacific Grove, CA:
Duxbury Press.

Ross, S. M. (1996), *Introductory Statistics*, New York: McGraw-Hill.

SAS Institute, Inc. (1990), *SAS® Procedures Guide, Version 6* (3^{rd} ed.), Cary, NC: SAS Institute,
Inc.

SAS Institute, Inc. (1994), *JMP® User’s Guide, Version 3*, Cary, NC: SAS Institute, Inc.

Siegel, A. F. and Morgan, C. J. (1996), *Statistics and Data Analysis - An Introduction* (2^{nd} ed.), New
York: John Wiley & Sons.

Smith, P. J. (1997), *Into Statistics: A Guide to Understanding Statistical Concepts in Engineering and the Sciences*,
Berlin; New York: Springer.

Snedecor, G. W. (1946), *Statistical Methods Applied to Experiments in Agriculture and Biology* (4^{th} ed.),
Ames, IA: Iowa State College Press.

*TI-83 Plus Graphing Calculator Guidebook*, Texas Instruments Inc. (1999)

Tukey, J. W. (1977), *Exploratory Data Analysis*, Reading, MA: Addison-Wesley.

Vining, G. G. (1998), *Statistical Methods for Engineers*, Pacific Grove, CA: Duxbury Press.

Weibull, W. (1939), “The Phenomenon of Rupture in Solids,” *Ingeniörs Vetenskaps Akademien Handlingar*, 153, 17.

Wessa, P. (2006), *Free Statistics Software*, Office for Research Development and Education, version 1.1.18, online at
www.wessa.net

Eric Langford

Department of Mathematics and Statistics

California State University, Chico

Chico, CA 95929-0525

U.S.A.
*elangford@csuchico.edu*

Volume 14 (2006) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications