An Investigation of the Median-Median Method of Linear Regression

Elizabeth J. Walters, Christopher H. Morrell, and Richard E. Auer
Loyola College of Maryland

Journal of Statistics Education Volume 14, Number 2 (2006), jse.amstat.org/v14n2/morrell.html

Copyright © 2006 by Elizabeth J. Walters, Christopher H. Morrell, and Richard E. Auer, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.

Key Words: Least squares line.

Abstract

Least squares regression is the most common method of fitting a straight line to a set of bivariate data. Another less known method that is available on Texas Instruments graphing calculators is median-median regression. This method is proposed as a simple method that may be used with middle and high school students to motivate the idea of fitting a straight line to data. The median-median line may also be viewed as a method that is not greatly affected by outliers (robust to outliers). Our paper briefly reviews the median-median regression method, considers various examples to compare the median-median line to the least squares line, and investigates the properties of the median-median line versus the least squares line using a simulation study.

1. Introduction

Two trends have had a great impact on statistics education. One trend involves the continual advancement in computer technology that has systematically increased the power and refinement of statistical data analysis. In the 1960's, computers revolutionized statistical methodology. During the 1970's, personal computers placed computing capability right on the desks of individual researchers. This evolution has continued and now grammar school students learn data skills literally in the palm of their hands using graphing calculators.

The second trend is the introduction of statistical ideas earlier in the school curriculum. Younger students are being exposed to the question of what truths lurk beneath the surface of data. These students are now being trained to use statistics as the primary tool for researching any topic within any field.

Evidence of both trends is seen when elementary and secondary school students use graphing calculators to perform data analysis on exams and homework assignments. Much of the early statistical training is being accomplished by teaching the basics of what has been termed exploratory data analysis (EDA).

As the first pioneer in exploratory data analysis, Tukey (1971, p. v) effectively defined EDA as “… looking at data to see what it seems to say. It concentrates on simple arithmetic and easy-to-draw pictures. It regards whatever appearances we have recognized as partial descriptions, and tries to look beneath them for new insights.” In his recommendations for what type of activity belongs in introductory statistics coursework, Hahn (1988, p. 27) said: “When we get to data analysis, we should stress graphical methods or exploratory data analysis, over formal statistical procedures.” Cobb and Moore (1997, p. 815) echoed the sentiments of Hahn saying: “Students like exploratory analysis and find that they can do it, a substantial bonus when teaching a subject feared by many. Engaging them early on in the interpretation of results, before the harder ideas come to their attention, can help establish good habits when you get to inference.”

Similar thinking was the basis for the Quantitative Literacy program of the mid-1980’s which was sponsored by the American Statistical Association (2005a) and the National Council of Teachers of Mathematics Joint Committee on the Curriculum in Statistics and Probability. Funded in part by the National Science Foundation, this program involves “… education workshops and written materials to help elementary and secondary school teachers make statistics more accessible to students … with an emphasis on graphical techniques” (American Statistical Association, 2005b). These written works were made available by Dale Seymour Publications through a series of four books. All four of these books focus on a “hands-on” approach that utilizes data very familiar to students from areas like sports, television, music, etc. One of these, Exploring Data by Landwehr and Watkins (1986), introduces students to a method of fitting a line through bivariate data using a simple application of the median.

One major goal of young statisticians as they learn exploratory data analysis is studying the relationship between two variables. When the data are bivariate with two numerical variables, this usually involves fitting a straight line through points on a scatterplot. The graphing calculator is programmed with two methods of line fitting for bivariate data. One method, fitting the least squares line, is a procedure well studied and routinely applied. But the median-median line, on the other hand, is not so well known or understood.

This paper considers the performance of these two line-fitting techniques as an aide to budding statisticians and their teachers as they encounter exploratory data analysis. We intend to illuminate the behavior of the relatively unknown median-median line with several data sets and through a small simulation study. Readers may compare how the median-median line reacts in various data settings to the performance of the least squares line.

2. Historical Evolution of the Median-Median Line

The median-median line traces back to a line-fitting approach proposed by Wald (1940). He suggested a very simple method where the points on a scatterplot are separated into a left half and a right half based on the median of the x-scores in a sample of bivariate data. The means of the x-scores and the y-scores are calculated using the data from only the left half of the scatterplot and then calculated using only the data on the right half. Concentrating on the two points,

and

, Wald proposed finding a line connecting these points that is then adjusted up or down to better fit the full array of points on the scatterplot. (The subscripts R and L denote which half of the data is used and no subscript implies the means are taken over the entire sample.) His line of fit has slope

and y-intercept

A similar procedure suggested by Nair and Shrivastava (1942) breaks up the points on a scatterplot into three regions with each region containing about the same number of points. The means of the x and y points in the left and right regions are used to find the slope of the line of fit much as Wald suggested.

Brown and Mood (1951) used the two-region approach but found the slope of the line of fit using medians in place of means. The primary advantage of using this measure of center comes from the median’s inherent ability to resist the strong effect of outliers. Most students of statistics know that the mean can be affected greatly by outliers since they are included with equal weight with the rest of the data in the sum of the scores. But the median takes on the same value whether the largest score in a data set is just somewhat larger than the rest of the data or is much larger than the second biggest score. In the context of fitting points on a scatterplot, this implies that a single point far from the general sloping trend of the rest of the points would not apply such a large “tug” on the location of the line of fit if the median is used to find the line.

Readers may recall that the least squares line is found by minimizing the sum of the squared distances that each point lies from the line. Since these distances are squared, Hartwig and Dearing (1979, p. 34) noted that “… cases lying farther and farther from the regression line increase the sum of the squared residuals at an increasing rate ... [and the line] will have to come reasonably close to them to satisfy the least squares criterion and, therefore, the least squares regression line will lack resistance to the excessive influence of a few atypical cases.” This means that Brown and Mood’s method is not only simple to apply, but also has the advantage of not allowing outlying cases to have undue impact on determining the line of fit.

Like Brown and Mood, Tukey (1971) utilized the medians in finding his line of fit, but he did so borrowing the three-region approach of Nair and Shrivastava. His line of fit, called the resistant line, is considered a basic methodology of exploratory data analysis. The median-median line provides the first iteration in the procedure to find Tukey’s resistant line. To obtain Tukey’s resistant line, the residuals are used to adjust the parameters in an iterative fashion. Appendix A provides the algorithm used by Texas Instruments to compute the median-median line. Tukey’s resistant line may be obtained using Minitab under the EDA submenu of Stat.

For those who wish to learn more about the broad methodology of exploratory data analysis, Velleman and Hoaglin (1981) provide a gentle review of the entire subject. To learn more about the median-median line at a more sophisticated mathematical level, readers are encouraged to consider Emerson and Hoaglin (1983) and Johnstone and Velleman (1985).

3. Examples

In order to illustrate the comparative performance of the median-median line and the least squares line, we consider three example data sets from popular introductory statistics texts. The first contains no outliers. The second illustrates the impact of influential observations and outliers on the least squares line. The third example examines the data in two ways: one way has a single extreme influential observation while the other has a single outlier.

3.1 Manatees and Motor Boats

An investigation of the relationship between the number of manatees killed by boats in Florida and the number of powerboat registrations in that state, from 1977 to 1990 (Moore, 1997, p 347), shows a strong positive correlation with no obvious outliers or influential points (r = 0.941, p-value < 0.001). To calculate the equation of the median-median line, one first divides the data into three regions by x-score (in this case, the number of powerboat registrations, in thousands) (see Table 1). In the case when there are ties in the x-scores that would result in these points being in different regions, all points with the same x-score are placed in an outer region. This may result in the middle region having less than 1/3 of the observations (see Appendix A and Appendix B). While this does not occur with the data for Example 3.1, it does happen with the data for Example 3.2.

Table 1. Manatee data divided into three regions based on x-score. In each region, the first column x is the
number of powerboat registrations, in thousands, and the second column y is the number of manatees killed.

Region 1 Region 2 Region 3

x y x y x y

447 13 513 24 614 33

460 21 526 15 645 39

481 24 559 34 675 43

498 16 585 33 711 50

512 20 719 47

median 482 20 542.5 28.5 675 43

Once the data are divided into three regions, the median of the x- and median of the y-scores are calculated for each region. The resulting three points for these data are termed the median-median points = (481, 20), = (542.5, 28.5), = (675, 43). The slope of the median-median line is the slope of the line passing through and ; that is, . The position of the median-median line is determined by looking at the line passing through and and a parallel line passing through . Moving the line connecting the two outer points one third of the way to the line through the point in the center region yields the median-median line (see Figure 1). Note that the line that connects the two outer points is based on two of the median-median points, while the line through is based on only this median-median point. Thus, choosing to move the first line one third of the way towards the second effectively makes the median-median line a weighted average of the outer two lines. The resulting y-intercept is (See Appendix A. Note that TI calls the slope a and the y-intercept b.). This yields the equation of the median-median line: . Note that the median-median and least-squares lines have similar equations, as is evident from the scatterplot (see Figure 2).

Figure 1

Figure 1. Illustration of the median-median method for the manatee data.

Figure 2

Figure 2. Scatterplot of manatees killed by powerboats versus number of boat registrations.

3.2 Mental Aptitude and Age at First Word

A scatterplot of scores on a test of mental aptitude (Gesell Adaptive Test) versus age at first word, for 21 children Moore and McCabe (2003, p. 161), shows a moderate negative correlation (r = -0.640, p-value = 0.002). Three points in this scatterplot are notable. Two children (■) began speaking at a much later age than the rest of the children. Another child (♦) scored much higher than we would expect, given his age at first word. The scatterplot in Figure 3 compares the median-median line to two least-squares lines: one for the entire set of data, and the other without the data for the latest-speaking child. Omitting the data point for this child leads to a least-squares line that is closer to the median-median line. Omitting the data for the unusually high-scoring child (♦) increases the strength of the sample correlation but does not noticeably change the equation of the least-squares line; thus, that least squares line is not included in the scatterplot. The data are provided in Appendix B.

Figure 3

Figure 3. Scatterplot of Gesell Adaptive Score versus Age at Which Child Began Speaking.

3.3 Meal Choices and Body Habits

A paper by Jiang and Hunt (1983) titled “The Relation Between Freely Chosen Meals and Body Habits” reports on the relationship between diet and body build. In particular, the researchers were interested in determining how the energy intake in one’s diet is related to that individual’s body build. Figure 4 plots dietary energy density (DED = kilocalories consumed divided by the weight of food eaten in grams) against the body mass index (BMI) (also known as the Quetelet index defined as weight in kg /(height in meters)²) for nine individuals Devore and Peck (2001, p. 145-146). Including the outlying point (■) representing an individual with an unusually low energy density for his body build, these data have a moderate positive correlation (r = 0.658, p-value = 0.054). Note that the least-squares line calculated without the outlier differs greatly from the original least-squares line but is almost identical to the median-median line.

Figure 4

Figure 4. Scatterplot of dietary energy density versus body mass index.

For illustration, we also consider DED as the explanatory variable and BMI as the response variable. In this scenario, the observation that was previously a high leverage point is now an observation with a large residual (Figure 5). Now the least squares line is not affected as much by the outlying observation as in Figure 4 where the observation has both high leverage and is far from the trend formed by the remaining observations. The data are provided in Appendix B.

Figure 5

Figure 5. Scatterplot of body mass index versus dietary energy density.

4. Simulation Study

To compare the statistical properties of least squares and median-median estimates of the slope of a linear regression model, a simulation study is conducted that considers a variety of conditions. In every simulation, the underlying linear regression model is defined to be Y = 0 + 1X +

. In order to create simulated Y-values, the expected Y-value is computed and then normal error is added. To create an outlier, 5 times the absolute value of the error is added to or subtracted from the expected y-value at that particular point.

The conditions under which the simulation is performed are:

two sets of values of the explanatory variable:
Set 1: x-values = (1, 2, 3, …, 24), and
Set 2: x-values = (2, 2, 4, 4, …, 24, 24),
two levels of the error standard deviation ( = 1 and 5), and
a number of outlier possibilities.

Without loss of generality, in most cases we assume that the outlier occurs in the upper region of X-values, that is, the outlier is one or two of Y_₁₇, …, Y_₂₄. Table 2 provides a summary of the outliers generated in the simulation study.

Table 2. Types of outliers considered in the simulation study.

Outlier Generation Description

a) No Outliers No Outliers

b) Y_₁₃ = 0 + 1X_₁₃ + 5|| One high outlier in the middle of the middle region

c) Y_₁₇ = 0 + 1X_₁₇ + 5||
Y_₁₈ = 0 + 1X_₁₈ - 5|| Outliers at the first two X-values in the upper region, one high and one low

d) Y_₁₇ = 0 + 1X_₁₇ + 5|| One high outlier at the first X-value in the upper region

e) Y_₂₃ = 0 + 1X_₂₃ - 5||
Y_₂₄ = 0 + 1X_₂₄ + 5|| Outliers at the two largest X-values, one high and one low

f) Y_₁₇ = 0 + 1X_₁₇ + 5||
Y_₁₈ = 0 + 1X_₁₈ + 5|| High outliers at the first two X-values in the upper region

g) Y_₁₇ = 0 + 1X_₁₇ - 5||
Y_₂₄ = 0 + 1X_₂₄ + 5|| A low outliers at the first X-value in the upper region and a high outlier at the largest X-value

h) Y_₂₄ = 0 + 1X_₂₄ + 5|| A single high outlier at the largest X-value

i) Y_₂₃ = 0 + 1X_₂₃ + 5||
Y_₂₄ = 0 + 1X_₂₄ + 5|| High outliers at the two largest X-values

	Outlier Generation	Description
a)	No Outliers	No Outliers
b)	Y_₁₃ = 0 + 1X_₁₃ + 5\|\|	One high outlier in the middle of the middle region
c)	Y_₁₇ = 0 + 1X_₁₇ + 5\|\| Y_₁₈ = 0 + 1X_₁₈ - 5\|\|	Outliers at the first two X-values in the upper region, one high and one low
d)	Y_₁₇ = 0 + 1X_₁₇ + 5\|\|	One high outlier at the first X-value in the upper region
e)	Y_₂₃ = 0 + 1X_₂₃ - 5\|\| Y_₂₄ = 0 + 1X_₂₄ + 5\|\|	Outliers at the two largest X-values, one high and one low
f)	Y_₁₇ = 0 + 1X_₁₇ + 5\|\| Y_₁₈ = 0 + 1X_₁₈ + 5\|\|	High outliers at the first two X-values in the upper region
g)	Y_₁₇ = 0 + 1X_₁₇ - 5\|\| Y_₂₄ = 0 + 1X_₂₄ + 5\|\|	A low outliers at the first X-value in the upper region and a high outlier at the largest X-value
h)	Y_₂₄ = 0 + 1X_₂₄ + 5\|\|	A single high outlier at the largest X-value
i)	Y_₂₃ = 0 + 1X_₂₃ + 5\|\| Y_₂₄ = 0 + 1X_₂₄ + 5\|\|	High outliers at the two largest X-values

For each type of outlier generation, 1000 samples are simulated. For each sample the least squares and median-median estimates of the intercept and slope are computed. Tables 3 to 6 provide the following statistics on the 1000 simulated least squares and median-median slope estimates: means, standard deviations, and mean square errors (variance of simulated slopes plus bias squared where the bias = mean - 1) for each of the nine methods of generating the data. If the mean of the simulated slope estimates is 1, the estimator is considered unbiased. The standard deviation provides a measure of spread of the estimates of the slopes. And the mean square error provides a measure of precision of how far the estimates vary from the true slope. Since the main focus of this paper is on the slopes of the estimated line, we do not consider the estimated intercepts in this discussion of the simulation results. These tables also show the leverage value (see Neter, Kutner, Nachtsheim, and Wasserman (1996, p. 375-377)) for the design points at which the outlier is generated. Leverage values help to identify outlying X-values that, in conjunction with extreme Y-observations, may lead to data points with a large influence on the slope and intercept of the line. It is recommended to compare the leverage value with where p is the number of parameters in the linear regression model. In our case p = 2 (for the intercept and slope) and n = 24 so = 0.167. In our example, no design point has a leverage that exceeds this value though the leverage of the most extreme X-values (1 and 24) come close to .

Table 3. Monte Carlo Results: Means, standard deviations, and mean square errors (x 10³) of the least squares (LS) and median-median (MM) slope estimates for 1000 replications.

Model: Y = 0 + 1X + ; X’s: Set 1; ~ N(0, 1²).

Lower region: X_₁, …, X_₈; Middle region: X_₉, …, X_₁₆; Upper region:X_₁₇, …, X_₂₄.

Mean Slope Estimate Standard Deviation MSE (x 10³)

Type of Outlier Leverage LS MM LS MM LS MM

a) No outliers 1.0017 1.0037 0.0295 0.0509 0.873 2.604

b) One high in middle of middle region 0.0419 1.0035 1.0037 0.0295 0.0509 0.882 2.604

c) First two X-values in upper region, one high and one low 0.0593
0.0680 0.9980 1.0291 0.0349 0.0568 1.222 4.073

d) One high at first X-value in upper region 0.0593 1.0170 1.0295 0.0315 0.0564 1.285 4.051

e) Two largest X-values, one high and one low 0.1375
0.1567 1.0054 0.9703 0.0472 0.0557 2.257 3.985

f) High at first two X-values in upper region 0.0593
0.0680 1.0357 1.0571 0.0338 0.0571 2.417 6.521

g) Low at first X-value in upper region, high at largest X-value 0.0593
0.1567 1.0254 1.0037 0.0424 0.0511 2.442 2.625

h) High at largest X-value 0.1567 1.0405 1.0038 0.0405 0.0510 3.280 2.615

i) High at two largest X-values 0.1375
0.1567 1.0749 1.0047 0.0482 0.0513 7.933 2.654

		Mean Slope Estimate	Standard Deviation	MSE (x 10³)
Type of Outlier	Leverage	LS	MM	LS	MM	LS	MM
a) No outliers		1.0017	1.0037	0.0295	0.0509	0.873	2.604
b) One high in middle of middle region	0.0419	1.0035	1.0037	0.0295	0.0509	0.882	2.604
c) First two X-values in upper region, one high and one low	0.0593 0.0680	0.9980	1.0291	0.0349	0.0568	1.222	4.073
d) One high at first X-value in upper region	0.0593	1.0170	1.0295	0.0315	0.0564	1.285	4.051
e) Two largest X-values, one high and one low	0.1375 0.1567	1.0054	0.9703	0.0472	0.0557	2.257	3.985
f) High at first two X-values in upper region	0.0593 0.0680	1.0357	1.0571	0.0338	0.0571	2.417	6.521
g) Low at first X-value in upper region, high at largest X-value	0.0593 0.1567	1.0254	1.0037	0.0424	0.0511	2.442	2.625
h) High at largest X-value	0.1567	1.0405	1.0038	0.0405	0.0510	3.280	2.615
i) High at two largest X-values	0.1375 0.1567	1.0749	1.0047	0.0482	0.0513	7.933	2.654

Table 4. Monte Carlo Results: Means, standard deviations, and mean square errors (x 10³) of the least squares (LS) and median-median (MM) slope estimates for 1000 replications.

Model: Y = 0 + 1X + ; X’s: Set 1; ~ N(0, 5²).

Lower region: X_₁, …, X_₈; Middle region: X_₉, …, X_₁₆; Upper region:X_₁₇, …, X_₂₄.

Mean Slope Estimate Standard Deviation MSE (x 10³)

Type of Outlier Leverage LS MM LS MM LS MM

a) No outliers 1.0087 1.0141 0.1473 0.1850 21.773 34.424

b) One high in middle of middle region 0.0419 1.0173 1.0141 0.1475 0.1850 22.056 34.424

c) First two X-values in upper region, one high and one low 0.0593
0.0680 0.9902 1.0614 0.1746 0.1948 30.581 41.717

d) One high at first X-value in upper region 0.0593 1.0848 1.0895 0.1576 0.1898 32.029 44.034

e) Two largest X-values, one high and one low 0.1375
0.1567 1.0272 0.9619 0.2362 0.1985 56.530 40.854

f) High at first two X-values in upper region 0.0593
0.0680 1.1786 1.1702 0.1691 0.1952 60.493 67.071

g) Low at first X-value in upper region, high at largest X-value 0.0593
0.1567 1.1271 1.0155 0.2118 0.1969 61.014 39.010

h) High at largest X-value 0.1567 1.2023 1.0423 0.2026 0.1913 81.972 38.385

i) High at two largest X-values 0.1375
0.1567 1.3743 1.0888 0.2408 0.2009 198.085 48.246

		Mean Slope Estimate	Standard Deviation	MSE (x 10³)
Type of Outlier	Leverage	LS	MM	LS	MM	LS	MM
a) No outliers		1.0087	1.0141	0.1473	0.1850	21.773	34.424
b) One high in middle of middle region	0.0419	1.0173	1.0141	0.1475	0.1850	22.056	34.424
c) First two X-values in upper region, one high and one low	0.0593 0.0680	0.9902	1.0614	0.1746	0.1948	30.581	41.717
d) One high at first X-value in upper region	0.0593	1.0848	1.0895	0.1576	0.1898	32.029	44.034
e) Two largest X-values, one high and one low	0.1375 0.1567	1.0272	0.9619	0.2362	0.1985	56.530	40.854
f) High at first two X-values in upper region	0.0593 0.0680	1.1786	1.1702	0.1691	0.1952	60.493	67.071
g) Low at first X-value in upper region, high at largest X-value	0.0593 0.1567	1.1271	1.0155	0.2118	0.1969	61.014	39.010
h) High at largest X-value	0.1567	1.2023	1.0423	0.2026	0.1913	81.972	38.385
i) High at two largest X-values	0.1375 0.1567	1.3743	1.0888	0.2408	0.2009	198.085	48.246

Table 5. Monte Carlo Results: Means, standard deviations, and mean square errors (x 10³) of the least squares (LS) and median-median (MM) slope estimates for 1000 replications.

Model: Y = 0 + 1X + ; X’s: Set 2; ~ N(0, 1²).

Lower region: X_₁, …, X_₈; Middle region: X_₉, …, X_₁₆; Upper region:X_₁₇, …, X_₂₄.

Mean Slope Estimate Standard Deviation MSE (x 10³)

Type of Outlier Leverage LS MM LS MM LS MM

a) No outliers 1.0018 1.0040 0.0296 0.0540 0.879 2.556

b) One high in middle of middle region 0.0425 1.0053 1.0040 0.0297 0.0504 0.910 2.556

c) First two X-values in upper region, one high and one low 0.0635
0.0635 1.0015 1.0342 0.0350 0.0569 1.227 4.407

d) One high at first X-value in upper region 0.0635 1.0188 1.0343 0.0321 0.0567 1.384 4.391

e) Two largest X-values, one high and one low 0.1474
0.1474 1.0022 0.9741 0.0474 0.0553 2.252 3.729

f) High at first two X-values in upper region 0.0635
0.0635 1.0359 1.0573 0.0340 0.0572 2.445 6.555

g) Low at first X-value in upper region, high at largest X-value 0.0635
0.1474 1.0222 1.0039 0.0423 0.0509 2.282 2.606

h) High at largest X-value 0.1474 1.0390 1.0042 0.0400 0.0506 3.121 2.578

i) High at two largest X-values 0.1474
0.1474 1.0753 1.0047 0.0485 0.0507 8.022 2.593

		Mean Slope Estimate	Standard Deviation	MSE (x 10³)
Type of Outlier	Leverage	LS	MM	LS	MM	LS	MM
a) No outliers		1.0018	1.0040	0.0296	0.0540	0.879	2.556
b) One high in middle of middle region	0.0425	1.0053	1.0040	0.0297	0.0504	0.910	2.556
c) First two X-values in upper region, one high and one low	0.0635 0.0635	1.0015	1.0342	0.0350	0.0569	1.227	4.407
d) One high at first X-value in upper region	0.0635	1.0188	1.0343	0.0321	0.0567	1.384	4.391
e) Two largest X-values, one high and one low	0.1474 0.1474	1.0022	0.9741	0.0474	0.0553	2.252	3.729
f) High at first two X-values in upper region	0.0635 0.0635	1.0359	1.0573	0.0340	0.0572	2.445	6.555
g) Low at first X-value in upper region, high at largest X-value	0.0635 0.1474	1.0222	1.0039	0.0423	0.0509	2.282	2.606
h) High at largest X-value	0.1474	1.0390	1.0042	0.0400	0.0506	3.121	2.578
i) High at two largest X-values	0.1474 0.1474	1.0753	1.0047	0.0485	0.0507	8.022	2.593

Table 6. Monte Carlo Results: Means, standard deviations, and mean square errors (x 10³) of the least squares (LS) and median-median (MM) slope estimates for 1000 replications.

Model: Y = 0 + 1X + ; X’s: Set 2; ~ N(0, 5²).

Lower region: X_₁, …, X_₈; Middle region: X_₉, …, X_₁₆; Upper region:X_₁₇, …, X_₂₄.

Mean Slope Estimate Standard Deviation MSE (x 10³)

Type of Outlier Leverage LS MM LS MM LS MM

a) No outliers 1.0090 1.0138 0.1479 0.1849 21.955 34.378

b) One high in middle of middle region 0.0425 1.0263 1.0138 0.1485 0.1849 22.744 34.378

c) First two X-values in upper region, one high and one low 0.0635
0.0635 1.0076 1.0628 0.1751 0.1949 30.718 41.930

d) One high at first X-value in upper region 0.0635 1.0940 1.0876 0.1607 0.1898 34.660 44.698

e) Two largest X-values, one high and one low 0.1474
0.1474 1.0108 0.9632 0.2371 0.1988 56.333 40.874

f) High at first two X-values in upper region 0.0635
0.0635 1.1797 1.1700 0.1700 0.1942 61.192 66.614

g) Low at first X-value in upper region, high at largest X-value 0.0635
0.1474 1.1112 1.0154 0.2114 0.1980 57.055 39.441

h) High at largest X-value 0.1474 1.1952 1.0467 0.2001 0.1917 78.143 38.930

i) High at two largest X-values 0.1474
0.1474 1.3763 1.0886 0.2427 0.2008 200.505 48.171

		Mean Slope Estimate	Standard Deviation	MSE (x 10³)
Type of Outlier	Leverage	LS	MM	LS	MM	LS	MM
a) No outliers		1.0090	1.0138	0.1479	0.1849	21.955	34.378
b) One high in middle of middle region	0.0425	1.0263	1.0138	0.1485	0.1849	22.744	34.378
c) First two X-values in upper region, one high and one low	0.0635 0.0635	1.0076	1.0628	0.1751	0.1949	30.718	41.930
d) One high at first X-value in upper region	0.0635	1.0940	1.0876	0.1607	0.1898	34.660	44.698
e) Two largest X-values, one high and one low	0.1474 0.1474	1.0108	0.9632	0.2371	0.1988	56.333	40.874
f) High at first two X-values in upper region	0.0635 0.0635	1.1797	1.1700	0.1700	0.1942	61.192	66.614
g) Low at first X-value in upper region, high at largest X-value	0.0635 0.1474	1.1112	1.0154	0.2114	0.1980	57.055	39.441
h) High at largest X-value	0.1474	1.1952	1.0467	0.2001	0.1917	78.143	38.930
i) High at two largest X-values	0.1474 0.1474	1.3763	1.0886	0.2427	0.2008	200.505	48.171

Figures 6 to 8 provide density estimates for three of these examples for the first set of x-values (1, 2, 3, …, 24): no outliers, a moderate outlier example (high and low outliers at the end – Table 2, design (e)), and the most extreme outlier example (two high outliers at the end – Table 2, design (i)). The density estimates are simply the simulated distributions of the slope estimates and are computed from the 1000 slope estimates using the SPlus function density.

The simulation study comparing the performance of the two methods of regression when no outliers are present (Figure 6) shows that both the median-median and LS estimates are close to being unbiased. However, there is less variation in the least-squares slope estimates than in the median-median slope estimates.

Figure 6

Figure 6. Density estimates of the slopes for the 1000 replications for least squares and median-median estimates and for = 1 and = 5;
X’s: Set 1; no outliers.

In the moderate case (Figure 7, one high and one low outlier at the end – Table 2, design (e)), the least-squares method still tends to give unbiased estimates of the slope while the median-median estimates are slightly biased to the low side and exhibit smaller MSE than the LS line when = 5.

Figure 7

Figure 7. Density estimates of the slopes for the 1000 replications for least squares and median-median estimates and for = 1 and = 5;
X’s: Set 1; one high and one low outlier at the end – Table 2, design (e).

In the extreme case (Figure 8, two high outliers at the end – Table 2, design (i)), however, the median-median method performs much better, with less variation and more accuracy than the least-squares method, especially in the case of greater population standard deviation ( = 5 vs. = 1).

Figure 8

Figure 8. Density estimates of the slopes for the 1000 replications for least squares and median-median estimates and for = 1 and = 5;
X’s: Set 1; two high outliers at the end – Table 2, design (i).

5. Conclusion

The median-median regression line is intended as a simple, resistant alternative to the least-squares line. In particular, it provides a method of line fitting that is computationally accessible to young students of statistics, while at the same time giving estimates that are robust, or less sensitive, to outliers than those given by least-squares regression.

The three examples provided in this paper show that, when an influential point is present, the median-median line may be less affected by the influential point than the least-squares line. The simulation study further supports the argument for using the median-median line as a resistant method of regression when outliers are present, especially in the most extreme cases (one or two high outliers at the end; see Table 2, designs (h and i)) and in cases with greater population variance when the outlier is on the most extreme x-value.

On the other hand, there exists a well-developed set of tools for inference and for detecting the presence of unusual observations when using least-squares regression under normal error assumptions. In contrast, inference about the median-median estimates would require bootstrapping to obtain approximate standard errors, confidence intervals/regions, and p-values for tests. Given this, in addition to the superior performance of least-squares estimation when either no outliers or moderate outliers are present, the median-median method of regression would not appear to be a valuable tool at either the professional or collegiate level.

However, in elementary and middle schools the median-median method of estimating a line may be a reasonable approach to describe a linear relationship in a scatterplot. The teacher may first introduce the idea of a scatterplot as a way of visualizing the association between a response and explanatory variable. If there is a straight-line trend in the plot, the straight line may be fit by eye using a ruler to try to capture the trend. While each student will likely be satisfied that their eye-balled line fits the data well, they will also recognize that each student would have chosen a slightly different line. This will suggest the need for a more objective and structured method of fitting a line. The median-median line will not only be simple to find, but may also feel very connected to the visual process each student had just undertaken.

Appendix A: Texas Instruments median-median algorithm

The following is the documentation provided by Texas Instruments on their web page on how the TI graphing calculators compute Median-Median estimates.

Median-Median Line Algorithm on Graphing Handhelds. 

Solution:

What method is used to calculate the median-median line?

The goal of the median-median line is to:

1) Divide the data into three parts with an equal number of data points
2) Define a summary point for each part
3) Use the three summary points to define the median-median line

How the TI Calculator does this:

1) The calculator will attempt to break the list into three equal parts without breaking up data groups of equal x values. 
   In this case, the algorithm we use is designed to include at least 1/3 of the points in the left and right groups. What 
   ever is left over is put into the center group, hopefully the remaining 1/3. The approach chosen was to fill the outside 
   groups first and allocate the remaining data points to the center group. If the center group is empty, an error message
   is generated.

   The reason equally x values are not split up is to ensure the same results are produced independent of the order the 
   data appears in the original list. Without this restriction, a different result could be produced depending on the 
   ordering in the data input lists.

2) A summary point is simply the median of all x's and y's in that part.  Let's call the summary points (x1,y1), (x2,y2), 
   and (x3,y3).

3) The median-median line will be parallel to the line going through the points (x1,y1) and (x3,y3)

                                                a = (y3 - y1) / (x3 - x1)

   and 1/3 the distance between the line through the two summary points and a parallel line going through the second 
   summary point (x2,y2).

                                        b = (y1 + y2 + y3 - a (x1 + x2 + x3)) / 3.

   An advantage of the median-median line over a least-squares line, is that stray data points do not affect the end result 
   very much.

Appendix B: Data sets used in the examples

Example 1. Manatee Deaths and Power Boat Registrations illustrating the 3 groups (data used with permission of W. H. Freeman and Company)

Power Boat Registrations	Manatee Deaths

447	13
460	21
481	24
498	16
512	20

513	24
526	15
559	34
585	33

614	33
645	39
675	43
711	50
719	47

Example 2. Gessell Scores and Age illustrating the 3 regions (data used with permission of W.H. Freeman and Company)

Age (x)	Gessell Score (y)

7	113
8	104
9	91
9	96
10	83
10	83
10	100
10	100

11	100
11	84
11	102
11	86
12	105

15	95
15	102
17	121
18	93
20	87
20	94
26	71
42	57

Example 3. Body Mass Index (BMI) and Dietary Energy Density (DED) illustrating the 3 regions (data used with permission of Thompson Learning)

BMI (x)	DED (y)	DED (x)	BMI (y)

21.1	0.54	0.44	21.5
21.5	0.44	0.54	21.1
22.1	0.67	0.67	22.1

22.3	0.78	0.78	22.3
22.4	0.90	0.86	22.8
22.8	0.86	0.90	22.4

23.1	0.91	0.91	23.1
23.3	0.94	0.93	26.8
26.8	0.93	0.94	23.3

From Statistics: The Exploration and Analysis of Data (with CD-ROM), 4^th Ed., by Devore/Peck, 2001. Reprinted with permission of Brooks/Cole, a division of Thompson Learning: www.thomsonrights.com. Fax 800 730-2215.

References

American Statistical Association (2005a), Quantitative Literacy – An Overview from jse.amstat.org/education/index.cfm?fuseaction=QLworkshops

American Statistical Association (2005b), Materials for K-12 Statistics Education from jse.amstat.org/education/index.cfm?fuseaction=k12material

Brown, G. W. and Mood, A. M. (1951), “On Median Tests for Linear Hypotheses,” Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA: University of California Press, 159-166.

Cobb, G. W. and Moore, D. S. (1997), “Mathematics, Statistics, and Teaching,” The American Mathematical Monthly, 104(9), 801-823.

Devore, J. and Peck, R. (2001), Statistics: The Exploration and Analysis of Data, 4^th Ed., Belmont, CA: Brooks/Cole.

Emerson, J. D. and Hoaglin, D. C. (1983), “Resistant Lines for y-versus-x,” in Understanding Robust and Exploratory Data Analysis, eds. D. C. Hoaglin, F. Mosteller, and J. W. Tukey, New York: John Wiley.

Hahn, G.J. (1988) “What Should the Introductory Statistics Course Contain?,” College Mathematics Journal, 19(1), 26-29.

Hartwig, F. and Dearing, B. E. (1979), Exploratory Data Analysis, Beverly Hills, CA: Sage Publications.

Jiang, C. L. and Hunt, J. N. (1983), “The Relation Between Freely Chosen Meals and Body Habits,” American Journal of Clinical Nutrition, 38, 32-40.

Johnstone, I. M. and Velleman, P. F. (1985), “The resistant line and related regression methods,” Journal of the American Statistical Association, 80, 1041-1054.

Landwehr, J. A. and Watkins, A. E. (1986), Exploring Data, Palo Alto, CA: Dale Seymour Publications.

Moore, D. S. (1997), Statistics: Concepts and Controversies, 4^th Ed., New York: W. H. Freeman and Co.

Moore, D. S. and McCabe, G. P. (2003), Introduction to the Practice of Statistics, 4^th Ed., New York: W. H. Freeman and Co.

Nair, K. R. and Shrivastava, M. P. (1942), “On a Simple Method of Curve Fitting,” Sankhaya, 6, 121-132.

Neter, J., Kutner, M. H., Nachtsheim, C. J., and Wasserman, W. (1996), Applied Linear Statistical Models, 4^th Ed., Boston: McGraw Hill.

Tukey, J. W. (1971), Exploratory Data Analysis, Reading, MA: Addison-Wesley.

Velleman, P. F. and Hoaglin, D. C. (1981), Applications, Basics and Computing of Exploratory Data Analysis, Boston, MA: Duxbury Press.

Wald, A. (1940), “The Fitting of Straight Lines if Both Variables Are Subject to Error,” Annals of Mathematical Statistics, 11, 282-300.

Acknowledgements

We thank the editor and referees who provided comments that greatly improved the manuscript.

Elizabeth J. Walters
Mathematical Sciences Department
Loyola College in Maryland
Baltimore, MD 21210-2699
U.S.A.
ewalters@loyola.edu

Christopher H. Morrell
Mathematical Sciences Department
Loyola College in Maryland
Baltimore, MD 21210-2699
U.S.A.
chm@loyola.edu

Richard E. Auer
Mathematical Sciences Department
Loyola College in Maryland
Baltimore, MD 21210-2699
U.S.A.
rea@loyola.edu

	Region 1		Region 2		Region 3

	x	y	x	y	x	y

	447	13	513	24	614	33
	460	21	526	15	645	39
	481	24	559	34	675	43
	498	16	585	33	711	50
	512	20			719	47

median	482	20	542.5	28.5	675	43

Power Boat Registrations	Manatee Deaths

447	13
460	21
481	24
498	16
512	20

513	24
526	15
559	34
585	33

614	33
645	39
675	43
711	50
719	47

Age (x)	Gessell Score (y)

7	113
8	104
9	91
9	96
10	83
10	83
10	100
10	100

11	100
11	84
11	102
11	86
12	105

15	95
15	102
17	121
18	93
20	87
20	94
26	71
42	57

Power Boat Registrations	Manatee Deaths

447	13
460	21
481	24
498	16
512	20

513	24
526	15
559	34
585	33

614	33
645	39
675	43
711	50
719	47

Age (x)	Gessell Score (y)

7	113
8	104
9	91
9	96
10	83
10	83
10	100
10	100

11	100
11	84
11	102
11	86
12	105

15	95
15	102
17	121
18	93
20	87
20	94
26	71
42	57

Power Boat Registrations	Manatee Deaths

447	13
460	21
481	24
498	16
512	20

513	24
526	15
559	34
585	33

614	33
645	39
675	43
711	50
719	47

Age (x)	Gessell Score (y)

7	113
8	104
9	91
9	96
10	83
10	83
10	100
10	100

11	100
11	84
11	102
11	86
12	105

15	95
15	102
17	121
18	93
20	87
20	94
26	71
42	57