Iain Pardoe
Lundquist College of Business, University of Oregon
Journal of Statistics Education Volume 16, Number 2 (2008), jse.amstat.org/v16n2/datasets.pardoe.html
Copyright © 2008 by Iain Pardoe all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words: Graphics; Indicator variables; Interaction; Linear regression; Model building; Quadratic; Transformations.
It can be challenging when teaching regression concepts to find interesting real-life datasets that allow analyses that put all the concepts together in one large example. For example, concepts like interaction and predictor transformations are often illustrated through small-scale, unrealistic examples with just one or two predictor variables that make it difficult for students to appreciate how these concepts might be applied in more realistic multi-variable problems. This article addresses this challenge by describing a complete multiple linear regression analysis of home price data that covers many of the usual regression topics, including interaction and predictor transformations. The analysis also contains useful practical advice on model building—another topic that can be hard to illustrate realistically—and novel statistical graphics for interpreting regression model results. The analysis was motivated by the sale of a home by the author. The statistical ideas discussed range from those suitable for a second college statistics course to those typically found in more advanced linear regression courses.
This article describes a complete multiple linear regression analysis of home price data for a city in Oregon, USA in 2005. At the time the data were collected, I was preparing to place my home on the market and it was important to come up with a reasonable asking price. Whereas realtors use experience and local knowledge to subjectively value a home based on its characteristics (size, amenities, location, etc.) and the prices of similar homes nearby, regression analysis provides an alternative approach that more objectively models local home prices using these same data. Better still, realtor experience can help guide the modeling process to fine-tune a final predictive model.
The article discusses statistical ideas ranging from those suitable for the regression component of a second college statistics course to those typically found in more advanced linear regression courses. The analysis includes many elements covered in typical regression components of second statistics courses such as indicator variables for coding qualitative information, model building, hypothesis testing, diagnostics, and model interpretation. The analysis also provides a compelling application of more challenging topics including predictor interactions, predictor transformations, and understanding model results through the use of graphics. I have used this material in my own second statistics course, which is taken by business undergraduates at the University of Oregon. The example generates much discussion with students able to strongly relate to questions about home values either directly or through their parents’ homes. Students can engage with this application for a variety of reasons. For example, they could predict the sale price of a home similar to the one in which they grew up, or explore the relative values of different home characteristics such as, "Is an additional bathroom valued more than an additional bedroom?"
This article is based on a case study in Pardoe (2006), which also contains further details on the more routine aspects of a regression analysis. Here I complement that case study by providing additional motivation for the analysis and further background on actual use of this dataset in the classroom. The article is organized as follows: Section 2 describes the dataset; Section 3 shows the model building process to develop suitable multiple linear regression models; Section 4 discusses the final model found in Section 3 along with its potential use; Section 5 describes how to construct graphs to better understand the results of the final model; and Section 6 concludes with ideas for extending the analysis in class or in student assignments.
The data file containing information on 76 single-family homes in Eugene, Oregon during 2005 was provided by Victoria Whitman, a Eugene realtor. We will model single-family home sale prices (Price, in thousands of dollars), which range from $155,000 to $450,000, using these predictor variables:
It seems reasonable to expect that homes built on properties with a large amount of land area command higher sale prices than homes with less land, all else being equal. However, an increase in land area of (say) 2000 square feet from 4000 to 6000 should probably make a larger difference (to sale price) than going from 24,000 to 26,000. Thus, realtors have constructed lot size "categories," which in their experience correspond to approximately equal-sized increases in sale price. The categories (variable Lot) used in this dataset are:
To reflect the belief that half-bathrooms (i.e., those without a shower or bath-tub) are not valued by home-buyers nearly as highly as full bathrooms, the variable Bath records half-bathrooms with the value 0.1.
This particular housing market has a mix of homes that were built from 1905 to 2005, with a mean around 1970. Since from the realtor’s experience both very old homes and very new homes tend to command a price premium relative to "middle age" homes in this market, a quadratic effect might be expected for an age variable in a multiple linear regression model to predict price. To facilitate this we use a rescaled Age variable by subtracting 1970 from "year built" and dividing by 10. The resulting variable has a mean close to zero and a standard deviation just over 2, and represents the number of decades away from 1970.
This dataset includes three types of sales listings: homes that have recently sold, "pending sales" for which a sale price had been agreed but paperwork still needed to be completed, and homes that are "active listings" offered for sale but which have not yet sold. Since at the time these data were collected the final sale price of a home could sometimes be considerably less than the price for which it was initially offered, we define an indicator variable, Active, to model the average difference between actively listed homes (Active=1) and pending or sold homes (Active=0).
Different neighborhoods in this housing market have potentially different levels of housing demand. The strongest predictor of demand that is available with this dataset relates to the nearest elementary school (out of six) for each home. Thus, we define five indicator variables to serve as a proxy for the geographic neighborhood of each home. The most common elementary school in the dataset, Edgewood Elementary, is the "reference level" and the indicator variables Edison to Parker represent the difference in price between those schools and Edgewood. Figure 3 in the Appendix contains a map that shows the location of the six elementary schools used in the analysis. The numbers of homes in neighborhoods near to Crest (6 homes) and Adams (3 homes) are relatively small, which may limit our ability to say much about systematic differences in home prices for these two neighborhoods. We return to this question at the end of Section 3.
The first step of any data analysis should be to explore the data through drawing appropriate graphs and calculating various summary statistics—Section 6.1 of Pardoe (2006) contains examples for this dataset. The dataset is sufficiently rich and varied that in my experience students engage with the data quite readily and appear to enjoy constructing scatterplots and boxplots for the easily understood variables. After the students get a feel for the data we next apply multiple linear regression modeling to see whether the various home characteristics allow us to model sale prices with any degree of accuracy. All of the topics considered in this section (and also Section 4) — namely R2, adjusted R2, regression standard error, residual analysis, indicator variables, variable transformations, interactions, regression assumptions, variable selection, and nested model F-tests — would typically be covered during the regression component of a second college statistics course.
We first try a model with each of the predictors "as is" (no transformations or interactions):
E(Price) = b0 + b1Size + b2Lot + b3Bath + b4Bed + b5Age + b6Garage + b7Active + b8Edison + b9Harris + b10Adams + b11Crest + b12Parker.
However, the residuals from model 1 fail to satisfy the zero mean (linearity) assumption in a plot of the residuals versus Age, displaying a relatively pronounced curved pattern. The left-hand plot in Figure 1 displays this plot, together with the loess fitted line that provides a graphical representation of the average value of the residuals as we move across the plot (i.e., as Age increases). Pardoe (2006, p. 107) and Cook and Weisberg (1999, p. 44) provide more details on the use of loess fitted lines for assessing patterns in scatterplots—this might be a more suitable topic for a more advanced regression course.
To attempt to correct this failing, we will add an Age2 transformation to the model, which as discussed above was also suggested from the realtor’s experience. The finding that the residual plot with Age has a curved pattern does not necessarily mean that an Age2 transformation will correct this problem, but it is certainly worth trying.
In addition, both Bath and Bed have relatively large individual t-test p-values in model 1, which appears to contradict the notion that home prices should increase with the number of bedrooms and bathrooms. This provides an ideal opportunity to ask students why this might be happening. For example, do they think that adding extra bathrooms to homes with just two or three bedrooms might just be considered a waste of space and so have a negative impact on price. Conversely, is there a clearer benefit for homes with four or five bedrooms to have more than one bathroom, so that adding bathrooms for these homes probably has a positive impact on price. If they agree with these statements, how can the model be extended to allow it to capture these price impacts? In my experience, interaction in regression is a challenging concept, but this kind of plausible and realistic example can greatly ease understanding. The instructor can guide the students in seeing that to model such a relationship we need to add a Bath × Bed = BathBed interaction term to the model. I show in Section 5 how this interaction can then be displayed graphically too.
Therefore, we next try the following model:
E(Price) = b0 + b1Size + b2Lot + b3Bath + b4Bed + b34BedBath + b5Age + b52Age2 + b6Garage + b7Active + b8Edison + b9Harris + b10Adams + b11Crest + b12Parker.
Model 2 results in an increased value of R2 (coefficient of determination), a reduced value of the regression standard error, and residuals that appear to satisfy the four regression model assumptions of zero mean (linearity), constant variance, normality, and independence reasonably well (the residual plot with Age on the horizontal axis is displayed as the right-hand plot in Figure 1).
However, the model includes some terms with large individual t-test p-values, suggesting that perhaps it is more complicated than it needs to be and includes some redundant terms. In particular, the last three elementary school indicators (Adams, Crest, and Parker) have p-values of 0.310, 0.683, and 0.389. We can conduct a nested model F-test (also known as an "analysis of variance" test or "extra sum of squares" test) to see whether we can safely remove these three indicators from the model without significantly worsening its fit. The resulting p-value of 0.659 is more than any sensible significance level, so we cannot reject the null hypothesis that the last three school indicator regression parameters are all zero. In addition, removing these three indicators increases the value of adjusted R2 and reduces the regression standard error. Removing these three indicators from the model means that the school reference level now comprises Edgewood, Adams, Crest, and Parker (so that there are no systematic differences between these four schools with respect to home prices). This also provides the instructor with an opportunity to explore a challenging concept—the use of indicator variables to model qualitative data—within a realistic context.
Thus, a final model for these data is
E(Price) = b0 + b1Size + b2Lot + b3Bath + b4Bed + b34BedBath + b5Age + b52Age2 + b6Garage + b7Active + b8Edison + b9Harris,
with estimated regression equation:
= 332.48 + 56.72Size + 9.92Lot − 98.16Bath − 78.91Bed + 30.39BathBed + 3.30Age + 1.64Age2 + 13.12Garage + 27.42Active +67.06Edison + 47.27Harris.
Model 3 results in residuals that appear to satisfy the regression model assumptions reasonably well (residual plots not shown). Also, each of the individual t-test p-values is below the usual 0.05 threshold (including Bath, Bed, and the BathBed interaction), except Age (which is included to retain hierarchy since Age2 is included in the model) and Garage (which is nonetheless retained since its p-value is low enough to suggest a potentially important effect).
A potential use for the final model might be to narrow the range of possible values for the asking price of a home about to be put on the market. For example, consider a home with the following features: 1879 square feet, lot size category 4, two and a half bathrooms, three bedrooms, built in 1975, two-car garage, and near Parker Elementary School (this was my home at the time). A 95% prediction interval ignoring the model comes to ($164,800, $406,800); this is based on the formula: sample mean ± t-percentile × sample standard deviation × By contrast, a 95% prediction interval using the model results comes to ($197,100, $369,000), which is about 70% the width of the interval ignoring the model. A realtor could advise the vendors to price their home somewhere within this range depending on other factors not included in the model (e.g., toward the upper end of this range if the home is on a nice street, the property is in good condition, and landscaping has been done to the yard). As is often the case, the regression analysis results are more effective when applied in the context of expert opinion and experience.
These results also illustrate that "prediction is hard;" whereas the final model might be considered quite successful in terms of its ability to usefully explain the variation in sale price (R2 = 58.8%), more than 40% of the variation in price remains unexplained by the model. Further, the above 95% prediction interval of ($197,100, $369,000) is perhaps disappointingly wide. The dataset predictors can only go so far in helping to explain and predict home prices in this particular housing market. Students might like to consider ways to explain more variation and to tighten up this interval, for example, by collecting more data observations or thinking of new predictor variables, such as other factors related to the geographical neighborhood, condition of the property, landscaping, and features such as fireplaces and updated kitchens.
A further use for the model might be to utilize the specific findings relating to the effects of each of the predictors on the price. For example, since = 56.72, we expect sale price to increase by $5672 for each 100 square foot increase in floor size, all else held constant. Interpretations of for lot size and for garage size are similarly straightforward, but the interpretations for numbers of bedrooms/bathrooms and age are more complicated. Section 5 suggests graphical ideas for increasing understanding of regression parameter estimates when interactions and transformations are involved, as they are here.
The preceding discussion contains most of the standard topics that would typically be covered during the regression component of a second college statistics course. Such courses sometimes also cover more "advanced" topics, such as the role that individual data observations can play in a multiple linear regression model (e.g., outliers or high leverages); these topics would certainly be covered in a more advanced course dealing only with regression. To illustrate, calculation of studentized residuals, leverages, and Cook’s distances can help to identify overly influential observations. Finding such observations can suggest the need to investigate possible data errors, to add additional predictors to the model, to respecify the model in some other way, or to consider removing the influential observations from the dataset. However, in this case none of the final model studentized residuals are outside the ± 3 range, and so none of the observations would probably be considered outliers. Home 76 (with a large floor size) has the highest leverage, although home 54 (the oldest home) is not far behind. These two homes also have the two highest Cook’s distances, although neither is above a 0.5 threshold (see Cook and Weisberg 1999, p. 358), and neither dramatically changes the regression results if excluded.
Interpretation of the parameter estimates for Bath, Bed, and Age are complicated somewhat by their interactions and transformations. In such circumstances it can be helpful to use statistical graphics to help interpret the model results. Section 5.4 in Pardoe (2006, pp. 188–194) introduces "predictor effect plots," line plots that show graphically how a regression response variable (home price in this case) is associated with changes to the predictor variables. For the variables Size, Lot, and Garage, these line plots simply represent the corresponding parameter estimates as straightforward slopes. However, in the case of Bath, Bed, and Age the plots provide additional insights due to the presence of complicating interaction effects and transformations.
Note that estimated regression parameters cannot usually be interpreted causally. We can really only use the regression models described in this article to quantify relationships and to identify whether a change in one variable is associated with a change in another variable, not to establish whether changing one variable "causes" another to change. The term "predictor effect" in this section indicates how the model expects Price to change as each predictor changes (and all other predictors are held constant), but without suggesting at all that this is some kind of causal effect.
The basic ideas behind predictor effect plots are sufficiently straightforward that they could be comfortably covered in the regression component of a second college statistics course. First, use the estimated regression equation to consider how Price changes as Size changes when we hold the remaining predictors constant (say, at sample mean values for the quantitative predictors and zero for the indicator variables):
This shows how Price changes as Size changes for homes with average values for Lot,..., Garage that are in the Edgewood, Adams, Crest, or Parker neighborhoods. The corresponding predictor effect plot has this Size effect drawn as a single straight line on the vertical axis with Size on the horizontal axis. It is straightforward to construct similar predictor effect plots for Lot and Garage.
However, the "BathBed effect on Price" involves an interaction and so is more complicated:
The left-hand plot in Figure 2 shows a line plot with this BathBed effect on Price on the vertical axis, Bath on the horizontal axis, and lines marked by the value of Bed. Now students can easily visualize the interaction concept previously considered verbally in Section 3. In homes with just two or three bedrooms, additional bathrooms are associated with lower prices (holding all else constant), particularly two-bedroom homes. Conversely, in homes with four or five bedrooms, additional bathrooms are associated with higher prices (all else constant), particularly five-bedroom homes. The scale on the plot shows the approximate magnitude of average prices for different numbers of bathrooms and bedrooms (for homes with average values for Size, Lot, Age, and Garage that are in the Edgewood, Adams, Crest, or Parker neighborhoods). Homes with other predictor values tend to have price differences of a similar magnitude for similar changes in the numbers of bathrooms and bedrooms. Similar interpretations can be made for the right-hand plot in Figure 2, which shows a line plot with the BathBed effect on Price on the vertical axis and Bed on the horizontal axis, and lines marked by the value of Bath.
Figure 2: Predictor effect plots for Bath and Bed in the home prices example. In the left plot, the BathBed effect on Price of 504.2 − 98.16Bath − 78.91Bed + 30.39BathBed is on the vertical axis while Bath is on the horizontal axis and the lines are marked by the value of Bed. In the right plot, the BathBed effect on Price is on the vertical axis while Bed is on the horizontal axis and the lines are marked by the value of Bath.
The "Age effect on Price" is complicated by the transformation:
However, the corresponding predictor effect plot (not shown) is simply a quadratic line that shows average prices decreasing from approximately $295k to $245k from the early 1900s to 1960, and then increasing again up to approximately $280k in 2005 (for homes with average values for Size, Lot, Bath, Bed, and Garage that are in the Edgewood, Adams, Crest, or Parker neighborhoods).
This article has described a complete multiple linear regression analysis of a compelling real-life dataset on home prices. The analysis covers many of the usual topics in a regression course, but there are additional possibilities for investigating this dataset further. The following suggestions provide ideas for students to continue working with this dataset. Some, such as the first suggestion below would be reasonable for the regression component of a second college statistics course, while others, such as the final suggestion, might be better suited for a more advanced course dealing only with regression.
Figure 3: Map of the local area in Eugene, Oregon covered by the analysis, including the locations of the six elementary schools.
I thank Victoria Whitman for providing the data and two anonymous reviewers and editors Roger Johnson and Dex Whittinghill for their many helpful suggestions.
Cook, R. D. and S. Weisberg (1999). Applied Regression Including Computing and Graphics. Hoboken, NJ: Wiley.
Pardoe, I. (2006). Applied Regression Modeling: A Business Approach. Hoboken, NJ: Wiley.
Iain Pardoe
Lundquist College of Business
University of Oregon
Eugene, Oregon
U.S.A.
ipardoe@lcbmail.uoregon.edu