Journal of Statistics Education v.6, n.1 (1998)
Copyright (c) 1998 by Robert Carver, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.
Key Words: Forecasting; Measurement; Regression; Time series; Variable transformation.
In a residential home, energy consumption is closely related to the outdoor temperature and size of the house. In a home of a given size, fuel consumption varies fairly predictably through the year. When homeowners add a room, other things being equal, energy consumption should increase. This dataset permits students to estimate the energy demand, make forecasts for future months, and investigate other relationships.
The dataset contains natural gas and electricity usage data for a single-family residence in the Boston area from September 1990 through May 1997, accompanied by monthly climatological data. The dataset is useful for illustrating the concepts and techniques of central tendency, dispersion, time series analysis, correlation, simple and multiple regression, and variable transformations.
1 Among the challenges of working with real data in an introductory statistics course is the limited "real world" experience of undergraduates. Many students may lack the background knowledge necessary to interpret or appreciate the story in a dataset. The dataset described in this article is easily accessible to students, as are the nature of the questions raised and the underlying causal relationships.
2 The dataset can be used to illustrate several techniques in a typical course, permitting students to return to familiar data as their statistical sophistication develops. More important than its use as grist for the computational mill, the data also provide opportunities for gaining insights into concepts with which students often struggle: variation, standard errors, and causal modeling.
3 My family lives near Boston, Massachusetts, in a house built in about 1890. Over the years, many changes have been made to the structure and its systems. Rooms have been added and indoor plumbing introduced. Electrical service and wiring have been changed and upgraded numerous times, and the web of cables in the basement is virtually a museum of the history of residential wiring. The heating system was originally fueled by coal, then oil, and now by natural gas. Though the coal-fired system depended on convection to distribute the warmed air, the current system is forced hot air, relying on an electric fan to push the heated air throughout a system of ducts.
4 Our furnace, stove, and water heater use natural gas. Our clothes dryer is all-electric. In early 1996, we added a bedroom and enlarged the kitchen, and we were interested in estimating the additional consumption of natural gas and electricity attributable to the new configuration. The family remained the same size, and our growing children have not yet started to change bathing habits, so demand for hot water has remained stable. The new construction improved the insulation in the affected areas, and several new lighting circuits were added. There is no air conditioning in the house.
5 The dataset contains monthly observations beginning in September 1990, and continuing through May 1997. The variables include mean temperature for the month in Boston, natural gas consumption, electricity consumption, heating and cooling degree days, and a dummy variable indicating the presence of the new room. This last variable is 0 for the months through November 1995, and 1 thereafter.
6 Degree days are a measure of temperature fluctuations that stimulate demand for heating or cooling. Specifically, heating degree days are sums of the absolute value of temperature deviations below a base temperature of 65 degrees Fahrenheit. For example, if the mean daily temperature were 60 degrees one day, that would represent five heating degree days. Conversely, cooling degree days sum the positive deviations from the base of 65 degrees.
7 Since the dataset spans several years, I elected to express the utility usage in units of consumption, rather than in dollar amounts. This way, we do not have to be concerned with price inflation. The gas company computes bills on the basis of therms used, where a therm is an index representing the variable heating capacity of one cubic foot of natural gas. Since the volume of natural gas is a function of temperature, the heating capacity of one cubic foot varies throughout the year. Bay State Gas computes consumption in therms on each gas bill. Electrical usage is measured in kilowatt hours (kwh). Bay State Gas and Boston Edison supplied all of the consumption figures in the form of monthly invoices.
8 The mean daily temperature and degree day data were obtained from the National Weather Service. Note that the observation periods are not perfectly aligned. For example, the temperature data for January 1991 refers to the period from January 1 through January 31 of that year. The gas consumption observation for the month reflects the period from December 18, 1990, through January 17, 1991, while the electricity observation reflects the period from December 12, 1990, through January 11, 1991. These irregularities provide an opportunity for class discussion of a typical difficulty with observational data, and might occasion a writing exercise or discussion of how one might redesign this study.
9 Besides the problem of non-aligned months, there are missing observations in the dataset. The gas company typically skips one or two bills during the summer months. Specifically, in 1991 and 1992, there was no September bill. Starting in 1993, the company has sent bills in June, August, and October, and the latter two bills account for about 60 calendar days rather than 30. Therefore, it makes sense to divide the total billing figure by the length of the billing period, and analyze mean gas consumption per day.
10 The major questions that prompted collection of the data are these:
In an average month, do we use additional natural gas as a result of adding a room? If so, how much?
In an average month, do we use additional electricity as a result of adding a room? If so, how much?
11 Underlying these questions, of course, is the issue of whether or not there is a sufficiently stable relationship between temperature and utility use to permit us to use the available data productively. Additionally, there are many useful questions that can be asked, permitting both practice with techniques and experience with statistical reasoning. For instance, Boston Edison only reads the electric meter every other month, but issues a monthly bill. In one month, the billing period starts with an actual reading and ends with an estimated reading; in the next, the converse is true. Is the reported usage consistently different in the two types of bills? Beyond such questions, the dataset provides good fodder for coming to grips with basic statistical concepts like the standard error of the mean.
12 In addition to estimating costs imposed by adding a room, the dataset lends itself to a variety of statistical techniques. This section reviews several of the many possibilities.
13 I first suggest that students simply look at the dataset, preferably on the printed page, and describe what they notice. There are two prominent features: the missing data (summertime gas bills) are conspicuous, as are the complementary wavelike patterns in the heating and cooling degree day columns. A moment's reflection brings an explanation for the latter. As for the former, we discuss why the data are missing, and begin to consider the implications for later analysis.
14 Students can construct and comment on histograms of the two dependent variables (daily gas and electricity usage). The kilowatt-hours-per-day distribution is symmetrical, bell-shaped, and quite well-behaved, while the gas consumption distribution, by comparison, is very irregular (Figure 1). The contrast in the two graphs lends itself to useful discussion.
Figure 1. Histograms of Mean Daily Gas Usage and Mean Daily Electricity Usage.
15 Since these are time series data, students should construct time series plots early in the analysis. I have them plot five variables against time: temperature, heating degree days, cooling degree days, gas consumption, and electricity consumption. Before constructing the graphs, I ask them what they expect to see in each. Then, after seeing the graphs, I ask them (a) if there were any surprises and (b) what they were. The graph of mean daily gas usage is shown in Figure 2. We discuss the differences among the five graphs, and whether the final two graphs provide any clues regarding the principal questions about increased consumption. I also like to ask students how the cooling and heating degree day graphs might look in other parts of the country.
Figure 2. Time Series Plot of Mean Daily Gas Usage.
16 Side-by-side boxplots of gas and electricity consumption pre- and post-construction (see Figure 3 for the gas boxplots) offer insight into the main questions in the case. The boxplots suggest that we consume more of each utility now that we have the new room, though the increase is more pronounced in the case of the natural gas.
Figure 3. Boxplots of Gas Usage Pre- and Post-Addition.
17 In addition to the boxplots, it is natural to compute means, standard deviations, and quartiles for the two dependent variables, pre- and post-construction of the new room. In doing so, it is clear that the means and quartiles are higher after the construction, but so are the standard deviations. Consumption became more volatile after the room was added. Moreover, when we compute the same measures for the climate variables, we find that, on average, the temperature has been lower, heating degree days higher, and cooling degree days fewer since the construction. As such, it is difficult to say whether the increased utility consumption should be attributed to the new room or to the chance variation in temperature.
18 Finally, scatterplots of daily usage versus temperature or degree days can begin to reveal the relationships at work. It is particularly useful to plot gas consumption against temperature, using different colors or symbols for observations before and after the construction (Figure 4). As expected, the relationship between temperature and gas consumption is quite strong, but that between electricity consumption and temperature is much weaker. Also as expected, post-addition consumption tends to be higher at each temperature level, though not always.
Figure 4. Scatterplot of Gas Usage vs. Temperature.
19 Because the cause and effect relationship between outdoor temperature and heating is so clear, the natural starting point for inference is a simple regression of daily gas usage (in therms) on either temperature or heating degree days. It is very useful to have some discussion about what "causes" gas consumption throughout the year, as well as a discussion of the relative strengths and weaknesses of the two climate measures. Mean daily temperature is easily understood and has no lower bound (unlike degree days), but does not reflect temperature variation during the month. Heating degree days require some translation for the class, but do capture variation around the mean temperature. As an explanatory variable, degree days makes good sense, and has the virtue that the y-intercept in a model featuring degree days has a natural interpretation: it is the amount of gas required for cooking and water heating, even when there is no need for heat.
20 When we examine correlation coefficients, we find that gas usage has a stronger correlation with temperature (-0.93) than with heating degree days (0.90). As such, the first regression we consider uses daily consumption and mean temperature. The slope of the line is negative: the higher the temperature, the less gas consumed. Minitab output for such a regression is shown here.
The regression equation is GaspDay = 15.4 - 0.217 Temp 71 cases used 10 cases contain missing values Predictor Coef StDev T P Constant 15.3677 0.5049 30.44 0.000 Temp -0.21696 0.01036 -20.94 0.000 S = 1.314 R-Sq = 86.4% R-Sq(adj) = 86.2% Analysis of Variance Source DF SS MS F P Regression 1 757.09 757.09 438.54 0.000 Error 69 119.12 1.73 Total 70 876.21 Unusual Observations Obs Temp GaspDay Fit StDev Fit Residual St Resid 38 50.0 1.900 4.520 0.160 -2.620 -2.01R 65 30.0 11.600 8.859 0.230 2.741 2.12R 67 37.0 11.600 7.340 0.184 4.260 3.27R R denotes an observation with a large standardized residual
21 These results are consistent with the hypothesized relationship, and all of the test statistics indicate a significant relationship. Of course, this model tells us nothing at all about the impact of the new room, but does lend credence to the idea that we should control for seasonal variation as we examine the mean consumption pre- and post-construction. This output also provides the occasion for a common sense discussion of unusual observations, and students are quite able to theorize about the processes which give rise to such observations.
22 The next regression introduces the presence of the new room into the model. This is accomplished via a dummy variable called NewRoom, which equals 0 prior to construction, and 1 thereafter. As such, the estimated coefficient of NewRoom is the marginal increase in gas usage, after controlling for variation in temperature.
The regression equation is GaspDay = 15.0 - 0.215 Temp + 1.11 NewRoom 71 cases used 10 cases contain missing values Predictor Coef StDev T P Constant 15.0060 0.4885 30.72 0.000 Temp -0.214562 0.009775 -21.95 0.000 NewRoom 1.1125 0.3521 3.16 0.002 S = 1.236 R-Sq = 88.1% R-Sq(adj) = 87.8% Analysis of Variance Source DF SS MS F P Regression 2 772.34 386.17 252.80 0.000 Error 68 103.87 1.53 Total 70 876.21 Source DF Seq SS Temp 1 757.09 NewRoom 1 15.25 Unusual Observations Obs Temp GaspDay Fit StDev Fit Residual St Resid 8 50.0 7.000 4.278 0.169 2.722 2.22R 31 33.0 10.800 7.925 0.216 2.875 2.36R 67 37.0 11.600 8.180 0.317 3.420 2.86R 74 53.0 1.900 4.747 0.321 -2.847 -2.38R 75 40.0 5.000 7.536 0.312 -2.536 -2.12R R denotes an observation with a large standardized residual
23 This model is a slight improvement in several respects: the standard error is reduced (from 1.314 to 1.236), the adjusted coefficient of multiple determination is slightly increased from 86.2% to 87.8%), and all p-values indicate significant results as before. Moreover, we have our first estimate of the increase in demand for gas: 1.11 therms per month.
24 This model, however, has some problems. Some of them, like serial correlation, are probably beyond the scope of a first course, but must be pointed out. A more compelling and easily addressed problem, though, is non-linearity. A plot of residuals vs. fitted values reveals a markedly curvilinear (concave up) pattern, suggesting that a linear model is not quite appropriate.
25 As a logical matter, students can quickly see that, despite the linear-looking scatter plot, it is not possible that temperature and gas usage are linearly related over all possible temperature values. In this downward sloping relationship, temperature can keep increasing, but gas consumption must stop at zero. I invite the class to apply the model to a pre-construction month in which the mean temperature was 75 degrees, and they quickly see the problem: the model predicts negative gas consumption.
26 Therefore, a better model would be one in which we fit a curve that flattens out at some appropriately high temperature. The next section discusses one such model.
27 Obviously, there are several other possible regression models involving degree days and electricity consumption. This first example can serve to illustrate how the analysis might proceed in those cases.
28 Before turning to other matters, consider a regression model of electricity consumption. This model (shown below) has three predictors: temperature, the new room dummy, and a dummy variable indicating whether electric usage is estimated. Three points deserve attention. First, the relationship between electricity consumption and temperature is much weaker than that for gas usage, with an R2 of only 33.3%. Second, we can be confident that usage has increased with the addition of the room, on the order of about 6 kwh per day, because we estimate a coefficient of 6.181 (p-value approximately zero). Third, there is no significant difference in reported usage when the electric company estimates the meter reading, according to the p-value for the estimated coefficient of "Est."
The regression equation is KWHpDay = 21.9 - 0.118 Temp + 6.18 NewRoom - 1.34 Est Predictor Coef StDev T P Constant 21.850 1.811 12.07 0.000 Temp -0.11837 0.03209 -3.69 0.000 NewRoom 6.181 1.218 5.08 0.000 Est -1.338 1.015 -1.32 0.192 S = 4.543 R-Sq = 35.8% R-Sq(adj) = 33.3% Analysis of Variance Source DF SS MS F P Regression 3 886.01 295.34 14.31 0.000 Error 77 1589.31 20.64 Total 80 2475.31 Source DF Seq SS Temp 1 323.71 NewRoom 1 526.48 Est 1 35.81 Unusual Observations Obs Temp KWHpDay Fit StDev Fit Residual St Resid 25 61.0 32.200 14.630 0.845 17.570 3.94R 27 40.0 6.000 17.116 0.855 -11.116 -2.49R 36 72.0 22.400 11.990 1.067 10.410 2.36R 37 62.0 3.600 14.511 0.857 -10.911 -2.45R 67 37.0 37.800 23.651 1.263 14.149 3.24R R denotes an observation with a large standardized residual
29 After discussing the non-linearity in these data, it is a natural extension to use the data to illustrate the application of various transformations. The class should see the desirability of fitting a curve to the data, and using the natural logarithm of temperature works well in the earlier regression model. It is also instructive to consider the same transformation in a model with heating degree days as the independent variable; because degree days often equal zero, the transformation presents students with another puzzle to resolve.
30 The data are also useful for construction of seasonal indices. The climate and gas consumption data illustrate series with little trend, but very regular seasonal variation. In contrast, the electricity consumption series is much more irregular, and has trended up since the room was added. The temperature and electricity series are complete, but the gas consumption series is interrupted.
31 Perhaps more important than the computational possibilities in this dataset are the conceptual ones. Due in part to the familiarity of the subject of the data, students are able to bring their intuitions to bear on some of the fundamental, yet thorny, ideas about statistics. This section reviews a few of the ways in which core concepts in the course can be explored with this dataset.
32 This dataset provides illustrations of categorical, interval, and ratio scales of measurement. It thus provides an opportunity to introduce or review those definitions in a practical context.
33 More interestingly, we have choices to make in selecting variables for analysis. Is "therms" a better variable than "cubic feet"? What is a degree day? What difference does it make to base a regression analysis on heating degree days rather than mean daily temperature? What is lost or gained by using climatological data from Boston, which is on the coast and about twenty miles from the house?
34 Why would gas usage be different in two months with the same average temperature? For instance, the mean temperature was 33 degrees in February 1991 and in March 1992. In the former month, mean consumption was 8.5 therms per day, but it was 8.7 therms per day in the latter. Why would this occur? Students are apt to suggest behavioral factors: perhaps we were at home more often in March, and turned up the heat. They are less likely to identify differences in the variance of temperature as an explanation, though they can be persuaded with the following line of reasoning:
35 Suppose we keep the thermostat set at 65 degrees. When the temperature goes below 65, the heat comes on. Consider two months in which the mean daily temperature is 66 degrees. In the first month, the temperature miraculously remains constant at that level. How much gas would we use? In the second month, the temperature fluctuates, but the mean is 66. Would we use the same amount of gas?
36 Is there a single concept in the introductory course that baffles students more predictably and profoundly than the standard error? The formula is straightforward enough, and students can quickly work with data to "get the right answer." But ask a class to explain what the standard error represents, and puzzled stares abound.
37 The mean temperature variable in this dataset might help a few students to get an insight into the concept. In this part of the country, temperatures in September are highly variable. Nights tend to be cool, but daytime temperatures are wildly erratic. Were the class to record temperatures hourly for the month and compute the sample standard deviation, it would be fairly high. Like the sample mean, the sample standard deviation would be different next September, and indeed it would be different if Amy records the temperatures on the hour, and Bill records them on the half hour. Any sample statistic will depend on the particular sample we have and on the unknown value of the population parameters.
38 In our dataset, we have mean temperatures for seven consecutive Septembers. The mean values are these:
62 61 61 62 64 64 64Asking the class (a) what they notice about these seven mean values, and (b) why they are all so similar can go a long way towards an understanding of what the standard error of the mean represents, and why the standard error is so much smaller than the standard deviation of x.
39 It is easy for students to see that warmer temperatures "cause" us to use less natural gas: we don't turn up the heat, we may cook a bit less, and the hot water stays hot. That fairly clear causal chain manifests itself in variables that have a high negative correlation.
40 In the case of electricity usage, though, the chain is much less clear. What does outdoor temperature have to do with electricity consumption, absent air conditioning? The class can speculate about why we might use more or less electricity in summer or winter. The correlation between kwh per day and mean temperature is -.362, suggesting that we use more electricity in colder months. Via what mechanism does cold weather cause us to use more kilowatt hours? Surely, the fan in the furnace runs more in cold weather, but then again, the refrigerator and window fans work harder in warm weather. Almost certainly, the real factor is daylight -- itself a correlate of temperature.
41 Though other datasets provide more drama or more direct connection to a student's major field of study, this set is surprisingly rich. I think that my students connect with it due to two factors: the basic story is familiar and understandable, and it's my house. The second factor obviously is not transferable to other settings, though any one with a tendency to hoard old utility bills can easily construct a similar dataset. In fact, students may want to compile a similar dataset from their family files. From my perspective, the homeliness of this dataset is its greatest virtue, followed closely by the fact that it can be revisited and referred to at several points during the course.
42 The file utility.dat.txt contains the raw data. The file utility.txt is a documentation file containing a brief description of the dataset.
I would like to acknowledge considerable help from Bob Hayden and Norton Starr. Participation at last year's New England Isolated Statisticians' Meeting was part of the inspiration for this work.
Columns 1 - 7 Observation month (formatted mmm-yy) 10 - 11 Number of days in the month 14 - 15 Mean monthly temperature in Boston, in degrees Fahrenheit 17 - 20 Mean natural gas usage per day for the month, in therms 23 - 25 Total therms used for the month 28 - 29 Days in the gas company billing cycle for the month 31 - 34 Total kilowatt hours consumed in the month 36 - 39 Mean kilowatt hours per day for the month 42 - 43 Days in the electric company billing cycle for the month 46 Dummy variable for method of determining kwh for the month (0 = actual month-end meter reading, 1 = estimated reading) 48 - 51 Total heating degree days for the month 54 - 56 Total cooling degree days for the month 58 Dummy variable for the new room (0 = pre-addition, 1 = post-addition)Values are aligned and delimited by blanks. A therm is a standard measure of the heating capacity of a cubic foot of natural gas. Due to changes in air temperature during the year, the heating capacity varies from month to month.
Department of Business Administration
320 Washington Street
Easton, MA 02357-1150