Constance H. McLaren and Bruce J. McLaren
Indiana State University
Journal of Statistics Education Volume 11, Number 1 (2003), ww2.amstat.org/publications/jse/v11n1/datasets.mclaren.html
Copyright © 2003 by Constance H. McLaren, Bruce J. McLaren, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words: Budgeting; Electrical utilities; Forecasting; Introductory Statistics; Spreadsheet; Time Series.
The Electric Bill dataset contains monthly household electric billing charges for ten years. In addition, there are values for such potential explanatory variables as temperature, heating and cooling degree days, number in household, and indicator variables for a new electric meter and new heat pumps. The values provide a real dataset to use for applications ranging from simple graphical analysis through a variety of time series and causal forecasting methods. The dataset also is suited to spreadsheet applications for break-even calculations and optimization. With knowledge of the utility’s tiered rate function, the bill amount can be converted to an estimate of the number of kilowatt hours used. A series of assignment questions is included and the accompanying Instructor’s Manual provides solutions.
The dataset contains actual data for an all-electric home (approximately 2700 square feet) in the Midwest. The data have been used in introductory business statistics classes, computer tools classes, and forecasting classes. Students seem to have an intuitive understanding of the data’s inherent fluctuations and realize the advantage of understanding the series for household budgeting purposes. The variable of primary interest is the dollar amount of the monthly electric bill. (Although monthly energy consumption may have been a more realistic dependent variable for forecasting than the amount billed, this information was not initially available. However, estimated consumption can be calculated if the rate function is known. The tiered rate function and other necessary values have been obtained and are provided in the dataset.) Other variables are provided as candidates for inclusion in multiple regression models. The analysis that follows is restricted to using the monthly bill as the dependent variable, but instructors may alternatively use estimated energy consumption if they wish.
The data were first gathered to help determine the effect of purchasing new heating and cooling equipment. The local utility offered both a rebate for equipment purchase and lowered electrical rates for a limited time for homes with more efficient units. After using the data for analysis and equipment purchase decisions, the problem was taken to a college level computing tools class for use in a spreadsheet application of break-even analysis. Later, the complete time series was used to illustrate concepts in college level business statistics and forecasting classes.
Students benefit from seeing classroom concepts applied to personal decision making as well as to “business” problems. Many of them feel (incorrectly) that the statistical concepts they are required to study will be used by someone else and not be part of their professional lives, let alone be something they might use voluntarily. By incorporating data such as these in the classroom, they learn that there are extensive and ongoing connections between what they study and what they can use to make decisions.
The bill amount and other data are taken from personal records. Temperatures and heating and cooling degree days information was taken from NOAA sites such as lwf.ncdc.noaa.gov/oa/documentlibrary/hcs/hcs.html. Heating and cooling degree days computation methods were revised July 2002.
Rate and tariff information needed to convert the monthly bill amount to the number of kilowatt hours used was supplied by CinergyPSI and is available online at www.cinergypsi.com/residential_services/tariffs/psi_energy_tariff.asp. The authors are particularly grateful for the guidance and data provided by CinergyPSI.
For each month, the BILL (in dollars including sales tax) is provided. The sales tax rate did not change from 5% during this period. Potential explanatory variables are included for multiple regression. The values for TEMP (average temperature), HDD (heating degree days), and CDD (cooling degree days) repeat every twelve months. Average temperature is based upon the average of the high and low values per month over a thirty year period. A complete description of the extensive statistical procedures followed by the National Climatic Data Center to remove bias and ensure a “midnight to midnight” time horizon for the daily reporting that is used to create the monthly averages can be found at lwf.ncdc.noaa.gov/oa/climate/normals/usnormalsprods.html (see "Climatography of the US No. 85"). Because we consider forecasting to be the primary objective of multiple regression models for this dataset, using long-run heating and cooling degree day averages is appropriate; students would not be in the position to forecast values for these independent variables. The family SIZE (number in household) varies throughout the period according to school schedules. The electric meter failed in 1999 and a new meter was installed. The larger of two heat pumps was replaced in 1997 and the smaller heat pump was replaced in 1998; the new units are significantly more energy efficient. The METER, PUMP1, and PUMP2 indicator variables in the dataset reflect these changes.
Converting the bill amount to the estimated number of kilowatt hours used is the subject of one of the spreadsheet assignments. Billing amounts at this utility are based on a connection charge, a tiered tariff function, and a collection of as many as eight riders that apply to all kilowatt hours used. Riders are amounts approved by the state utility rate commission to compensate for items such as fuel costs, environmental issues, and capital recovery. Positive amounts reflect additional costs passed to the consumer. Negative amounts indicate rebates. The variable RIDER TOTAL shows the net amount for each month of the time series. Rates for the connection charge and tariff function appear in a table in the Instructor’s Manual. The solution to this assignment is provided in the CONSUMPTION column.
There is one missing value for BILL and there are several outliers (including an actual value of $0.00) during the months that the meter was failing. Instructors can use these instances to address data cleansing, or they can replace these values prior to giving the students access to the dataset and skip the discussion. The CONSUMPTION amounts for the months with outliers are based on the suggested replacement values. The time series plot of BILL is shown in Figure 1 below. Minitab leaves out the missing value from January 1994 (index 37) and connects the adjacent values. The value of 0 for August 1999 is accurate.
Figure 1. BILL vs. Index.
This dataset can be used at many points in a statistics class. It is also appropriate for a class in operations management or a class dedicated to forecasting. The spreadsheet assignments are applicable to a second level computing tools class. If more than just a few of the observations are used, students should have access to software. The data have been used in several different classes to illustrate graphing, descriptive statistics, correlation, seasonal decomposition, a variety of smoothing methods, ARIMA models, and multiple regression. Minitab and Excel are both applicable tools for analysis.
We offer the following successive assignments for use in the classroom. Instructors would certainly have to choose those assignments that fit the educational objectives of the class and the abilities of the students. A detailed set of assignment questions and solutions is found in the Instructor’s Manual.
Assignment 1: Graphing and Data Examination
Use only the monthly BILL data. Ask the students to examine it both numerically and visually. The data are markedly seasonal and moderate trend exists. Students should notice that there is a problem with outliers and missing values, so data cleansing becomes an issue. Comparing measures of dispersion by month can prompt interesting discussions. Plotting each year as a separate series on the same axes can reveal patterns.
Assignment 2: Seasonal Decomposition
Use only the monthly BILL data for an assignment on seasonal decomposition. This time series is well suited for a group project or an individual assignment, or it could serve as a demonstration for teaching the subject. Students must determine whether an additive or multiplicative model is more appropriate for this time series. This assignment provides a good opportunity for students to use their models to develop fitted values and to calculate errors. We suggest you hold out the last year’s values to test the forecasting model and to be sure that students understand how to use their model for prediction.
Assignment 3: Smoothing Methods
Use only the monthly BILL data for forecasting based on smoothing methods. Students are asked to apply simple exponential smoothing and should realize that a seasonal exponential smoothing method is more appropriate. If there is access to forecasting software, or if students are adept with spreadsheet operations, other smoothing methods should be applied. Students should continue to record error measurements.
Assignment 4: ARIMA Models (Advanced)
Use only the monthly BILL data and apply ARIMA methods to the time series. In order to determine an effective model, students must analyze and transform the data. The resulting information fosters discussion on appropriate model parameters. This assignment requires the use of Minitab or another package and is best suited for advanced study.
Assignment 5: Regression with Monthly Indicator Variables
Use the monthly BILL data and the observation number, and create patterned indicator variables for the months. Develop a multiple regression model that has an intercept, a time variable, and eleven monthly indicator variables. Lead a discussion on why eleven rather than twelve indicator variables are appropriate. This assignment can be done with Excel or Minitab. At their discretion, instructors could include a quadratic time variable.
Assignment 6: Multiple Regression
Lead a classroom discussion on what factors other than seasonality might cause fluctuations in the size of the electric bill. Students always seem eager to find explanatory relationships. Reveal the additional data, and/or ask students to search for information on heating/cooling degree days, electric rates, and other topics that may have a bearing on the size of the bill. Discuss the logic of including the average temperature variable. Discussion about data sources and data availability for a causal model is appropriate. The instructor should remind students that if the forecasting model is to be used for prediction, the values of the independent variables must be available.
Assignment 7: Report Writing
If the data are used to support an ongoing project, have the students write a final report to explain why it is important to study this kind of a time series and to address the practicality of the methods applied. They should summarize their results and defend their choice of forecasting method.
A related issue deals with a break-even decision this family had to make in 1997 when the first heat pump was replaced. The issue was whether to replace the second heat pump then or wait until it failed. This provides a good exercise for your students to work with spreadsheets to do break-even analysis. Specific cost and rate information and assignment questions can be found in the Instructor’s Manual in Appendix B.
Assignment 2: Conversion of billing amount to kilowatt hours used
When the data were gathered to use in statistics classes, the number of kilowatt hours used was not available. However, a case can be made that forecasting the consumption rather than the amount spent would lead to better analysis. In this assignment, students should gather information on the cost function in use during a particular month and any riders that were in effect and learn how the bill amount is computed. This is not a trivial assignment. Once this information is clearly understood, students can use the rate categories to calculate the number of kwh. More advanced students can develop an integer programming model and use Excel’s Solver to convert the bill amount to kwh used.
The file electricbill.dat.txt contains the raw data. The file electricbill.txt is a documentation file containing a brief description of the dataset.
1 - 3
5 - 8
10 - 12
14 - 19
21 - 24
26 - 29
31 - 33
43 - 52
54 - 58
Amount of bill (in dollars), includes 5% sales tax
Average temperature (in degrees Fahrenheit)
Heating Degree Days
Cooling Degree Days
Number of family members at home
New meter? (indicator variable, 1 = yes)
New heat pump 1? (indicator variable, 1= new)
New heat pump 2? (indicator variable, 1= new)
Total charge (per kwh) for all riders
Calculated consumption (in kwh)
The dataset contains values for January 1991 through December 2000. The values are aligned and delimited by spaces. Missing values are denoted by "*."
For more information about using the dataset and to see specific assignment questions and their solutions, see the Instructor’s Manual.
Constance H. McLaren
Indiana State University
Terre Haute, IN 47809
Bruce J. McLaren
Indiana State University
Terre Haute, IN 47809