Mitchell R. Watnik
University of Missouri-Rolla
Journal of Statistics Education v.6, n.2 (1998)
Copyright (c) 1998 by Mitchell R. Watnik, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the author and advance notification of the editor.
Key Words: Exploratory data analysis; Model selection and validation; Regression; Stepwise model selection.
Well-defined measures of performance are readily available for baseball players, making the modeling of their salaries a popular statistical exercise. In this article, the salaries for non-pitchers for the 1992 Major League Baseball season are provided, along with numerous measures of the players' previous year's performances. Also included are indicators of each player's ability to switch teams. This dataset is useful in upper-division regression analysis courses because it exhibits many "real world" difficulties that can be remedied using techniques outlined in the course.
1 Linear regression is a core course in most statistics programs, and many science and social science programs also employ linear regression techniques. Frequently, textbooks provide datasets that exhibit properties discussed in the section immediately preceding the question involving the data, but datasets whose analysis requires many of the techniques covered in even an elementary course are harder to come by.
2 The dataset discussed here is different. Its analysis can result in a model that explains much of the variation in salary, but the process requires employment of many of the techniques covered in regression analysis courses. This type of dataset was used with great success as part of the 1988 ASA Data Analysis Exposition; see Denby (1988) or Hoaglin and Velleman (1995). A "good" analysis of this dataset results in a model that has many features interesting to economics students or students who are baseball fans. I used this dataset as part of a take-home final in my upper-division regression modeling class.
2. The Dataset
3 The dataset consists of information about Major League Baseball players. The response variable is their 1992 salaries (measured in thousands of dollars and obtained from the New York Times of November 19, 1992). Possible explanatory variables include various measures of the players' 1991 performance. (See the Appendix for descriptions of each variable.) These data were obtained from the Sacramento Bee of October 15, 1991. Students who are not familiar with baseball may be made aware that, with the exception of strike-outs and errors, all of these variables would sensibly be positively correlated with salary.
4 The last four numeric variables are dummy variables indicating "free agency eligibility," "free agent in 1991/2," "arbitration eligibility," and "arbitration in 1991/2." The special 1991/2 dummy variables are used because the players' union argued that owners colluded to keep the salary of free agents in 1991-2 lower. A list of free agents was obtained from the New York Times of November 13, 1991, and a list of players undergoing arbitration in 1992 was published in the New York Times on February 23, 1992. The reason these variables are important is that, at the time, baseball had rules stating that a player could not go to the team of his choice unless he was "free agent eligible," and he could only be eligible if he had a certain amount of experience. From an economics point of view, it seems reasonable that if a player is not able to market himself to the highest bidder, his salary will not be as high. At the time, "arbitration" was for players who did not have enough experience to be free agents, but had some experience in the league. In this case, the player and his team would go to an appointed "arbitrator" who would choose between the player's suggested salary and the team's suggested salary. Players who were neither "free agent eligible" nor "arbitration eligible" either accepted what their team was willing to pay them or did not play.
5 There are some possibly significant interactions between the four dummy variables and the quantitative variables "runs," "runs batted in," "home runs," and "batting average." Therefore, for example, an analyst might choose to multiply runs by each of the four dummy variables to get four interaction terms. These interactions could be interesting, if determined significant, because home runs and batting average are measures of individual performance, while runs and runs batted in are measures of a player's contribution to the team. If free agents increase their salaries for better individual performances, it would give one some insight into the priorities owners use to determine their salary structures. Similarly, the significance of interaction terms involving arbitration, if any, would give insight into the arbitrators' decision-making process.
6 The last variable in each data row is the player's name, enclosed in double quotes. If your software has difficulty handling this text data, you may choose to manually delete the character information. This information was obtained from the Society for American Baseball Research (SABR) at ftp://skypoint.com/pub/members/a/ashbury/sabr/SALARIES /1992_salaries_baseball and CNN Sports Illustrated at http://www.cnnsi.com/baseball/mlb/historical_profiles/. The careful reader might observe that some players' salaries as listed on the SABR web site differ from the ones in the dataset -- especially the outlying observations pointed out in Section 3. This is caused by SABR's using salaries on Opening Day, while the salaries obtained from the New York Times are recorded as of the trading deadline on September 1, 1992.
7 Referees noted that career variables such as number of games played or number of at-bats, which I have left out, could have an impact on the model. This is, of course, true. I would welcome hearing of any such discovery. Students can obtain career data for players at the CNNSI site referred to above.
8 Once all of the explanatory variables are in place, students may start their analyses. First, they should obtain a histogram of the response values and notice that it is highly right-skewed. (This implies that a few players are making substantially more than the rest.) They may model the response using all of the independent variables and look at a QQ plot of the residuals; the residuals are also heavy-tailed. Therefore, the response should be transformed. Taking the log of the salary is an appropriate transformation here, but it is worth noting that the response is now the log of salary; the interpretations of the beta estimates will differ from those for the players' actual salaries.
9 After transforming the response variable, students can start employing their stepwise or other model-building techniques. This dataset is large enough to allow splitting it in order to use part of the data to select a model and the rest to be held aside for model validation. Nonetheless, computer programs may not be able to do an exhaustive search since there are around 30 independent variables (including interactions).
10 Once candidate models are obtained, students should be encouraged to obtain a QQ plot of the residuals and a plot of the residuals versus the fitted values. Both indicate that there are outliers! The same three outliers were consistently identified in my students' final models: Gary Pettis, Juan Samuel, and Lance Parrish. (See the comment in Section 2 about differences between SABR and the dataset provided here.) These outliers are influential, and, as it turns out, they are "unfairly" included in the dataset because these players were actually paid much more than the dataset implies. However, because of a few obscure rules in baseball regarding the "waiving" of players, their salaries from their current teams were substantially less than they should have been. That is, these players were actually being paid a much higher salary, but by their former teams. Students will thus have to find those observations and delete them.
11 After deleting the outliers, students can start to seriously think about model selection techniques. There may be two or three "good" models singled out by any criterion, and different criteria often point to different models, as is the case with this dataset. After their previous experience with the data not conforming to ideal standards, the students should know to plot the fitted values against the residuals and to obtain QQ plots for each of their models. Now, it is up to the students to choose a final model and justify it!
12 As part of this assignment, I asked my students to interpret at least one of the parameter estimates for the quantitative variables in terms of a player's estimated salary (as opposed to log salary). Similarly, I asked them to interpret at least one of the parameter estimates for the dummy variables. In this non-standard scenario, the interpretation of the estimates is not "as Xi increases by one unit, the estimated mean increase in Y is b units, holding all other Xj variables constant," and I wanted to keep my students from getting in the habit of using that interpretation without thinking about the situation. My linear regression course had a substantial number of graduate students, and I always try to ask questions requiring interpretation of results.
13 It turned out that, in all of the different models my students chose as their "final" models, the estimate for the constant term was remarkably close to the natural log of the minimum salary in 1992. This led the students who made that discovery to state that at least that statistic had an intuitively sensible value in their interpretation. Furthermore, it strengthened their belief in the model building process and in their choice for the final model.
14 This paper discusses the modeling of baseball players' salaries as a function of their performance the previous year and their ability to market their skills to other teams. The process of properly analyzing this dataset requires students in a linear regression course to employ many of the tools introduced in such a course -- including diagnostics of the assumptions associated with standard linear regression and remedial measures to be taken when the assumptions are not met -- because it has a few properties not found in "textbook cases." My original analysis of this dataset led me to discover that different information criteria chose different models, and, because the selected models were not nested, standard hypothesis testing procedures did not apply. I investigated non-nested model selection tests and wrote my Ph.D. thesis on the subject.
15 The file baseball.dat.txt contains the raw data. The file baseball.txt is a documentation file containing a brief description of the dataset.
I wish to thank Richard Green of the UC Davis Agricultural Economics Department for encouraging me to pursue the use and publication of this study beyond his econometrics course. In addition, I offer thanks to Tom Kirchoff, the anonymous referees employed by the Journal of Statistics Education on my paper, and its section editor and editor, Robin Lock and Jackie Dietz, respectively, for their constructive criticisms and suggestions on the final draft which improved this paper. I take full responsibility for any typos in the dataset and any errors in the text of this paper which may remain.
Appendix - Key to Variables in baseball.dat.txt
Columns 1 - 4 Salary (in thousands of dollars) 6 - 10 Batting average 12 - 16 On-base percentage (OBP) 18 - 20 Number of runs 22 - 24 Number of hits 26 - 27 Number of doubles 29 - 30 Number of triples 32 - 33 Number of home runs 35 - 37 Number of runs batted in (RBI) 39 - 41 Number of walks 43 - 45 Number of strike-outs 47 - 48 Number of stolen bases 50 - 51 Number of errors 53 Indicator of "free agency eligibility" 55 Indicator of "free agent in 1991/2" 57 Indicator of "arbitration eligibility" 59 Indicator of "arbitration in 1991/2" 61 - 79 Player's name (in quotation marks)Players' batting averages are calculated as the ratio of number of hits to the number of hits plus the number of outs. On-base percentage is the ratio of number of hits plus the number of walks to the number of hits plus the number of walks plus the number of outs. Therefore, the batting average is less than or equal to the on-base percentage. A batting average above .300 is very good; OBP above .400 is excellent. An RBI is obtained when a runner scores as a direct result of a player's at-bat.
I believe that number of hits serves as a proxy for the amount of playing individuals did in the year. There is a statistic for number of games played available, but this statistic counts any entry into the game, even defensive participation for a single out, the same as participating for the entire contest.
Denby, L. (1988), Dataset from Poster Session sponsored by the Section on Statistical Graphics of the American Statistical Association, on Statlib, ed. Michael Myers. (http://stat.lib.cmu.edu/datasets)
Hoaglin, D., and Velleman, P. (1995), "A Critical Look at Some Analyses of Major League Baseball Salaries," The American Statistician, 49, 277-285.
Mitchell R. Watnik
Department of Mathematics and Statistics
University of Missouri-Rolla
Rolla, MO 65409-0020