NAME: CentalPark.dat TYPE: Weather Data (Markov Chains) SIZE: 10, 354 observations, 128 variables ARTICLE: To ski or not to ski: Estimating transition matrices to predict tomorrow's snowfall using real data DESCRIPTIVE ABSTRACT: The Global Historical Climatology Network (GHCN)-Daily database contains extensive (daily) temperature (TMAX, TMIN), precipitation (PRCP), and snowfall records (SNOW, SNWD) from around the world. In light of this information, it is possible to use real data for a number of weather related problems. In particular, we focus on the Markov Chain precipiation model as it is discussed in a variety of courses and serves as a 'standard' Markov chain introduction. Specifically, a seasonal variation of the classic Markov chain precipitation example is discussed, predicting a significant snow depth tomorrow from today's snow depth conditions. This file description should be used with the accompanying Central Park.dly data set, and the included Central Park R Commands.R code. Note that the data layout for the other included location files is analogous to the Central Park data. SOURCE: The data was downloaded from the GHCN-Daily archive at: http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/hcn/. Note that the original filename is: USC00305801.dly. Alternatively, the file is in the accompanying zip distribution. In addition, the following variable descriptions are adapted from the readme.txt file of the GHCN archive. The reader may thus refer to: http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt for additional data description. VARIABLE DESCRIPTIONS: Central Park.dat ------------------------------ Variable Position Type ------------------------------ ID 1-11 Character YEAR 12-15 Integer MONTH 16-17 Integer ELEMENT 18-21 Character VALUE1 22-26 Integer MFLAG1 27-27 Character QFLAG1 28-28 Character SFLAG1 29-29 Character VALUE2 30-34 Integer MFLAG2 35-35 Character QFLAG2 36-36 Character SFLAG2 37-37 Character . . . . . . . . . VALUE31 262-266 Integer MFLAG31 267-267 Character QFLAG31 268-268 Character SFLAG31 269-269 Character ------------------------------ These variables have the following definitions: ID is the station identification code. YEAR is the year of the record. MONTH is the month of the record. ELEMENT is the element type. The five core elements are: PRCP = Precipitation (tenths of mm) SNOW = Snowfall (mm) SNWD = Snow depth (mm) TMAX = Maximum temperature (tenths of degrees C) TMIN = Minimum temperature (tenths of degrees C) VALUE1 is the value on the first day of the month (missing = -9999). MFLAG1 is the measurement flag for the first day of the month. There are five possible values: Blank = no measurement information applicable B = precipitation total formed from two 12-hour totals D = precipitation total formed from four six-hour totals L = temperature appears to be lagged with respect to reported hour of observation T = trace of precipitation, snowfall, or snow depth QFLAG1 is the quality flag for the first day of the month. There are fourteen possible values: Blank = did not fail any quality assurance check D = failed duplicate check G = failed gap check I = failed internal consistency check K = failed streak/frequent-value check M = failed megaconsistency check N = failed naught check O = failed climatological outlier check R = failed lagged range check S = failed spatial consistency check T = failed temporal consistency check W = temperature too warm for snow X = failed bounds check SFLAG1 is the source flag for the first day of the month. There are twenty possible values (including blank): Blank = No source (i.e., data value missing) 0 = U.S. Cooperative Summary of the Day (NCDC DSI-3200) 1 = U.S. Preliminary Cooperative Summary of the Day -- Transmitted 2 = U.S. Preliminary Cooperative Summary of the Day -- Keyed from paper forms 6 = CDMP Cooperative Summary of the Day (NCDC DSI-3206) A = U.S. Automated Surface Observing System (ASOS) real-time data (since January 1, 2006) a = Australian data from the Australian Bureau of Meteorology B = U.S. ASOS data for October 2000-December 2005 (NCDC DSI-3211) b = Belarus update F = U.S. Fort data G = Official Global Climate Observing System (GCOS) or other government-supplied data H = High Plains Regional Climate Center real-time data I = International collection (non U.S. data received through personal contacts) M = Monthly METAR Extract (additional ASOS data) N = Community Collaborative Rain, Hail,and Snow (CoCoRaHS) Q = Data from several African countries that had been "quarantined", that is, withheld from public release until permission was granted from the respective meteorological services R = NCDC Reference Network Database (Climate Reference Network and Historical Climatology Network-Modernized) S = Global Summary of the Day (NCDC DSI-9618) NOTE: "S" values are derived from hourly synoptic reports exchanged on the Global Telecommunications System (GTS). Daily values derived in this fashion may differ significantly from "true" daily data, particularly for precipitation (i.e., use with caution). u = Ukraine update X = U.S. First-Order Summary of the Day (NCDC DSI-3210) z = Uzbekistan update VALUE2 is the value on the second day of the month MFLAG2 is the measurement flag for the second day of the month. QFLAG2 is the quality flag for the second day of the month. SFLAG2 is the source flag for the second day of the month. ... and so on through the 31st day of the month. Note: If the month has less than 31 days, then the remaining variables are set to missing (e.g., for April, VALUE31 = -9999, MFLAG31 = blank, QFLAG31 = blank, SFLAG31 = blank). Note that we do not incorporate any quality control measures in the described approach, however, this information is available if it is desired. STORY BEHIND THE DATA: Despite their simplicity, a Markov chain is a reasonable model for precipitation data. For this reason, a number of courses include the elementary Markov chain example, predicting whether or not it will rain tomorrow from today's rainfall conditions (Ross, 2003, e.g. 4.1) The purpose of this paper was to show how real data may be used for these examples, and to introduce a 'holiday' variation of the classic example. Specifically, we consider predicting a significant snow depth tomorrow from today's snow depth conditions. Additional information about these data can be found in the "Datasets and Stories" article "To ski or not to ski: Estimating transition matrices to predict tomorrow's snowfall using real data" in the Journal of Statistics Education (Rotondi 2011). PEDAGOGICAL NOTES: The described data may be included in a preliminary introduction or review of Markov chains in an elementary course in stochastic analysis or applied probability. Application of these transition matrices may involve standard Markov chain analysis questions. For example, upon presentation of the transition matrix (P), the student could be asked to determine characteristics of the Markov process, such as the limiting probabilities. A possible sample worksheet is included in Rotondi (2011, Appendix A). SUBMITTED BY: Michael A Rotondi Department of Epidemiology and Biostatistics The University of Western Ontario Room K201 London, Ontario, Canada N6A 5C1 mrotondi@uwo.ca