Mary Richardson
Grand Valley State University
Neal Rogness
Grand Valley State University
Byron Gajewski
The University of Kansas Medical Center
Journal of Statistics Education Volume 13, Number 3 (2005), ww2.amstat.org/publications/jse/v13n3/richardson.html
Copyright © 2005 by Mary Richardson, Neal Rogness, and Byron Gajewski, all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words:Active learning; Assessing normality; Blinding; Confounding variable; Kaplan-Meier survival function; Paired difference experiment; Randomization; Randomized block design; Right-censored data; Two-way analysis of variance; Wilcoxon signed rank test.
In addition to discussing the use of the activity in the introductory course, we will discuss extensions that can be used in intermediate or upper-level courses. The extensions involve assessing normality, the Wilcoxon signed rank test, Kaplan-Meier survival functions, two-way analysis of variance, and the randomized block design.
The randomized block design experiment to be discussed in Section 3.4 was assessed in a non-statistics major graduate course in experimental design taught in the spring of 2003 at the University of Kansas Medical Center. The class was comprised of ten Ph.D. students, with seven from Nursing and three from the Department of Hearing and Speech. Blinded student evaluations (not examined by the instructor until the grades were complete) assessed the “likeability” of the experiment for learning.
For completeness, we define relevant experimental design terminology.
Units are the objects upon which measurements are made or observed.
In an experiment, the researcher actively imposes some treatment on the units in order to observe the responses.
The response variable measures an outcome of the experiment. It is the variable that is thought to depend on the explanatory variable.
The explanatory variable is a variable that is thought to explain or cause the observed outcomes. It is the variable that explains changes in the response variable.
The possible values of the explanatory variable are called the levels of that explanatory variable.
A treatment is a specific combination of the levels of the explanatory variables.
A confounding variable is a variable whose effect on the response variable cannot be separated from the effect of the explanatory variable on the response variable.
A treatment group is a group of experimental units which receive an actual treatment.
Random allocation is a planned use of chance for assigning units to treatments. Randomization tends to produce groups of experimental units that are similar with respect to potential confounding variables.
A single-blind experiment is one in which the units are ignorant of which treatment they receive. A double-blind experiment is one in which neither the units nor those working with the units knows who is receiving which treatment.
A completely randomized design is a design for which independent random samples of experimental units are selected for each treatment.
An experiment in which observations are paired and the differences are analyzed is called a paired difference experiment (matched pairs experiment). Making comparisons within groups of similar experimental units is called blocking, and the paired difference experiment is a simple example of a randomized block experiment.
Assuming a class size of 30 students, the instructor will need 60 small bags, 60 sticks of chewing gum (30 sticks of each of two brands: Brand 1 and Brand 2), 60 sticky notes, and a set of plastic gloves (for sanitary purposes when unwrapping the gum and placing the sticks in a bag). We have used various brands of gum for this activity; both regular and sugar-free stick gums. The only chewing gums we suggest should not be used for this activity are the ‘intense flavor’ gums, whose flavor duration may exceed a typical 50-minute class session. Different choices of gum brands will produce different results in terms of concluding there is/is not a significant difference in the gum flavor longevity. However, this activity can be used irrespective of the conclusion reached from the collected data.
We collect the flavor longevity data during each of two class periods. We ask students to give suggestions for how the experiment should be carried out. For example, should everyone chew Brand 1 on the first day and Brand 2 on the second day? We agree that everyone should not chew the same brand on each day. If everyone chewed the same brand on each day, then order of chewing the brands would be a confounding variable. To prevent chewing order from being confounded with the flavor duration time, we develop a procedure for randomizing the chewing order. One possible approach is to assign each student a two-digit label, and randomly select labels assigning the corresponding students to chew Brand 1 on the first day and Brand 2 on the second day, with the remaining students chewing Brand 2 on the first day and Brand 1 on the second day. Another approach is to have each student flip a coin. If the coin lands heads, that student will chew Brand 1 then Brand 2; otherwise he/she will chew Brand 2 then Brand 1.
We agree that the experiment should be single-blinded. That is, students should not know which gum brand they are chewing. The instructor’s knowledge of which gum brand is being chewed should not introduce bias into the experiment, so there is no need to use double-blinding.
We ask students to think about to what population the results of this experiment can be generalized. Can the results be generalized to all adults? Is it reasonable to assume that the results obtained from using students as experimental units can be generalized to the entire adult population? Are there characteristics present in college students that would influence their judgment on flavor duration, but not influence the judgment of the adult population? We agree that it seems reasonable to generalize the results to the population of all adults.
We ask students if we can make a cause and effect conclusion based on this experiment. Based upon the results of our experiment, can we conclude that a certain brand of gum results in a longer or shorter flavor duration value than another brand? We agree that our experiment may allow us to make such a conclusion.
It is important to note that we have randomized the experiment by day to eliminate a possible confounding effect between day and brand of gum. Statisticians call this a “cross-over” design. Cross-over designs account for a “learning” effect or carry-over of the experience by the subject. Accounting for this statistically is out of the scope of this paper. We will assume that there is no carry-over effect. Note that in the presence of multiple sections one might implement a similar design but give the subjects the same brand of gum and one may test day effects to test our hypothesis without utilizing more complex cross-over analyses.
We hope, that by participating in this experiment, students will have a much better feel for sampling situations for which data are paired or matched. Students are asked to explain why the two samples of flavor longevity values are not independent.
Here are some examples of correct student answers:
Students are asked to discuss a modification of the data collection scheme for the experiment that would result in independent samples.
Here are some examples of correct student answers:
(Brand 1, Brand 2) Flavor Duration in Minutes (10,35) (30,35) (27,29) (25,15) (26,45) (25,40) (43,30) (45,40) (30,15) (18,30) (39,20) ( 5,15) (35,30) (45,32) (35,14) (21,22) (45,19) (45,12) (40,23) ( 7, 9) (45,36) (18,16) (22,27) (20,21) ( 7,23) (13,30) ( 9, 8) (47,36) (20,24) (35,44)Figure 1. Example Class Data for Sugar-free Cinnamon Stick Gum
Students calculate the differences in flavor duration values (Brand 1 minus Brand 2) and enter the differences into the appropriate column on the Data Sheet. Once these differences have been calculated, students construct a boxplot to display the distribution of the differences. Based on the boxplot, students explain whether they believe that the flavor duration differs for the two gum brands. Figure 2 shows descriptive statistics and a boxplot for the example class differences.
mean = 1.90 standard deviation = 14.17 min = -25.00 first quartile = -9.25 median = 0.00 third quartile = 13.00 max = 33.00
Figure 2. Descriptive Statistics and Boxplot of Example Class Differences (differences are formed by: Brand 1 - Brand 2)
From the boxplot we see that roughly half of the differences are negative and half are positive, indicating that there does not seem to be a difference in the flavor duration of the two brands of gum.
The test statistic formula for performing a hypothesis test on the mean difference for paired samples is introduced. Students perform a hypothesis test to determine if there is a significant difference in the mean number of minutes that the flavor lasts for Brand 1 and Brand 2 gums. The test statistic value for the example class data is 0.73, with a corresponding P-value of 0.47, well above any reasonable level of significance. Thus, we cannot conclude that there is a significant difference in the mean flavor duration, in minutes, for Brand 1 and Brand 2 chewing gums. We note that the result obtained here is typical when comparing two brands of sugar-free cinnamon stick gum.
Students give a practical interpretation of the P-value that was calculated in performing the hypothesis test. Students construct a 95% confidence interval for the mean difference in flavor duration, in minutes, for gum Brands 1 and 2 (Brand 1 - Brand 2) and explain how the confidence interval gives the same conclusion as the hypothesis test. For the example class differences, the 95% confidence interval is: (-3.39,7.19).
Note to the Instructor: We have found that the safest choice for a gum flavor to use that will not result in any censored data values or data values that are close to the 50-minute cut-off is sugar-free bubble gum flavored stick gum.
Students are asked if they feel that this activity should be used in future introductory statistics classes. Although we have not found that exactly 80% of students would recommend continuing this activity (i.e. 4 out 5), the good news is that our 95% confidence interval estimate of the percentage of students who would recommend this activity is: (81%,100%).
Students are asked if they feel that the instructions for completing the activity are clear, and if they do not think the instructions are clear, they are asked to state what they would change in order to make them clear. Overwhelmingly, students respond that they completely understood what was expected of them in completing this activity. Some example student responses are:
Students are asked if they think that participating in this activity helped them to think about independent samples versus dependent samples. Some example student responses are:
Students are asked to state why we cannot ignore pairing and analyze paired samples data as if we had two independent samples. Some example student responses are:
Exam Question: (Adapted from a question found in McClave and Sincich
(2003).)
In each scenario described below, we are interested in comparing a variable measured for two different groups.
#1: A pupillometer is a device used to observe changes in an individual’s pupil dilations as he or she is exposed to different visual stimuli. The Design and Market Research Laboratories of the Container Corporation of America used a pupillometer to evaluate consumer reaction to different silverware patterns for one of its clients. Suppose five consumers were chosen at random and each was shown two different silverware patterns.
Consumer | 1 | 2 | 3 | 4 | 5 |
Pattern 1 | 1.00 | 0.95 | 1.45 | 1.20 | 0.75 |
Pattern 2 | 0.80 | 0.65 | 1.25 | 1.00 | 0.80 |
(a) These samples are (circle one): independent matched (or paired)
Because ...
(b) Calculate the value of the test statistic for performing a hypothesis test to determine if the data provide significant evidence to indicate that there is a difference in the mean pupillometer readings for the two patterns. Do not perform the hypothesis test, only calculate the value of the test statistic.
#2: A pupillometer is a device used to observe changes in an individual’s pupil dilations as he or she is exposed to different visual stimuli. The Design and Market Research Laboratories of the Container Corporation of America used a pupillometer to evaluate consumer reaction to different silverware patterns for one of its clients. Suppose ten consumers were chosen at random. Five of the consumers were shown silverware Pattern 1, and the other five consumers were shown silverware Pattern 2.
Pattern 1 readings | 1.10 | 0.90 | 1.40 | 1.25 | 0.85 |
Pattern 2 readings | 1.00 | 0.75 | 1.25 | 1.00 | 0.90 |
(a) These samples are (circle one): independent matched (or paired)
Because ...
(b) Calculate the value of the test statistic for performing a hypothesis test to determine if the data provide significant evidence to indicate that there is a difference in the mean pupillometer readings for the two patterns. Do not perform the hypothesis test, only calculate the value of the test statistic.
In the table below, we provide an example of student performance on this exam question. Recall that results are presented for fifteen student responses.
Example Exam Results:
Correctly Identified Sampling Scheme for Scenario 1 |
Correctly Identified Sampling Scheme for Scenario 2 |
Correctly Identified Test Statistic for Scenario 1 |
Correctly Identified Test Statistic for Scenario 2 |
15/15 | 15/15 | 9/15 | 8/15 |
The results are quite mixed. A very positive aspect of the results is that every one of the students was able to correctly classify the two sampling scenarios as either independent or dependent. However, a negative aspect of the results is that some students have difficulty identifying the appropriate hypothesis testing procedure to apply, after they have categorized samples as being dependent or independent. We believe that the biggest obstacle that prevents some students from correctly identifying the paired-difference test statistic formula is that, even though they realize they are dealing with dependent samples, they still focus on the fact that they have two samples. They are unable to form differences within the pairs as a first step and then proceed to apply the correct test statistic formula to the differences. Instead, they apply the two independent samples formula.
It is important that the students recognize instances of cases of when to use the paired test statistic and the independent test statistic. However, it is also important for the students to check the condition needed to use the intended probability distribution for the test statistic. This condition is that the distribution for the x-bar values follows a normal curve (or is at least quite close to a normal curve). The operational check for this is either to observe that the original values are essentially normally distributed or that the sample size is reasonably large. Students should be aware that the probability distribution called for in a hypothesis test or confidence interval is not always the correct description of the probabilities involved - certain requirements need to be satisfied in order to appeal to that distribution. This is particularly true for normal, t, chi-squared, and F distributions. Therefore we dedicate the subsequent section to exploring how close the raw data is to being normally distributed.
(Brand 1, Brand 2) Flavor Duration in Minutes (42, 8) (13, 7) ( 4, 5) (49,22) (11, 7) (16, 7) ( 9, 5) ( 7, 4) (38,25) (41,16) (22,17) ( 9,23) (37,23) (14, 8) (10,15) (16,16) (18, 7) ( 3, 6) (48,50) (12, 7) (34,29) (37,16) (31,15) (13, 9) ( 6, 6)
Figure 3. Example Class Data for Spearmint Stick Gum
Students are asked to explain why the two samples are not independent. Students calculate the differences in the number of minutes the flavor lasted for the two gums (Brand 1 minus Brand 2) and enter the differences into the appropriate column on the Data Sheet.
Students calculate the mean, median, and quartiles for the differences and use these calculations to help determine if it can be assumed that the distribution of the differences is a normal distribution. Students construct a stem-and-leaf plot of the differences and check for non-normal features. The mean and median differences are compared, as are the distances from the quartiles to the median. Students use a statistical software package to construct a Q-Q plot of the differences. Students write a summary paragraph to explain whether they believe it can be assumed that the distribution of the differences is a normal distribution. Figure 4 displays results for assessing the normality of the example class differences for the spearmint flavored stick gum.
Brand 1 - Brand 2 Stem-and-Leaf Plot
Stem & Leaf
-1|4 -0|5 -0|123 0|003444 0|555669 1|134 1|6 2|1 2|57 3|4 Stem width: 1.00 Each leaf: 1 case(s) mean = 7.48 standard deviation = 10.83 min = -14.00 first quartile = 0.00 median = 5.00 third quartile = 13.50 max = 34.00
Figure 4. Assessing Normality for the Example Class Differences
For the example class differences, the stem-and-leaf plot shows a slightly right skewed distribution. A mean of 7.48 minutes compared to a median of 5.00 minutes also indicates that the distribution is right skewed. The first quartile is 5.00 minutes below the median, while the third quartile is 8.50 minutes above the median, indicating a right skew. However, the Q-Q plot does not show a marked departure from linearity.
The results of the normality assessment will depend on the brands and flavors of gum used. This extension can be used irrespective of the conclusion reached concerning normality.
Students construct a boxplot to display the distribution of the differences. Based on the boxplot, students explain whether they believe that the flavor duration differs for the two gum brands. Figure 5 shows a boxplot of the differences for the spearmint flavored gum. The circle on the boxplot indicates an outlying difference value. An outlier is defined as a difference value that is more than 1.5 times the IQR beyond Quartile 1 or Quartile 3 (where IQR = Quartile 3 - Quartile 1).
Figure 5. Boxplot of Example Class Differences (differences are formed by: Brand 1 - Brand 2)
From the boxplot we see that roughly 75% of the differences are positive, indicating that the flavor duration of the Brand 1 gum appears to last longer than that of Brand 2.
Students conduct an appropriate statistical hypothesis test to determine if there is a significant difference in the flavor duration for Brand 1 and Brand 2 gums. The hypothesis testing procedure that is applied will depend on whether the distribution of differences is judged to be non-normal.
For the example class data, since the normality checks indicate that it may not be safe to assume that the population of differences is normal, students might apply the Wilcoxon signed rank test. For these data, the Wilcoxon signed rank test statistic produces a P-value of 0.001, well below any reasonable level of significance. It is therefore concluded that there is a significant difference in the typical flavor duration of the two gum brands. Students give a practical interpretation of this P-value.
40c | 40 | 22 | 35 | 30 |
35 | 7 | 40c | 40c | 32 |
40c | 40 | 20 | 20 | 27 |
40c | 40c | 7 | 40c | 31 |
40c | 30 | 13 | 40c | 40c |
Figure 6. Example Censored Class Data for Sugar-free Cinnamon Stick Gum (Brand 1)
During the next classroom period, we give students the flavor longevity data values, using a ‘40c’ to indicate censoring at 40 minutes. We introduce the concept of censored data and note that many of the longevity values are censored. We ask students to view the flavor longevity values as the survival lengths of experimental units in a study. We discuss with students how the censoring might affect the analysis of the data and introduce the concept of a survival function and discuss the interpretation of a Kaplan-Meier life table and survival function. Students are given a Worksheet (see Appendix C.1) that must be completed as homework. The Worksheet formally introduces students to introductory terminology pertaining to censoring and survival analysis.
Students enter the data into a statistical software package and generate the Kaplan-Meier life table. Students generate a plot of the Kaplan-Meier survival function. In Figure 7, we have included the Kaplan-Meier life table and survival function for the example class data.
Survival Analysis for Brand 1 Flavor Longevity Time Status Cumulative Standard Cumulative Number Survival Error Events Remaining 7.00 Not Censored 1 24 7.00 Not Censored .9200 .0543 2 23 13.00 Not Censored .8800 .0650 3 22 20.00 Not Censored 4 21 20.00 Not Censored .8000 .0800 5 20 22.00 Not Censored .7600 .0854 6 19 27.00 Not Censored .7200 .0898 7 18 30.00 Not Censored 8 17 30.00 Not Censored .6400 .0960 9 16 31.00 Not Censored .6000 .0980 10 15 32.00 Not Censored .5600 .0993 11 14 35.00 Not Censored 12 13 35.00 Not Censored .4800 .0999 13 12 40.00 Not Censored 14 11 40.00 Not Censored .4000 .0980 15 10 40.00 Censored 15 9 40.00 Censored 15 8 40.00 Censored 15 7 40.00 Censored 15 6 40.00 Censored 15 5 40.00 Censored 15 4 40.00 Censored 15 3 40.00 Censored 15 2 40.00 Censored 15 1 40.00 Censored 15 0 Number of Cases: 25 Censored: 10 ( 40.00%) Events: 15
Figure 7. Survival Table and Survival Function for Example Censored Class Data
Students interpret the survival table and the plot of the survival function. An estimated 75% of the chewers consider their gum to still have flavor at 25 minutes. However, by 35 minutes, this percentage drops to approximately 50%. Approximately 40% of the chewers consider the gum to maintain flavor for at least 40 minutes.
We discuss with students that our task is to determine which of two treatments is more effective in prolonging the survival time of the experimental units. We note that we are assuming that the two treatment groups (the chewers of Brands 1 and 2 gums) are both representative of their respective populations. We give students the data, along with a Worksheet (see Appendix C.2) that must be completed as homework. In Figure 8, we have included example class datasets. For these example datasets, our class size is 25 students for Brand 1 and 18 students for Brand 2. The results are for two brands of sugar-free cinnamon flavored stick gum.
Brand 1 (n=25) | Brand 2 (n=18) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
40c | 40 | 22 | 35 | 30 | 8 | 16 | 20 | 28 | ||
35 | 7 | 40c | 40c | 35 | 40c | 22 | 40 | 25 | ||
40c | 40 | 20 | 20 | 27 | 17 | 26 | 40c | 30 | ||
40c | 40c | 7 | 40c | 31 | 35 | 40c | 35 | |||
40c | 30 | 13 | 40c | 40c | 40c | 30 | 28 |
Figure 8. Example Censored Class Data for two Brands of Sugar-free Cinnamon Stick Gum
In extension 2A, students used a statistical software package to construct the Kaplan-Meier life table and survival function for the Brand 1 flavor duration values. Students use a statistical software package to construct the Kaplan-Meier life table for the Brand 2 flavor duration values. Students plot Kaplan-Meier survival functions for the flavor duration values of both gum brands on the same graph. In Figure 9, we include a plot of the Kaplan-Meier survival functions for the example class data.
Figure 9. Survival Functions for Example Censored Class Data
Students compare the overall survival rates for the two brands and examine the plot of the survival functions to determine if it appears that one of the brands tends to have longer flavor duration values than the other brand. At 10 minutes, the percentage of chewers still detecting flavor in the gum is approximately 95% for Brand 2 and 90% for Brand 1. At 15 minutes, the percentage still detecting flavor drops to 88% for Brand 1 and stays at 95% for Brand 2. However, at 20 and 25 minutes, the percentages still detecting flavor are higher for Brand 1, by approximately 2% and 8%, respectively. The median percentage for Brand 2 occurs at approximately 28 minutes, while the median for Brand 1 is at approximately 35 minutes. The percentage still detecting flavor at 40 minutes and beyond is approximately 40% for Brand 1 and only 22% for Brand 2.
Students use a statistical software package to perform a hypothesis test to determine if there is a significant difference in the flavor duration values for the two brands. Figure 10 shows test statistics and P-values for the Log Rank and Wilcoxon (Breslow’s generalized Wilcoxon) test procedures.
Statistic | df | Significance | |
Log Rank | 1.74 | 1 | 0.1866 |
Breslow | 1.45 | 1 | 0.2289 |
Figure 10. Test Statistics and P-Values for Determining a Significant Difference Between the Two Survival Rates
The P-values for the test procedures are both above any reasonable level of significance. Thus, we cannot conclude that there is a significant difference in the flavor duration values for the two brands.
We explain to students that it is desired to perform an experiment that involves a quantitative response variable and at least two qualitative attributes, all of which are related to gum. Students are divided into groups and asked to brainstorm potential variables of interest. The various ideas are collected via the whiteboard and the merits of each are discussed. Potential qualitative variables include flavor of gum, sugar vs. sugar-free, brand of gum, and type of piece of gum (i.e., stick vs. tablet). Potential quantitative variables include a rating of the gum flavor, the flavor intensity, the texture of the gum, and the length of flavor duration.
For this example, the independent variables chosen were Flavor of gum (spearmint vs. winterfresh) and Piece of gum (stick vs. tablet) and the response variable was the Texture of gum (using a rating scale where 0 = very soft, pliable and 10 = very hard, rubbery). To reduce confounding, all four gums were manufactured by the same company and were all sugar-free.
Assuming a class size of 30 students, the instructor will need 120 small paper cups, 120 pieces of chewing gum (30 sticks of each of two flavors: spearmint and winterfresh and 30 tablets of each of two flavors: spearmint and winterfresh), three sets of playing cards, 120 sticky notes (preferably 30 of four different colors), 30 index cards, and a set of plastic gloves (for sanitary purposes when unwrapping gum and placing the pieces in a cup). In addition, use of a statistical software package is required.
During a class period before the data collection begins, each student is given an index card and asked to write his or her name on each of the four sticky notes. Then each student is given a set of cards and asked to shuffle the cards. When the top card is flipped over, the students are told to write the name of the suit on the first sticky note. This process is repeated using the remaining cards and sticky notes. After class, the sticky notes are removed from the index cards and each is affixed to a small paper cup. A piece of unwrapped gum, corresponding to the suit on the sticky note, is placed in a cup and the cup is stapled shut.
The instructor assigns a treatment number to each of the four combinations of Flavor and Piece and on the first day of data collection, those cups with the color of sticky notes corresponding to Treatment One are taken to class. A Ratings Sheet is passed out (see Appendix D), as are the cups. Each student is asked to claim the cup with his or her name on it and to record his or her name on the Ratings Sheet, along with the suit indicated on the sticky note. All students begin chewing at the same time. Students are told to record a texture score on the data collection sheet at the time that the chewed gum is discarded. This process is repeated on the subsequent three class periods. The color of sticky notes is useful in helping to determine which gum to chew in the event that a student is absent on a data collection day and has more than one cup in the set of cups being passed around. In Figure 11, we have included an example class dataset.
Tablet/Spearmint: | 4, 1, 2, 4, 2, 4, 2, 7, 3, 1, 3, 4, 0, 2, 2, 4, 3, 6, 2, 1, 0 |
Tablet/Winterfresh: | 2, 3, 1, 5, 3, 1, 4, 4, 1, 3, 1, 1, 1, 0 |
Stick/Spearmint: | 5, 6, 7, 7, 7, 3, 7, 5, 7, 5, 7, 2, 6, 2, 10, 6, 3, 5, 6, 2, 4, 4 |
Stick/Winterfresh: | 3, 3, 5, 7, 7, 6, 6, 7, 8, 7, 4, 3, 1, 5, 3, 8, 6, 3, 4, 7, 2, 6, 1 |
Figure 11. Example Class Texture Scores for Piece (Tablet vs. Stick) and Flavor (Spearmint vs. Winterfresh) Two-Way ANOVA
We ask students to input the class data into a statistical software package and perform a two-way analysis of variance using the response variable Texture and the main effects of Flavor and Piece of gum. Students are instructed to prepare a report of their analysis and are asked to address several questions in their report (see Appendix E for the questions).
For each treatment, students are asked to generate the mean and standard deviation of the Texture scores. In Table 1, we have included these descriptive statistics for the four treatments.
PIECE | FLAVOR | N | Mean | Std. Deviation | |
---|---|---|---|---|---|
Tablet | Spearmint | TEXTURE | 21 | 2.71 | 1.793 |
Winterfresh | TEXTURE | 22 | 2.41 | 1.563 | |
Stick | Spearmint | TEXTURE | 22 | 5.27 | 2.051 |
Winterfresh | TEXTURE | 23 | 4.87 | 2.181 |
Students are asked to generate an ANOVA summary table and make a conclusion regarding whether Flavor of gum and Piece of gum interact to affect the mean Texture score. Students are also asked to generate an interaction plot and discuss whether the conclusion regarding interaction agrees with what the plot shows. The profile plot in Figure 12 shows the absence of interaction between the independent variables Piece and Flavor, which is consistent with the information given in Table 2 (F = 0.01, P = 0.91).
Dependent Variable: TEXTURE
Source | Type III Sum of Squares | df | Mean Square | F | P-value |
---|---|---|---|---|---|
Model PIECE FLAVOR PIECE*FLAVOR Error |
1439.424 138.399 2.757 0.053 308.576 |
4 1 1 1 84 |
359.856 138.399 2.757 0.053 3.674 |
97.959 37.675 0.750 0.014 |
0.000 0.000 0.389 0.905 |
Total | 1748.000 | 88 |
R Squared = 0.823 (Adjusted R Squared = 0.815)
Figure 12. Profile Plot of Piece (Tablet vs. Stick) and Flavor (Spearmint vs. Winterfresh) for the Response Variable Texture.
In the absence of interaction, students are asked to use the appropriate generated P-values to make conclusions regarding whether Flavor or Piece of gum are significant. Students are asked to compare means corresponding to the levels of the significant factors.
Table 2 shows that Flavor is not significant with respect to Texture (F = 0.75, P = 0.39), whereas Piece is significant (F = 37.68, P < 0.001), with the mean texture for stick gum (5.07) significantly higher than the mean texture for tablet gum (2.56) (see Table 3 and Table 4).
FLAVOR | N | Mean | Std. Deviation | |
---|---|---|---|---|
Spearmint | TEXTURE | 43 | 4.02 | 2.304 |
Winterfresh | TEXTURE | 45 | 3.67 | 2.256 |
FLAVOR | N | Mean | Std. Deviation | |
---|---|---|---|---|
Tablet | TEXTURE | 43 | 2.56 | 1.666 |
Stick | TEXTURE | 45 | 5.07 | 2.104 |
At this point we ask the students to step back from these conclusions and consider the statistical assumptions that validate the inference made from the ANOVA table. We remind the students that there are three assumptions. The first is that values from the populations for each cell come from a normal distribution, second that those distributions have a common value for their variances, and third that the responses are independent. Students can investigate the first two assumptions by observing a Q-Q plot of the residuals (which for brevity we skip) and the summary statistics presented in Table 1 respectively. The standard deviations are fairly close; assuming equal variances is not unreasonable. Independence is the assumption in clear violation and can be shown to students by considering the following question:
Given their answers to this question, students are asked to discuss what concerns they may have about using a two-way analysis of variance design for the gum texture ratings data. In addition, they are asked to suggest possible alterations to the experimental design to correct for this concern. Readers may have noted that the experiment discussed in this section is not a true completely randomized design. Rather, the design can be viewed as a randomized block design with each student as a block.
After discussing the results for a two-way analysis of variance, we ask students to reformat the class data and use a statistical software package to run a randomized block design (blocking on student) using the response variable Texture and the main effects of Flavor and Piece of gum. Students are instructed to prepare a report of their analysis and are asked to address questions [2], [3], and [4] from the Appendix E questions sheet.
In Figure 13, we modify the example class texture data to include fixed blocking. There are missing values in the data, but we include only those cases for which there are complete data (n = 20), as dealing with missing data is beyond the scope of this activity.
Student (Block) | Stick/Spearmint (Treatment 1) | Tablet/Spearmint (Treatment 2) |
Tablet/Winterfresh (Treatment 1) | Stick/Winterfresh (Treatment 4) |
---|---|---|---|---|
1 | 5 | 4 | 4 | 3 |
2 | 6 | 1 | 1 | 3 |
3 | 7 | 4 | 6 | 7 |
4 | 7 | 2 | 2 | 7 |
5 | 3 | 4 | 3 | 6 |
6 | 7 | 2 | 1 | 6 |
7 | 5 | 7 | 3 | 7 |
8 | 5 | 3 | 2 | 7 |
9 | 7 | 1 | 1 | 3 |
10 | 2 | 3 | 5 | 1 |
11 | 6 | 4 | 3 | 5 |
12 | 2 | 0 | 1 | 3 |
13 | 10 | 2 | 4 | 8 |
14 | 6 | 2 | 4 | 6 |
15 | 3 | 4 | 1 | 3 |
16 | 5 | 3 | 3 | 4 |
17 | 6 | 6 | 1 | 7 |
18 | 2 | 2 | 1 | 2 |
19 | 4 | 1 | 1 | 6 |
20 | 4 | 0 | 0 | 1 |
Figure 13. Example Class Texture Scores for Randomized Block Design
Table 5 shows that there is no significant interaction between the independent variables Piece and Flavor (F = 0.005, P = 0.94). Flavor is not significant with respect to Texture (F = 1.11, P = 0.30), whereas Piece is significant (F = 44.47, P < 0.001), with the mean texture for the stick gums significantly higher than the mean texture for the tablet gums. Table 6 displays estimates for individual treatment means.
Dependent Variable: TEXTURE
Source | Type III Sum of Squares | df | Mean Square | F | P-value |
---|---|---|---|---|---|
Model PIECE FLAVOR PIECE*FLAVOR SUBJECT Error |
1378.388 112.812 2.813 0.013 145.237 144.613 |
23 1 1 1 29 57 |
59.930 112.812 2.813 0.013 7.644 2.537 |
23.622 44.466 1.109 0.005 3.013 |
0.000 0.000 0.297 0.944 0.001 |
Total | 1523.000 | 80 |
R Squared = 0.905 (Adjusted R Squared = 0.867)
Dependent Variable: TEXTURE
95% Confidence Interval | ||||
Treatment | Mean | Std. Error | Lower Bound | Upper Bound |
Stick Spearment | 5.100 | 0.356 | 4.387 | 5.813 |
Tablet Spearment | 2.750 | 0.356 | 2.037 | 3.463 |
Tablet Winterfresh | 2.350 | 0.356 | 1.637 | 3.063 |
Stick Spearment | 4.750 | 0.356 | 4.037 | 5.463 |
This experiment was done in the graduate course and in answering a multiple response question “The following teaching strategy assisted in my learning: class experiments” three students “agreed” and seven “strongly agreed.”
Students at all levels enjoy participating in this activity and develop an interest in analyzing the data to determine whether there is a significant difference in the flavor duration of the brands (or types) of chewing gum.
The activity provides introductory students with a concrete example of paired or matched samples. In an intermediate course, the activity provides a paired dataset for which students must determine the appropriate statistical procedure to apply. Altering the data collection scheme allows the instructor to use the activity to introduce a basic analysis of right-censored data and to discuss the comparison of survival rates. Introducing different gum types and flavors gives the instructor an opportunity to discuss principles of experimental design and allows students to interactively generate a dataset that can be analyzed using a two-way analysis of variance or a randomized block design.
One obvious conclusion from this paper is that the gum experiment looks promising for assisting beginning non-mathematical college students in understanding the difference between paired and independent data. In addition, the use of the later activities look promising given the enthusiastic response in the graduate course. Now hopefully at least 4 out of 5 readers will give these activities or modifications of them a try in the class!
A.1 Data Sheet
Student | Brand 1 Flavor Duration (minutes) |
Brand 2 Flavor Duration (minutes) | Difference in Flavor Duration (Brand 1 - Brand 2) |
---|---|---|---|
1 | |||
2 | |||
3 | |||
4 | |||
5 | |||
6 | |||
7 | |||
8 | |||
9 | |||
10 | |||
11 | |||
12 | |||
13 | |||
14 | |||
15 | |||
16 | |||
17 | |||
18 | |||
19 | |||
20 | |||
21 | |||
22 | |||
23 | |||
24 | |||
25 | |||
26 | |||
27 | |||
28 | |||
29 | |||
30 |
pdf version of Appendix A: A.2 Worksheet
A.2 Worksheet
Which Gum Lasts Longer?
Background: (Background taken from: Brandt (2001) “FORMULATION CHALLENGE: CONFECTIONERY - A STICKY Situation,” found online at: www.preparedfoods.com/CDA/ArticleInformation/features/BNP__Features__Item/0,1231,114008,00.html.) “Gum chewing dates back to ancient civilizations. Ancient Greeks chewed mastic tree resin, ancient Central American Mayans chewed chicle, and American Indians chewed gum made from spruce tree resin. This gum was eventually replaced by paraffin wax gum. Today’s chewing gums are made mostly of synthetic materials.”
“Long-lasting flavor is one of the ‘holy grails’ of the chewing gum industry. For most chewing gums today, flavor lasts about 12 to 13 minutes as a standard ...”
Problem: We want to determine if there is a significant difference in the mean number of minutes it takes for two different brands of chewing gum to lose their flavor.
Instructions: The Data Sheet contains the gum data that was collected in class (the length of time, in minutes, that the flavor lasted for Brand 1 and Brand 2 gums).
min = _____ quartile 1 = _____ median = _____ quartile 3 = _____ max = _____
Boxplot:
-40 -30 -20 -10 0 10 20 30 40Based on the boxplot, would you conclude that there is a difference in the number of minutes that flavor is retained for Brand 1 and Brand 2 gums? Explain.
Let = the mean difference in the number of minutes that flavor is retained (Brand 1 minus Brand 2).
To test: , based on a simple random sample of n_{D} differences from the population, we use the test statistic: , where is the null hypothesized difference, is the mean of the sample differences, s_{D} is the standard deviation of the sample differences, and n_{D} - 1 degrees of freedom are used for the test.
Perform a hypothesis test to determine if there is a significant difference in the mean number of minutes that the flavor lasts for Brand 1 and Brand 2 gums.
Which Gum Lasts Longer?
Problem:
We want to determine if there is a significant difference in the typical flavor duration, in minutes, for two different brands of chewing gum.
Instructions:
The Data Sheet contains the gum data that was collected in class (the length of time, in minutes, that the flavor lasted for Brand 1 and Brand 2 gums).
min = _____ quartile 1 = _____ median = _____ quartile 3 = _____ max = _____ mean = _____
min = _____ quartile 1 = _____ median = _____ quartile 3 = _____ max = _____
Boxplot:
-40 -30 -20 -10 0 10 20 30 40
Based on the box plot, would you conclude that there is a difference in the number of minutes that flavor is retained for Brand 1 and Brand 2 gums? Explain.
calculated value of test statistic =
C.1 One Sample
How Long does the Gum Last?
Problem:
We want to determine the flavor duration, in minutes, for a certain brand of chewing gum. However, some of our data values have been censored at 40 minutes. How can we analyze this data?
Background: (Collett (1996); Lang and Secic (1997))
In analyzing the gum flavor duration data, we are analyzing the times to an event (the event that the gum flavor has expired). Survival analysis is a common application of time-to-event analysis. Estimates can be obtained of the probability of survival (the event does not occur) as a function of time from a starting point. Any event occurring at the end of some time interval, such as the death of a medical patient, or the failure of a part in a piece of equipment, can be viewed as the event in a survival analysis.
Survival analysis requires special statistical methods due to the fact that the event of interest may not yet have occurred when the data analysis is performed. When data are collected on occurrence times for an event of interest and the data includes events that have not yet occurred, these data are said to be right-censored. Survival analysis methodology can incorporate right-censored data. Note that the gum flavor duration values are right-censored (at 40 minutes).
One way to summarize survival data is through estimates of the survival function. The survival function, S(t), is the probability that the survival time is greater than or equal to time, t. That is S(t)=P(survival time t).
The Kaplan-Meier procedure is a nonparametric (distribution-free) method of estimating survival rates at each point in time. Kaplan-Meier is said to be nonparametric since it does not require specific assumptions to be made about the underlying distribution of the survival times.
Note to the Instructor: For additional references that discuss basic biostatistics concepts, see: Lachin (2000) and Rosner (2000).
Instructions:
pdf file for Appendix C.2: Two Sample Censoring Worksheet
C.2 Two Sample
Which Gum Lasts Longer?
Problem:
We want to compare the flavor durations, in minutes, for two brands of chewing gum. However, some of our data values have been censored at 40 minutes. How can we analyze this data?
Background:
We have seen how to analyze data for a single right-censored sample. In order to compare two right-censored samples, we will extend what we have learned for the one sample case.
Instructions:
Brand 1 (n=25) | Brand 2 (n=18) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
40c | 40 | 22 | 35 | 30 | 8 | 16 | 20 | 28 | ||
35 | 7 | 40c | 40c | 35 | 40c | 22 | 40 | 25 | ||
40c | 40 | 20 | 20 | 27 | 17 | 26 | 40c | 30 | ||
40c | 40c | 7 | 40c | 31 | 35 | 40c | 35 | |||
40c | 30 | 13 | 40c | 40c | 40c | 30 | 28 |
Name______________________________
(0 = no intensity, 10 = extreme intensity) 0 1 2 3 4 5 6 7 8 9 10
(0 = no flavor, 10 = extremely flavorful) 0 1 2 3 4 5 6 7 8 9 10
____________________ minutes
(0 = very soft, pliable, 10 = very hard, rubbery) 0 1 2 3 4 5 6 7 8 9 10
Brandt, L. A. (2001), “FORMULATION CHALLENGE: CONFECTIONARY - A STICKY Situation,” [Online], www.preparedfoods.com/CDA/ArticleInformation/features/BNP__Features__Item/0,1231,114008,00.html
Cobb, G. (1992), “Teaching Statistics,” in Heeding the Call for Change: Suggestions for Curricular Action, ed. L. Steen, MAA Notes, 22, 3-43.
Collett, D. (1996), Modelling Survival Data in Medical Research, New York: Chapman and Hall.
Gnanadesikan, M., Scheaffer, R., Watkins, A., and Witmer, J. (1997), “An Activity-Based Statistics Course,” Journal of Statistics Education [Online], 5(2). ww2.amstat.org/publications/jse/v5n2/gnanadesikan.html
Lachin, J. M. (2000), Biostatistical Methods: The Assessment of Relative Risks, New York: John Wiley and Sons.
Lang, T. A. and Secic, M. (1997), How to Report Statistics in Medicine: Annotated Guidelines for Authors, Editors, and Reviewers, Philadelphia: American College of Physicians.
McClave, J. T. and Sincich, T. (2003), Statistics, 9^{th} edition, Upper Saddle River, New Jersey: Prentice Hall.
Rosner, B. (2000), Fundamentals of Biostatistics, 5^{th} edition, Pacific Grove, California: Duxbury.
Scheaffer, R. (1996), Overview for Activity-Based Statistics: Instructor Resources, New York: Key Curriculum Press; Springer.
Mary Richardson
Department of Statistics
Grand Valley State University
Allendale, MI 49401
U.S.A.
richamar@gvsu.edu
Neal Rogness
Department of Statistics
Grand Valley State University
Allendale, MI 49401
U.S.A.
rognessn@gvsu.edu
Byron Gajewski
The University of Kansas Medical Center
Kansas City, KS 66160
U.S.A.
bgajewski@kumc.edu
Volume 13 (2005) | Archive | Index | Data Archive | Information Service | Editorial Board | Guidelines for Authors | Guidelines for Data Contributors | Home Page | Contact JSE | ASA Publications