Paul Roback
St. Olaf College
Beth Chance
California Polytechnic State University, San Luis Obispo
Julie Legler
St. Olaf College
Tom Moore
Grinnell College
Journal of Statistics Education Volume 14, Number 2 (2006), ww2.amstat.org/publications/jse/v14n2/roback.html
Copyright © 2006 by Paul Roback, Beth Chance, Julie Legler, and Tom Moore all rights reserved. This text may be freely shared among individuals, but it may not be republished in any medium without express written consent from the authors and advance notification of the editor.
Key Words: Goodnessoffit test; Mathematical statistics; Sampling distribution; Studentlearning focus; Teacher collaboration.
We began our undertaking intrigued by what we knew of Japanese Lesson Study, but extremely “green” with respect to its implementation. Not only were we newcomers to the ideas shaping lesson study, but we could find nothing in the literature to guide the implementation of lesson study specifically at the college level, especially in an upperlevel statistics course. Thus, we embarked on a pilot implementation—a preliminary attempt to assess the feasibility of Japanese Lesson Study principles in upperlevel undergraduate statistics courses. We hoped to gain insight into concrete benefits and potential pitfalls. In this manuscript, we will first describe Japanese Lesson Study —its philosophy, its process, its desired outcomes, and early findings from its implementation at the K12 level in the United States. Second, we will outline the process we followed in implementing lesson study principles in a Mathematical Statistics course at St. Olaf College in the spring semester of 2004. With a prerequisite of Probability Theory, this course was targeted toward juniors and seniors who were mathematics majors or statistics concentrators with no previous course in statistics. Finally, we will report the results of our implementation and offer suggestions and recommendations for others who might consider this approach.
As outlined in Curcio (2002), the lesson study process involves several important steps:
Each lesson study group must identify a broad overarching goal and develop a set of specific objectives. The broad goal contains a vision of the type of student the educational community wishes to produce, identifying gaps between the ideal student and what teachers typically observe. Examples cited include themes such as “be active problemsolvers” or “develop scientific ways of thinking” (Lewis and Tsuchida 1998, p. 14). Once the broad goal is formulated, the group then identifies a specific lesson topic that might address this goal. Selection of the lesson topic customarily involves deep study of the current curriculum, student backgrounds, and other initiatives designed to address the broad goal. The group then forms a second, more specific, set of objectives related to the lesson topic selected. For example, Lewis and Tsuchida (1998) describe a lesson designed with the broad goal of encouraging fifthgrade Japanese students to demonstrate scientific thinking. Specifically, the teacher asks pairs of students to study the effects of three variables generated by the class on the cycle time of a pendulum, with the objective of being able to separate the effects.
A strength of lesson study is the atmosphere of collaboration that it fosters. Teachers bring different perspectives and experiences to a common task. As noted by U.S. educational researcher Richard Elmore, “isolation is the enemy of improvement” (Lewis 2002, p. 11). However, lesson study differs from other collaborative activities because it “makes teacher collaboration concrete and focuses on a specific goal: better understanding of student thinking in order to develop lessons that advance student learning” (WangIverson 2002, paragraph 7). Lesson study guidelines (Fernandez and Chokshi 2002) advise teachers to make the most of their limited meetings, working out fine details of lesson plans and handouts between meetings, while using meeting times for examination of materials, plotting general strategy, and discussion of larger issues.
In lesson study, collaboration does not end with the development of a lesson plan; rather, the collaboration has just begun. As one group member teaches the lesson, the others (and possibly outsiders) observe the class with a careful eye toward how students engage with and process the material, guided by questions posed and objectives stated during the planning process. Thus, the focus of observation is not the teacher, but the students and their learning. Class observers should follow a clear protocol of behavior (Curcio 2002); for instance, they should refrain from interfering in the lesson, e.g., answering student questions, but they should be free to ask clarification questions of students. As discussed in Watanabe (2002), insightful observation does not happen automatically; rather, it is a skill teachers must learn. Part of that skill is the ability to gather meaningful information beyond what can be gleaned from tests, written assignments, or even videotape. Lewis (2002) cites records of student engagement, persistence, degree of interest, emotional reactions, and quality of smallgroup discussion as examples of meaningful data.
Collaboration then continues as the group reconvenes to reflect thoughtfully on the class period. Again, there are suggested protocols for these feedback sessions (Curcio 2002), which can be summarized by some basic tenets:
Ideally, the group will modify their lesson plan based on these discussions and a different teacher will teach the lesson to a new group of students. Other group members will observe this class session, and the process will be repeated.
Lesson study, though, is more than just a collaborative activity, and the development of a study lesson is much more than a set of lecture notes. One crucial product created by the group is the lesson study plan. As described in Fernandez and Chokshi (2002), Japanese teachers often use a fourcolumn chart. (Curcio (2002) describes a slightly different fourcolumn chart.) Column One contains the steps of the lesson—the sequence of topics, examples, and questions that the teacher has planned. Column Two contains student activities and expected student responses and reactions for each step in the lesson. Column Three contains points for the teacher to remember, ways in which the teacher might deal with student responses, and ways to tie the lesson together. Finally, Column Four lists methods for evaluating whether each segment of the lesson was successful in achieving its goals.
Through the organization of these columns and the creation of a lesson study plan, some points of emphasis become evident. For instance, most typical lesson plans that we do in our everyday teaching would stop with Column One. The expected student responses in Column Two and the teacher reaction to these responses in Column Three illustrate the focus on student learning—considering a priori how students will be processing information, forming questions, and constructing new knowledge. Through lesson study, teachers develop “the eyes to see students (kodomo wo miru me)” (Lewis 2002, p.12). Furthermore, the evaluation methods in Column Four illustrate the focus on research, as “the classroom becomes the teachers’ laboratory for continuous improvement of teaching and learning,” (WangIverson 2002, paragraph 9) and one can assess objectively the success of the lesson in meeting stated goals.
The reader can find more detailed guides to the implementation of lesson study in the references (see especially Stigler and Hiebert 1999; Curcio 2002; Fernandez 2002; Fernandez and Chokshi 2002; Lewis 2002; Watanabe 2002) and at the following websites of lesson study research groups: www.tc.columbia.edu/lessonstudy and www.lessonresearch.net. Researchers in these groups and elsewhere identify the many benefits noticed in Japan and the U.S. of adopting a culture of lesson study. Frequently cited benefits (Lewis and Tsuchida 1998; Lewis 2002; Lewis, Perry, and Hurd 2004) for teachers and teaching practice include increased knowledge of subject matter, increased knowledge of instruction, increased ability to observe students, increased focus on student learning, stronger collegial networks, stronger support for novice teachers, stronger connection of daily practice to longterm goals, stronger motivation and sense of efficacy, support for taking risks, and improved quality of available lesson plans. Benefits for students include improved achievement, learning more carefully considered content more deeply, enhanced ability to make connections, and a higher level of engagement with the material.
The first lesson study groups in the United States were formed only five years ago, and educational researchers caution that “lesson study is easy to learn but difficult to master” (Chokshi and Fernandez 2004, p.524). Fernandez, Cannon, and Chokshi (2003) describe the development of three new lenses for examining lessons:
Through these lenses, U.S. practitioners of lesson study have documented challenges which researchers (Stigler and Hiebert 1999; Fernandez 2002; Lewis 2002; WangIverson 2002) maintain must be overcome before obtaining the successful outcomes common in Japan.
Through the curriculum developer lens, one immediately notes that the curriculum in the United States is overprescribed compared to that in Japan, leaving less time to explore topics in depth. If U.S. teachers choose to lead students to knowledge construction, they run the risk of not finishing the race to complete a lengthy list of topics. This pressure exists in many undergraduate statistics courses, too, as instructors try to pacify client disciplines or other teachers using a particular course as a prerequisite. Another challenge cited for K12 educators in the United States (Fernandez 2002) is the lack of common curricular ground, in contrast with the national curriculum strictly followed by Japanese teachers. A successful lesson cannot be planned without carefully considering student backgrounds and the place of a lesson in the larger curriculum. College instructors may face an even bigger challenge in this regard with their considerable freedom to choose material to cover, presentation style, and primary texts. For example, a lesson study group of undergraduate instructors may find themselves debating at length which topics to include and in what order to present them before finally focusing on a specific lesson plan.
Lack of proficiency using the student lens is another barrier to successful implementation of lesson study. U.S. teachers are not often trained to analyze each problem, each question posed, and each choice in idea development from the perspective of the student. For instance, when posing a problem to the class, it is not enough to list potential student solutions; a teacher must consider what each solution says about student understanding and processing. Effective use of the student lens requires teamwork among educators and careful assessment of student learning. One barrier, then, is the independent nature of those attracted to teaching, which may become more pronounced with more experience. U.S. teachers at all levels customarily teach in isolation, not routinely opening their classrooms to outside observers and constructive criticism. Yet observation of students during lessons is essential to the development of a student lens.
From the viewpoint of some researchers (Fernandez, Cannon, and Chokshi 2003), the biggest challenge to the successful implementation of lesson study is the ability of teachers to examine lessons through a researcher’s lens. As Fernandez and her colleagues observed Japanese teachers mentoring U.S. 5^{th} and 6^{th} grade teachers on the lesson study process, they noted that “the Japanese teachers emphasized four critical aspects of good research: the development of meaningful and testable hypotheses, the use of appropriate means for exploring these hypotheses, the reliance on evidence to judge the success of research endeavors, and the interest in generalizing research findings to other applicable contexts” (p. 173). In adopting this researcher lens, a practitioner of lesson study must continually relate lesson steps to overall goals and objectives, carefully consider how to gather evidence to assess whether or not objectives are being met, and reflect on which insights gained might apply to future classroom settings.
After spending the first meeting watching a videotape overview of Japanese Lesson Study ( Curcio 2002) and discussing the lesson study philosophy and process, we came to the second meeting ready to brainstorm about big goals and lesson content. The discussion was predictably wideranging, and it consumed much of the next few meetings. We discussed the important ideas we’d like students to remember from a statistics class, how to make those important ideas stick, which important ideas students struggle to understand, how to tie several ideas together, and how to manage the level of detail presented. We also mentioned good lessons and activities on which we could build—Cents and the Central Limit Theorem (Scheaffer, Gnanadesikan, Watkins, and Witmer 1996), the German Tank problem (e.g., Scheaffer, et al. 1996), golf tees inscribed with numbers from different distributions, etc. The concept of sampling distributions became the primary theme we wished to incorporate, in the context of goodnessoffit tests.
Our target audience was 23 students (primarily juniors and seniors) in Math 312B: Mathematical Statistics at St. Olaf College. This section of Math 312, taught by one of the authors (Roback), consisted of students with no previous course in statistics; the prerequisite was Probability Theory, which the majority of students had taken the previous semester. The required textbook for Math 312 was An Introduction to Mathematical Statistics and Its Applications by Larsen and Marx (2001); in addition, SPlus programming was used on a weekly basis for running simulations, exploring properties of test statistics, and analyzing data (for more detail see the course syllabus). Weekly homework assignments contained a mixture of mathematical derivation, applied data analysis, and SPlus simulation. The class met three times a week for 55minute sessions, which were comprised of lecture and wholeclass problem solving, with occasional small group activities. Students were expected to attend every class session so that full participation in classroom activities and takehome assignments could be assumed.
The study lesson on goodnessoffit tests and sampling distributions was conducted in the nexttolast week of the 16week semester, immediately after a unit on regression analysis and inference. Another author (Moore) observed the class and took notes of his observations. We also arranged to have both lessons videotaped, but between an absent videographer and marginal video quality, we could not gather as much information from the videotapes as we hoped.
Table 1 contains a short outline of our study lesson plan; more details can be found in the partial fourcolumn study lesson plan in Table 2 and in the complete fourcolumn study lesson plan ; handouts from class can also be found at handouts. Specific objectives for our study lesson included:
The plan in Table 1 is the last of several iterations, and it reflects the efforts from group meetings over 12 weeks, as well as efforts by several individuals between group meetings to fill in details and provide the group rough drafts to discuss.
Time  Steps in Study Lesson 

Day One  Discuss the general problem: How would an M&M manufacturer decide whether the colors of M&Ms are being produced in the correct proportions? 
Discuss potential sample results: How much deviation is too much?  
Pass out M&M samples and form groups of two. Each group must devise a test statistic to measure the deviance of their sample from what they would expect if the process is working correctly.  
Groups of two combine to form groups of four, and each group of four selects and presents one of their test statistics (with rationale) to the class. The rationale should be based on deliberations about what defines a good test statistic.  
Pose the next problem: Based on their chosen test statistic, would they conclude that their original sample of M&Ms contains convincing evidence that the manufacturing process is malfunctioning?  
Begin to investigate empirical sampling distributions and pvalues with hand calculations from simulated samples generated by SPlus under the null hypothesis.  
Athome before Day Two  Generate an empirical sampling distribution for the group test statistic and also for the chisquare goodnessoffit statistic. 
Find empirical pvalues for group’s original data and 10 prototype samples designed to illustrate the performance of the test statistic under specific cases.  
Day Two  Discuss results from inclass and takehome assignments. Think about criteria for good test statistics. 
Groups work on the Fumble Problem (Larsen and Marx 2001,
p. 253) – students investigate how to extend the chisquare goodnessoffit test to discrete probability distributions.  
Guide groups to think about issues such as estimating model parameters, adjusting degrees of freedom, and avoiding small cells.  
Athome before Day Three  Conduct simulations to see the value in adjusting degrees of freedom in the chisquare distribution when parameters are estimated. 
Examine how the simulation extends goodnessoffit tests to continuous probability distributions.  
Day Three (not officially part of the study lesson)  Discuss results from Days One and Two, and the
takehome assignment for Day Three. 
Develop the chisquare test of independence for twoway tables. 
It is important to recognize that lengthy discussion preceded the formulation of this lesson plan. For example, we spent considerable time planning the first few steps of Day One; we wanted to provide motivation with a realistic problem, and we wanted to lead students to develop a test statistic on their own. We ended up using the standard M&M multinomial distribution problem, partly because this Math 312 class was about to tragically complete their first full statistics course without ever eating or counting M&Ms, but mainly because it was a simple problem with a real context.
The introductory portion of Day One represents a unique product of this study lesson that never would have materialized without thoughtful collaboration (and which really never would have materialized under a typical presentation of this material based on Larsen and Marx (2001)). Students, in groups of two, were asked to think about how one could separate sample results into those that favor the null hypothesis and those that favor the alternative hypothesis. While designing their tests, students were asked to consider what properties a good test statistic should possess. Based on these properties, students then had two opportunities to present, defend, and potentially modify their invented test statistics—first as two groups of two came together to compare their respective test statistics, and second as the groups of four presented their chosen test statistics to the class. The fourcolumn lesson plan for this introductory part of Day One is shown in Table 2; the remainder can be found at fourcolumn lesson plan.
Learning Activities and Key Questions  Student Activity and Expected Responses  Teacher’s Response and Things to Remember  Goals and Evaluation 

Day One Introduce general problem: How would an M&M manufacturer decide whether the colors of M&Ms are being produced in the correct proportions?  A candy manufacturer is told to make 13% brown, 14% yellow, 13% red, 24% blue, 20% orange, and 16% green candies, but he believes the manufacturing process is malfunctioning.  
Brainstorm plan  Suggest ways to evaluate the claim  Get students to suggest

How easily do students consider sampling variability? 
Discuss potential sample results: How much deviance is too much?  Expect students to be okay with a little deviance from null, but unsure of where to draw the line.  Present possible ways multinomial sample of size 40 could turn out – ask if each one provides significant evidence of malfunctioning.  Do they understand that some variability from what is expected is natural? 
Introduce data  Each student receives a bag, work in pairs to get the tally for the first 40 M&Ms  Blindly take 20 candies from the big bag of M&Ms.
 
Examine sample  Students tally the colors  Look at your sample results. Do they support the manufacturer’s claim?  Do students think beyond the sample? 
Develop “custom” test statistics: While we expect some discrepancy, how can you decide if your sample is “too different” from expected? How can you measure how “deviant” your sample is? Can you express this as one number?  Students brainstorm ways to measure the deviation. Groups of 2 for 5 minutes. 
What are some properties of your measurement technique? Do you expect the results to be large,
small? Positive, negative? Pass out Handout #1. Students who want to use zscores need to combine them in some way to come up with one number. 
Which ideas from course do students latch onto? Do their custom statistics separate samples which agree with the null from those which agree with alternative? 
Combine with another group: Decide which of two test statistics is preferable.  Prepare to defend choice to class. Groups of 4 (2 groups of 2) for 5 minutes 
Encourage groups to be able to defend choice based on desirable properties.  What are seen as good properties of a test statistic? 
Share with class  “Defend” their test statistic (and its properties) to the rest of the class  The formula you have come up with is a “test statistic”. 
After selecting a final test statistic, students were asked to specify which values of their test statistic would provide strong evidence against the null hypothesis. Until this point in the course, any test statistic the students had considered had magically, after pulling a couple of theorems out of a hat, followed a wellknown distributional form under the null hypothesis. Now, with their own creations, students needed to think about empirical sampling distributions and empirical pvalues. We had defined and discussed sampling distributions at various points, and we had frequently used simulations in SPlus to investigate issues of robustness, so the groundwork had been laid to explore empirical sampling distributions and pvalues (or so we believed). By focusing on empirical sampling distributions, we were placing the specific objectives of developing goodnessoffit tests within the broader goals of promoting statistical thinking and understanding the role of sampling distributions.
Another source of discussion and disagreement was the order of topics for Day One. To transition from the test statistics constructed by students to the chisquare statistic, we attempted to create “prototype samples” to allow students to examine the behavior of their test statistic in specific cases. The prototype samples were designed to illustrate extreme cases and introduce subtle cases that would show the advantages of the chisquare statistic. Table 3 shows some of these prototypes. For example, sample A reflects the most likely multinomial sample under the null hypothesis. Samples B and C were designed to illustrate how the test statistic handles discrepancies of the same absolute size, one in the more abundant categories and one in the less abundant categories. This comparison provided the biggest departure between the chisquare statistic and the most popular student choice (the average squared difference between observed and expected counts). Samples D and E were designed to illustrate the effect of sample size and Sample F was designed to illustrate how extreme results are handled.
Sample  Blue  Orange  Green  Yellow  Red  Brown 

A  10  8  7  5  5  5 
B  10  8  7  5  9  1 
C  14  4  7  5  5  5 
D  13  11  10  2  2  2 
E  26  22  20  4  4  4 
F  40  0  0  0  0  0 
Originally, we planned to have students examine the prototype samples immediately after developing their own test statistic, as a way to determine properties, strengths, and weaknesses of their test statistic. However, we decided to follow the development of an invented test statistic with empirical sampling distributions and pvalues, allowing the class to spend more time with this fundamental idea. We then introduced the prototype samples, hoping to lead students to see inefficiencies with their developed statistic and motivation for the chisquare statistic. In fact, we decided that the chisquare statistic could be effectively introduced between Day One and Day Two; students could simulate empirical sampling distributions for their statistic and the chisquare statistic, and compare the performance of both statistics on the prototype samples. In this way, we expected some thoughtful, empiricallymotivated discussions at the beginning of Day Two with the underlying purpose of addressing the original question about the desired production proportions.
The development of customized test statistics on Day One took longer than expected, so we spent less time than expected (under 15 minutes) examining the empirical sampling distribution. As a result, the instructor gave a hurried summary of empirical sampling distributions and pvalues at the end of Day One, and students were responsible for both examining prototype samples and the chisquare test statistic before Day Two, with the guidance of SPlus template code provided in a handout. The logjam spilled into Day Two as well. The class and instructor spent more time than expected (over 20 minutes) sharing and summarizing what they had learned about empirical pvalues and the performance of their test statistic compared to the chisquare statistic, but the discussion was too valuable to cut short. One problem was that, in most of the prototype samples the students investigated, the differences between the two test statistics were too subtle to be meaningful. Students, however, were intrigued with the idea that, by using simulation under the null hypothesis, they were free to employ any test statistic that they deemed sensible. In retrospect, it is not surprising that the subtleties of this new approach (i.e., empirical sampling distributions) took a while to sink in, despite the groundwork in place.
On Day Two, after a reflective discussion of Day One and a little theory about the chisquare goodnessoffit test, we spent all our remaining time with the Football Fumbles example (see handouts). The agenda for Day Two had been slightly modified when the four authors met after Day One to review successes, surprises, and opportunities for improvement. Instead of the typical “present the formula, then trudge out an example” format, we designed Day Two—goodnessoffit tests for specific distributions with parameters unknown—as a natural extension of Day One. Students were asked, based on how we attacked the M&M problem, to develop methodology for determining whether the number of fumbles in a game for each college football team could be reasonably modeled with a Poisson distribution. Student groups with a little prodding were able to extend from categories determined by M&M colors to categories determined by number of fumbles in a game. They then hit stumbling points our lesson study group had anticipated, and the instructor was able to direct them to think about issues such as: how do we determine the expected number of teams in each group? how do we handle the unknown parameter from a Poisson distribution? how do we ensure that the expected number of teams in each group exceeds some minimum (after the instructor cautioned about small expected values in light of model assumptions)? how do we determine pvalues for our test of hypothesis? Once again, the class spent more time than we had expected on the Football Fumbles example, and we were not able to attack the Cockpit Noise example, in which students would make a further extension to goodnessoffit tests for continuous distributions. However, in planning for the athome activity following Day Two, we illustrated (through SPlus code) how one might use a goodnessoffit test to determine if a set of data was sampled from a normal distribution. This illustration was housed in a simulation built to examine the advisability of adjusting the degrees of freedom in the chisquare test statistic when model parameters are estimated from sample data.
Day Three was not part of the study lesson planned by the group, but it used the ideas and activities from Days One and Two to bring the unit on goodnessoffit tests to a satisfying (and efficient) close. After reflecting as a class on the main ideas from Day Two and the simulation results completed at home following Day Two, the chisquare goodnessoffit test was extended to twoway tables of categorical variables as a test of independence.
Just as we had following Day One, the four authors met to debrief after Day Three. Beginning with comments from the instructor and the observer, we reviewed the entire lesson, comparing our plans and intentions with how the class actually proceeded and how the students reacted. Many of our observations relating to our original list of goals and objectives are included in the next section. This reflective session was absolutely vital, but it would have been even more valuable if all group members and even some outsiders had been able to observe the lessons being taught. Unfortunately, we were limited to a single observer because of teaching conflicts. Ideally, this reflective meeting would then be followed by the planning and implementation (by a different instructor) of a revised lesson based on insights acquired during the first teaching. In our case, the repeat session fell victim to lack of time at the end of the semester, although repeat sessions in the same semester are inherently challenging since different sections of the same course tend to move at similar paces. We address implications of the lack of additional observers and ongoing revisions in upcoming sections.
Objective #1 on student engagement was successfully accomplished. Notes by the classroom observer (Moore) mentioned that the problem set up created interest from the outset and that student groups actively sought solutions to questions posed. Over the previous 12 weeks of Math 312B, student engagement was, with a few exceptions, limited to working out problems in pairs, explaining concepts to partners, and class discussions. So the activities designed for knowledge construction for the study lesson required a higher level of engagement from the students. Although students were actively collaborating at several points during the lesson, we were often surprised at the slow progress of their collaborations. Perhaps our expectations were too high, given that some of us had not previously inserted activities with this level of student content responsibility in an upper level statistics course, but the fact that students were being asked to step out of their Math 312B “comfort zone” in the secondtolast week could have also played a role. For example, the observer noted that groups did not want to write things down, hoping to avoid commitment to written answers, although time limits eventually prodded groups to stick with an answer. Also, the observer noted that the 20minute wrapup at the beginning of Day Two was a valuable conversation even though students did more responding than inquiring. A high level of student engagement was still evident; higher quality engagement could be enhanced by implementing lessons such as this throughout the semester, or perhaps by planning a study lesson prior to the start of the semester and implementing it early in the semester.
Objective #2 on statistical thinking about test statistics was the primary focus of Day One, and it was met with fair success according to observer and teacher notes. The observer was asked, in the study lesson plan, to note “How easily do students consider sampling variability?” (Answer: pretty naturally, as they recognized that different sample results could come from the same underlying production process.), and “Do students think beyond the sample?” (Answer: Yes, although no group came up with the idea of examining hypothetical samples which might have occurred.). Groups varied in their abilities to generate ideas, but most ended up with a reasonable test statistic (for example, three of the five larger groups settled on the average squared difference between observed and expected values). We had just completed our regression unit, so many groups leaned heavily on the idea of the sum of squared residuals from that unit. Our predictions for studentgenerated test statistics (e.g., maximum absolute difference between observed and expected, average zscore for categories) were not realized, in some part by our failure to account for the carryover effect of the previous topic studied. We would expect to find much more variability in proposed test statistics if this lesson was presented earlier in the course. At other times, groups promoted certain oversimplifications, such as the idea that test statistics are only valuable when their distributions are wellknown (and preferably normal). Most promising, though, was the observer’s note that, through the lesson, students began to realize that performing hypothesis tests is a process and not just a formula. By being confronted with questions about how to design test statistics and what criteria to use to evaluate them, students began to see beyond simple formulas.
Objective #3—seeing the need for an empirical sampling distribution—was the transition about which we most fretted, and it proved to be a lofty goal. In the Evaluation Column, we asked, “Do students think about sampling distributions? If not, what are their natural inclinations?” According to observer records, the teacher “struggled mightily” to get the students to suggest looking at an empirical sampling distribution to determine if the calculated test statistic provided convincing evidence against the null hypothesis. Even after walking the class through the analogy of onesample tests of proportions in the context of black and white M&Ms from the Teacher Response Column, the instructor inevitably posed the idea of empirical sampling distributions himself. We should not have been so surprised by this difficult transition. From the teacher notes, the students were looking for “some statistic [which] magically followed a known null distribution.” Upon reflection, we realized that this thinking followed the pattern found in all previous cases during the semester. The idea of a sampling distribution had been defined, discussed, and illustrated through SPlus simulations at various points during the semester; nevertheless, every important test statistic seemed to, with the introduction of a few magical theorems, follow a known, convenient null distribution. Our planning for and reflection on Objective #3 proved to be one of the most valuable parts of our lesson studybased process; since we believed that students should leave Math 312 with (among other things) a strong notion of sampling distributions, it was apparent that the idea of sampling distributions needs to be stressed and illustrated earlier, more often, and in more effective ways. Indeed, using this lesson earlier in the term would be one effective way to introduce this concept.
The statement of Objective #4 was, in retrospect, poorly written. Although the central statistical content of this unit was indeed the chisquare goodnessoffit test, our objective, as worded, merely stated that this topic was to be “introduced.” Instead, we sought to provide content and motivation from which the chisquare goodnessoffit test and its null distribution would naturally proceed. We hoped the students themselves would see the need for and the utility of these ideas. In fact, students first encountered the chisquare test statistic when completing their takehome assignment following Day One. Ideally, in this way, students would be more likely to recall the rationale behind goodnessoffit tests in general and the chisquare test in particular, which feeds into one of our broad overall goals of making important themes (like sampling distributions) more memorable to students. Teacher notes indicated that students favorably contemplated the idea of empirical sampling distributions and pvalues — that we could obtain pvalues for their customized test statistic nearly as easily as any classic test statistic.
Finally, Objective #5 on extending the goodnessoffit tests to various scenarios showed satisfying progress. The first extension, from the categorical to the discrete case, was trickier than envisioned. As the observer noted, “Something about the switch from colors to number of fumbles, defining the categories, caught the groups up. Prompts were needed from the professor.” The next extension, from the discrete to the continuous case, was made by the students themselves for the takehome assignment following Day Two, and a wrapup discussion at the beginning of Day Three made it apparent students were okay with this extension. The final extension, then, to the chisquare test of independence, seemed almost automatic to the students on Day Three. In considering these extensions, students appeared to be putting the main ideas behind the chisquare goodnessoffit test together; for example, one student asked insightfully on Day Two, “Now let me get this straight…we’re using a chisquare distribution to test whether or not data follows a Poisson distribution.”
Assessment of our lesson study was also done through the students’ viewpoint. An endofsemester, online, anonymous evaluation was completed by 13 of the 23 members of the class (a low response, due in part to illtimed campuswide computer system breakdowns). One question specifically asked students “Did you like the format we used in class with the Chapter 10 (chisquare) material—i.e., developing ideas in small groups and testing them between classes with SPlus simulations? What did you like or dislike about these classes compared to others?” Seven of the 13 respondents reported liking the approach in the study lesson, 2 did not like it, 3 were neutral, and 1 did not respond to this question. Those who liked it reported that our study lesson “captured my interest”, “allows for new types of mental connections to be made and to see things in ways different and perhaps richer than before”, and “help[ed] me remember the basic ideas behind chisquared[sic]”. Others made suggestions and comments such as “lecture main points/ideas at end” (instead of at the beginning of the next class period), “pace seemed a little slow”, “not always clear on what objective was”, “nice balance; difficult to use throughout a course”, and “there was a guy looking over our shoulder and taking notes on us.” (Note: we did discuss the lesson study process and its purpose with the students prior to commencement of the lesson in class.) Since the first 12 weeks of class had featured lectures with examples and some small group activities, but nothing as active and openended as our study lesson, it is not surprising that some students were longing for their familiar routine with only one week to go.
In addition, on the final examination, one of the five questions (Question 3 of the Final Exam) was devoted to goodnessoffit tests, including an SPlus simulation. Given that no book problems were assigned on this topic, students performed very well, producing a raw median score of 21 out of 25—the second highest of the five questions (see Final Exam Rubric for broad scoring rubric). Unfortunately, these results could not be compared with historical results, since this final examination differed greatly in format from past finals in Mathematical Statistics given by the instructor.
Even as we offer our insights, we recognize that many open questions exist about the application of Japanese Lesson Study at the undergraduate level. Some questions for future attention include:
True implementation of lesson study is not easy – for this process to be effective, it is essential that instructors feel comfortable devoting sufficient time to the process, sharing their ideas, spending class time on open student investigations, and observing and reflecting on each others’ teaching. Yet, our experience indicates that Japanese Lesson Study principles can be implemented successfully in upperlevel undergraduate statistics courses. Despite our inexperience and imperfect implementation, all involved found the application of lesson study principles to be valuable and worthwhile, an experience which has had a lasting impact on our teaching beyond the single lesson on which we collaborated.
Chokshi, S., and Fernandez, C. (2004), “Challenges to Importing Japanese Lesson Study: Concerns, Misconceptions, and Nuances,” Phi Delta Kappan, 85(7), 520525.
Curcio, F. R. (2002), A User’s Guide to Japanese Lesson Study: Ideas for Improving Mathematics Teaching, Reston, VA: National Council of Teachers of Mathematics.
Fernandez, C. (2002), “Learning from Japanese Approaches to Professional Development: The Case of Lesson Study,” Journal of Teacher Education, 53(5), 393405.
Fernandez, C., Cannon, J., and Chokshi, S. (2003), “A USJapan lesson study collaboration reveals critical lenses for examining practice,” Teaching and Teacher Education, 19, 171185.
Fernandez, C., and Chokshi, S. (2002), “A Practical Guide to Translating Lesson Study for a U.S. Setting,” Phi Delta Kappan, 84(2), 128136.
Garfield, J., delMas, R., and Chance, B. (2005), “The Impact of Japanese Lesson Study on Teachers of Statistics,” paper presented at the Joint Statistical Meetings, Minneapolis, MN.
Larsen, R. J., and Marx, M. L. (2001), An Introduction to Mathematical Statistics and Its Applications, 3^{rd} Ed., Upper Saddle River, NJ: PrenticeHall.
Lewis, C. (1995), Educating Hearts and Minds: Reflections on Japanese Preschool and Elementary Education, New York, NY: Cambridge University Press.
 (2002), “Does Lesson Study Have a Future in the United States?” Nagoya Journal of Education and Human Development, 1, 123.
Lewis, C., Perry, R., and Hurd, J. (2004), “A Deeper Look at Lesson Study,” Educational Leadership, February 2004, p.1822.
Lewis, C. and Tsuchida, I. (1998), “A Lesson is Like a Swiftly Flowing River: Research lessons and the improvement of Japanese education,” American Educator, Winter, 1417 and 5052.
Scheaffer, R. L., Gnanadesikan, M., Watkins, A., and Witmer, J. A. (1996), ActivityBased Statistics, New York, NY: SpringerVerlag.
Stigler, J. W., and Hiebert, J. (1999), The Teaching Gap: Best Ideas from the World’s Teachers for Improving Education in the Classroom, New York: The Free Press
WangIverson, P. (2002), “Why Lesson Study?” in Papers and Presentations: An Introduction from RBS Lesson Study Conference 2002. (www.rbs.org/lesson_study/conference/2002/papers/wang.shtml)
Watanabe, T. (2002), “Learning from Japanese Lesson Study,” Educational Leadership, 59, 3639.
 (2003), “Lesson Study: A New Model of Collaboration,” Academic Exchange Quarterly, Winter, 180184.
Paul Roback
Department of Mathematics, Statistics, and Computer Science
St. Olaf College
Northfield, MN 55057
U.S.A.
roback@stolaf.edu
Beth Chance
Department of Statistics
California Polytechnic State University
San Luis Obispo, CA 93407
U.S.A.
bchance@calpoly.edu
Julie Legler
Department of Mathematics, Statistics, and Computer Science
St. Olaf College
Northfield, MN 55057
U.S.A.
legler@stolaf.edu
Tom Moore
Department of Mathematics and Computer Science
Grinnell College
Grinnell, IA 501121690
U.S.A.
mooret@grinnell.edu
Volume 14 (2006)  Archive  Index  Data Archive  Information Service  Editorial Board  Guidelines for Authors  Guidelines for Data Contributors  Home Page  Contact JSE  ASA Publications