Learning About Assessment:

A Case Study of a Multimedia Language Program

Clara Román-Odio & Bradley Hartlaub

The Ohio5 Foreign Language & Technology Consortium

The Ohio5 group, like many language educators in this country, believes that multimedia-based approaches offer significant advantages over traditional teaching methods by accelerating and sustaining the process of language acquisition. Our success in procuring long-term support that will nurture new advances in the field depends on the extent to which we can articulate in quantitative and objective terms the pedagogical benefits provided by these new technologies.

The primary objective of this work is to describe the major steps involved in designing a comparative study that provides a rigorous and systematic evaluation of the use of multimedia in the language classroom. We hope this initiative will help others in the Ohio5 to plan and carry out studies that will yield new and reliable information about the impact that these technologies are having in language learning.

The work will address basic questions regarding methods of assessment in language teaching, including the use of statistical design and analysis, such as:

It will also discuss the importance of key aspects of experimental design including: sample size determination; eliminating bias; testing techniques to assess writing, oral, reading and listening skills, grammar and vocabulary; statistical significance; and examples of assessment instruments (quizzes, student evaluations, minute cards, etc.). The methods will be illustrated by a case study describing Román-Odio and Hartlaub’s assessment of Kenyon’s Spanish music program, Ingenio, and by a sample of successful studies intended to measure qualitative and quantitative effects of CMI (Computer Mediated Instruction) on language learning. The final section of the document will provide a summary of recommendations made by CALL (Computer Assisted Language Learning) researchers on assessment of educational technologies in the languages.

I. Where Do I Start?

As with any research situation, the very first step is to state as clearly as possible what you want to know and for what purpose. Without a clear statement of the research question it is difficult to collect useful information and make appropriate inferences. The following questions can help you draft a research statement:

Once the research statement is clear, then steps can be taken to turn your research question into an evaluation plan. One way to begin expressing such plan in concrete terms is by outlining specifications for a hypothesis test. Arthur Hughes suggests a useful framework for test construction, including the following categories:

A. Framework for Test Construction:

Research Statement: For example, I created a nifty multimedia-based program for my second and third year Spanish students. The program has . . . . The students learn . . . . I wonder whether this is an effective way to help students develop their . . . (speaking / writing / reading / listening / oral) ability. To answer this question, I would like to study the effects of . . . . (technology / methodology) on the development of the (speaking / writing / reading / listening) ability of my students.

Operations: The tasks that students have to be able to carry out. For a reading test, for example, these might include: the ability to scan a text to locate specific information, construe the meaning of complex passages, guess the meaning of words from context, identify the referents of pronouns, etc. For a speaking test it could be giving directions.

Types of text: For a reading test these might include passages from textbooks and newspapers.

For a writing test they could be letters, post cards, notes, descriptions, narrations, short essays.

Addressees: These are the people to whom the student is expected to be able to write or speak. For example, native speakers of the same age and peer group.

Topics: The subject areas reflect the kinds of written and spoken texts that are to be found in the teaching materials at various levels.

Format and Timing: This aspect should specify test structure and / or type of treatment, including time allocated to components and elicitation procedures, with examples. It should include how many passages will be presented (for reading or listening) or required (for writing), the number of items in each component and what weighting is to be assigned to each component. It should also include a description of the treatment applied to the experimental and the control groups.

Criteria to Determine Levels of Performance: This refers to the required level (s) of performance for different levels of success. It may consist of a simple statement to the effect that to demonstrate mastery, 80 percent of the items must be answered correctly. It may also be more specific using, for example, descriptors from the ACTFL (American Council for the Teaching of Foreign Languages) scales or from other methods of scoring such as the analytical method devised by John Anderson.

Scoring Procedures: Scoring procedures should be specified. There could be a detailed key, making scoring almost entirely objective, or there could be multiple scorers where scoring is subjective. According to Hughes, scoring will be valid and reliable if:

Sampling: For content validity, pre-testing and beneficial backwash (the effect of testing on teaching and learning) Hughes recommends the creation of several versions of the test. The best way to identify items that are faulty is through teamwork. Your colleagues can be an excellent resource to identify items in your assessment plan that need improvement. Critical questions that may be asked include:

In addition, one version of the test should be administered to native speakers (who should score 100 percent or close to it) of a similar educational background of your students.

Pre-testing: It can help you identify problems in administration and scoring. It is also essential to be able to calculate the reliability coefficients of the test or treatment (Hughes 49-52).

Clearly, Hughes’ framework is theoretical, specifying only main facets of testing. In order to turn your research question into an evaluation plan it is essential to define the behavior or characteristics that could reasonably constitute the variables that you are interested in assessing. The Follow-up Activity sections of this work are intended as an opportunity to practice the process whereby an assessment question is operationalized. Take a few minutes to read them and jot down your ideas.

B. Follow-up Activity for Developing an Assessment Plan

As part of an assessment team effort you have been asked to write a draft of the research statement, operations and format-timing portions of the evaluation plan for the following situations:

1) A college Russian teacher has developed a series of digital movies, captioned with subtitles in the target language, for her first-semester students. The digital movies, which include authentic video clips, were created with QuickTime software. Students can view them with the captions or without. She wants to know what effect captioning has on listening-comprehension skills (Collentine & Sweeney, "An Introduction to Conducting CALL Research" CALICO 99 Workshop, June 2, 1999).

2) Our French colleagues asked their students to dialogue in the target language with other classmates and other French classes via chatrooms. We would like to know if this is an effective way to help students develop their speaking abilities. How can we measure this effect? What type of treatment (conditions / methodologies) can we apply to demonstrate the efficacy of this technology-based teaching practice?

II. What Instruments am I Going to Use to Assess Student Learning?

This will depend on two factors:

We make the assumption that the best way to test, for example, students’ writing abilities is to ask them to write. Since the same applies for oral, reading and listening abilities, we will discuss testing techniques for each skill separately. The most important principle to keep in mind is that we want to set tasks that form a representative sample of the tasks that we expect our students to be able to perform. To identify and select a representative sample of such tasks we can use Hughes’ framework presented above. For example:

A. Writing:

To obtain samples that properly represent the student’s writing ability Hughes offers the following recommendations.

1) Set as many tasks as possible so that you offer students as many fresh starts as possible. The more scores you have for each student the more reliable the total score should be (81).

2) Test only writing and nothing else. You don’t want to test if the student is creative, imaginative, intelligent or has a good reason to hold an opinion. One way of reducing this risk is to make use of visual materials. For example:

3) Restrict students’ arguments to keep them from going far astray. This will make comparisons between students’ writing abilities easier.

4) Determine a method to obtain reliable scoring for writing. Scoring can be holistic (based on an overall impression or on descriptors of exiting scales such as the ACTFL scale for writing) or analytical, requiring a separate score for each of a number of aspects of a task (81-97).

5) Follow-up Activity for Assessment of Writing Abilities

  1. Design a writing exercise that can be performed both in the classroom and in the computer room. Following the advice given above, develop an assessment plan to determine the effect of CMI on your students’ writing skills. You may ask, for example:

B. Oral Ability

Oral ability, or the ability to interact successfully in the target language, requires speech comprehension and production. To test oral ability, as for writing, we want to choose tasks that form a representative sample of the tasks that we expect our students to be able to perform. To specify such tasks we can make use of the ACTFL rating scales for oral proficiency. Thus, if we test at the intermediate level, our test operations may include the language functions presented in the ACTFL description of the Intermediate-Low / Mid / Advanced ability. Once the operations or tasks are established we can select a format and elicitation techniques. For example:

1) Format:

2) Elicitation Techniques:

3) Follow-up Activity for Assessment of Oral Abilities

a) Using passages (oral interactions) based on a captioned digital movie, construct a series of tasks to predict oral ability. You may use, for example, cloze passages, short answer questions, or narrations. Repeat the exercise using the same passages, but to be presented on a tape recording. How can you demonstrate whether the technical variables (captioned digital movie vs. audio

tape) have a different effect on the oral ability of your students? Write a draft of the research statement, operations and format-timing portions of an assessment plan to answer this question.

C. Reading Ability

1) In order to select a representative sample of the reading abilities that you want to test, it is necessary to specify them as accurately as possible. What are the macro-skills directly related to reading abilities?

2) Underlying these are other micro-skills including:

3) The types of texts you select (e.g. newspaper editorial, letter, poem, timetable, 2,000 words from a novel or an essay) will depend on the macro-skill you want to test. For example, to test scanning you will need passages which contain plenty of discrete pieces of information. Detailed reading, on the other hand, can be tested using a few sentences. Assuming that what you want to test is only reading ability, Hughes recommends to choose texts which students have not read, or texts that are not too culturally specific. Once the texts and tasks have been selected you can choose elicitation techniques, such as:

4) Elicitation Techniques:

Keep in mind that the elicitation techniques should make minimal demands on the writing abilities of your students, and that errors of grammar, spelling or punctuation should not be penalized (as in a listening test) since this will make the measurement of reading ability less accurate..

5) Follow-up Activity for Assessment of Reading Abilities

a) Can you think of one or two ways by which you could measure the effect that CMI may have on the development of your students’ reading abilities? Are there software mechanisms which can effectively promote reading abilities? Design a series of computer-based tasks to respond to this inquiry. You may consider, for example, the following alternatives:

How would you incorporate these features into your research question?

D. Listening

Because listening (like reading) is a receptive skill, it parallels in many ways the testing of reading. As with reading, we should first specify the macro and micro skills directly related to listening abilities, including:

1) Macro-Skills:

2) Micro-Skills:

3) The types of texts you select will depend, once again, on the macro-skill you want to test. For example, if you are interested in how students handle language intended for native speakers, then you should use samples from authentic speech (e.g. radio broadcast, T.V. news, dialogue from a soap opera or from a multi-participant talk show). If, on the other hand, you want to test the students’ abilities to follow an academic lecture, you may need a native speaker reading from a longer text. The elicitation techniques will also depend on the objective of the test. For example:

4) Elicitation Techniques:

5) Follow-up Activity for Assessment of Listening Abilities

a) The Spanish faculty members at your institution are using a new video program in their beginning Spanish classes which includes traditional and digital video (Quick Time). They would like to determine whether digital video is more efficient than traditional video technology in developing students’ listening skills. Following the advice offered above, design a series of tasks to respond to this question.

 

E. Grammar and Vocabulary

While grammar and vocabulary are essential to the development and demonstration of communicative skills, they are rarely regarded as ends in themselves (Hughes 150). Most of our final exams, however, test these underlying abilities in some way. Ideally, in testing grammar, students should supply the appropriate grammatical structures through the use of paraphrase, completion, and modified cloze. Regarding vocabulary, if it has been consciously taught, then all the words presented should be included in the specifications. Otherwise, words can be grouped according to their frequency and usefulness. Definitions, synonyms, gap filling, and matching words with pictures constitute the most frequently used testing techniques for vocabulary. In this area it would be useful to investigate whether computer-based instruction of grammar (encouraging form-focus) and of vocabulary are more effective than the teacher-based instruction.

III. Experimental Design

The most useful resource we have found for providing a starting point for any researcher designing a new experiment is A Checklist for Planning Experiments in Dean and Voss (1999) (Appendix A).

 

 

 

A. Blocking, Randomization and Replication

While we are certainly not intending to write a complete introduction to design and analysis of experiments in this document, we would like to briefly mention three primary techniques that are fundamental to experimental design: blocking, randomization and replication.

1) Blocking is when you purposely design your study to reduce the variability due to a known source of variation (called a block). For example, many agricultural studies use fields as blocks because it is well known that soil type and fertility vary from location to location. To try to reduce this source of variation the treatments (perhaps fertilizers) are randomly applied within each field, which has similar soil type and fertility.

2) When we say randomly in design we do not mean haphazard or with no apparent structure, we mean the deliberate use of chance in a planned way. While this step may seem unnecessary and time consuming, it is the most important and most overlooked step in the entire process. The primary reason for using randomization is to reduce bias. Although there are many types of bias, two of the most common are researcher bias and systematic bias. Researcher bias refers to the bias that results from a researcher using their personal opinions to influence, either deliberately or unintentionally, any aspect of the experiment. Systematic bias refers to a bias that often results from the order in which the treatments are applied. For example, a researcher may decide to conduct all trials for treatment 1 and then all trials for treatment 2. However, if the measurements are collected by laboratory technicians, the technicians may improve as they continue to apply the experimental procedure or they may tire from repeated application of the same task. Either way, any comparisons of the responses from treatment 1 with those from treatment 2 are confounded (mixed up) with the order in which the observations were collected. (Refer to Appendix B for an example).

3) The primary importance of replication is to insure precision. Replication refers to fact that many experiments are repeated under the same conditions on different experimental units so we can estimate the variation within a set of experimental conditions and also compare the variability across differing experimental conditions. Both sources of variation (variation due to error and variation due to the treatments) need to be estimated for a complete analysis. It is important to distinguish replication from repeated measures. Replication refers to settings where independent pieces of information will be collected each time the experiment is conducted. Repeated measures refers to settings where multiple observations are recorded for the same experimental unit (e.g., recording student performance in several testing periods).

B. Control and Experimental Groups

In all comparative experiments it is essential to have a comparison group, typically referred to as a control group. The control group is formed and treated the same as the experimental group, except these experimental units do not receive the treatment of interest. They may receive a placebo or some form of a standard treatment we are trying to compare with, but they do not receive the treatment of interest in the experiment. The idea is to form two identical groups that differ in only one way, the treatment of interest.

C. Matched Pair Design / Paired Replicates Design

While trying to form two groups of identical students sounds like a reasonable idea, it is not as easy as you might think. Therefore, rather than using a completely randomized design, it is sometimes preferable to use a matched pairs or paired replicates design. In a matched pairs design, pairs of subjects from the population are formed so that members of a given pair are as similar as possible with respect to one or more characteristics that might potentially affect the measurement to be investigated. Identical twins are commonly used for matched pairs designs on human subjects. In language studies we can form pairs of students by ranking them according to their language abilities. One member of each pair will be randomly assigned to the treatment group and the other to control group. For a paired replicates design, subjects serve as their own controls in providing baseline measurements prior to being exposed to the experimental set of circumstances or treatment and then they are measured once again after such exposure. Differences between the pre and post (or before and after) observations are then used to assess the effect of the treatment.

D. Sample Size Determination

Two different methods are commonly used to determine the appropriate sample size or number of observations needed on each treatment. Both methods require some specific information that will often require a pilot study. For example, we must have an estimate for the amount of variability due to experimental error. How much variation do we expect to see in the exam scores? We must also decide how confident the researcher wants to be in their results. Since it is impossible to be 100% certain of the outcome in a random experiment, we must be willing to accept some chance of error. Most journals require 95% confidence, or a 5% error rate, to publish manuscripts. However, it is also reasonable to require 96, 98, or 99% confidence in certain situations. The third piece of information required for sample size determination problems is some idea of what practical differences are for the given situation. For example, if we are giving a proficiency test and scores in the 80's demonstrate a certain level of proficiency while scores in the 90's demonstrate an improved level of proficiency then we may want to design our study to detect differences of size 10. For more algorithmic approaches to calculating sample sizes, see Dean and Voss (1999).

There are several important points to keep in mind when determining the sample size.

  1. The more confident you want to be, the more observations you will need.
  2. The algorithms used by statisticians for determining sample sizes will not include budget or time restrictions of the experiment.
  3. To reduce the sample sizes you may have to reduce the confidence level slightly, refine the experiment in order to reduce the amount of error variability, or allow for less precision (wider interval estimates).

IV. Statistical Data Analysis

A. Descriptive Statistics and Graphs

A thorough statistical analysis of your experiment should include descriptive statistics, graphical summaries, and formal decision rules (known as hypothesis tests) to test your hypotheses. Common descriptive statistics that are used to measure the center of a distribution are the mean and the median. The mean is the average of the observations and the median is the "middle" observation. We placed middle in quotes because the actual computation of the median depends on whether you have an even or odd number of observations. If you have some unusual observations in your data, then the median is the preferred measure of center.

Two common visual displays that can be used to provide a graphical summary of your data are histograms and a boxplots. Histograms are formed by partitioning the range of the data into small intervals and finding out how many of the observations fall into each of these intervals. Bars are then erected over the intervals. The heights of the bars are frequencies, relative frequencies, or percentages. Boxplots are constructed from five descriptive statistics (minimum, 25th percentile, median, 75th percentile, maximum) and they are extremely useful for comparing two or more sets of observations. For example, comparing scores of the control group with scores of the treatment group.

Although these exploratory data analysis techniques are extremely useful and should be included in any analysis, many journals require formal statistical inference for publication. The most commonly used vehicle for completing the inference is a hypothesis test. Moore and McCabe (1999) provides a complete introduction to hypothesis testing, but the major elements of a hypothesis test are summarized below.

B. Hypothesis Testing Definitions

1. The null hypothesis H0 is a statement we want to evaluate with the collected observations. This statement is usually given by specifying the value of one or more parameters and typically corresponds to the hypothesis of "no change" or "no effect" from a new procedure under investigation. (e.g. H0: µ = 10 or H0: mean of the control group = mean of treatment group)

2. The alternative hypothesis Ha corresponds to the claim that is being made in the problem of interest. This claim is almost always the primary reason for conducting the experiment in the first place. Often it is also what the experimenter hopes or strongly believes to be true. It is important that we completely specify the alternative to H0. (e. g. Ha: µ 10, Ha: µ > 10, or Ha: µ < 10; Ha: the mean of the control group is not equal to the mean of the treatment group, Ha: mean of the control group is greater than the mean of the treatment group, or Ha: the mean of the control group is less than the mean of the treatment group)

3. A hypothesis test is a rule, that on the basis of appropriate collected sample data, leads to a decision whether or not to reject H0.

4. The significance level for a test is the probability of incorrectly rejecting H0. The significance level is usually denoted by " and is typically set at .01 or .05.

5. The null distribution of a test statistic S is the probability distribution (or sampling distribution) of S when H0 is true. This probability distribution tells us how S should behave if H0 is true.

6. The P-value of the test is the probability, computed under the assumption that H0 is true, that the test statistic will take a value at least as extreme as that actually observed.

A P-value conveys information about the strength of evidence against H0 and allows an individual decision maker to draw a conclusion at any specified significance level. The conclusion at any particular " level results from comparing the P-value to ":

If P-value # ", then reject H0 in favor of Ha at level ",

or

If P-value > ", then do NOT reject H0 at level ".

C. Hypothesis Testing Procedure

1. State the null hypothesis H0 and the alternative hypothesis Ha.

2. Specify the significance level ".

3. Calculate the value of the test statistic.

4. Find the p-value for the observed data.

D. Eliminating Bias

This will involve ensuring that your assessment instruments are reliable and valid, using multiple scorers and providing a detailed scoring key at the outset of scoring.

1) Test Reliability

If a test measures consistently it is considered reliable. There are standard procedures to estimate a test’s reliability (Henning 80-87). The first requirement is to have two sets of scores for comparison. The most common method of obtaining such scores is the split half method. This method involves one administration of one test that is split in two halves which are really equivalent, through a careful matching of items (Henning 83). Each student is given two scores. If the two scores are strongly correlated the test may be considered reliable. It is possible to quantify the reliability of a test through a mathematical measure of similarity known as reliability coefficient. Perfect agreement between two sets of scores will result in a reliability coefficient of one. Total lack of agreement will result in a zero. Thus, test reliability coefficients will range from 0 to 1. (Refer to Appendix E for procedure).

2) Test Validity

A test is said to be valid if it measures accurately what it is intended to measure (Hughes 22). How do we know that a test is really measuring what it is supposed to measure? Test specifications and representative samples of the functions that are meant to be covered (as shown in part II of this work) are essential to ensure content validity. Pre-testing can also help you determine the validity coefficient of a test (for a better understanding of the value of this procedure see Hughes 23-24). Finally, keep in mind that the best test can give invalid and unreliable results if is not carefully administered. For a more in depth understanding of the concept of validity see Chapelle’s Internal and External Validity Issues.

3) Scorer Reliability and Scoring Keys

Another aspect of eliminating bias is related to scoring. If no judgement is required on the part of the scorer, the scoring is objective. If judgement is required, multiple scorers are recommended. With the holistic method it is essential to identify benchmarks that typify key levels of ability. Only when there is an agreement on these benchmarks should scoring begin. Scorers should also specify acceptable answers and assign points for partially correct responses. (For a detailed explanation of holistic, analytical and objective scoring methods see Bailey 185-203).

V. A Case Study of a Multimedia Language Program

A. Objective

The purpose of this study was to investigate the effects of CD dictation and multimedia-based text on the listening and comprehension skills of 25 students enrolled in an introduction to Latin American literature and popular culture course. The study also inquired about the students’ response to using CD dictation and Ingenio Romance y Protesta en la Música Popular Hispana, a music-based multimedia program developed by the Spanish faculty at Kenyon College.

B. Research Background

Previous research on music and language acquisition indicates that music and language are closely related, both neurologically and developmentally (Crooswhite, 1996; Homburg, 1980; Levman, 1992; Zierer, 1985). Some researchers theorize that music and language follow similar developmental patterns in adjacent areas of the brain (Radocy & Boyle 174). But why use music in the foreign language classroom? Based on a study of students enrolled in a beginning French course at the University of Florida, Leutenegger and Mueller (1964) suggest that musical aptitudes--pitch, loudness, rhythm, time, timbre, and tonal memory--might be important in foreign language learning, with tonal memory the most significant predictor. Abrate (1983) lists other benefits that accrue through music, including the fact that songs:

Wilcox (1985) further proposes that music may aid language learners in acquiring the prosody (rhythm, tone and stress patterns) of the target language. This seems to be mediated by:

Music also has a therapeutic value. As Graham (1989) explains, music "can be used to establish a mood, to lessen anxiety, to encourage calmness, to ease loneliness and soothe irritability" (3). Furthermore, the language of emotions is commonly encoded in songs and this language can support the acquisition of the target culture. Once again, to cite Graham: "the vocabulary fluency people learn in song allows a fluency of emotion and a communication of their most sincere feelings" (3).

In view of this background, this study seeks to explore the potential benefits of using music via multimedia, by examining the relationship between aural / visual exploration of key cultural values and language learning. CALL researchers have indicated that the simultaneous presentation of language through multimedia can substantially improve student language achievement. Focusing on aural input, for example, Mann (1995) reports that hypermedia learning environments are particularly effective when text is accompanied by sound. Leow (1995) confirms this claim by showing that language learners are more likely to notice language structures when they are presented as aural input, or when text and aural stimuli are presented simultaneously. Research on the effect of captioning (where video, sound and text concur) supports a similar claim by demonstrating that the textual modeling of the captions have a significant impact on the comprehension and oral abilities of the students (Garza, 1991; Borras, 1995). In the area of colorization, investigators have shown that highlighting a target structure can enhance not only listening and reading skills but productive skills as well (Bell, 1984; Garza, 1996). The research question underlying this study is whether the simultaneous presentation of spoken (songs) and written (texts) language through multimedia has an effect in the listening and comprehension abilities of the student.

C. Method

Using each student as his / her own control, the study traces the development of the listening and comprehension skills of 25 Spanish students during twelve 1 ½ hour training periods. Songs were randomly assigned to 12 time slots. During periods 1-4 students received traditional instruction, which consisted of listening to and analyzing songs. During periods 5-12 CD dictation and the cultural contexts of the two musical genres being studied--salsa and nueva canción--were introduced using multimedia. Salsa was taught in periods 1, 2, 5, and 8 and nueva canción in periods 3, 4, 9 and 12. At the end of each period, students’ listening and comprehension skills were tested using a cloze exercise of song transcription and three short answer questions. All of the tests were administered in the same way each time and had the same format: 30 fill-in-the-blank items based on a song (10 nouns, 10 verbs, 10 adjectives or adverbs) and three short answer questions. CD dictation allows students to transcribe a song from a CD and to see the song text through the feedback feature of the program. Ingenio combines text, music, images, video and interactive exercises in the teaching of several genres of Hispanic popular music. The effect of the new instructional method was evaluated in periods 6, 8, 10 and 12, using the same type of cloze and short answer exercise employed during the first four periods of training. There were eight testing session (periods 1-4 and periods 6, 8, 10 and 12) and students did not know that the study was being conducted. By the end of the experimental period we had approximately 440 scores (16 scores for each of the 25 students). Formal hypothesis tests were used to analyze the data and a questionnaire was administered to gauge students’ response to using CD dictation and Ingenio in the course.

D. Results and Discussion

1) Listening

A formal statistical analysis of the data indicates that there is a significant increasing trend (t=8.23, p=.0001) in the listening scores for the first four sessions where the traditional method was used, as well as for the last four sessions (t=8.44, p=.0001) where the new method was used. In the last four sessions, however, the average listening scores are significantly higher than the scores in the first four sessions (5.620 vs. 6.863). This suggests that the listening skills of students improved significantly with the introduction of multimedia and CD dictation. Nevertheless, a caveat that must be kept in mind when interpreting these results is the difficulty of discriminating time-dependent from multimedia-associated improvements. In a future experiment this limitation may be overcome by a randomized alternation of multimedia and traditional methods of instruction, or by randomly assigning students into two groups--with multimedia and without multimedia.

In order to interpret our results it is essential to ask what is the precise object of our testing. In assessing students’ listening skills we are not testing what vocabulary they acquired, nor if their pronunciation or their ability to distinguish between the letters c / z / s improved after listening to songs. Indeed, the only question that our assessment instrument or test can answer is whether CD dictation and multimedia have a significant effect on attentive listening and speed writing. In our testing, each song represents a fresh start for the students. Therefore, over time students learn to pay a higher level of attention to the song language. Only through a fine tuning of this micro-skill they become able to transcribe fast enough to keep up with the song. Anyone who has attempted to transcribe a song written in a foreign language knows the time and the difficulty involved in this process. Our question is, then, can students truly develop attentive listening through multimedia? CD dictation, the training tool that we used, allows students to listen to, type and see the song language. The results of our study (the listening scores in the last four sessions are substantially higher than those of the first four sessions) suggest that by simultaneously engaging the aural and visual sensorial receptors of students, this training tool boosted the students’ ability to understand and transcribe songs. Previous research supports the claim that CALL applications can facilitate and advance various language skills simultaneously by engaging the speaker in integrative tasks (i.e. requiring listening, reading, speaking and writing abilities). In view of these findings, maybe the most significant conclusion of the listening aspect of our study is that multimedia may help teachers to target areas of language skills in such a specific and concurrent way (i.e. attentive listening and speed writing) that students’ overall foreign language development is inevitably achieved.

2) Comprehension

Comparing the comprehension scores for the first four sessions with those in the last four sessions, we find that the average scores are significantly higher (8.385 vs. 8.944) in the last four sessions. There is also a significant increasing trend (t=4.55, p=.0001) in the scores during the first four sessions, while there is not a significant linear trend (t=1.64, p=.1032) in the scores during the last four sessions. Thus, the results indicate that, although overall students’ comprehension skills improved significantly, they reached a plateau during the last three testing periods of the experiment. In order to explain this outcome we need to take a closer look at the overall trend in the scores, for which we used a scatterplot smoothing method. The overall increasing trend is clearly visible for the listening scores in Figure 1. For the comprehension scores in Figure 2 we observe a significant increasing trend, culminating at test 5, and a subsequent plateau.

Figure 1: Scatterplot of Listening Scores with Smoothing

 

Figure 2: Scatterplot of Comprehension Scores with Smoothing

This can be explained as the result of at least two confounding factors related to the assessment instrument used to test comprehension. First, it is possible that in answering the same type of content-based question (analysis of theme, rhetorical elements and musical components of the song), the students learned how to respond in order to obtain a higher score. Second, the comprehension questions referred exclusively to the song being tested and, therefore, do not reflect students’ acquisition of overall comprehension and cultural competence obtained after introducing Ingenio. This aspect was tested in the course’s final exam but was not part of this experimental design and data analysis.

Boxplots were used to compare the distributions of the scores for the eight testing session. Figure 3 contains side-by-side boxplots of the listening test scores.

 

Figure 3: Boxplots of Listening Scores

Notice that the overall trend is increasing, but students did not do as well on tests 3 and 5. A careful examination of these two tests, as well as comments made by students, suggest that song #3 was particularly challenging due to drastic changes in pitch voice, rhythm and musical arrangement. The low scores of test # 5 seem to be the result of a difficulty in dealing with new vocabulary. Figure 4 contains side-by-side boxplots of the comprehension scores.

 

Figure 4: Boxplots of Comprehension Scores

The comprehension scores tend to be higher than the listening and therefore pattern over time is not as clear. The use of other elicitation and testing techniques to assess comprehension (such as recording an oral statement from the students based on the song, identifying the order of events presented in the multimedia program, or a summary cloze about the evolution of a particular genre) would have yield more detailed information about the effect of our multimedia program on students’ comprehension skills. As assessed in this experiment, however, students’ comprehension skills do not appear to be affected by the use of CD dictation and multimedia.

E. Conclusion

This study represents an initial step in an effort to understand the role that music, informed by hypermedia, might play in fostering students’ language development. The study supports the hypothesis that the simultaneous presentation of spoken and written language through multimedia can substantially improve the listening skills of the language learner. Compared to traditional training, the new instructional method, including CD dictation and multimedia, offered more opportunities for student exposure to the target language. Simultaneously engaging the aural and visual receptors of students had a positive effect on students’ language output as expressed in the listening scores of the last four testing sessions of the study. Moreover, in the open-ended items of the questionnaire students reported that the exploration of the songs’ cultural context stimulated their interest in Hispanic popular music and led them to rely more on music for language learning. The large majority of students enjoyed using CD dictation and Ingenio and felt that their ability to understand the target language from a CD had dramatically improved. Several students reported residual learning or the involuntary rehearsal of a song long after the music stopped, which supports Wilcox’s claim.

Contrary to our expectations, students’ comprehension skills did not appear to be affected by the myriad of stimuli and data provided by multimedia. This could have been the result of a deficiency in the assessment instrument used to test comprehension, in terms of the elicitation techniques used to test the skill. Due to the small sample size and the manifestation of some confounding variables these results need further empirical assessment.

One of the most challenging aspects of this type of study is the handling of a class as an experimental unit, while avoiding detractions from the learning experience. Conversely, trying to execute an experiment across different classes or sessions using different teachers and materials in each raises issues of confounding variables such as: difference in teaching styles and staff, seasonal issues that affect performance and class composition. To deal with these difficulties CALL researchers recommend focusing more specifically on finding what components of CALL lessons are effective, with what kind of lesson and for what kind of students (Miech, Nave & Mosteller 78). This perspective, coupled with a rigorous experimental design that includes quantitative and qualitative analysis, can turn our empirical assessment efforts into a fruitful source of information. (Other published studies on assessment of multimedia instruction can be found on Appendix C).

 

 

V. Summary of recommendations by CALL researchers

Appendix A

A Checklist for Planning Experiments

Define the objectives of the experiment.

Identify all sources of variation, including:

treatment factors and their levels,

experimental units,

blocking factors, noise factors, and covariates.

Choose a rule for assigning the experimental units to the treatments.

Specify the measurements to be made, the experimental procedure, and the anticipated difficulties.

Run a pilot experiment.

Outline the analysis.

Calculate the number of observations that need to be taken.

Review the above decisions. Revise, if necessary.

Appendix B

Randomization: An Example

Faculty members are evaluated by students in a variety of ways. Mandatory course and instructor evaluation forms, minute cards, informal word of mouth, reputation, and letter writing are just a few examples. Suppose the college policy at a particular institution requires the Provost to randomly select students from a faculty member’s classes to participate in the evaluation process. The Provost decides that 5 students are to be selected from an introductory class of 25 students. How should she randomly select these 5 students? One method is to put the 25 names on 25 slips of paper, put the 25 slips of paper in a bag, shuffle the contents, and then pick 5 slips. The names appearing on the selected slips will be the 5 students who are asked to participate in the evaluation. Another method relies on the use of random numbers. Each of the 25 students is assigned a random number using a random number generator. The list of names is then sorted according to the random numbers and the students with the 5 smallest numbers are asked to participate in the evaluation. Table 3.3 contains a list of 25 students and random numbers assigned using the random number generator in Minitab. Adam, Akilah, James, Jason, and Morgan would be the five students asked to participate in the faculty review.

Table 3.3. A Class List and Assigned Random Numbers.

Name Random Number

Christian 0.359234

Luke 0.940090

Molly 0.934925

JoAnne 0.809115

Elkinsette 0.787670

Akilah 0.154661

Joiel 0.873139

Christian 0.812036

Cheshe 0.923244

Patricia 0.340109

Elizabeth 0.446112

Sara 0.482329

Jacob 0.927142

Chris 0.754434

Jed 0.521955

Jessica 0.403488

Adam 0.102106

James 0.197304

Whitney 0.480334

Vanessa 0.900897

William 0.582211

Morgan 0.323056

Chad 0.968394

Nuntanit 0.485099

Jason 0.288986

Table of Random Numbers

0.878819

0.549603

0.453608

0.609469

0.634254

0.330177

0.60724

0.222232

0.850828

0.029715

0.07934

0.934133

0.10173

0.601621

0.273403

0.100041

0.42852

0.741463

0.551085

0.352237

0.536098

0.025498

0.30029

0.827358

0.035664

0.312134

0.096739

0.64487

0.189481

0.982802

0.297564

0.729729

0.692749

0.388914

0.631802

0.459895

0.051262

0.306832

0.64062

0.803317

0.940491

0.499471

0.761715

0.496942

0.915528

0.056258

0.372853

0.600413

0.440219

0.695414

0.974192

0.247599

0.297069

0.809012

0.075612

0.735718

0.366852

0.879186

0.217656

0.048043

0.729668

0.355231

0.639844

0.516062

0.313893

0.019824

0.47391

0.774601

0.128501

0.566829

0.664449

0.654828

0.57704

0.866866

0.061258

0.126239

0.501144

0.581869

0.94424

0.063932

0.787398

0.46408

0.734462

0.604207

0.142745

0.053057

0.020121

0.302605

0.432813

0.285031

0.025101

0.571211

0.043746

0.435974

0.2799

0.825289

0.662033

0.185876

0.512332

0.791342

0.103254

0.865562

0.911864

0.47286

0.498155

0.476135

0.712345

0.545798

0.595233

0.681434

0.326138

0.634816

0.101723

0.444297

0.488589

0.137517

0.556241

0.36218

0.858581

0.415811

0.147372

0.077353

0.614026

0.606726

0.344891

0.526913

0.815674

0.520363

0.491291

0.161394

0.5525

0.533506

0.787208

0.46308

0.205291

0.255734

0.698992

0.890235

0.778869

0.842154

0.752982

0.056228

0.128288

0.301327

0.667771

0.358825

0.806762

0.306027

0.429014

0.669834

0.801728

0.546724

0.762362

0.910176

0.145774

0.559048

0.082452

0.85642

0.274657

0.930562

0.914642

0.838622

0.425688

0.081112

0.262177

0.944538

0.213022

0.377282

0.750432

0.285422

0.289305

0.074575

0.116353

0.380803

0.620498

0.062911

0.89095

0.84395

0.49802

0.66727

0.256996

0.724075

0.601206

0.430073

0.839108

0.743886

0.080198

0.983947

0.088531

0.879963

0.358892

0.405225

0.621738

0.685484

0.54854

0.36069

0.961591

0.926311

0.110459

0.290861

0.851023

0.123994

0.674273

0.966814

0.20949

0.882468

0.465821

0.422374

0.747957

0.356492

0.118989

0.959147

0.261494

0.190247

0.650699

0.044016

0.641095

0.362214

0.427302

0.957047

0.885416

0.628619

0.182553

0.077038

0.804266

0.775129

0.79665

0.922524

0.741767

0.230726

0.531467

0.282828

0.298985

0.619347

0.633199

0.878333

0.836959

0.388298

0.673041

0.456292

0.084696

0.48672

0.23152

0.037715

0.817859

0.531484

0.809632

0.255799

0.849658

0.201824

0.564028

0.124263

0.76125

0.510714

0.293818

0.832935

0.491989

0.196214

0.511139

0.31598

0.529629

0.823424

0.435863

0.031781

0.996003

0.90811

0.43741

0.244594

0.111258

0.109704

0.14203

0.478585

0.22853

0.370359

0.352245

0.294499

0.894256

0.92331

0.829151

0.222004

0.629117

0.914682

0.249719

0.255092

0.550339

0.616591

0.655875

0.507987

0.235199

0.297251

0.242903

0.394758

0.158394

0.757802

0.330833

0.394185

0.574149

0.981972

0.436377

0.145997

 

Appendix C

Samples of Published Studies on Assessment of Multimedia Instruction

Questionnaires and minute cards are typical assessment instruments which are used to gauge what students and teachers believe to be advantages and disadvantages of using computer-based technology for the teaching and learning of foreign languages (for examples see Appendix D). More important, however, is to get a feel for a good experimental design. For this reason we have selected samples of successful studies intended to measure qualitative and quantitative effects of CMI on language learning. We hope that these samples will give you new insights into possible ways of assessing the impact of technology in your language classes. The selection criteria used in this sampling includes: type of technology and treatment employed, methods for data collection and analysis, results, and date of publication.

Objective. The purpose of this study was to determine what qualitative and quantitative differences exist between discussion via networked computer and normal oral class discussion, in terms of student participation and quality of expression. It also inquired about teachers’ and students’ response to using Daedalus InterChange.

Method. The study compared the quantity and characteristics of discourse produced by two groups (40 students total) of second semester French during an InterChange session and an oral class discussion on the same topic. Three types of data were collected: a) transcripts of students’ writing during the fifth InterChange session (on Friday); b) transcriptions of students’ productions during the fifth oral discussion session (on Monday); c) students’ and teachers’ responses to questionnaires to assess the technology. Both oral and InterChange transcripts were coded for discourse functions (greetings, assertions, questions, commands, self corrections), verbal tense and mood, syntactic characteristics (coordination, subordination, negation, comparative and superlative structures, relative pronouns), length of turns, and students’ use of English. The unit of analysis was the clause, except for the tabulation of indicative mood, which was done by sentences. T-units were used to segment oral discourse. Reliability of coding was assessed by two specially trained raters. Due to the sample size and the descriptive nature of the study, formal statistical analysis was not considered appropriate and hence generalizations to other populations should be made cautiously.

Results. The study reported that students had over twice as many turns, produced two to four times more sentences, and used a much greater variety of discourse functions when working in InterChange than they did in their oral discussion. The distribution and direction of turns were different in the two conditions, with much more direct student-to-student exchange in the InterChange condition. Both students and instructors responded favorably to using InterChange although students were more enthusiastic than instructors. Features of InterChange that may be unsettling for teachers include: less attention to grammatical accuracy and less coherence and continuity of discussions.

Objective. The purpose of this study was to investigate the effect of captions on the listening comprehension and oral abilities of 70 students enrolled in an intermediate / advanced English as a Second Language (ESL) course and 40 students enrolled in an advanced Russian course.

Method. Within each language, students were randomly assigned to two groups--with captions and without captions. Both groups attended a one-hour testing session where they viewed five "authentic" video segments. Students in the experimental group watched the video segments with captions, while students in the control group watched the same video segments without captions. For each segment, both groups answered 10 multiple-choice questions written in the target language. Students were instructed to mark only answers of which they were certain and to leave others blank. At the end of each testing session, five students were randomly selected for five-minute interviews, in which they were asked to retell one video segment of their choosing, keeping as close as possible to the original language of the segment. The interviews were tape-recorded.

Results. Students who watched the segments with captions had a mean gain of 75 percent in correct answers, a mean decrease of 61 percent in incorrect answers, and a mean decrease of 84 percent in unanswered questions over students who watched the video segments without captions. Average gains in correct answers were higher for Russian students (90 percent) than for ESL students (60 percent). In the interviews, students who saw captioned segments consistently demonstrated greater ability to recall language of the video than students who did not see captions. This study supports the claim that "captioning can substantially improve student comprehension" (Miech, Nave & Mosteller 65) by engaging both the aural and visual sensory receptors of the students.

Objective. The purpose of this study was to investigate the differential effect of two types of CALL feedback --conventional and "intelligent" feedback-- on 34 college students enrolled in an intermediate Japanese course.

Method. Students were randomly assigned to two groups. One group used a CALL program that provided conventional feedback on a lesson about the construction of passive sentences, and the other group used an "intelligent" CALL program that provided detailed error analysis on the same subject. The CALL program offering conventional feedback gave information in English about what was wrong with the student’s answers and compared the student response with the correct answer. The "intelligent" CALL program explained in English why a response was incorrect, offering a detailed grammatical explanation about the errors. Students spent four hours studying their respective CALL lessons and did not know that the comparison of the two types of feedback was being conducted. Shortly after the last CALL session students were evaluated using a 20-question achievement test on passive sentence construction. Students were also assessed three weeks later on the final exam using four questions pertaining to passive structures.

Results. The study reported that students in the "intelligent" CALL group significantly outscored the students in the "conventional" CALL group on both the 20-question achievement test and the final exam questions on passive sentence construction. Nagata concluded that the "intelligent" CALL feedback, "which explained the functions and semantic relations of nominal phrases in the sentences, was especially helpful to them for understanding the concepts of the particles and passive structures" (337).

 

Objective. The purpose of this study was to investigate the effect of two Internet-based technologies--on-line chatrooms and on-line newspapers--on the acquisition of cultural knowledge and the writing and oral abilities of 62 students enrolled in two advanced Spanish courses.

Method. The study compared oral and written abilities of students before and after using Internet-based activities during the fifth semester of an advanced Spanish course. Three types of data were collected: a) students’ portfolios containing writing samples from on-line chatroom sessions, one page journal entries based on newspaper readings and a final written report; b) two surveys to assess students’ background, attitudes, and experience with Internet (before and after introducing this technology); and c) two oral proficiency tests given at the beginning and the end of the semester. Students were asked to engage in intercultural exchanges via chatrooms and to read on-line Hispanic newspapers on a weekly basis. On-line discussions and students’ responses to press articles were evaluated using holistic portfolio assessment. Oral abilities were assessed through an oral proficiency exit interview. Beside the survey, there was no formal assessment of students’ acquisition of cultural knowledge.

Results. The study reported that most students found the on-line activities "to be helpful in enhancing their development of language skills and their understanding of Hispanic culture and people" (112). Interviews with students indicated that the use of synchronous on-line chatroom sessions created a less stressful environment for foreign language learning than in-class discussions. On-line writing also had a positive effect on oral ability. Most of the students attained a higher level of oral proficiency, progressing from an Intermediate-Mid to an Intermediate-High. This result led to the conclusion that "on-line chatroom not only helped students write better but also enabled them to speak more fluently" (115). The researcher and the students reported that on-line newspaper activities increased students’ cultural knowledge and reading skills. This claim, however, was not supported by formal assessment. Overall the study supports the claim that "the use of on-line newspapers and on-line chatrooms heightened students’ interest and motivation for learning the foreign language and the foreign culture in a dynamic rather than in a passive way" (116).

Objective. The purpose of this study was to investigate whether on-line help is useful, how often learners tend to use it, and what kind of help is crucial for "ineffective" as well as for "effective" language learners.

Method. The study compared the use of on-line help by two groups of EFL college learners--"ineffective" and "effective"--while using an interactive videodisc program (IVD) designed to strengthen listening comprehension skills. Ten first-year "ineffective" learners and ten second-year "effective" learners were selected, using TOEFL scores, direct observation of students’ abilities, and other course records as criteria for selection. While working with an interactive video unit, students had access to two functions to control the videodisc player (pause and rewind) and to eight types of on-line help:

To assess listening comprehension and to elicit self report data, 54 comprehension questions were displayed at different points during the playing of the video. Students were asked to "think aloud" in Chinese and answer questions such as, "What are you thinking?" "What don’t you understand?" "How did you know the answer?" (87). Their answers were recorded and transcribed for analysis. The program’s tracker recorded students’ names, ID numbers, their start time, frequency of access to specific on-line help, and the starting and ending frame numbers of videodisc segments in order to identify the video portions where help was requested. To elicit students’ attitudes towards the material, the researcher distributed a questionnaire at the end of the experimental period. Descriptive statistical methods were used to analyze the amount of time students spent on task, the kind of help requested, and their listening comprehension scores. T-tests and correlation procedures were used to compare frequency of strategy use between the two groups of students.

Results. The study reported that the members of the "effective" group completed a total of 397 comprehension questions, while the members of the "ineffective" group completed a total of 294 questions. Overall, the "effective" learners requested less than half the help requested by the "ineffective" learners. Members of the "ineffective" group used almost exclusively only one type of on-line help for each comprehension question. The functions most frequently used were the video controller rewind function and the English and the Chinese scripts. T-test procedures showed no significant differences between the groups’ time on task or comprehension scores. The results suggests that "effective" and "ineffective" learners use similar types of help and frequency of help but different amounts of help. As for time on task and comprehension of the video the learners do not vary greatly from each other. Qualitative analysis shows that "effective" and "ineffective" learners use learning strategies differently. For example, "effective" learners attend to more audio and video clues at the same time, use various strategies with more flexibility, use both personal experience and linguistic context to determine the meaning of the utterance and think aloud systematically. Responses to the questionnaire items show that students felt positive about the design of the courseware. The most significant implication that can be drawn from this study is that some of the help features are seldom used. Hence, the investigator asks "whether learners really use optimal learning strategies or whether CALL developers implement more help than needed" (94).

Appendix D

Questionnaires and minute cards

A. Questionnaire used in the Ingenio study

Multimedia and Web-based Materials Evaluation Form

Students: The MLL department is interested in the role that multimedia and web-based materials have played in your education. Based on your experience in this course, please answer the following questions.

Circle the symbol that best describes your reaction to each of the following statements concerning the course or the instructor.

(SA = Strongly Agree, A = Agree, N = Neutral, D = Disagree, SD = Strongly Disagree)

    1. The utilization of multimedia technology was beneficial for the development of my language skills, that is:
    2. a. Reading SA A N D SD

      b. Writing SA A N D SD

    c. Cultural understanding SA A N D SD

    d. Oral ability SA A N D SD

    e. Listening ability SA A N D SD

  1. Learning the cultural context presented in our music program Ingenio improved my language skills.
  2. SA A N D SD

  3. CD dictation improved my ability to listen to, and understand a song.
  4. SA A N D SD

     

     

  5. Multimedia approach to language acquisition enhanced my learning of the language more than traditional approaches.
  6. SA A N D SD

     

  7. The use of technology on this course sparked my interest in this field of study (Spanish language, culture, and literature).
  8. SA A N D SD

  9. Please explain how using multimedia and web-based materials affected the development of your language skills.
  10.  

  11. What did you find most valuable about using multimedia in this course?
  12.  

  13. What did you find least valuable about using multimedia in this course.

 

Circle the letter that best describes your overall evaluation of the technologies used in class.

(E = Excellent, G = Good, F = Fair, P = Poor, D = Dismal)

 

Overall evaluation E G F P D

 

B. Minute card (example)

Math 6: Minute Card for Lab #1 (provided by Prof. Hartlaub)

1. Did you find the lab to be useful? Yes or No

    1. If yes, what was useful?

 

B. If no, what was missing that if it would have been present you would have found to be a useful exercise?

2. Are you having problems with Minitab? Yes or No

    1. If yes, what problems are you having?

 

 

  1. Please add any other comments or suggestions about the class or the instructor.

 

 

4. Thank you!

 

 

 

Appendix E

Computing the Reliability Coefficient

The listening scores for a group of 25 students is shown below. An asterisk denotes missing scores (the students did not take the test for some reason) and their scores cannot be used to calculate the reliability coefficient because pairs of scores are needed for each student.

Data Display

Row List1 List2

1 6.0 6.0

2 4.4 5.0

3 5.0 6.0

4 3.6 5.0

5 4.0 4.0

6 5.0 *

7 6.0 7.5

8 5.0 7.0

9 6.0 7.0

10 4.0 4.0

11 3.5 5.0

12 5.0 6.0

13 3.0 *

14 4.0 5.0

15 6.0 7.0

16 4.0 6.0

17 4.4 6.0

18 6.0 6.0

19 * 6.0

20 6.0 6.0

21 6.0 *

22 7.0 7.0

23 6.0 6.5

24 7.0 7.0

25 6.0 7.0

The value of the correlation coefficient for the 21 students with both scores was computed with Minitab, a statistical software package, and the results are shown below.

Correlations (Pearson)

Correlation of List1 and List2 = 0.786

To find the reliability coefficient for these scores, we simply multiply the value of the correlation coefficient, r, by 2 and then divide by (1+r). Thus, the value of the reliability coefficient for two sets of listening scores is .

 

Useful CALL / Language Assessment and Statistical Resources

A. On CALL & Language Assessment

Abrate, Jayne Halsne. "Pedagogical Applications of the French Popular Song in the Foreign Language Classroom." The Modern Language Journal 67 (1983): 8-12.

Alderson, J.C., Clapham, C. & Wall D.. Test Specification, Language Test Construction and Evaluation. Cambridge: Cambridge UP, 1995.

Bachman, Lyle. "What Does Language Testing Have to Offer?" TESOL Quarterly 25.4 (1991):

- - -. Fundamental Considerations in Language Testing. Oxford: Oxford UO, 1990.

Bailey, K. Learning About Language Assessment. Pacvific Grove, CA: Heinle & Heinle, 1998.

Bell, J.M. "The ColorSounds Story." ColorSound Monthly 1 (1984): 1-2.

Berge, Z. & M. Collins. Computer Mediated Communication and the Online Classroom in Distance Learning. Cresskill, N.J: Hampton P., 1995.

Borras, Isabel. "Effects of Multimedia Courseware Subtitling on the Speaking Performance of College Students of French." The Modern Language Journal 78 (1994):61-75.

Brown, J.D. & Hudson, T. "The Alternatives in Language Assessment." TESOL Quarterly 32.4 (1998): 653-676

Chapelle, C & J. Jamieson. "Computer Assisted Language Learning as a Predictor of Success in Aquiring English as a Second Language. TESOL Quarterly 20 (1986): 27-46.

- - -. "Research Trends in Computer-Assisted Language Learning." Teaching Languages with Computers. Ed. M. Pennington. La Jolla, CA: Althelstan, 1989. 49-59.

- - -. "Internal and External Validity Issues in Research on CALL Effectiveness." Computer-

Assisted Language Learning and Testing. Ed. P. Dunkel. Rowley, MA, 1991. (37-57).

- - -. "A framework for the investigation of CÆLL as a Context fo SLA." CALL Journal 6.3 (1995): 2-8.

Chun, D. "Using Computer Networking to Facilitate the Acquisition of Interactive Competence." System 22.1 (1994): 17-31.

Collentine, J. "Cognmitive Principles and CALL Grammar Instruction: A Mind-Centered, Input Approach." CALICO Journal 15 (1998): 1-18.

Crosswhite, Jeanette. "Effect of Music on Language Development of Preschool Children." Ph.D. Thesis. University of North Carolina, at Greensboro, 1996.

Emerson John D. and Frederick Mosteller. "Interactive Multimedia in College Teaching. Part I: A Ten Year Review of Reviews."Educational Media and Technology Yearbook 1998. Ed. By R.M. Branch. Englewood, CO: Libraries Unlimited 23 (1998): 43-58.

- - -. "Interactive Multimedia in College Teaching. Part II: Lessons from Research in the Sciences." Educational Media and Technology Yearbook 1998. Ed. By R.M. Branch. Englewood, CO: Libraries Unlimited 23 (1998): 59-74.

Garret, N. "ICALL and Second Language Acquisition." Intelligent Language Tutors: Theory Shaping Technology. Eds. V.M. Holland, J. Kaplan and M. Sams. Mahwah, New Jersey: Laurence Erlbaum, 1995.

Garza, Thomas. "Evaluating the Use of Captioned Video Materials in Advanced Foreign Language Learning." Foreign Language Annals 24.3 (1991): 239-57.

- - -. "The Message is the Medium: Using Video Materials to Facilitate Foreign Language Performance." Texas Papers in Foreign Language Education 2 (1996): 1-18.

Graham, David, B. "Using Audiotapes for Instruction and Assessment: La Music C’Est Quelque Chose de Magique." Language Association Bulletin 40:3 (1989)3-6, 27.

Heening, Grant. A Guide to Language Testing: Development, Evaluation and Research. Boston: Heinle & Heinle, 1987.

Higgins, J. Language, Learners and Computers: Human Intelligence and Artificial Unintelligence. Singapore: Longman, 1988.

Homburg, Taco Justus. "A Comparison Between the Acquisition of Music and of Language." Papers from the 1979 Mid-America Linguistics Conference 2.3 (1980): 24-31.

Hughes, Arthur. Testing for Language Teachers. Cambridge: Cambridge UP, 1989.

Kelm, O. " The use Synchronous Computer Networks in Second Language Instruction: A Preliminary Report." Foreign Language Annals 25.5 (1992): 441-454.

- - -. "E-mail Discussion groups in Foreign Language Education: Grammar Follw-up." Telecollaboration in Foreign Language Learning: Proceedings of the Hawaii Symposium. Ed. M. Warschauer. Honolulu, HI: U of Hawaii, Second Language Teaching and Curriculum Center, 1995.

Kenning, M. M. & Kenning, M.J. Computers and Language Learning: Current Theory and Practice. New York: Ellis Horwood, 1990.

Kern, R. "Restructuring Classroom Interactions with Networked Computers: Effects on Quantity and Quality of Language Production." Modern Language Journal 79.4 (1995): 457-476.

Kleinmann, H.H. "The Effects of Computer-Assisted Instruction on ESL Reading Achievement." Modern Language Journal 71 (1987): 267-276.

Lee, L. "Going Beyond the Classroom Learning: Acquiring cultural Knowledge Via On-line Newspapers and Intercultural Exchanges Via On-Line Chat Rooms." CALICO Journal 16 (1998): 101-120.

Leow, R. "Modality and Intake in Second Language Acquisition. Studies in Second Language Acquisition 17 (1995): 79-90.

Levman, Bryan G. "The Genesis of Music and Language." Ethnomusicology: Journal of the Society For Ethnomusicology 36.2 (1992): 147-170.

Liou. H. "Research on On-Line Help as Learner Strategies for Multimedia CALL Evaluation. CALICO Journal 14 (1997): 81-98.

Leutenegger, R.R. and T.H. Mueller. "Auditory Factors and the Acquisition of French Language Mastery." Modern Language Journal 47 (1964):141-46.

Mann, B. "Focusing Attention with Temporal Sound." Journal of Research on Computing in Education 27 (1995): 402-424.

Miech, Edwards J., Bill Nave & Frederick Mosteller. "On CALL: A Review of Computer-Assisted Language Learning in U.S. Colleges and Universities." Educational Media and Technology Yearbook 1997. Ed. By Robe. Vol 22. Englewood, CO: Libraries Unlimited, Inc.

Moeller, A.J. "Moving from Instruction to Learning with Technology: where’s the Content?" CALICO Journal 14 (1997): 5-14.

Nagata, Noriko. "Intelligent Computer Feedback for Second Language Instruction." Modern Language Journal 77.3 (1993): 330-39.

Nutta, J. "Is Computer-Based Grammar Instruction as Effective as Teacher directed Grammar Instruction for Teaching L2 Structures?" CALICO Journal 16 (1998): 49-62.

Oxford, R.L. Language Learning Strategies: What Every Teacher Should Know. Boston: Heinle &Heinle, 1990.

- - -. "Linking Theory of Learning with Intelligent Computer Assisted Language Learning." Intelligent Language Tutors: Theory Shaping Technology. Eds. V.M. Holland, J. Kaplan and M. Sams. Mahwah, New Jersey: Laurence Erlbaum, 1995.

Price, K. "The Use of Technology: Varying the Medium in Language Teaching." Interactive Language Teaching. Ed. W. Rivers. New York: Cambridge UP, 1987.

Radocy & Boyle Pchycological Foundations of Musical Behavior. Springfield, IL: Charles C. Thomas, 1979.

Reeves, T. "Pseudoscience in Computer Based Instruction: The case of Learner Control Research." Journal of Computer Based Instruction 20 (1993): 39-46.

Robinson, P. "Generalizability and Automaticity of Second Language Learning Under Implicit, Incidental, Enhanced and Instructed Conditions." Studies in Second Language Acquisition 19 (1997): 223-48.

Salaberry, M.R. "A Theoretical Foundation for the Development of Pedagogical Tasks in Computer Mediated Communication." CALICO Journal 14 (1996): 5-35.

Seels, P., S.M. Shieber & T. Wasow, eds. Foundational Issues in Natural Language Processing. Cambridge, MA: MIT Press, 1991.

Stenson, N., B. Downing, J. Smith. "The Effectiveness of Computer-Assisted Pronunciation Training. CALICO Journal 9:4 (1992): 5-19.

Turner, J. "Assessing Speaking." Annual Review of Applied Linguistics 18 (1998): 192-207.

- - -. "Creating Content-Based Language Tests: Guidelines for Teachers. The Content-Based Classroom. Eds. M.A. Snow & D. M. Brinton. White Plains, NY:Addison Wesley Longman, 1992. 187-200.

Underwood, J.H. Linguistics, Computers and the Language Teachers: A Communicative Teacher. Rowley, MA: Newbury House, 1984.

Warschauer, M. "Comparing Face-to-Face and Electronic communication in the Second Language Classroom." CALICO Journal 13.2 (1996): 7-26.

- - -. "Electronic Literacies: Language, Culture and Power in Online Education." Unpublished Doctoral Dissertation, U of Hawaii, 1997.

- - -. "Virtual Connections: Online Activities and Projects for Networking Language Learners." Honolulu, HI: U of Hawaii, Second Language Teaching and Curriculum Center.

Wesche, M. "Communicative Testing in a Second Language. Modern Language Journal 67.1 (1983).

Wilcox, Wilma B. "Music Cues from Classroom Singing for Second Language Acquisition: Prosodic Memory for Pronunciation of Target Vocabulary by Adult Non-Native English Speakers." Ph.D. Dissertation. University of Kansas, 1995.

Zierer, Ernesto. "Algunos aspectos neuropsíquicos del lenguaje y de la vivencia musical." Lenguaje y Ciencias 25:4 (1985): 117-130.

B. Internet CALL Bibliographies

Athelstan. (1999) CALL Bibliography.

http://www.nol.net/~athel/athelbib.html

Coski, C. &C. Kinginger. (1999) CMC in FLE: Annotated Bibliography.

http://www.lll.hawaii.edu/nflrc/networks/nw3/

Chantrill, R. (1999). A Bibliography of Evaluation of Interactive Multimedia for Language Teaching.

http://www.cltr.uq.oz.au:8000/%7erichardc/immevalbib.html

Higgins, J. (1999). Call: A Bibliography.

http://www.stir.ac.uk/celt/staff/higdox/callbib.htm

Sedgwick,R. (1997). Annotated Bibliography of the Effectiveness of CALL.

http://www.cltr.uq.oz.au:8000/interest/biblio.html

Sharp, S.K. & P. Liu (1997). Computer-Assisted Language Learning Bibliography

http://www.inform.umd.edu/edres/colleges/arhu/depts/langctr/flit/biblio/bibliography.html

C. Statistical References

Moore, D. S. and McCabe, G. P. (1999), Introduction to the Practice of Statistics, 3rd Edition, New York: W. H. Freeman and Company. (This is the most widely used introductory statistics book in the country. David is a past president of the American Statistical Association and he has an outstanding reputation for his clear and thorough explanations of statistical concepts. He is a leader in the area of statistical education and this resource is extremely valuable for anyone who wants to see a basic introduction to statistics. This resource was one of the first to include an entire chapter on Producing Data. The chapter includes all of the major concepts in design of experiments, sampling, and the groundwork for statistical inference.)

Dean, A. M. and Voss, D. T. (1999), Design and Analysis of Experiments, Springer-Verlag New York, Inc. (This is one of several new textbooks on design and analysis of experiments. One of the things that set it apart for its competitors are the first two chapters. Chapter 1 covers the three major areas of design, blocking, randomization, and replication (BRR!), and properly refers to this topic of design as an art. Chapter 2 includes the most useful instructional tool we have found for explaining and applying the major principles in design, a checklist for planning experiments. The authors step through the checklist with examples dealing with cotton spinning, soap, batteries, and cake baking. Chapters 3-19 are intended for upper-level undergraduates or graduate students, but the first two chapters provide an outstanding reference for all researchers.)

 

* We would like to express our gratitude to Susan Palmer for her helpful insights on, and critical evaluation of this work. We also would like to acknowledge the support we have received from the Ohio 5 Consortium and the Steering Committee.