Constructing a Chinese as Second Language Learner Corpus for Language Learning and Research

Home » conference » programme » abstracts » Constructing a Chinese as Second Language Learner Corpus for Language…

Chen, Howard, National Taiwan Normal University, Taiwan, hjchen@ntnu.edu.tw

Many researchers and language teachers around the world believe that language corpora have great potentials for improving second/foreign language learning and teaching. Among various types of corpora resources, learner corpora in particular have received much attention recently. One of the most influential projects is the ICLE (International Corpus of Learner English) project led by Professor Granger in University of Louvain, Belgium. The new version of ICLE corpus contains 3.7 million words of EFL writing from learners representing 16 mother tongue backgrounds. The ICLE corpus has helped to produce many research papers and pedagogical materials within the past decade. Although useful English learner corpora such as ICLE are widely available, very few learner corpora for other target languages are available.

Recently, because of the rapid economic growth in China, an increasing number of students are learning Chinese as a second language. Although the number of Chinese as second language (CSL) learners is increasing rapidly around the world; very few CSL learner corpora are available for teaching, learning, and research. For CSL research, learner corpus can play an important role. CSL teachers and researchers can conduct research on learners’ interlanguage development and gain insights about leaner’s difficulties and needs. Material writers can further use the results of error analysis to produce useful pedagogical materials. CSL leaner corpus might also be used to better understand the differences among learners at different proficiency levels (cf. Cambridge English Profile Project).

This paper will introduce a new Chinese as second language learner corpus and related corpus search tools developed by MTC (Mandarin Teaching Center) and SC-TOP (Steering Committee of Test of Proficiency) in Taiwan. MTC is located at National Taiwan Normal University and it is the largest Chinese teaching centers in Taiwan. There are more than 1600 students enrolled in each quarter, and there are more than 150 teachers in this center. Students from more than 70 countries are studying in this center. SC-TOP is a language testing research center sponsored by Ministry of Education for developing various Chinese as a second language proficiency tests. Based on the data provided by these two centers, a 3 million word Chinese as a second language learner corpus has been developed. The MTC-TOP learner corpus includes the following two different types of learner data:

CSL learners’ short essays written in various TOP tests.
CSL students’ writing assignments at MTC

To facilitate corpus search, the learner corpus was further automatically tagged with a Chinese tagger called CKIP (Chinese Knowledge Information Processing) tagger developed by Academia Sinica, Taiwan. The POS-tagged CSL corpus is useful for research and teaching. In addition to the learner corpus, a web concordancer which has several different search options was also developed. This web concordancer allows users retrieve specific words and phrases from CSL learner corpus. Thus, various CSL learners’ usage can be retrieved and studied more easily and systematically. Furthermore, the POS-tagged learner corpus can be used to search for collocates used by CSL learners.

In addition, it is also important to further analyze various errors made by CSL learners. Since it is not possible for computers to identify errors, native speakers were asked to tag this learner corpus. 25 types of major errors were first identified, and about 800000 words were tagged so far. These manually tagged errors can also be searched via a web interface. Because these learner data were produced by learners from various L1 language backgrounds, teachers and researchers can also find errors and patterns produced by CSL learners from different native language backgrounds.

To illustrate how this MTC-TOP learner corpus can benefit CSL research, we used this learner corpus to conduct a study on the acquisition of Chinese classifiers by various CSL learners. Chinese classifier is a notoriously difficult language feature for many CSL learners. In the past, most studies on Chinese classifiers learning often involved few subjects and were based on very limited number of learner errors. With the help of this 3 million words CSL corpus, the classifiers errors made by many learners from various first language backgrounds can be found more easily and examined more systematically. The analysis on classifier errors produced by different CSL learners also helps researchers to better identify the common Chinese classifiers errors. In addition to classifiers, other common errors made by CSL learners can also be investigated with the help of computerized corpus. Several problematic areas (Chinese synonyms and Chinese particle le) will be further discussed in this paper. The CSL learner corpus and the web concordancer should be able to help more researchers uncover CSL interlanguage patterns and conduct various types of research. It is evident that this CSL learner corpus can make significant impact on CSL teaching, learning, and research.