The "Mandarin spoken corpora project" is part of the Language Archives Project (National Digital Archive Project). The main aim of the "Mandarin spoken corpora project" is to collect a wide variety of speech data of Taiwan Mandarin and to digitally archive the use of Taiwan Mandarin in audio and video data formats. The project consists of (1) speech data collection and processing, (2) toolkit and database development, (3) metadata management, (4) speech annotation design and (5) web query system construction. Three main Mandarin spoken corpora are currently in working, funded by the Institute of Linguistics, National Science Council and the National Digital Archives Project.
These include "Mandarin Topic-oriented Conversation Corpus" (MTCC), "Mandarin Conversational Dialogue Corpus" (MCDC) and "Mandarin Map Task Corpus" (MMTC). The annotation systems include "discourse annotation", "detailed spontaneous speech phenomena" and "particular phonetic phenomena". Web users can also use our web query system to search for keywords and annotations marked in the corpora mentioned above.