MAVIR: a corpus of spontaneous formal speech in Spanish and
Transcripción
MAVIR: a corpus of spontaneous formal speech in Spanish and
IberSPEECH 2012 – VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop MAVIR: a corpus of spontaneous formal speech in Spanish and English Antonio Moreno Sandoval and Leonardo Campillos Llanos Laboratorio de Lingüística Informática (LLI), Universidad Autónoma de Madrid, 28049 Madrid, SPAIN {antonio.msandoval, leonardo.campillos}@uam.es Abstract. MAVIR corpus is a collection of audio and video recordings, with their corresponding orthographic transcriptions and prosodic annotation. The main aim of the corpus is researching in Natural Language Processing and Speech Technology. Recordings come from lectures and talks on language technologies celebrated within the framework of MAVIR consortium. The corpus is made up of 13 recordings (audio and video) in Spanish and English languages, collected during the I, II and III MAVIR Conferences, held in Madrid in 2006, 2007 and 2008 respectively. Keywords: Language resources, spontaneous speech, formal speech. 1 Introduction Spoken language resources are indispensable data for developing and evaluating speech systems. In this paper, we describe a collection of audio and video recordings of formal, spontaneous, speeches in Spanish and English. Those recordings were taken in a series of lectures and panels organized by the MAVIR consortium1 between 2006 and 2008. Spoken language resources are typically divided into speech databases and spontaneous speech corpora. The former are collections of high-quality recordings and detailed phonetic transcriptions of speech in controlled environments. The later are typically collections of a wide variety of spoken registers and nonscripted speech. Those corpora are collected mainly for linguistic analyses and applications such as language teaching, or writing grammars and dictionaries. The first spoken corpora collected were part of general, reference, national corpora such as BNC [2] or CREA [4]. Spanish research groups have not been very active in the compilation of spontaneous speech processing. Analogously, there are a few corpora available for spontaneous speech in Spanish [1,5]. The LLI-UAM has a long history of such as resources. 1 MAVIR: Mejorando el Acceso y Visibilidad de la Información en Red (http://www.mavir.net/) is a research consortium funded by the Madrid Regional Government under the grants S0505/TIC-0267 and S2009/TIC-1542. adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011 -224- IberSPEECH 2012 – VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop Starting in 1990-92, they collected the first spontaneous speech corpus of Spanish, CORLEC [8]2. A decade later, with a new team, they were responsible for developing the Spanish corpus within the European project C-ORAL-ROM [5]3. This project, along with its contemporary Dutch Spoken Corpus [6], was conducted during the early years of the past decade. The two projects were an improvement over the aforementioned national corpora in various aspects detailed in [11]: 1. 2. 3. 4. 5. The acoustic quality: from analog tapes to digital recording. In the CORLEC times, simply there were not digital recorders. One of the aims of the C-ORAL-ROM project was to provide data to the language technology community with sufficient quality. Clear separation of the metadata (header), from the text transcription. The synchronization of transcription and audio (by utterances). This is useful to segment the signal according to the text, but also to check the quality of transcription. Prosodic and Part-of-Speech annotations were provided in different layers. Legal rights of the speakers and copyrights holder (in media recordings) are preserved. Every recording has the written permit from the participant. Therefore, a clear evolution of the oral corpus can be seen, on the basis of maintaining the essence: recording of spontaneous speech in its context of use. 2 Description of the MAVIR Corpus MAVIR corpus has been constructed under the experience of previous corpora, C-ORAL-ROM and CHIEDE4 [7] but MAVIR is a bilingual corpus (Spanish and English) with important differences with respect to those mentioned (see Table 1). Table 1. Distinctive features of the three corpora. C-ORAL-ROM CHIEDE Reference corpus Child corpus Topic-oriented Design Formal vs. informal By child ages By topic Interactional type Monologues, dialogues, conversations Dialogues and conversations Between 5 and 30 minutes Between 10 and 30 minutes General type Typical length of recordings 2 MAVIR Monologues and panel discussions Between 20 minutes and one hour http://www.lllf.uam.es/ING/Corlec.html. The transcription is available for downloading. http://www.lllf.uam.es/ING/Coralrom.html 4 http://www.lllf.uam.es/ING/Chiede.html 3 -225- IberSPEECH 2012 – VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop MAVIR is a corpus of formal speech, in contrast with C-ORAL-ROM and CHIEDE, where the informal speech is the characteristic feature. Another relevant aspect is its topic orientation: lectures on language technologies issues such as information retrieval or semantic web. The Table 2 shows the text distribution by language and topic. In figures, MAVIR consists of 13 files (9 in Spanish and 4 in English) with a total duration of more than 10 hours and over 100,000 words, including 3 hours and 10 minutes in English and over seven hours in Spanish (table 2). Table 2. MAVIR text distribution File mavir01 mavir02 mavir03 mavir04 mavir05 mavir06 mavir07 mavir08 mavir09 mavir10 mavir11 mavir12 mavir13 TOTAL 5 Title Challenges for Information Extraction Proceso de innovación de tecnologías de acceso a la información: ¿Cómo llegar al mercado? España y los buscadores: un mercado potencial Aplicaciones en dominios médico y cultural On-demand Information Extraction Buscador General Panhispánico Tecnología de la Web Semántica Premio MAVIR 2007 Buenas prácticas en presencia web para grupos de investig. Multimedia Retrieval and Evaluation Premio MAVIR 2008 Beyond Text-based Multimedia Retrieval Buscando cangrejos en Flickr Duration Nº of words5 Nº of utterances Lang. 1h 07' 39" 9113 597 Eng 1h 14' 32" 13422 682 Spa 38' 11" 6681 481 Spa 57’ 22" 9310 347 Spa 36' 08" 4461 464 Eng 29' 09" 4332 140 Spa 21' 47" 3831 190 Spa 18' 55" 3356 189 Spa 1h 10' 03" 11179 650 Spa 1h 27' 24" 15659 657 Eng 20' 20" 3130 152 Spa 1h 7' 40" 11168 741 Eng 43' 38" 10h 38' 48" 7837 103479 531 7902 Spa The word count has been provisionally performed considering every item between two spaces; so, actually, a multiword such as es decir (‘that is’) counts as two words. -226- IberSPEECH 2012 – VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop The total number of words for each language is 63078 in Spanish and 40401 in English. With respect to participants, the four texts in English are monologues, while the seven Spanish-language recordings are split between monologues and round tables, with a total of 19 different speakers. 3 Methodology The recordings were taken at the scene of conferences (in different sites). In most cases the signal was taken directly from the audio system. In other cases, the lectures were recorded with a DAT recorder. Speech signal was down-sampled to 16kHz, 16bit mono. For editing recordings, we used the program CoolEdit©. This software allows one to manipulate the sound, thus improving quality, eliminating noise or cutting out parts that are not relevant. The corpus was transcribed and prosodically annotated by several transcribers, all of them Ph.D. students with a background in linguistics. Experienced members of the LLI-UAM supervised the whole task. The transcribers based their annotation on the transcription guidelines, following the C-ORAL-ROM conventions [10]. Each transcriber made a first version, which was revised by another transcriber; after revision, they discussed disagreements and reached a final version. Transcriptions were carried out from the processed sound files. The text of a transcription is divided into two parts: header and transcription. Information regarding the participants and the communicative situation is included in the header; for instance, speaker’s data (sex, education, dialect, etc.), topic, duration, transcribers or revisers. The second part of the file, after the metadata section, is the text transcription, carried out through the orthographic transliteration of the recordings, following specific conventions developed specifically for spoken language. The punctuation system established for written language is not suitable for spoken language. Next, we will describe briefly the conventions. 4 Transcription conventions Figure 1 below shows an example of the transcription of recording mavir05. Figure 1. Fragment of the transcription from file mavir05 *SEK: ok /// so this is the result /// you can tell /// right ? this is a result xxx was supposed to get /// I can pick one of them /// maybe this one /// {%com: he waits until the page loads} Netherlands beats Spain /// hhh {%act: interjection} beat hhh {%act: laugh} I didn't know /// you know what I'm forward to xxx /// so / yeah ? this is what / maybe / &ah we can expect from / a question like country name /// -227- IberSPEECH 2012 – VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop Following is a summary of the transcription conventions used in the corpora. Table 3. Transcription conventions Mark Description Meaning Example Non-terminal prosodic break. Non-autonomous tonal units *GRI: thank you Antonio / and thank you (…) // Non-terminal autonomous pros. break. Independent tonal units *GRI: ok // so for example / &ah (…) /// Terminal prosodic break Informative units (complete meaning) *SEK: that's the idea /// ¡! Exclamat. utterance Exclamation *GRI: this problem got solved ! ¿? Interrogat. utterance Interrogation *SEK: you get idea ? … Not-finished utterance Suspended intonation *SEK: but / at the moment ... = Self-interruption + Interruption Speaker is interrupted. ¬ Turn continued after an interruption It is used at the beginning of the interrupted turn. ! Lengthening Long vowel/conson. [/] Simple retracting [///] Retracting Syntactic reformulation <> Overlapping It is used when two people speak at the same time. / Intentional interruption *GRI: here = ups! / excuse me *LRO: el caso de xxx + *IRA: millones / *ENR: no /// no /// *IRA: ¬ se gastaron *SEK: all ! one thousand Repetition or retrace *GRI: not [/] no job get started *GRI: Booth was &assassina [///] sorry /// *JSL: <es capaz de resolvérselo> *LRO: [<] <de resolvérselo> # Non-prosodic break A long break (not expressive intention) *COR: herramientas sencillas xxx Not-transcribed words Passage not understood *GRI: literature xxx & Before a fragment or unfinished word A non-complete word (self-correction) *GRI: so being &ab [/] being able to pull out &eh &ah &mm Vocalic support or filler hhh {%act:} # {%com: consults laptop} The speaker uses it *GRI: &eh similar obserto keep his / her turn. vation can be made Paralingüistic or An onomatopoeia, *GRI: hhh {%act: cough} non‑linguistic elem. laugh, assent, click… {%alt:} Production errors {%com:} Comments A wrong word or mispronunciation. *SEK: &ah promotion {%alt: promo-tion} to xxx It comments an event *GRI: {%com: drinks} -228- IberSPEECH 2012 – VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop 5 Alignment The alignment involves the text synchronization with the original sound, either by conversational turns or by utterances (in our case, the corpus is aligned by utterances). In the first stage, every text fragment is synchronized with the corresponding sound. Trained linguists manually carry out this work by means of professional software, and it is a painstaking task, since it requires precision when marking the initial and the final time codes for every utterance. After the synchronization is finished, the conversion to the XML format is automatically performed. The text is broken down into utterances (according to the time codes marked by the linguist), which are limited by a time stamp at the beginning and at the end of each fragment. Figure 2 shows the synchronized transcription corresponding to the fragment in figure 1. Figure 2. Fragment from an XML file (mavir05) with the transcription and the time codes <UNIT speaker="SEK" startTime="543.109" endTime="544.578"> ok </UNIT> <UNIT speaker="SEK" startTime="544.578" endTime="545.99"> so this is the result </UNIT> <UNIT speaker="SEK" startTime="545.99" endTime="547.125"> you can tell </UNIT> <UNIT speaker="SEK" startTime="547.125" endTime="548.962"> right ?</UNIT> <UNIT speaker="SEK" startTime="548.962" endTime="552.248"> this is a result xxx was supposed to get </UNIT> <UNIT speaker="SEK" startTime="552.248" endTime="554.452"> I can pick one of them </UNIT> <UNIT speaker="SEK" startTime="554.452" endTime="555.572"> maybe this one </UNIT> <UNIT speaker="SEK" startTime="555.572" endTime="565.385"> {%com: he waits until the page loads} Netherlands beats Spain </UNIT> <UNIT speaker="SEK" startTime="565.385" endTime="568.103"> hhh {%act: interjection} beat hhh {%act: laugh} I didn't know </UNIT> <UNIT speaker="SEK" startTime="568.103" endTime="570.489"> you know what I'm forward to xxx </UNIT> <UNIT speaker="SEK" startTime="570.489" endTime="571.605"> so / yeah ?</UNIT> <UNIT speaker="SEK" startTime="571.605" endTime="576.171"> this is what / maybe / &ah we can expect from / a question like country name </UNIT> 6 Conclusions With regard to applications, the MAVIR corpus has been applied to date for the following tasks: 1. A descriptive study of speech dysfluencies of Spanish in formal register [3] -229- IberSPEECH 2012 – VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop 2. 3. Development, training and test of several ASR systems –among them, the AVTS and the THALES-UPM system–. Besides, researchers working in the European project transLectures [13] have been interested in the corpus, and they have been given a copy of the resource The corpus will be put to use in the wordspotting test task which is going to be hold during IberSpeech 2012 conference. The LLI-UAM group has manually annotated more than 5000 words to be used in this competitive evaluation. The MAVIR corpus is a contribution to resources for the speech technology research community. The corpus is freely available for research purposes. Please contact with Dr. Antonio Moreno Sandoval to get a copy of the DVDs ([email protected]). References 1. Benedí, J.-M., Lleida, E., Varona, A., M.-J. Castro, Galiano, I., Justo, R., López de Letona, I., Miguel, A.: Design and acquisition of a telephone spontaneous speech dialogue corpus in Spanish: DIHANA. In Proc. of Fifth International Conference on Language Resources and Evaluation, LREC 2006. Genova, Italy. 2. British National Corpus. http://www.natcorp.ox.ac.uk/ 3. Campillos, L., Alcántara, M.: Speech Dysfluencies in Formal Context. Analysis based on Spontaneous Speech Corpora. In Proc. Corpus Linguistics Conference 2009 (2009) 4. Corpus de Referencia del Español Actual. http://corpus.rae.es/creanet.html 5. Cresti, E., Moneglia, M., Bacelar do Nascimento, F., Moreno-Sandoval, A., Veronis, J., Martin, P., Choukri, K., Mapelli, V., Falavigna, D., Cid, A., Blum, C.: The C-ORAL-ROM Project. New methods for spoken language archives in a multilingual romance corpus. In Proc. of Language Resources and Evaluation Conference 2002. Las Palmas, Spain. (2002). C-ORAL-ROM official web site: http://lablita.dit.unifi.it/coralrom 6. Dutch Spoken Corpus. http://lands.let.ru.nl/cgn/ehome.htm 7. Garrote Salazar, M.: CHIEDE. Corpus de habla infantil espontánea del español. Ph.D. Thesis. Madrid: UAM Publishing Service. (2008) 8. González Ledesma, A., De la Madrid, G., Alcántara Plá, M., De la Torre, R., MorenoSandoval, A.: Orality and Difficulties in the Transcription of Spoken Corpora. In Proc. of the Workshop on Compiling and Processing Spoken Language Corpora, LREC, 2004, Lisbon (2004) 9. Marcos Marín, F.: El Corpus Oral de Referencia de la Lengua Española contemporánea. Project Report. Madrid (1992). http://www.lllf.uam.es/ESP/Info%20Corlec.html 10. Moneglia, M.: The C-ORAL-ROM resource. In Cresti, E., Moneglia, M. (eds.) p. 27 (2005) 11. Moreno Sandoval, A.: La evolución de los corpus de habla espontánea: la experiencia del LLI-UAM. Actas de las Segundas Jornadas de Tecnologías del Habla. Granada (2002) 12. Moreno, A., De la Madrid, G., Alcántara, M., González, A., Guirao, JM., de la Torre, R.: The Spanish Corpus. In Cresti, E., Moneglia, M. (eds.) C-ORAL-ROM: Integrated reference Corpora for Spoken Romance Languages, pp. 135-161. Amsterdam: John Benjamins (2005) 13. TransLectures project (Transcription and Translation of Video Lectures). http://llach.dsic.upv.es/~translectures/ -230-