MAVIR: a corpus of spontaneous formal speech in Spanish and

Transcripción

MAVIR: a corpus of spontaneous formal speech in Spanish and
IberSPEECH 2012 – VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop
MAVIR: a corpus of spontaneous formal speech in
Spanish and English
Antonio Moreno Sandoval and Leonardo Campillos Llanos
Laboratorio de Lingüística Informática (LLI), Universidad Autónoma de Madrid,
28049 Madrid, SPAIN
{antonio.msandoval, leonardo.campillos}@uam.es
Abstract. MAVIR corpus is a collection of audio and video recordings, with
their corresponding orthographic transcriptions and prosodic annotation. The
main aim of the corpus is researching in Natural Language Processing and
Speech Technology. Recordings come from lectures and talks on language
technologies celebrated within the framework of MAVIR consortium. The corpus is made up of 13 recordings (audio and video) in Spanish and English languages, collected during the I, II and III MAVIR Conferences, held in Madrid
in 2006, 2007 and 2008 respectively.
Keywords: Language resources, spontaneous speech, formal speech.
1
Introduction
Spoken language resources are indispensable data for developing and evaluating
speech systems. In this paper, we describe a collection of audio and video recordings
of formal, spontaneous, speeches in Spanish and English. Those recordings were taken in a series of lectures and panels organized by the MAVIR consortium1 between
2006 and 2008. Spoken language resources are typically divided into speech databases and spontaneous speech corpora. The former are collections of high-quality
recordings and detailed phonetic transcriptions of speech in controlled environments.
The later are typically collections of a wide variety of spoken registers and nonscripted speech. Those corpora are collected mainly for linguistic analyses and applications such as language teaching, or writing grammars and dictionaries. The first
spoken corpora collected were part of general, reference, national corpora such as
BNC [2] or CREA [4].
Spanish research groups have not been very active in the compilation of spontaneous speech processing. Analogously, there are a few corpora available for spontaneous speech in Spanish [1,5]. The LLI-UAM has a long history of such as resources.
1
MAVIR: Mejorando el Acceso y Visibilidad de la Información en Red
(http://www.mavir.net/) is a research consortium funded by the Madrid Regional Government under the grants S0505/TIC-0267 and S2009/TIC-1542.
adfa, p. 1, 2011.
© Springer-Verlag Berlin Heidelberg 2011
-224-
IberSPEECH 2012 – VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop
Starting in 1990-92, they collected the first spontaneous speech corpus of Spanish,
CORLEC [8]2. A decade later, with a new team, they were responsible for developing
the Spanish corpus within the European project C-ORAL-ROM [5]3. This project,
along with its contemporary Dutch Spoken Corpus [6], was conducted during the
early years of the past decade. The two projects were an improvement over the aforementioned national corpora in various aspects detailed in [11]:
1.
2.
3.
4.
5.
The acoustic quality: from analog tapes to digital recording. In the
CORLEC times, simply there were not digital recorders. One of the aims of
the C-ORAL-ROM project was to provide data to the language technology
community with sufficient quality.
Clear separation of the metadata (header), from the text transcription.
The synchronization of transcription and audio (by utterances). This is useful to segment the signal according to the text, but also to check the quality
of transcription.
Prosodic and Part-of-Speech annotations were provided in different layers.
Legal rights of the speakers and copyrights holder (in media recordings)
are preserved. Every recording has the written permit from the participant.
Therefore, a clear evolution of the oral corpus can be seen, on the basis of maintaining the essence: recording of spontaneous speech in its context of use.
2
Description of the MAVIR Corpus
MAVIR corpus has been constructed under the experience of previous corpora,
C-ORAL-ROM and CHIEDE4 [7] but MAVIR is a bilingual corpus (Spanish and
English) with important differences with respect to those mentioned (see Table 1).
Table 1. Distinctive features of the three corpora.
C-ORAL-ROM
CHIEDE
Reference corpus
Child corpus
Topic-oriented
Design
Formal vs. informal
By child ages
By topic
Interactional
type
Monologues, dialogues, conversations
Dialogues and
conversations
Between 5 and 30
minutes
Between 10 and
30 minutes
General type
Typical length
of recordings
2
MAVIR
Monologues
and panel discussions
Between 20
minutes and
one hour
http://www.lllf.uam.es/ING/Corlec.html. The transcription is available for downloading.
http://www.lllf.uam.es/ING/Coralrom.html
4
http://www.lllf.uam.es/ING/Chiede.html
3
-225-
IberSPEECH 2012 – VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop
MAVIR is a corpus of formal speech, in contrast with C-ORAL-ROM and
CHIEDE, where the informal speech is the characteristic feature. Another relevant
aspect is its topic orientation: lectures on language technologies issues such as information retrieval or semantic web. The Table 2 shows the text distribution by language
and topic.
In figures, MAVIR consists of 13 files (9 in Spanish and 4 in English) with a total duration of more than 10 hours and over 100,000 words, including 3 hours and 10
minutes in English and over seven hours in Spanish (table 2).
Table 2. MAVIR text distribution
File
mavir01
mavir02
mavir03
mavir04
mavir05
mavir06
mavir07
mavir08
mavir09
mavir10
mavir11
mavir12
mavir13
TOTAL
5
Title
Challenges for Information
Extraction
Proceso de innovación de
tecnologías de acceso a la
información: ¿Cómo llegar al
mercado?
España y los buscadores: un
mercado potencial
Aplicaciones en dominios
médico y cultural
On-demand Information
Extraction
Buscador General
Panhispánico
Tecnología de la Web
Semántica
Premio MAVIR 2007
Buenas prácticas en presencia
web para grupos de investig.
Multimedia Retrieval and
Evaluation
Premio MAVIR 2008
Beyond Text-based
Multimedia Retrieval
Buscando cangrejos en Flickr
Duration
Nº of
words5
Nº of utterances
Lang.
1h 07' 39"
9113
597
Eng
1h 14' 32"
13422
682
Spa
38' 11"
6681
481
Spa
57’ 22"
9310
347
Spa
36' 08"
4461
464
Eng
29' 09"
4332
140
Spa
21' 47"
3831
190
Spa
18' 55"
3356
189
Spa
1h 10' 03"
11179
650
Spa
1h 27' 24"
15659
657
Eng
20' 20"
3130
152
Spa
1h 7' 40"
11168
741
Eng
43' 38"
10h 38' 48"
7837
103479
531
7902
Spa
The word count has been provisionally performed considering every item between two spaces; so, actually, a multiword such as es decir (‘that is’) counts as two words.
-226-
IberSPEECH 2012 – VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop
The total number of words for each language is 63078 in Spanish and 40401 in
English. With respect to participants, the four texts in English are monologues, while
the seven Spanish-language recordings are split between monologues and round tables, with a total of 19 different speakers.
3
Methodology
The recordings were taken at the scene of conferences (in different sites). In most
cases the signal was taken directly from the audio system. In other cases, the lectures
were recorded with a DAT recorder. Speech signal was down-sampled to 16kHz, 16bit mono.
For editing recordings, we used the program CoolEdit©. This software allows one
to manipulate the sound, thus improving quality, eliminating noise or cutting out parts
that are not relevant.
The corpus was transcribed and prosodically annotated by several transcribers, all
of them Ph.D. students with a background in linguistics. Experienced members of the
LLI-UAM supervised the whole task. The transcribers based their annotation on the
transcription guidelines, following the C-ORAL-ROM conventions [10]. Each transcriber made a first version, which was revised by another transcriber; after revision,
they discussed disagreements and reached a final version.
Transcriptions were carried out from the processed sound files. The text of a transcription is divided into two parts: header and transcription. Information regarding the
participants and the communicative situation is included in the header; for instance,
speaker’s data (sex, education, dialect, etc.), topic, duration, transcribers or revisers.
The second part of the file, after the metadata section, is the text transcription, carried out through the orthographic transliteration of the recordings, following specific
conventions developed specifically for spoken language. The punctuation system
established for written language is not suitable for spoken language. Next, we will
describe briefly the conventions.
4
Transcription conventions
Figure 1 below shows an example of the transcription of recording mavir05.
Figure 1. Fragment of the transcription from file mavir05
*SEK: ok /// so this is the result /// you can tell /// right ? this is a result xxx was supposed to
get /// I can pick one of them /// maybe this one /// {%com: he waits until the page loads}
Netherlands beats Spain /// hhh {%act: interjection} beat hhh {%act: laugh} I didn't know ///
you know what I'm forward to xxx /// so / yeah ? this is what / maybe / &ah we can expect
from / a question like country name ///
-227-
IberSPEECH 2012 – VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop
Following is a summary of the transcription conventions used in the corpora.
Table 3. Transcription conventions
Mark
Description
Meaning
Example
Non-terminal prosodic break.
Non-autonomous
tonal units
*GRI: thank you Antonio
/ and thank you (…)
//
Non-terminal autonomous pros. break.
Independent tonal
units
*GRI: ok // so for example / &ah (…)
///
Terminal prosodic
break
Informative units
(complete meaning)
*SEK: that's the idea ///
¡!
Exclamat. utterance
Exclamation
*GRI: this problem got solved !
¿?
Interrogat. utterance
Interrogation
*SEK: you get idea ?
…
Not-finished utterance
Suspended intonation
*SEK: but / at the moment ...
=
Self-interruption
+
Interruption
Speaker is interrupted.
¬
Turn continued
after an interruption
It is used at the
beginning of the
interrupted turn.
!
Lengthening
Long vowel/conson.
[/]
Simple retracting
[///]
Retracting
Syntactic reformulation
<>
Overlapping
It is used when two
people speak at the
same time.
/
Intentional interruption *GRI: here = ups! / excuse me
*LRO: el caso de xxx +
*IRA: millones /
*ENR: no /// no ///
*IRA: ¬ se gastaron
*SEK: all ! one thousand
Repetition or retrace *GRI: not [/] no job get started
*GRI: Booth was
&assassina [///] sorry ///
*JSL: <es capaz de
resolvérselo>
*LRO: [<] <de resolvérselo>
#
Non-prosodic
break
A long break (not
expressive intention)
*COR: herramientas sencillas
xxx
Not-transcribed words
Passage not understood
*GRI: literature xxx
&
Before a fragment
or unfinished word
A non-complete word
(self-correction)
*GRI: so being &ab [/]
being able to pull out
&eh &ah
&mm
Vocalic support
or filler
hhh
{%act:}
# {%com: consults laptop}
The speaker uses it *GRI: &eh similar obserto keep his / her turn. vation can be made
Paralingüistic or
An onomatopoeia,
*GRI: hhh {%act: cough}
non‑linguistic elem. laugh, assent, click…
{%alt:}
Production errors
{%com:}
Comments
A wrong word or
mispronunciation.
*SEK: &ah promotion
{%alt: promo-tion} to xxx
It comments an event *GRI: {%com: drinks}
-228-
IberSPEECH 2012 – VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop
5
Alignment
The alignment involves the text synchronization with the original sound, either by
conversational turns or by utterances (in our case, the corpus is aligned by utterances).
In the first stage, every text fragment is synchronized with the corresponding
sound. Trained linguists manually carry out this work by means of professional software, and it is a painstaking task, since it requires precision when marking the initial
and the final time codes for every utterance. After the synchronization is finished, the
conversion to the XML format is automatically performed. The text is broken down
into utterances (according to the time codes marked by the linguist), which are limited
by a time stamp at the beginning and at the end of each fragment. Figure 2 shows the
synchronized transcription corresponding to the fragment in figure 1.
Figure 2. Fragment from an XML file (mavir05) with the transcription and the time codes
<UNIT speaker="SEK" startTime="543.109" endTime="544.578"> ok </UNIT>
<UNIT speaker="SEK" startTime="544.578" endTime="545.99"> so this is the result </UNIT>
<UNIT speaker="SEK" startTime="545.99" endTime="547.125"> you can tell </UNIT>
<UNIT speaker="SEK" startTime="547.125" endTime="548.962"> right ?</UNIT>
<UNIT speaker="SEK" startTime="548.962" endTime="552.248"> this is a result xxx was supposed to get </UNIT>
<UNIT speaker="SEK" startTime="552.248" endTime="554.452"> I can pick one of them
</UNIT>
<UNIT speaker="SEK" startTime="554.452" endTime="555.572"> maybe this one </UNIT>
<UNIT speaker="SEK" startTime="555.572" endTime="565.385"> {%com: he waits until the
page loads} Netherlands beats Spain </UNIT>
<UNIT speaker="SEK" startTime="565.385" endTime="568.103"> hhh {%act: interjection} beat
hhh {%act: laugh} I didn't know </UNIT>
<UNIT speaker="SEK" startTime="568.103" endTime="570.489"> you know what I'm forward
to xxx </UNIT>
<UNIT speaker="SEK" startTime="570.489" endTime="571.605"> so / yeah ?</UNIT>
<UNIT speaker="SEK" startTime="571.605" endTime="576.171"> this is what / maybe /
&amp;ah we can expect from / a question like country name </UNIT>
6
Conclusions
With regard to applications, the MAVIR corpus has been applied to date for the
following tasks:
1.
A descriptive study of speech dysfluencies of Spanish in formal register [3]
-229-
IberSPEECH 2012 – VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop
2.
3.
Development, training and test of several ASR systems –among them, the
AVTS and the THALES-UPM system–. Besides, researchers working in the
European project transLectures [13] have been interested in the corpus, and
they have been given a copy of the resource
The corpus will be put to use in the wordspotting test task which is going to be
hold during IberSpeech 2012 conference. The LLI-UAM group has manually
annotated more than 5000 words to be used in this competitive evaluation.
The MAVIR corpus is a contribution to resources for the speech technology research community. The corpus is freely available for research purposes. Please contact with Dr. Antonio Moreno Sandoval to get a copy of the DVDs ([email protected]).
References
1. Benedí, J.-M., Lleida, E., Varona, A., M.-J. Castro, Galiano, I., Justo, R., López de Letona,
I., Miguel, A.: Design and acquisition of a telephone spontaneous speech dialogue corpus
in Spanish: DIHANA. In Proc. of Fifth International Conference on Language Resources
and Evaluation, LREC 2006. Genova, Italy.
2. British National Corpus. http://www.natcorp.ox.ac.uk/
3. Campillos, L., Alcántara, M.: Speech Dysfluencies in Formal Context. Analysis based on
Spontaneous Speech Corpora. In Proc. Corpus Linguistics Conference 2009 (2009)
4. Corpus de Referencia del Español Actual. http://corpus.rae.es/creanet.html
5. Cresti, E., Moneglia, M., Bacelar do Nascimento, F., Moreno-Sandoval, A., Veronis, J.,
Martin, P., Choukri, K., Mapelli, V., Falavigna, D., Cid, A., Blum, C.: The C-ORAL-ROM
Project. New methods for spoken language archives in a multilingual romance corpus. In
Proc. of Language Resources and Evaluation Conference 2002. Las Palmas, Spain. (2002).
C-ORAL-ROM official web site: http://lablita.dit.unifi.it/coralrom
6. Dutch Spoken Corpus. http://lands.let.ru.nl/cgn/ehome.htm
7. Garrote Salazar, M.: CHIEDE. Corpus de habla infantil espontánea del español. Ph.D.
Thesis. Madrid: UAM Publishing Service. (2008)
8. González Ledesma, A., De la Madrid, G., Alcántara Plá, M., De la Torre, R., MorenoSandoval, A.: Orality and Difficulties in the Transcription of Spoken Corpora. In Proc. of
the Workshop on Compiling and Processing Spoken Language Corpora, LREC, 2004, Lisbon (2004)
9. Marcos Marín, F.: El Corpus Oral de Referencia de la Lengua Española contemporánea.
Project Report. Madrid (1992). http://www.lllf.uam.es/ESP/Info%20Corlec.html
10. Moneglia, M.: The C-ORAL-ROM resource. In Cresti, E., Moneglia, M. (eds.) p. 27
(2005)
11. Moreno Sandoval, A.: La evolución de los corpus de habla espontánea: la experiencia del
LLI-UAM. Actas de las Segundas Jornadas de Tecnologías del Habla. Granada (2002)
12. Moreno, A., De la Madrid, G., Alcántara, M., González, A., Guirao, JM., de la Torre, R.:
The Spanish Corpus. In Cresti, E., Moneglia, M. (eds.) C-ORAL-ROM: Integrated reference Corpora for Spoken Romance Languages, pp. 135-161. Amsterdam: John Benjamins
(2005)
13. TransLectures project (Transcription and Translation of Video Lectures).
http://llach.dsic.upv.es/~translectures/
-230-

Documentos relacionados