BRAVO: Búsqueda de Respuestas Avanzada Multimodal y

Transcripción

BRAVO: Búsqueda de Respuestas Avanzada Multimodal y
Jornada de Seguimiento de Proyectos, 2010
Programa Nacional de Tecnologías Informáticas
BRAVO: Búsqueda de Respuestas
Avanzada Multimodal y Multilingüe
TIN2007-67407-C03
Paloma Martínez Fernández *
Universidad Carlos III de Madrid
José Miguel Goñi Menoyo**
Universidad Politécnica de Madrid
Antonio Moreno Sandoval***
Universidad Autónoma de Madrid
Abstract
The project aims at creating a multimodal (text and voice) and multilingual answers search
platform which integrates the modules developed by the different participating groups. The
stating hypothesis is that it is possible to improve the answers search task of the current
systems, working on the modules which made up the architecture of a system of this sort.
Specially, the multilingual IR modules, the enhancement of indexing, speeding up the
information access, improvement of extraction and arrangement of answers and the questions
analysis. We deal with web information, encyclopaedic resources, scientific documents and
news. Thus, linguists' work is essential to develop and/or adapt appropriate resources, as well
as for the integration of lexical and software resources. We also aim at applying this techniques
and methodology to other areas, as ontology and information retrieval, Named Entities and
voice interaction, investigating ways of adapting these tasks to new domains and languages.
Keywords: question answering, information retrieval, linguistic resources
1 Project Goals and resources
The aim of this project is to develop a platform for question answering on multimedia contents.
This environment will allow the analysis of available techniques and methods in multilingual
Information Retrieval (IR), in question answering (QA), information and ontology extraction as
well as in automatic speech recognition for spontaneous speech. Moreover, it is important to focus
on Spanish language both in query language and document collections. This objective also implies
to apply new techniques and to enhance current ones through defining hybrid techniques and
evaluating them. The scope of the project is not limited to treat textual objects but to extend them
to multimedia objects that will be described by using particular cases of documental representations
used in textual objects.
Partial objectives are:
• To create a multimodal (text and voice) and multilingual QA Platform to access multimedia
contents.
*
Email: [email protected]
Email: [email protected]
***
Email: [email protected]
**
TIN2007-67407-C03
•
•
•
•
•
•
To integrate in this platform the components for the different on line and off line phases that
have to be performed in QA systems enhancing the state-of-the-art in this field (Information
Retrieval, Answer extraction and ranking and question analysis)
To define, implement and evaluate the necessary updates in an IR subsystem to integrate it in
the QA Platform. Particularly, the treatment of smaller units than documents (sentences,
paragraphs, etc.) will be considered in order to locate the required information as well as the
intelligent treatment of entities (name entities recognition) and the integration of lexical and
semantic knowledge in query expansion.
To evaluate the platform in International forums mainly CLEF, TAC and others.
To develop linguistic resources for Arabic and Japanese languages.
To integrate linguistic resources to allow a better processing of spontaneous speech in order to
adjust a speech recognizer to user queries.
To design data models in specific domains in order to build ontologies by using semiautomatic techniques.
To achieve the goals stated above the project was assigned 8 EDP from LABDA-UC3M
subproject, 5,5 from GSI-UPM subproject and 9 from LLI-UAM subproject.
Each subproject planning is shown in Annex I.
2 Level of achievement
BRAVO has three subprojects: the first of them, named BRAVO-BR (LABDA-UC3M), realises
the platform development tasks and the component related to information extraction, question
analysis with temporal expression and speech recognition adapted to QA Systems; the second,
named BRAVO-RI (GSI-UPM), is in charge of multimedia information retrieval components
(storage and access optimization, paragraph retrieval, integration of resources) as well as domain
semantics modelled using metadata and ontologies for restricted application domains; finally, the
third subproject, called BRAVO-RL (LLI-UAM), is the multilingual resources provider needed to
extend the platform to no occidental languages. In special, it will be developed some resources and
linguistic tools for Arabic and Japanese as well as oral corpus for processing of Spanish Questions.
Figure 1 shows the participation of each team in the project.
2.1 Question Answering and Information Extraction
As the main result, we have developed of a software platform that includes a QA system where the
original modules has been improved, a named entities recognizer, SPINDEL, that regardless of
language, applies machine learning based on bootstrapping techniques, [2, 3, 4] and a temporal
expressions recognizer [20, 21, 22].An evaluation module has been added with a twofold
functionality: to test the QA systems in different domains and concerning the voice input, a
software tool, RET (Recognition Evaluation Tool), has been developed to test the output of
commercial ASR systems and it has been used in three scenarios: the queries to the QA system, the
automatic transcription of audio from video files, [27], [44], and a real-time captioning system used
in the classroom for deaf students, [9].
Related to the work on building our proper ASR system (activities 1.5.2 to 1.5.5, see planning in
Annex 1), we have decided to work exclusively on commercial ASR systems (Via Voice and
TIN2007-67407-C03
Dragon) in activities 1.5.1 and 1.5.6. due to only a technical engineer has been recruited and he is
working in integrating the modules developed in the project in the QA platform.
These developments have been tested in three domains: news collections (EFE), wikipedia and
scientific documentation from Medline (biomedical texts).
In the biomedical domain, a prototype for drug names recognition and drug-drug interactions
extraction in the medical literature using UMLS, dictionaries and USAN rules of naming drugs. As
a result, it is available automatically annotated corpus using the DrugNer system with generic drug
names and other biomedical concepts and manually evaluated by a pharmacological expert. The
corpus consists of 849 abstracts that were downloaded from PubMed and is available at
http://basesdatos.uc3m.es/index.php?id=359). The system combines information obtained by the
UMLS MetaMap Transfer (MMTx) program and nomenclature rules recommended by the World
Health Organization (WHO) International Nonproprietary Names (INNs) Program to identify and
classify pharmaceutical substances, [13, 14, 15, 16, 17, 18].
Evaluation of these techniques in several forums: CLEF 2008 and 2009 (http://www.clefcampaign.org) track on Multiple Language Question Answering (QA &CLEF), [5, 6, 7], Second
Web People Search Evaluation [3, 26] and Text Analysis Conference (TAC 2009), [2].
2.2 Multimedia Information Retrieval
During these two first years of the project, the GSI-UPM team has been working in the
development and the continuous improvement of the IDRA tool, as well as testing this and other
available tools (some were developed previously in the research group) by participating in
international competitions on Information Retrieval and related disciplines.
Although several tools and indexing systems were available, the decision of a new development was
taken. The new tool should be opened to different formats and functionality for evaluating new
techniques related with multimedia information retrieval (in particular, text and image annotations).
A previous prerrequisite was that the different parameters used for the computation of relevance
and similarity among documents, news, technical reports, or even simple image annotations could
be easily changed for experiments. IDRA also offer, in addition to basic functionalities, advanced
ones for the management and storage of contents in an efficient way. Its design is flexible and it is
very well documented, in order to facilitate its future enhancement.
Regarding the development of the tool IDRA, [24, 39, 40], the first tasks were the review and
adaptation of previously existent resources. From then, the relevant key issues are: (a) It is fully
implemented using Java technology, using the most appropriate data structures for the
management of indexes. Having this into account, a more indexing ability is achieved, as well as a
lesser answer time for index queries; (b) Its interface offer new functionalities such as: more text
formats can be indexed, LUCENE integration for results comparisons, viewing, browsing and
management of data and data structures stored after indexing, or results analysis using different
evaluating metrics; (c) IDRA tool is distributed using a GPL 3.0 licensing schema. (See
http://sourceforge.net/projects/idraproject/).
A set of activities related with sentiment analysis has been initiated, [28]. A full review of available
resources, taking into account multilinguality was achieved, as well as a comparative evaluation.
Among them, Sentiwordnet, wordnet affect, verbnet and conceptnet were analysed. Unfortunately,
it was not possible to participate in the "SemEval Task on Affective Text" competition, that would
allowed us a more complete analysis.
LABDA-UC3M is working in a methodological approach to apply metadata of multimedia
contents to improve accessibility in web [7, 8, 10, 11].
TIN2007-67407-C03
Regarding evaluation activities, in 2008 the team participated in NTCIR-7, an international
competition for Asian languages (as well as English) issues related with information retrieval. The
task we participated for was multilingual sentiment analysis for Asian languages and English,
submitting a few experiments.
We participated in the 2008 and 2009 CLEF editions, following our uninterrupted tradition from
the 2003 edition, submitting several experiments in different tracks. In particular, the tasks for
CLEF 2008 were: ImageCLEFphoto, ImageCLEFmed, ImageCLEF Medical Image Annotation
and VideoCLEF. In the CLEF 2009 edition, the tasks we participated were: ImageCLEFphoto and
ImageCLEFmed [25, 29, 30, 31, 32, 33, 34, 35] . In the ImageCLEFmed tasks we tried to improve
the retrieval of medical images among multilingually-annotated, heterogeneous collections using
semantic expansion techniques [36, 37 42]. In ImageCLEFphoto IDRA tool was used in some
experiments (integrating text retrieval data and image content-based retrieval data), and in some
other, clustering techniques was essayed for the ordering of the results obtained in the queries.
2.3 Linguistic Resources
The main tasks of the LLI-UAM in BRAVO are: (a) Creation of new multilingual resources in
Arabic, Spanish and Japanese; [52, 54] (b) Design and annotation of a Spanish speech corpus of
questions,[9] (c) Definition of a model for question classification,. (d) Adding linguistic resources
to improve the management of spontaneous speech, in order to adapt a voice recognizer to
questions formulation.
From those goals, (a) is the most important and time-consuming effort for the subproject. In this
task, the LLI-UAM has worked alone, without coordination with the other two projects. On the
other hand, for the last three tasks, LLI-UAM has worked in closed collaboration with the
LABDA-UC3M team, as those linguistic resources are basic for the training of the QA system [5,
6, 20].
As for the LR developed, this a list of current work: Improvement of a Spanish PoS tagger and
phonological transcriber, development of an Arabic PoS tagger, development of child corpus of
Spanish, [50], development of an acoustic database on questions for Spanish and Arabic,
development of a spontaneous speech corpus of Japanese, development of a basic audio lexicon of
Japanese for didactic purpose
The most outstanding resources, in terms of innovation, are those devoted to Arabic (the tagger
and the acoustic database) since there are few groups in the world working on Arabic NLP and LR.
3 BRAVO mid-term results
3.1 Personnel in training
With respect to the formation of human resources, several Doctoral Dissertations have been
performed:
• César de Pablo Sánchez, “Semisupervised learning of patterns for answer extraction in QA
systems” (july 2010), european mention LABDA-UC3M
• Isabel Segura Bedmar, "Application of information extraction techniques to pharmacological
domain: extracting drud-drug interactions" (april 2010), european mention, LABDA-UC3M
• Lourdes Moreno, “AWA, a methodological Framework specific of accessibility to develop
web applications” (march 2010), LABDA-UC3M
TIN2007-67407-C03
•
Marta Garrote Salazar: “CHIEDE: corpus de habla infantil espontánea del español”. 2008.
LLI-UAM..
• Ana González Ledesma: Los marcadores del discurso en el corpus C-ORAL-ROM: anotación
pragmática, estrategias computacionales de etiquetado y aplicaciones a otros campos. 2010.
LLI-UAM.
• Julio Villena: “Hybrid Models for Information Retrieval”, GSI-UPM, in course.
• Sara Lana: “Cognitive models of feedback for Information Retrieval”, GSI-UPM, in course.
• Mª Teresa Vicente Díez, “Reconocimiento expresiones temporales en castellano y su
aplicación a la extracción de información”, LABDA-UC3M, in course.
In addition, two new researchers in formation have joined the LLI-UAM team: Alicia González
(FPU grant) and Leonardo Campillos (predoctoral contract funded by the Madrid Regional
Government). Also ten undergraduate students have been carried out their master thesis around
the project research.
3.2 Coordination
Coordination of three subprojects has been reflected in the evaluation of the platform in the
international CLEF forum under MIRACLE team that includes the three research teams plus
DAEDALUS company (EPO in the project proposal). CLEF participation (as is shown in the
publications section) has been materialized in “Multilingual Question Answering (QA@CLEF)”
and “Cross-Language Image Retrieval (ImageCLEF)” tracks.
Moreover, the three research groups belong to the MAVIR consortium (a network of excellence
funded by the Madrid Regional Government, www.mavir.net) where they have actively participated
in several workshops, conferences and other projects. LABDA-UC3M has organized the Spanish
Conference
on
Natural
Language
Processing
(SEPLN
2008)
(http://basesdatos.uc3m.es/sepln2008/web/), where researchers from the three groups have taken
part in sessions about language technologies. LLI-UAM organized the VI Congreso Nacional de
Lingüística General (http://elvira.lllf.uam.es/clg8/) with a session on multilingual natural language
processing with GSI-UPM researchers.
The joint research between the LLI-UAM and the GSI-UPM teams are more than 15 years old.
Both groups are participated in several co-ordinated projects, as well as join publications and
software development. The relation with LABDA-UC3M is more recent, but very intensive in the
last five years: both teams participate jointly in BRAVO and in MAVIR. The UAM and the UC3M
have exchanged researchers (Dr. Doaa Samy and Dr. Marta Garrote) during few months, with
excellent results for the production.
It must be said that the three groups in the project are submitted again a co-ordinated proposal for
the next R&D call, as a final proof of the satisfactory research experience.
3.3 Collaboration with other national and international research groups
LLI-UAM has strengthened the relations with international groups, very related with the research
interests of the project and with the previous connections of the LLI members:
Cairo University: Dr. Samy is an Associate Professor of Spanish and Computational Linguistics. In
2009 with a grant by the AECID, a Spanish-Egyptian Workshop on NLP and LR for Spanish and
Arabic was held in Cairo, co-organized with Dr. Moreno. The most important result of this
international cooperation has been the signature of an agreement of research between UAM and
CU, pushed by Moreno and Samy respectively.
TIN2007-67407-C03
Tokyo University of Foreign Studies (TUFS): there is already a student/teacher exchange
agreement between UAM and TUFS, being Kimura the UAM responsible. In 2009-10 a grant by
UAM-Banco Santander for research with Asian institutions has been received. During this period
several visits to Tokyo have been programmed for recording spontaneous speech for our corpus of
spoken Japanese.
Language Technology Lab at DFKI, Saarbruecken: Dr. Alcántara has been a post-doc visiting
researcher during two years, working in different projects related with multimodal processing. The
relation will be maintained in the future with the participation of T. Declerck in the next project
proposal, as external member of the LLI.
LABDA-UC3M: In the two last years (2008 and 2009) a considerable effort has been performed
in order to promote mobility with the aim to interchange knowledge with other relevant national
and international research groups. The researchers that take part in this project proposal have done
several stays: in 2008 Lourdes Moreno was three months in DSIC at UPV under Dr. Oscar Pastor
supervision, Isabel Segura en Natural Language Engineering group under Dr. Paolo Rosso
supervision; in 2009 César de Pablo and Isabel Segura have been at DFKI, Saarbrucken (Germany)
during 6 months with Thierry Declerck. Finally, José Luis Martínez is finishing his Phd with the
title "Incorporating semantics in a software process development through Business Rules” (april
2010) with José Carlos González and Paloma Martínez as supervisors.
3.3 Technology transfer
BRAVO project is of great importance to DAEDALUS company due to its interest on QA
technology and the integration on voice user interfaces. For this reason, this company has
developed with the collaboration of the teams a web QA system working on the Spanish Wikipedia
called respond.es that is available at http://miracle2.uc3m.es:8180/QAGWTInterface/ and has
supported several grants for three undergraduate students from UC3M to do their master thesis in
this demonstrator. This has enabled DAEDALUS to follow the advances in the state of the art in
QA technologies as well as the application of ASR technology to this kind of applications. If as the
results of BRAVO Project it is viable to define a product that could be commercialised an
agreement could be signed among the authors. GSI-UPM and LABDA-UC3M work as
DAEDALUS university partners in BUSCAMEDIA-Hacia una adaptación semántica de medios
Digitales Multirred- Multiterminal- CENIT-E project (CEN-20091026, 2009-2012).
As a result of research in QA, the system "SQUASH: A Question Answering System for Spanish”,
which is part of Technology Portfolio, Technical Services and R&D Networks, promoted by the
Fundación para el conocimiento madrid+d in 2008 was jointly developed by researchers from the
LABDA-UC3M and LLI-UAM teams. SQUASH is a modular question answering system for the
Spanish language. It enhances traditional search engine functionality by providing precise answers
in real time to questions in natural language.
The usefulness of the results for society is related to the impact of the Language Technologies in
the Society of Information. Language resources, the main working line for the LLI-UAM, provide
data for inferring knowledge and for training NLP systems. The multimodal (audio and text) and
multilingual nature of the current resources compiled during the project is a clear signal of
innovation of the research. In addition to NLP support, some of those linguistic resources are also
applied by the team researchers in teaching spoken language, especially Spanish and Japanese. This
late application was not foreseen in the project proposal and is becoming a very active and
productive (a couple of books will be soon in print).
TIN2007-67407-C03
4 References
LABDA-UC3M Publications
[1] Castro, E. Castaño, L. and Martínez, P. Evaluation of a named entity recognition system over
SNOMED CT., Simposio OpenHealth-Spain, Universidad de Alcalá, 29-30 April 2009.
[2] César de Pablo-Sanchez, Juan Perea, Isabel Segura-Bedmar, Paloma Martinez. The UC3M
team at the Knowledge Base Population task. Text Analysis Conference (TAC 2009), November
2009.
[3] De Pablo Sánchez C. and Martínez, P. UC3M at WePS2-AE: Acquiring Patterns for People
Attribute Extraction from Webpages. 2nd Web People Search Evaluation Workshop, April 21st Madrid, Spain, Co-located with the WWW2009 conference.
[4] De Pablo, C.; Martínez, P. (2009). Building a Graph of Names and Contextual Patterns for
Name. ECIR 2009, LNCS 5478 Springer 2009, pp. 530-537.
[5] De Pablo-Sánchez, C., Martínez-Fernández, J.L., González-Ledesma, A., Samy D., Martínez
P., Moreno-Sandoval A. and Al-Jumaily, H. Combining Wikipedia and Newswire Texts for
Question Answering in Spanish. Advances in Multilingual and Multimodal Information Retrieval,
CLEF 2007, Revised Selected Papers, LNCS 5152, págs. 352-355.
[6] Martínez-González, A., de Pablo-Sánchez, C., Polo-Bayo, C., Vicente-Díez, M.T., MartinezFernández, P., Martínez-Fernández, J.L. 2008. The MIRACLE Team at the CLEF 2008
Multilingual Question Answering Track. CLEF 2008, Revised Selected Papers. LNCS 5706, pp.
409-420.
[7] Moreno, L., Martínez, P. and Ruiz, B. “Disability Standards for Multimedia on the Web”.
Volume 15, issue 4, 2008, IEEE Multimedia pp:52-54.
Moreno L., Martínez P. and Ruiz B. “Guiding accessibility issues in the design of Websites”,
SIGDOC´08. Sep 22-24, Lisboa, Portugal, 2008.
[8] Moreno L., Martínez P. and Ruiz B. “Integrating HCI in a Web accessibility engineering
approach”. 13th International Conference on Human-Computer Interaction. HCI 2009 19-24 July
09, San Diego, CA, USA.
[9] Moreno, J., Garrote, M., Martínez, P. and. Martínez-Fernández, J.L Some experiments in
evaluating ASR systems applied to multimedia retrieval, 7th Workshop on Adaptative Multimedia
Retrieval, 24-25 september, Madrid 2009.
[10] Moreno, L., Martínez, P. and Ruiz-Mezcua, B. «A bridge to Web Accessibility from the
Usability Heuristics», 5th annual Usability Symposium USAB 2009. Usability & HCI eInclusion
Springer LNCS 5889. November 09-10, 2009. Linz, Austria.
[11] Moreno, L., Martínez, P. and Ruiz-Mezcua, B. «Guías metodológicas para contenidos
multimedia accesibles en la Web». Interacción 2009, X Congreso Internacional de Interacción
Persona-Ordenador, 7-9 Septiembre 2009, Barcelona, Spain.
[12] Pérez-Lainez, R. Iglesias, A., de Pablo-Sanchez, C. ANONIMYTEXT: Anonimization of
Unstructured Documents, KDIR 2009, November 2009.
[13] Segura-Bedmar, I., Crespo, M. de Pablo-Sánchez, C (2009) Score-based approach for
Anaphora Resolution in Drug-Drug Interactions Documents. 14th International Conference on
Applications of Natural Language to Information Systems (NLDB 2009).
[14] Segura-Bedmar, I., Crespo, M. de Pablo-Sánchez, C., Martínez, P. (2009) DrugNerAR:
Linguistic Rule-Based Anaphora Resolver for Drug-Drug Interaction Extraction in
Pharmacological Documents. ACM Third International Workshop on Data and Text Mining in
Bioinformatics (DTMBIO 09), november 2009.
TIN2007-67407-C03
[15] Segura-Bedmar, I.; Martínez, P.; Samy, D. (2008). A preliminary approach to recognize generic
drug names by combining UMLS resources and USAN naming conventions. BIONLP'08,
Association for Computational Linguistics (ACL), Columbus, Ohio, 19 de junio de 2008
[16] Segura-Bedmar, I.; Martínez, P.; Samy, D. (2008). Detección de fármacos genéricos en textos
biomédicos. Revista Española para el procesamiento del lenguaje natural. 40, 27-34.
[17] Segura-Bedmar, I.; Martínez, P.; Segura-Bedmar, M. (2008). Drug name recognition and
classification in biomedical texts. Drug Discovery Today. 13, (17/18), 816-823.
[18] Segura-Bedmar, Isabel, Crespo, Mario, de Pablo-Sánchez, Cesar, Martínez, Paloma. (2010).
Resolving anaphoras for the extraction of drug-drug interactions in pharmacological documents.
To appear in BMC BioInformatics.
[19] Vicente-Díez, M.T. y Martínez, P. Aplicación de técnicas de extracción de información
temporal a los sistemas de búsqueda de respuestas. Revista Procesamiento del Lenguaje Natural. N.
42 (marzo 2009); pp.25-30.
[20] Vicente-Díez, M.T., de Pablo-Sánchez, C., Martinez-Fernández, P., Moreno, J. and Garrote,
M. 2009. Are Passages Enough? The MIRACLE Team Participation at QA@CLEF2009 . In
Cross-Language Evaluation Forum (CLEF) 2009 Working Notes, in ECDL 2009 conference.
Corfú, Greece. September 2009.
[21] Vicente-Díez, M.T., Martínez P. 2009. Temporal Semantics Extraction for Improving Web
Search. 8th International Workshop on Web Semantics (WebS' 09), in Proceedings of the 20th,
DEXA 2009, Linz, Austria, 31 August - 4 September, 2009.
[22] Vicente-Díez, M.T., Samy, D. y Martínez, P. An empirical approach to a preliminary successful
identification and resolution of temporal expressions in Spanish news corpora. In Proceedings of
the Sixth International Language Resources and Evaluation Conference (LREC'08). European
Language Resources Association (ELRA). Marrakech, Morocco. 28-30 May 2008.
GSI-UPM Publications
[23] Ana García-Serrano and José Miguel Goñi-Menoyo. Applied Research in Linguistic
Engineering: Resources and Tools. “Egyptian-Hispanic Meeting on Language Processing and
Language Resources in Spanish and Arabic” Cairo University, Egypt, 1-4 November 2009.
Supported by AECID and Mavir Consortium.
[24] Ana García-Serrano, Xaro Benavent, Rubén Granados and José Miguel Goñi-Menoyo. Some
Results Using Different Approaches to Merge Visual and Text-Based Features in CLEF’08 Photo
Collection, Evaluating Systems for Multilingual and Multimodal Information Access. 9th
Workshop of the Cross-Language Evaluation Forum, CLEF 2008, LNCS 5706.
[25] González-Cristóbal, José C.; Goñi-Menoyo, José M.; Villena-Román, Julio; and Lana-Serrano,
Sara. (2008) MIRACLE Progress in Monolingual Information Retrieval at Ad-Hoc CLEF 2007.
Advances in Multilingual and Multimodal Information Retrieval. 8th Workshop of the CrossLanguage Evaluation Forum, CLEF 2007, LNCS, vol. 5152, págs. 156-159
[26] González-Cristóbal, José C.; Maté, Pablo; Vadillo, Laura; Sotomayor, Rocío; and Carrera,
Álvaro. Learning by doing: A baseline approach to the clustering of web people search results. In
Proceedings of the 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW
Conference. Madrid, Spain, abril de 2009.
[27] Julio Villena-Román, Sara Lana-Serrano (2008, Septiembre) MIRACLE at VideoCLEF 2008:
Classification of Multilingual Speech Transcripts. Working Notes for the CLEF 2008 Workshop.
[28] Julio Villena-Román, Sara Lana-Serrano and José C. González-Cristóbal (2008, Diciembre)
MIRACLE at NTCIR-7 MOAT: First Experiments on Multilingual Opinion Analysis.
[29] Julio Villena-Román, Sara Lana-Serrano and José Carlos González-Cristóbal (2008)
MIRACLE at ImageCLEFmed 2007: Merging Textual and Visual Strategies to Improve Medical
TIN2007-67407-C03
Image Retrieval. Advances in Multilingual and Multimodal Information Retrieval. 8th Workshop of
the Cross-Language Evaluation Forum, CLEF 2007, LNCS, vol. 5152, págs. 593-596.
[30] Julio Villena-Román, Sara Lana-Serrano, José C. González-Cristóbal (2008) MIRACLE-GSI
at ImageCLEFphoto 2008: Experiments on Semantic and Statistical Topic Expansion. Working
Notes for the CLEF 2008 Workshop.
[31] Julio Villena-Román, Sara Lana-Serrano, José C. González-Cristóbal (2009) MIRACLE-GSI
at ImageCLEFphoto 2009: Comparing Clustering vs. Classification for Result Reranking. Working
Notes for the CLEF 2009.
[32] Julio Villena-Román, Sara Lana-Serrano, José Luis Martínez-Fernández and José Carlos
González-Cristóbal. (2008) MIRACLE at ImageCLEFphoto 2007: Evaluation of Merging
Strategies for Multilingual and Multimedia Information Retrieval,” Advances in Multilingual and
Multimodal Information Retrieval. 8th Workshop of the Cross-Language Evaluation Forum,
CLEF 2007, LNCS, vol. 5152, págs. 500-503
[33] Lana-Serrano, Sara; Villena-Román, Julio; and González-Cristóbal, José-C.: MIRACLE at
ImageCLEFmed 2008: Semantic vs. Statistical Strategies for Topic Expansion, Evaluating Systems
for Multilingual and Multimodal Information Access. 9th Workshop of the Cross-Language
Evaluation Forum, CLEF 2008, LNCS 5706.
[34] Lana-Serrano, Sara; Villena-Román, Julio; and González-Cristóbal, José-C.: MIRACLE-GSI at
ImageCLEFphoto 2008: Different Strategies for Automatic Topic Expansion., Evaluating Systems
for Multilingual and Multimodal Information Access. 9th Workshop of the Cross-Language
Evaluation Forum, CLEF 2008, LNCS 5706.
[35] Lana-Serrano, Sara; Villena-Román, Julio; González-Cristóbal, José C.; and Goñi-Menoyo,
José M. (2008) MIRACLE at GeoCLEF Query Parsing 2007: Extraction and Classification of
Geographical Information. Advances in Multilingual and Multimodal Information Retrieval. 8th
Workshop of the Cross-Language Evaluation Forum, CLEF 2007, LNCS, vol. 5152, págs. 786-793,
[36] Lana-Serrano, Sara; Villena-Román, Julio; González-Cristóbal, José C.; and Goñi-Menoyo,
José M. (2008) MIRACLE at ImageCLEFanot 2007: Machine Learning Experiments on Medical
Image Annotation. Advances in Multilingual and Multimodal Information Retrieval. 8th Workshop
of the Cross-Language Evaluation Forum, CLEF 2007, LNCS, vol. 5152, págs. 597-600,
[37] Lana-Serrano, Sara; Villena-Román, Julio; González-Cristóbal, José Carlos; and Goñi-Menoyo,
José Miguel. (2009) MIRACLE at ImageCLEFannot 2008: Nearest Neighbour Classification of
Image Feature Vectors for Medial Image Annotation. Evaluating Systems for Multilingual and
Multimodal Information Access. 9th Workshop of the Cross-Language Evaluation Forum, CLEF
2008, LNCS 5706.
[38] R. Granados, X. Benavent, A. García-Serrano, J.M. Goñi. (2008) MIRACLE-FI at
ImageCLEFphoto 2008: Experiences in merging Text-based and Content-based Retrievals.
Working Notes for the CLEF 2008 Workshop.
[39] R. Granados, X. Benavent, R. Agerri, A. García-Serrano, J.M. Goñi, J. Gomar, E. De Ves, J.
Domingo, G. Ayala. (2009) MIRACLE-FI at ImageCLEFphoto 2009. Working Notes for the
CLEF 2009 Workshop
[40] Rubén Granados Muñoz, Ana García Serrano, José M. Goñi Menoyo. La herramienta IDRA
(Indexing and Retrieving Automatically). XXV Conferencia de la Sociedad Española para el
Procesamiento del Lenguaje Natural (SEPLN’09). San Sebastián, 2009.
[41] Sara Lana-Serrano, Julio Villena-Román, José C. González-Cristóbal (2008) MIRACLE at
ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion. Working Notes for
the CLEF 2008 Workshop.
TIN2007-67407-C03
[42] Sara Lana-Serrano, Julio Villena-Román, José Carlos González-Cristóbal, José Miguel GoñiMenoyo. (2008) MIRACLE at ImageCLEFannot 2008: Classification of Image Features for
Medical Image Annotation. Working Notes for the CLEF 2008 Workshop
[43] Sara Lana-Serrano, Julio Villena-Román, José Carlos González-Cristóbal. (2009) MIRACLE
at ImageCLEFmed 2009: Reevaluating Strategies for Automatic Topic Expansion. Working Notes
for the CLEF 2009 Workshop
[44] Villena-Román, Julio; and Lana-Serrano, Sara: MIRACLE at VideoCLEF 2008: Topic
Identification and Keyframe Extraction in Dual Language Videos, Evaluating Systems for
Multilingual and Multimodal Information Access. 9th Workshop of the Cross-Language Evaluation
Forum, CLEF 2008, LNCS 5706.
LLI-UAM Publications
[45] Alcántara, M. Introducción al análisis de estructuras lingüísticas en corpus. Aproximación
semántica. Madrid: Servicio de Publicaciones de la UAM,. 2007
[46] Alcántara, M.. "El análisis lingüístico en la transcripción automática de la lengua hablada, el
Proyecto COAST" ,VIII Congreso de Lingüística General: El valor de la diversidad
[meta]lingüística, Madrid. 2008
[47] Alcántara, M. "La anotación del habla en corpus de vídeo" en Revista de Procesamiento del
Lenguaje Natural, 8, 2007
[48] Alcántara, M. "Uso de corpus de habla espontánea en la enseñanza de la cortesía en español"
en Nicolás, Carlota: Ricerche sul Corpus del parlato romanzo C-ORAL-ROM. Studi linguistici e
applicazioni didattiche per l'insegnamento di L2. Firenze: Firenze University Press. 2007
[49] Campillos, L. "Las expresiones causales en el corpus de habla espontánea C-ORAL-ROM".
En Actas del 8ª Congreso de Lingüística General, UAM, 25-28 de junio.: 2008
[50] Garrote, M., Guirao, J.M. y Moreno, A.. "Extracción de unidades distintivas en adultos y niños
de un corpus de lengua oral espontánea". 8ª Congreso de Lingüística General, UAM, junio 2008.
[51] González-Ledesma, A. "Pragmatext, Annotating the Spanish C-ORAL-ROM Corpus with
Pragmatic Knowledge",4th Corpus Linguistics Conference, University of Birmingham, July. 2007.
[52] González-Ledesma, A. y Samy, D. "Marcadores discursivos en árabe y español: un estudio
computacional basado en corpus paralelos con anotación pragmática". 8ª Congreso de Lingüística
General, UAM, 25-28 de junio 2008
[53] Gozalo, P. "Reflexiones sobre el futuro. Los datos del español no nativo". 8ª Congreso de
Lingüística General, UAM, 25-28 de junio 2008
[54] Kimura, C. "The constancy and alteration in the respect language of Japanese" Panel titled
"Re-creation of Identities in East Asia: Literature and Linguistics" 5th International Convention of
Asia Scholars), Kuala Lumpur. 2007
[55] Moreno Sandoval, A., Guirao, J.M. y Torre Toledano, D. "Herramientas de anotación de
corpus de habla espontánea del Laboratorio de Lingüística Informática de la UAM" Revista de la
Sociedad Española para el Procesamiento del Lenguaje Natural. Nº 41, 2008.
[56] Moreno Sandoval, A., T. Toledano, D., De La Torre, R., Garrote, M. y Guirao, J.M..
"Developing a Phonemic and Syllabic Frequency Inventory for Spontaneous Spoken Castilian
Spanish and their Comparison to Text-Based Inventories". LREC 2008,Marrakech.
[57] Moreno Sandoval, A.,. (editor). Actas del VIII Congreso de Lingüística General: El valor de la
diversidad [meta]lingüística. Madrid.CD-ROM.ISBN 978-84-96487-19-9, 2008
[58] Samy, D. y González-Ledesma, A.. "Pragmatic Annotation of Discourse Markers in a
Multilingual Parallel Corpus (Arabic- Spanish-English)". LREC 2008, Marrakech, may 2008
Jornada de Seguimiento de Proyectos, 2010
Programa Nacional de Tecnologías Informáticas
TIN2007-67407-C03
Figure 1: Linguistic and software modules and Participants in BRAVO project
Jornada de Seguimiento de Proyectos, 2010
Programa Nacional de Tecnologías Informáticas
ANNEX I: PLANNING SUBPROJECT 1
Activities/Tasks
1.1.1 Project Management and Coordination
1.1.2 Subproject 1 Management
1.2 Platform for resources integration
1.3.1 Definition of the architecture
1.3.2 Adjustment of existing resources, modules and prototypes
1.3.3 Integration in the environment
1.4.1 Analysis of the state of the art in answer extraction
1.4.2 Analysis and implementation of a module for temporal exp..
1.4.3 Flexible answer extraction
1.4.4 Validation and ranking of answers
1.5.1 Evaluation of commercial and open solutions
1.5.2 Evaluation and implementation of a SAD
1.5.3 Module for feature extraction
1.5.4 Estimation of the acoustic model
1.5.5 Estimation of a Language model
1.5.6 Validation of the system in several environments
1.6.1 Textual questions analysis module
1.6.2 Oral questions analysis module
1.6.3 Validation of prototype
1.7.1 Design and implementation of a probabilistic QA model
1.7.2 Integration and validation of the prototype
1.8.1 Participation in International Evaluation forums
1.9.1 Web demonstrator of QA prototypes
1.9.2 Publication of research results
PLANNING SUBPROJECT 2
Activities/Tasks
2.1.1 Management of subproject 2
2.1.2 Coordination of subproject 2
2.2.1 Debugging and enlargement of existing linguistic …
2.2.2 Compilation of semantic resources based on ….
2.3.1 Debugging and enlargement of linguistic processing …
2.3.2 Development of entities recognition module
2.3.3 Development of semantic processing module
2.3.4 Modules for multimedia processing
2.4.1Domain modeling
2.5.1 Analysis of actual status of IR in QA systems
2.5.2 Integration/adaptation of specific improvements for QA
PLANNING SUBPROJECT 3
Activities/Tasks
3.1.1. Management of subproject 3
3.2.1. Development of Arabic resources
3.2.2. Development of Spanish resources
3.2.3. Development of Japanese resources
3.3.1. Study of domain and design of recordings collec.
3.3.2. Collection of subcorpus of read speech
3.3.3. Collection of subcorpus of spontaneous speech
3.3.4. Annotation of the corpus of questions
3.3.5. Splitting the corpus: training and evaluation
3.4.1. Model to classify textual questions in Spanish
3.4.2. Model to classify textual questions in Arabic
3.4.3. Grammars for oral questions in Spanish
First year (*)
Second Year (*)
x|x|x|x|x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|x|x|
x|x|x|x|x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|x|x|
|x|x|x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|x|x|
x|x|x|
x|x|x|x|x|x|x|x|x|x|
x|x|x|x|x|x|x|x|x|x|x|x|
x|x|x|x|
|x|x|x|x|x|x| x|x|x|x|x|x|
|x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|
|x|x|x|x|x|x|x|x| x|x|x|x|x|x|x|x|x|x|
x|x|x|x|x|x|
x|x|
|x|x|x|x|x|x|
x|x|x|x| x|x|x|x|x|x|
|x|x|x|x|x|x|x|x|
Third Yeard (*)
x|x|x|x|x|x|x|x|x|x|x|x|
x|x|x|x|x|x|x|x|x|x|x|x|
x|x|x|x|x|x|x|x|x|x|x|x|
x|x|x|x|x|x|x|x|x|x|x|x|
x|x|x|x|x|x|
|x|x|x|x|x|x|
|x|x|x|x|x|x|x|x|
|x
x|x|x|x|x|x|x|x|x|x|x|x|
|x|
|x|x|x|x|x|x|
First year (*)
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
|x|x|x|x
|x|x|x|x|x|x|x|x
|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|
|x|x|x|x|x|x|x|x
First year (*)
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x
x|x|x|x
x|x|x|x|x|x
|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x|
x|
|x|
x|x|x|x|x|x|x|x|x|x|x|x|
x|x|x|x|x|x|
x|x|x|x|x|x|
x|x|x|x|x|x|
x|x|x|x|x|x|x|x|x|x|x|x|
x|
|x|x|
x|x|x|x|x|x|x|x|x|x|x|x|
Second year (*)
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|
Third year (*)
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|
Second year (*)
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
Third year (*)
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|
x|x|x|x|x|x|
x|x|x|x|x|x|
x|x|x|x|x|x|
x|x|x|x|x|x|
x|x|x|x|x|x|x|x|x|x|x|x
x|x|x|x|x|x|x|x|x|x|x|x
|x|x|x|x|x|x
|x|x|x|x|x|x
|x|x|x|x|x|x
x|x|x|x|x|x

Documentos relacionados