D3.1 Spanish and Italian text correction modules adapted to web
Transcripción
D3.1 Spanish and Italian text correction modules adapted to web
DELIVERABLE Project Acronym: FLAVIUS Grant Agreement number: ICT-PSP-250528 Project Title: Foreign LAnguage Versions of Internet and User generated Sites D3.1 Spanish and Italian text correction modules adapted to web environment Revision: 1.0 Authors: Sonia Collada Pérez, Julio Villena Román (Daedalus) 1 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services Revision History Revision Date 0.1 Author January, 7th, Sonia Collada Pérez, 2011 Julio Villena Román Organisation Description Daedalus 1.0 Statement of originality: This deliverable contains original unpublished work except where clearly indicated otherwise. Acknowledgement of previously published material and of the work of others has been made through appropriate citation, quotation or both. 2 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services 1. Introduction FLAVIUS aims to provide an easy and cost-effective way for webmasters to have their website translated and indexed into several languages. The spelling and grammar checker module aims to improve the quality of the source text so as to get a higher translation quality afterwards. Within FLAVIUS project, the source languages that will be taken into account are English, French, Spanish and Italian. The aim of this document is to describe the modifications performed on the spelling and grammar checker module for Spanish and Italian in order to adapt it to web environment and mainly to user generated content. 2. Daedalus contribution Daedalus is a company established by a group of specialists on research, development, innovation and transfer of technology in the field of Information and Communications Technology (ICT). In the domain dedicated to text correction Daedalus has developed a spell and grammar checker which complies with the media and the editorial needs, providing the quality required in these areas. FLAVIUS project aims to provide an easy and cost-effective way for webmasters to have their website translated and indexed into several languages. The aim of Daedalus spelling and grammar checker within this project is to improve the quality of the source text so as to get a higher translation quality afterwards. Since text to be translated will be user generated content it has been necessary to adjust the spelling and grammar checker in order to adapt its behaviour to this new scenario. 3. Error detection The objective of this task is to carry out any necessary modification to the spelling and grammar checker modules for Spanish and Italian to be able to process user generated content such as blog posts, reviews, etc. So the first step is to find the differences between the specific features of this type of content with respect to the formal language that is typically used by professional writers such as journalists, etc. 3 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services In order to collect a text corpus, an RSS monitoring robot has been developed. This robot automatically downloads and processes RSS channels published in different sites belonging to Qype, one our content-provider partner. The following table shows the list of RSS channels that are actually fed into the robot. Table 1: List of RSS channels for Spanish and Italian Spanish Italian http://www.qype.es/es300-madrid/rss http://www.qype.es/es511-barcelona/rss http://www.qype.es/es213-bilbao/rss http://www.qype.es/es523-valencia/rss http://www.qype.es/es530-palma-de-mallorca/rss http://www.qype.es/es212-donostia-san-sebastian/rss http://www.qype.es/es111-santiago-de-compostela/rss http://www.qype.es/es243-zaragoza/rss http://www.qype.es/es618-sevilla/rss http://www.qype.es/es/rss http://www.qype.es/uk/rss http://www.qype.es/fr/rss http://www.qype.es/it/rss http://www.qype.it/es/rss http://www.qype.it/uk/rss http://www.qype.it/fr/rss http://www.qype.it/it/rss In fact, we are already collecting information for English and French, as shown in next table. Table 2: List of RSS channels for Spanish and Italian English French http://www.qype.co.uk/es/rss http://www.qype.co.uk/uk/rss http://www.qype.co.uk/fr/rss http://www.qype.co.uk/it/rss http://www.qype.fr/es/rss http://www.qype.fr/uk/rss http://www.qype.fr/fr/rss http://www.qype.fr/it/rss 4 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services Twice a day, checking reports are automatically built by using the last up-to-date engines and then sent by email to a group of expert reviewers. These reports contain a list of errors which are analyzed in order to detect false positives and false negatives. These errors are stored in a database which allows to assess the status of the checking engines. 4. System adaptation to web environment The monitoring approach has stressed the need of performing enhancements on the system, such as including a language detector, modifying the lexical base and enhancing the grammar engine. Language detector One of the most common errors is related to the different languages that appear simultaneously in the same text, mainly when analyzing reviews from Qype (and also manually from TVTrip). This is due to the fact that even when the review is written in Spanish it might be referred to a foreign place whose name will be written in a different language, for instance: En mi reciente viaje a París visité otra vez la Tour Eiffel. Un tempo arena di corride e poi teatro, oggi è un tranquillo punto di ritrovo e di sosta. Particolarmente belli e caratteristici sono i caffè che si trovano ai suoi angoli. Vi sono molte bancarelle dove comprare stupendi souvenir.La Plaza Mayor acquista molto fascino sotto il periodo natalizio. In this case, some special treatment should be done for foreign proper nouns. Our approach is to include a list of widely used (understood) proper names for important places in all the languages that the user could speak and the system is able to check. Besides, many other expressions can be used in a foreign language to make allusion to a foreign place and they could cause false positives, for instance: Creo que es uno de los mejores restaurantes de todo Paris los platos son maravillosos, hechos con ingredientes muy frescos. Todo está fantástico. Bon Appétit ! In this case, the strategy is to try to detect the fragment in a different language and inhibit the checking process. 5 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services Lexical base Web monitoring provides new lexical resources that are daily added to the spelling and grammar checker, such as: • • • • • • Proper names. Foreign words which are frequently used in blogs: brunch, risotto, cool, glamouroso, pen_drive, chic, chill-out. Colloquial and new expressions: corta_y_pega, nacho, calidad-precio, rojo_pasión. SMS-like writing. Emoticons. Interjections frequently used in blogs. About 180 new proper nouns and common words were included daily during November and December (see Appendix 1). Grammar engine Due to the fact that the content available in provider sites is generated by users, some of the system grammar rules have been slightly modified or disabled so as to comply with providers needs. • The typographic rule used to detect the multiple punctuation marks has been omitted. For instance, the following is accepted as correct: ¡¡¡hola!!! • Modifications on the rule used to return a single error when several punctuation marks are unbalanced, for example: Hola! This sentence is considered to be correct (instead of ¡Hola!) although exclamation and question marks should be balanced with an opening and closing one. • Modification that allows ignoring email addresses that otherwise might be considered as errors, such as: [email protected]. However the system cannot detect and ignore user names such as: 6 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services chicoestelar / guillermoacosta / Misterpollo / carlitos77 as it is not possible to distinguish whether it is a mistake or an user name. • The modification of configuration parameters in order to allow disabling style and typography correction due to the type of content that has been considered. These are the API input parameters: o txt: input text, UTF-8 encoding, in plain text, HTML or XML. o key: access key, needed for making any request. To get a valid access key, contact [email protected]. o clang: language of the text. The allowed values are the following: es: Spanish it: Italian en: English fr: French o ilang: language of the interface. Valid values are the same as clang or en (English). o format: format of the output. The allowed values are the following (described later): xml: XML (default) json: JSON format html: HTML format check: returns a tagged version of the text o offset: offset where to start the revision of the text, starting from 0 (default). o mode: check mode, according to the following values: all: get all errors (default) next: get only the next error (from the given offset) o config: settings for the revision process. This parameter is a string with a list of values, separated by semicolon, indicating one or more of the following values: pp=<value>: try to guess words with known prefixes: 0=inactive, 1=active (default). 7 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services dh=<value>: handle hyphenation at the end of line: 0=inactive (default), 1=active. aqoi=<value>: accept words within quotes or in italics: 0=inactive (default), 1=active. tls=<value>: warn of too long sentences: 0=inactive, 1=active (default). dpn=<value>: try to guess unknown proper nouns: 0=inactive, 1=active (default). stme=<value>: behaviour in sentences with many errors: 0=show individual messages, 1=group messages and ignore, 2=group messages and give a warning (default). mf=<value>: try to guess mathematical formulas: 0=inactive, 1=ignore formulas, 2=check formulas (default). red=<value>: check redundancy in sentence: 0=inactive, 1=active (default) spa=<value>: check spacing: 0=inactive, 1=active (default). comppunc=<value>: check unbalanced punctuation: 0=inactive, 1=active (default). corrpunc=<value>: check incorrect punctuation: 0=inactive, 1=active (default). fwi=<value>: foreign words should be written in italics: 0=inactive, 1=active (default). alw=<value>: suggest alternatives to loan words: 0=inactive, 1=active (default). dic=<value>: list of active dictionaries [NOT SUPPORTED YET]. level=<value>: knowledge of language, CEFR level (Common European Framework of Reference for Languages): • A1 (Breakthrough) • A2 (Waystage) • B1 (Threshold) • B2 (Vantage) • C1 (Proficiency) • C2 (Mastery) (default) And the configuration that should be considered when dealing with user provided content such as Qype reviews is the following: o config: pp=1: try to guess words with known prefixes. dh=0: not to handle hyphenation at the end of line. aqoi=1: accept words within quotes or in italics. tls=0: not to warn of too long sentences. dpn=1: try to guess unknown proper nouns. 8 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services • stme=2: group messages and give a warning. mf=0: ignore formulas. red=0: ignore redundancy in sentence. spa= 0: ignore spacing mistakes. comppunc=0: ignore unbalanced punctuation. corrpunc=0: ignore incorrect punctuation. fwi=0: foreign words do not need to be written in italics. alw=1: suggest alternatives to loan words. level=A1: knowledge of language breakthrough, CEFR level (Common European Framework of Reference for Languages). When considering texts with wrong accentuation the checker used to cause some incorrect grammar structures. In order to avoid this type of analysis the orthographic module has been swapped with the disambiguation module, and the syntactic analysis is performed considering a disambiguation process which is insensitive to accents. Other typical problem related to accents is due to words which are frequently written without accents in user generated content. The rules that deal with this type of problems have been improved. For instance the following pairs of words are frequently wrongly spelt and need disambiguation: o ultimo-último el <error>ultimo</error> de la clase el <error>ejercito</error> de Israel he leído el <error>articulo</error> RULE(L"ReglaUltimoPorÚltimo") ( EXISTENTIAL_TAG(POS(N), TagVerb) AND EXISTENTIAL_TAG(POS(N), TagSingularVerb) AND EXISTENTIAL_TAG(POS(N), TagPresentIndicative) AND !(EXISTENTIAL_TAG(POS(N), TagNoun OR_TAG TagAdjective) AND !EXISTENTIAL_TAG(POS(N), TagNounAppreciativeY)) AND (EXISTENTIAL_TAG(POS(N-1), TagArticle OR_TAG TagNumeral OR_TAG TagDemonstrative OR_TAG TagPossesive OR_TAG TagPrenominalAdjective) AND UNIVERSAL_TAG(POS(N-1), TagMasculine) AND UNIVERSAL_TAG(POS(N-1), TagSingular) OR FORM(POS(N-1), L"al|del") OR 9 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services EXISTENTIAL_TAG(POS(N-1), TagPreposition)) AND EXISTENTIAL_TAG_WITH_ACCENTS(POS(N), TagMascSingNoun OR_TAG TagMascSingAdjective, sTemp) ) AND SAFE_CONTEXT(POS_WITHOUT_OVERFLOW(N-1), POS_WITHOUT_OVERFLOW(N)) THEN SUG_WORD(POS(N), sTemp); ADD_ERROR(Error_Spelling, POS(N), POS(N), msg(L"es", L"Posiblemente falte el acento.", L"en", L"Maybe the diacritical mark is missing.", L"fr", L"Assurez-vous de ne pas avoir oublié un accent.", L"it", L"Forse manca l'accento."), Error_Poco_Seguro, L"", B1, L"ReglaUltimoPorÚltimo"); END_RULE o replica-réplica los familiares de la <error>victima</error> trabajo en aquella <error>fabrica</error> RULE(L"ReglaReplicaPorRéplica") ( EXISTENTIAL_TAG(POS(N), TagVerb3Singular) AND EXISTENTIAL_TAG(POS(N), TagSingularVerb) AND EXISTENTIAL_TAG(POS(N), TagVerbPresent) AND EXISTENTIAL_TAG(POS(N), TagVerb2) AND EXISTENTIAL_TAG(POS(N), TagImperativeVerb) AND !(EXISTENTIAL_TAG(POS(N), TagNoun OR_TAG TagAdjective) AND !EXISTENTIAL_TAG(POS(N), TagNounAppreciativeY)) AND ((EXISTENTIAL_TAG(POS(N-1), TagNumeral OR_TAG TagDemonstrative OR_TAG TagPossesive OR_TAG TagPrenominalAdjective) OR FORM(POS(N-1), L"una") OR FORM(POS(N-1), L"la")) AND UNIVERSAL_TAG(POS(N-1), TagFeminine) AND UNIVERSAL_TAG(POS(N-1), TagSingular) OR EXISTENTIAL_TAG(POS(N-1), TagPreposition)) AND !(EXISTENTIAL_TAG(POS(N-1), TagDemonstrative) AND IS_FIRST_WORD(POS(N-1))) AND EXISTENTIAL_TAG_WITH_ACCENTS(POS(N), TagCommonNounFemenine OR_TAG 10 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services TagFeminineSingularAdjective, sTemp) ) AND SAFE_CONTEXT(POS_WITHOUT_OVERFLOW(N-1), POS_WITHOUT_OVERFLOW(N)) THEN SUG_WORD(POS(N), sTemp); ADD_ERROR(Error_Spelling, POS(N), POS(N), msg(L"es", L"Posiblemente falte el acento.", L"en", L"Maybe the diacritical mark is missing.", L"fr", L"Assurez-vous de ne pas avoir oublié un accent.", L"it", L"Forse manca l'accento."), Error_Poco_Seguro, L"", B1, L"ReglaReplicaPorRéplica"); END_RULE o esta-está El campesino <error>esta</error> asustado me <error>esta</error> yendo muy bien ¿<error>Estas</error> furioso por eso? no <error>esta</error> mal en absoluto que seas de un equipo como todo el mundo se <error>esta</error> de maravilla RULE(L"ReglaEstaPorEstá") ( FORM(POS(N), L"esta|estas") AND (FORM(POS(N-1), L"cómo|dónde") AND !EXISTENTIAL_TAG(POS(N+1), TagFemNoun) AND NUMBER_AGREES(POS(N), POS(N+1)) OR EXISTENTIAL_TAG(POS(N+1), TagAdjectiveOrParticiple OR_TAG TagNoun) AND !GENDER_AND_NUMBER_AGREE(POS(N), POS(N+1)) OR EXISTENTIAL_TAG(POS(N+1), TagGerundVerb) AND !EXISTENTIAL_TAG(POS(N+1), TagFemNoun) AND !NUMBER_AGREES(POS(N), POS(N+1)) OR FORM(POS(N-1), L"se")) ) AND SAFE_CONTEXT(POS_WITHOUT_OVERFLOW(N), POS_WITHOUT_OVERFLOW(N+2)) 11 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services THEN IF FORM(POS(N), L"esta") THEN SUG_WORD(POS(N), L"está"); ELSEIF FORM(POS(N), L"estas") THEN SUG_WORD(POS(N), L"estás"); END ADD_ERROR(Error_Spelling, POS(N), POS(N), msg(L"es", L"Posiblemente falte el acento.", L"en", L"Maybe the diacritical mark is missing.", L"fr", L"Assurez-vous de ne pas avoir oublié un accent.", L"it", L"Forse manca l'accento "), Error_Poco_Seguro, L"", B1, L"ReglaEstaPorEstá"); END_RULE o hacia-hacía la limpieza se <error>hacia</error> casi imposible Se <error>amplia</error> el carácter participativo de los carteles RULE(L"ReglaHaciaPorHacía") ( FORM(POS(N), L"se") AND FORM_ENDING(POS(N+1), L"ia|ian") AND EXISTENTIAL_TAG_WITH_ACCENTS(POS(N+1), TagVerb3, sTemp) ) AND SAFE_CONTEXT(POS_WITHOUT_OVERFLOW(N), POS_WITHOUT_OVERFLOW(N+1)) THEN SUG_WORD(POS(N+1), sTemp); ADD_ERROR(Error_Spelling, POS(N+1), POS(N+1), msg(L"es", L"Posiblemente falte el acento.", L"en", L"Maybe the diacritical mark is missing.", L"fr", L"Assurez-vous de ne pas avoir oublié un accent.", L"it", L"Forse manca l'accento "), L"esDidac", L"Quizá te hayas olvidado de poner el acento."), Error_Poco_Seguro, 12 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services L"", B1, L"ReglaHaciaPorHacía"); END_RULE o mas-más Newton dedicó <error>mas</error> tiempo a la investigación química que a la física Fue el pensador de mentalidad <error>mas</error> científica que produjo la Edad Media. La complejidad del genoma es mayor en los organismos <error>mas</error> evolucionados. Se añadirán tres meses <error>mas</error> de reclusión. Aquí el fiscal no es <error>mas</error> que un hombre que denuncia. El estado, <error>mas</error> comprometido en esa época. todos los partidos políticos, <error>mas</error> de mil talleres de capacitación con la sed permanente de <error>mas</error> y <error>mas</error> consumo RULE(L"ReglaMasPorMás") ( FORM(POS(N), L"mas") AND EXISTENTIAL_TAG(POS(N), TagConjunction) AND ((UNIVERSAL_TAG(POS(N+1), TagNoun OR_TAG TagAdjective OR_TAG TagParticiple) AND !EXISTENTIAL_TAG(POS(N+1), TagProperNoun)) OR EXISTENTIAL_TAG(POS(N+1), MorStrongPunctuationC OR_TAG TagPreposition OR_TAG TagConjunction) OR EXISTENTIAL_TAG(POS(N-1), TagConjunction OR_TAG TagPreposition)) AND !FORM(POS(N+1), L"sin_embargo") ) AND SAFE_CONTEXT(POS_WITHOUT_OVERFLOW(N-1), POS_WITHOUT_OVERFLOW(N+1)) THEN SUG_WORD(POS(N),L"más"); ADD_ERROR(Error_Spelling, POS(N), POS(N), msg(L"es", L"Posible confusión al emplear la conjunción <i>mas</i> en vez del adverbio <i>más</i>, conviene revisar el acento.", L"en", L"It may be a confusion using the conjunction <i>mas</i> instead of the adverb <i>más</i>, it is convenient to check the diacritical mark.", 13 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services L"fr", L"Il y a peut-être une confusion entre la conjonction mas et l’adverbe más.", L"it", L"Possibile confusione nell'uso della congiunzione <i>mas</i> invece dell'avverbio <i>más</i>. Controllare l'accento."), Error_Poco_Seguro, L"Check_Panhisp", C2, L"ReglaMasPorMás"); END_RULE Besides it has also been modified the rules used to detect proper nouns (Name + Surname). Other modifications Most of the problems detected during the monitoring phase where related to sentences with a high number of prepositions or sentences with coordinated structures. In the first case, the threshold used to detect whether a sentence is correct has been modified and thus achieving a more adequate performance. “la víctima y su marido vivían en el piso” [the victim and his husband lived in the same apartment] In the second case, the main problem arises when checking the concordance between verb and subject when the subject is a coordinated structure. Finally this problem has been solved and sentences such as the following are now correctly analyzed: “los postres sobre todo el tiramisú estaban muy buenos” [desserts were splendid, tiramisu over all] Some other rules have been included to distinguish between pairs such as: o haber vs. a_ver Mira <error>haber</error> si no llegas a tiempo RULE(L"ReglaHaberPorAVer") FORM(POS(N), L"haber") AND IS_ELEMENTAL_TOKEN(POS(N)) AND FORM(POS(N+1), L"si") AND !EXISTENTIAL_TAG(POS(N-1), TagDeterminer OR_TAG TagPreposition) 14 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services THEN SUG_WORD(POS(N), L"a"); SUG_SPACE(); SUG_WORD(POS(N+1), L"ver"); ADD_ERROR(Error_Spelling, POS(N), POS(N), msg(L"es", L"Posible confusión al emplear la forma verbal <i>haber</i> inadecuadamente.", L"en", L"Possible confusion using the verb <i>haber</i> improperly.", L"fr", L"Le verbe <i>haber</i> n'est pas employé correctement.", L"it", L"Possibile confusione facendo uso della forma verbale <i>haber</i> inadeguatamente."), Error_Poco_Seguro, L"", C1, L"ReglaHaberPorAVer"); END_RULE o o sino vs. si_no Mi mujer me sabía inofensivo, pues <error>sino</error> habría ideado otra solución RULE(L"ReglaSinoPorSi_No") FORM(POS(N), L"sino") AND !(EXISTENTIAL_TAG(POS(N-1), TagDeterminer) AND FORM(POS(N-1), L"suyo|suya|suyos|suyas|mío|mía|míos|mías|tuyo|tuya|tuyos|tuyas")AND (UNIVERSAL_TAG(POS(N+1), TagPersonalVerb) AND !FORM(POS(N+1), L"hace") OR EXISTENTIAL_TAG(POS(N+1), TagAccusativePersonal OR_TAG TagDativePersonal) AND UNIVERSAL_TAG(POS(N+2), TagPersonalVerb) OR FORM(POS(N+1), L"no") AND UNIVERSAL_TAG(POS(N+2), TagPersonalVerb) OR FORM(POS(N+1), L"no") AND EXISTENTIAL_TAG(POS(N+2), TagAccusativePersonal OR_TAG TagDativePersonal) AND UNIVERSAL_TAG(POS(N+3), TagPersonalVerb) OR FORM(POS(N+1), L"no") AND EXISTENTIAL_TAG(POS(N+2), TagDativePersonal) AND EXISTENTIAL_TAG(POS(N+3), TagAccusativePersonal) AND UNIVERSAL_TAG(POS(N+4), TagPersonalVerb)) AND SAFE_CONTEXT(POS_WITHOUT_OVERFLOW(N-1), POS_WITHOUT_OVERFLOW(N+4)) THEN SUG_WORD(POS(N), L"si no"); 15 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services NEW_SUG(); SUG_WORD(POS(N), L"sino que"); ADD_ERROR(Error_Spelling, POS(N), POS(N), msg(L"es", L"Posible confusión al emplear la conjunción <i>sino</i> inadecuadamente.", L"en", L"Possible confusion using the conjunction <i>sino</i> improperly.", L"fr", L"Il s'agit peut-être d'un usage incorrect de la conjonction <i>sino</i>.", L"it", L"Possibile confusione nell'uso della congiunzione <i>sino</i>."), Error_Poco_Seguro, L"", C1, L"ReglaSinoPorSi_No"); END_RULE Furthermore a new rule has been developed in order to detect unknown abbreviations and returning an error without suggestion, previously the system suggested lots of possible forms. It had also being considered whether it is correct adding to resources the determiners containing @ in order to indicate invariable genre (tod@s, ningun@s, algun@s...) and creating new rules that allow to analyze sentences which contain pattern mismatch due to the use of oral patterns in written language. Finally since last 23th December orthographic rules are being modified following the recommendations of the new RAE (Real Academia Española) orthography. 16 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services 5. Appendix 1 This section contains the list of 172 words extracted from checking reports the 21st December. These words are analyzed by our experts and included to the resources files with the corresponding information. Each type of resource is included in a different file as shown below. File: LAST_NAMES.db (15 terms included) This file contains a list of last names in different languages. Bakiyev Crace Faili Fakhrizadeh Ayad Balfe Bersani Yeates Bongiorno Brisman Cardle Creevey Asato Hasseloff Baqeri Each of these terms is incorporated to resources specifying some grammatical and semantic information. Table 3: Entry example for LAST_NAMES.db Grammatical info BAKIYEV form = Bakiyev es = ok en = ok it = ok fr = ok cat = noun noun_type = proper dic = DIC_DEF gender = undef Semantic info C_ BAKIYEV sementity_type = PERSON sementity_subtype = LAST_NAME sementity_class = instance sementity_fiction = nofiction 17 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services number = undef concepts = C_BAKIYEV File: CELEBRITIES.db(8 terms included) File containing the names of celebrities. Jonas_Brothers Kanye_West Alain_Bashung Kurmanbek_Bakiyev Usher Blondie The_New_Romantiques Sparklehorse Each of these terms is incorporated to resources specifying some grammatical and semantic information. Table 4: Entry example for CELEBRITIES.db Grammatical info JONAS_BROTHERS form = Jonas_Brothers en = ok it = ok fr = ok es = ok cat = noun noun_type = proper gender = undef number = undef dic = DIC_SOC concepts = C_JONAS_BROTHERS Semantic info C_JONAS_BROTHERS semtheme_type = ARTS semtheme_subtype = MUSIC sementity_type = ORGANIZATION sementity_subtype = ORGANIZATION_OTHER sementity_subsubtype = ARTISTIC_ORGANIZATION sementity_class = instance sementity_fiction = nofiction File: ECONOMY.db(10 terms included) Set of economic entities such as companies. UKFI UK_Financial_Investments Nutella 18 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services ParkSinta Rodilla WD-40 Yonhap_News_Agency Yonhap Groupon Bangladesh_Garments_Manufacturers_and_Exporters_Associations Each of these terms is incorporated to resources specifying some grammatical and semantic information. Table 5: Entry example for ECONOMY.db Grammatical info UKFI form = UKFI es = ok en = ok it = ok fr = ok cat = noun noun_type = proper src = INTERNET gender = undef number = sing concepts = C_UKFI text_form = initialism remission_entitykey = UK_Financial_Investments UK_Financial_Investments form = UK_Financial_Investments es = ok en = ok it = ok fr = ok cat = noun noun_type = proper src = INTERNET gender = undef number = sing concepts = C_UK_Financial_Investments Semantic info C_UKFI sementity_type = ORGANIZATION sementity_subtype = COMPANY sementity_subsubtype = FINANCIAL_COMPANY sementity_subsubsubtype = INVESTMENT_COMPANY sementity_class = instance C_UK_Financial_Investments sementity_type = ORGANIZATION sementity_subtype = COMPANY sementity_subsubtype = FINANCIAL_COMPANY sementity_subsubsubtype = INVESTMENT_COMPANY sementity_class = instance sementity_fiction = nofiction 19 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services File: INSTITUTIONS.db(13 terms included) File containing a set of institutions. British_Council Consejo_Británico CFJ Centre_de_Formation_des_Journalistes Atelier_de_Poésie_Ouverte Solidaritat_Catalana ANC African_National_Congress Campaign_for_an_English_Parliament CEP FoE PIRC Parental_Information_and_Resource_Centers Each of these terms is incorporated to resources specifying some grammatical and semantic information. Table 6: Entry example for INSTITUTIONS.db Grammatical info BRITISH_COUNCIL form = British_Council en = ok es = ok fr = ok it = ok cat = noun noun_type = proper gender = masc number = sing concepts = C_BRITISH_COUNCIL es_checkinfo_id = foreign_word es_checkinfo_level = b1 es_checkinfo_form = Consejo_Británico es_checkinfo_lang = english CONSEJO_BRITÁNICO form = Consejo_Británico es = ok Semantic info C_BRITISH_COUNCIL sementity_type = ORGANIZATION sementity_subtype = INSTITUTE sementity_subsubtype = INSTITUTE_OTHER semtheme_type = SOCIETY semtheme_subtype = EDUCATION sementity_class = instance sementity_fiction = nofiction C_CONSEJO_BRITÁNICO sementity_type = ORGANIZATION sementity_subtype = INSTITUTE 20 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services cat = noun noun_type = proper gender = masc number = sing concepts = C_CONSEJO_BRITÁNICO sementity_subsubtype = INSTITUTE_OTHER semtheme_type = SOCIETY semtheme_subtype = EDUCATION sementity_class = instance sementity_fiction = nofiction File: FIRT_NAMES.db(28 terms included) This file contains a list of first names in different languages. Colima Bashir Ginni Laureline Marielle Batefimbi Suransky Barbaree Rumack Mirlande Mohamud Philomene Siân Uri Zwelakhe Andi Cath Cheng Choe Chun Fereidoun Kurmanbek Abdur Aisling Catia Ayad Brigid Caity Each of these terms is incorporated to resources specifying some grammatical and semantic information. 21 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services Table 7: Entry example for FIRST_NAMES.db Grammatical info COLIMA#2 form = Colima es = ok en = ok it = ok fr = ok cat = noun noun_type = proper dic = DIC_DEF gender = masc number = sing concepts = C_COLIMA#2 Semantic info C_COLIMA#2 sementity_type = PERSON sementity_subtype = FIRST_NAME sementity_class = instance sementity_fiction = nofiction File: OTHER_NAMES.db(20 terms included) This file contains a list of various proper names such as films, products, film festivals. Ampera El_Día_Después meridiano_de_Greenwich Waverley Mediator American_Music_Awards AMA Complexo_do_Alemão Panopticon REDD Yeddah Scary_Movie Qasam Sipdis Secret_Internet_Protocol_Distribution Secret_Internet_Protocol_Router_Network SIPRNet Wrongfully_Accused Broxden_Junction Broxden 22 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services Each of these terms is incorporated to resources specifying some grammatical and semantic information. Table 8: Entry example for OTHER_NAMES.db Grammatical info AMPERA form = Ampera es = ok en = ok it = ok fr = ok cat = noun noun_type = proper dic = DIC_DEF src = INTERNET gender = undef number = sing concepts = C_AMPERA EL_DÍA_DESPUÉS form = El_Día_Después es = ok en = ok it = ok fr = ok cat = noun noun_type = proper src = INTERNET gender = masc number = sing concepts = C_EL_DÍA_DESPUÉS Semantic info C_AMPERA semtheme_type = SOCIETY semtheme_subtype = TRANSPORT sementity_type = PRODUCT sementity_subtype = PHYSICAL_PRODUCT sementity_subsubtype = VEHICLE sementity_class = instance sementity_fiction = nofiction C_EL_DÍA_DESPUÉS sementity_subtype = CULTURAL_PRODUCT sementity_subsubtype = SHOW semtheme_type = SPORT semtheme_subtype = FOOTBALL sementity_class = instance sementity_fiction = nofiction File: LOCATION.db(20 terms included) File containing a list of locations. Ergenekon Côte_d'Azur Costa_Azul Costa_Azzurra Cité_Soleil Ceyhan 23 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services Kouriles Kuriles Curiles Terzigno Mirail Cadix Paestum Thézy-Glimont Etampes Colombey Greymouth Amilly Yssingeaux Mohammédia Each of these terms is incorporated to resources specifying some grammatical and semantic information. Table 9: Entry example for LOCATION.db Grammatical info ERGENEKON form = Ergenekon es = ok en = ok fr = ok it = ok cat = noun noun_type = proper gender = undef number = sing concepts = C_ERGENEKON % En la mitología turca, Ergenekon es un lugar mítico localizado en los inaccesibles valles de los montes Altaï. CÔTE_D'AZUR form = Côte_d'Azur en = ok fr = ok it = ok es = ok cat = noun Semantic info C_ERGENEKON semgeo_countrykey = C_TURQUÍA sementity_type = LOCATION sementity_subtype = LOCATION_OTHER sementity_class = instance sementity_fiction = fiction semtheme_type = HUMANITIES semtheme_subtype = MYTHOLOGY C_CÔTE_D'AZUR semgeo_countrykey = C_FRANCIA#1 sementity_type = LOCATION sementity_subtype = GEOLOGICAL_REGION sementity_subsubtype = GEOLOGICAL_REGION_OTHER sementity_subsubsubtype = 24 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services noun_type = proper gender = undef number = sing concepts = C_CÔTE_D'AZUR en_checkinfo_form = French_Riviera en_checkinfo_id = wrong_adaptation en_checkinfo_lang = french en_checkinfo_level = b2 es_checkinfo_form = Costa_Azul es_checkinfo_id = foreign_word es_checkinfo_lang = french es_checkinfo_level = b2 it_checkinfo_form = Costa_Azzurra it_checkinfo_id = foreign_word it_checkinfo_lang = french it_checkinfo_level = b2 NATURAL_REGION sementity_class = instance sementity_fiction = nofiction File: GENT.db(3 terms included) Names given to the people from a particular region or country. mecano grisón panyabí Each of these terms is incorporated to resources specifying some grammatical and semantic information. Table 10: Entry example for GENT.db Grammatical info MECANO#1 form = mecano model = A_NINO model = DIC_DEF model = A concepts = C_MECANO src = DRAE MECANO#2 form = mecano model = N_NINO model = DIC_DEF Semantic info C_MECANO model = CITYGENT semgeo_citykey = C_LA_MECA semgeo_countrykey = C_ARABIA_SAUDÍ C_MECANO model = CITYGENT semgeo_citykey = C_LA_MECA semgeo_countrykey = C_ARABIA_SAUDÍ 25 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services model = N concepts = C_MECANO File: FOREIGN.db(7 terms included) Set of foreign words that are commonly used in different languages. currie post glamouroso risotto brunch death_metal trash_metal Each of these terms is incorporated to resources specifying some grammatical and semantic information. Table 11: Entry example for FOREIGN.db Grammatical info CURRIE form = currie model = ESTRES model = DIC_DEF concepts = C_CURRIE model = N checkinfo_orig = curry checkinfo_lang = english checkinfo_id = wrong_adaptation checkinfo_form = curri checkinfo_level = a1 checkinfo_src = panhisp Semantic info C_CURRIE model = GASTRONOMY File: NOMS.db(8 terms included) List of nouns. ye relaciones_públicas calidad-precio cronicidad 26 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services autopase neurodesarrollo tempura contraproyecto Each of these terms is incorporated to resources specifying some grammatical and semantic information. Table 12: Entry example for NOMS.db Grammatical info YE form = ye lex = ye model = N_LUNA model = DIC_DEF concepts = C_YE model = N src = DRAE % sin: i griega RELACIONES_PÚBLICAS form = relaciones_públicas lex = relaciones_públicas model = RUBIALES model = DIC_DEF concepts = C_RELACIONES_PÚBLICAS model = N src = DAL DRAE Semantic info C_YE model = LINGUISTICS C_RELACIONES_PÚBLICAS model = SOCIETY File: ADJS.db(7 terms included) List of adjectives. cachifo ortotelefónico calentorro cantoso orujero hímnico siseante 27 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services Each of these terms is incorporated to resources specifying some grammatical and semantic information. Table 13: Entry example for ADJS.db Grammatical info CACHIFO form = cachifo model = A_NINO model = COLOQ model = A src = SM3 File: ABREV-SMS.db(17 terms included) File that contains SMS-like writing. 4ever =mnt 1bs 1bso awa crk cta ksa ksi ktl? mvil npi ns prnt qdamos to2 ymm Each of these terms is incorporated to resources specifying some grammatical and semantic information. Table 14: Entry example for ABREV-SMS.db Grammatical info 4ever form = 4ever 28 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services model = COLOQ cat = sc tag = Z es_checkinfo_id = wrong_form es_checkinfo_level = a1 es_checkinfo_form = para siempre File: EMOTIC.db(10 terms included) Set of emoticons. =) :'( :-X :O u_u U_U 'O' :S :P :* Each of these terms is incorporated to resources specifying some grammatical and semantic information. Table 15: Entry example for EMOTIC Grammatical info =) form = =) model = COLOQ lex = contento cat = sc tag = Z src = DAEDALUS File: CL-IJ.db(6 terms included) Interjections frequently used in blogs jejeje jejejeje 29 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services jajaja juas jajajaja jeajeajeajea Each of these terms is incorporated to resources specifying some grammatical and semantic information. Table 16: Entry example for CL-IJ Grammatical info jejeje form = jejeje lex = jejeje cat = ij src = INTERNET 30 Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level C Confidential, only for members of the consortium and the Commission Services