D3.1 Spanish and Italian text correction modules adapted to web

Transcripción

DELIVERABLE
Project Acronym: FLAVIUS
Grant Agreement number: ICT-PSP-250528
Project Title: Foreign LAnguage Versions of Internet and User generated Sites
D3.1 Spanish and Italian text correction
modules adapted to web environment
Revision: 1.0
Authors:
Sonia Collada Pérez, Julio Villena Román (Daedalus)
1
Project co-funded by the European Commission within the ICT Policy Support Programme
Dissemination Level
C
Confidential, only for members of the consortium and the Commission Services
Revision History
Revision Date
0.1
Author
January, 7th, Sonia Collada Pérez,
2011
Julio Villena Román
Organisation
Description
Daedalus
1.0
Statement of originality:
This deliverable contains original unpublished work except where clearly
indicated otherwise. Acknowledgement of previously published material
and of the work of others has been made through appropriate citation,
quotation or both.
2
Dissemination Level
C
1. Introduction
FLAVIUS aims to provide an easy and cost-effective way for webmasters to have their website
translated and indexed into several languages. The spelling and grammar checker module aims to
improve the quality of the source text so as to get a higher translation quality afterwards. Within
FLAVIUS project, the source languages that will be taken into account are English, French, Spanish
and Italian.
The aim of this document is to describe the modifications performed on the spelling and grammar
checker module for Spanish and Italian in order to adapt it to web environment and mainly to user
generated content.
2. Daedalus contribution
Daedalus is a company established by a group of specialists on research, development, innovation
and transfer of technology in the field of Information and Communications Technology (ICT). In the
domain dedicated to text correction Daedalus has developed a spell and grammar checker which
complies with the media and the editorial needs, providing the quality required in these areas.
FLAVIUS project aims to provide an easy and cost-effective way for webmasters to have their website
translated and indexed into several languages. The aim of Daedalus spelling and grammar checker
within this project is to improve the quality of the source text so as to get a higher translation quality
afterwards. Since text to be translated will be user generated content it has been necessary to adjust
the spelling and grammar checker in order to adapt its behaviour to this new scenario.
3. Error detection
The objective of this task is to carry out any necessary modification to the spelling and grammar
checker modules for Spanish and Italian to be able to process user generated content such as blog
posts, reviews, etc. So the first step is to find the differences between the specific features of this
type of content with respect to the formal language that is typically used by professional writers such
as journalists, etc.
3
Dissemination Level
C
In order to collect a text corpus, an RSS monitoring robot has been developed. This robot
automatically downloads and processes RSS channels published in different sites belonging to Qype,
one our content-provider partner. The following table shows the list of RSS channels that are actually
fed into the robot.
Table 1: List of RSS channels for Spanish and Italian
Spanish
Italian
http://www.qype.es/es300-madrid/rss
http://www.qype.es/es511-barcelona/rss
http://www.qype.es/es213-bilbao/rss
http://www.qype.es/es523-valencia/rss
http://www.qype.es/es530-palma-de-mallorca/rss
http://www.qype.es/es212-donostia-san-sebastian/rss
http://www.qype.es/es111-santiago-de-compostela/rss
http://www.qype.es/es243-zaragoza/rss
http://www.qype.es/es618-sevilla/rss
http://www.qype.es/es/rss
http://www.qype.es/uk/rss
http://www.qype.es/fr/rss
http://www.qype.es/it/rss
http://www.qype.it/es/rss
http://www.qype.it/uk/rss
http://www.qype.it/fr/rss
http://www.qype.it/it/rss
In fact, we are already collecting information for English and French, as shown in next table.
Table 2: List of RSS channels for Spanish and Italian
English
French
http://www.qype.co.uk/es/rss
http://www.qype.co.uk/uk/rss
http://www.qype.co.uk/fr/rss
http://www.qype.co.uk/it/rss
http://www.qype.fr/es/rss
http://www.qype.fr/uk/rss
http://www.qype.fr/fr/rss
http://www.qype.fr/it/rss
4
Dissemination Level
C
Twice a day, checking reports are automatically built by using the last up-to-date engines and then
sent by email to a group of expert reviewers. These reports contain a list of errors which are analyzed
in order to detect false positives and false negatives. These errors are stored in a database which
allows to assess the status of the checking engines.
4. System adaptation to web environment
The monitoring approach has stressed the need of performing enhancements on the system, such as
including a language detector, modifying the lexical base and enhancing the grammar engine.
Language detector
One of the most common errors is related to the different languages that appear simultaneously in
the same text, mainly when analyzing reviews from Qype (and also manually from TVTrip). This is due
to the fact that even when the review is written in Spanish it might be referred to a foreign place
whose name will be written in a different language, for instance:
En mi reciente viaje a París visité otra vez la Tour Eiffel.
Un tempo arena di corride e poi teatro, oggi è un tranquillo punto di ritrovo e di sosta.
Particolarmente belli e caratteristici sono i caffè che si trovano ai suoi angoli. Vi sono molte
bancarelle dove comprare stupendi souvenir.La Plaza Mayor acquista molto fascino sotto il
periodo natalizio.
In this case, some special treatment should be done for foreign proper nouns. Our approach is to
include a list of widely used (understood) proper names for important places in all the languages that
the user could speak and the system is able to check.
Besides, many other expressions can be used in a foreign language to make allusion to a foreign
place and they could cause false positives, for instance:
Creo que es uno de los mejores restaurantes de todo Paris los platos son maravillosos, hechos
con ingredientes muy frescos. Todo está fantástico. Bon Appétit !
In this case, the strategy is to try to detect the fragment in a different language and inhibit the
checking process.
5
Dissemination Level
C
Lexical base
Web monitoring provides new lexical resources that are daily added to the spelling and grammar
checker, such as:
•
•
•
•
•
•
Proper names.
Foreign words which are frequently used in blogs: brunch, risotto, cool, glamouroso,
pen_drive, chic, chill-out.
Colloquial and new expressions: corta_y_pega, nacho, calidad-precio, rojo_pasión.
SMS-like writing.
Emoticons.
Interjections frequently used in blogs.
About 180 new proper nouns and common words were included daily during November and
December (see Appendix 1).
Grammar engine
Due to the fact that the content available in provider sites is generated by users, some of the system
grammar rules have been slightly modified or disabled so as to comply with providers needs.
•
The typographic rule used to detect the multiple punctuation marks has been omitted. For
instance, the following is accepted as correct:
¡¡¡hola!!!
•
Modifications on the rule used to return a single error when several punctuation marks are
unbalanced, for example:
Hola!
This sentence is considered to be correct (instead of ¡Hola!) although exclamation and
question marks should be balanced with an opening and closing one.
•
Modification that allows ignoring email addresses that otherwise might be considered as
errors, such as: [email protected].
However the system cannot detect and ignore user names such as:
6
Dissemination Level
C
chicoestelar / guillermoacosta / Misterpollo / carlitos77
as it is not possible to distinguish whether it is a mistake or an user name.
•
The modification of configuration parameters in order to allow disabling style and
typography correction due to the type of content that has been considered.
These are the API input parameters:
o
txt: input text, UTF-8 encoding, in plain text, HTML or XML.
o
key: access key, needed for making any request. To get a valid access key, contact
[email protected].
o
clang: language of the text. The allowed values are the following:
es: Spanish
it: Italian
en: English
fr: French
o
ilang: language of the interface. Valid values are the same as clang or en (English).
o
format: format of the output. The allowed values are the following (described later):
xml: XML (default)
json: JSON format
html: HTML format
check: returns a tagged version of the text
o
offset: offset where to start the revision of the text, starting from 0 (default).
o
mode: check mode, according to the following values:
all: get all errors (default)
next: get only the next error (from the given offset)
o
config: settings for the revision process. This parameter is a string with a list of
values, separated by semicolon, indicating one or more of the following values:
pp=<value>: try to guess words with known prefixes: 0=inactive, 1=active
(default).
7
Dissemination Level
C
dh=<value>: handle hyphenation at the end of line: 0=inactive (default),
1=active.
aqoi=<value>: accept words within quotes or in italics: 0=inactive (default),
1=active.
tls=<value>: warn of too long sentences: 0=inactive, 1=active (default).
dpn=<value>: try to guess unknown proper nouns: 0=inactive, 1=active
(default).
stme=<value>: behaviour in sentences with many errors: 0=show individual
messages, 1=group messages and ignore, 2=group messages and give a
warning (default).
mf=<value>: try to guess mathematical formulas: 0=inactive, 1=ignore
formulas, 2=check formulas (default).
red=<value>: check redundancy in sentence: 0=inactive, 1=active (default)
spa=<value>: check spacing: 0=inactive, 1=active (default).
comppunc=<value>: check unbalanced punctuation: 0=inactive, 1=active
(default).
corrpunc=<value>: check incorrect punctuation: 0=inactive, 1=active
(default).
fwi=<value>: foreign words should be written in italics: 0=inactive, 1=active
(default).
alw=<value>: suggest alternatives to loan words: 0=inactive, 1=active
(default).
dic=<value>: list of active dictionaries [NOT SUPPORTED YET].
level=<value>: knowledge of language, CEFR level (Common European
Framework of Reference for Languages):
• A1 (Breakthrough)
• A2 (Waystage)
• B1 (Threshold)
• B2 (Vantage)
• C1 (Proficiency)
• C2 (Mastery) (default)
And the configuration that should be considered when dealing with user provided content
such as Qype reviews is the following:
o
config:
pp=1: try to guess words with known prefixes.
dh=0: not to handle hyphenation at the end of line.
aqoi=1: accept words within quotes or in italics.
tls=0: not to warn of too long sentences.
dpn=1: try to guess unknown proper nouns.
8
Dissemination Level
C
•
stme=2: group messages and give a warning.
mf=0: ignore formulas.
red=0: ignore redundancy in sentence.
spa= 0: ignore spacing mistakes.
comppunc=0: ignore unbalanced punctuation.
corrpunc=0: ignore incorrect punctuation.
fwi=0: foreign words do not need to be written in italics.
alw=1: suggest alternatives to loan words.
level=A1: knowledge of language breakthrough, CEFR level (Common
European Framework of Reference for Languages).
When considering texts with wrong accentuation the checker used to cause some incorrect
grammar structures. In order to avoid this type of analysis the orthographic module has been
swapped with the disambiguation module, and the syntactic analysis is performed
considering a disambiguation process which is insensitive to accents.
Other typical problem related to accents is due to words which are frequently written
without accents in user generated content. The rules that deal with this type of problems
have been improved. For instance the following pairs of words are frequently wrongly spelt
and need disambiguation:
o
ultimo-último
el <error>ultimo</error> de la clase
el <error>ejercito</error> de Israel
he leído el <error>articulo</error>
RULE(L"ReglaUltimoPorÚltimo")
(
EXISTENTIAL_TAG(POS(N), TagVerb) AND
EXISTENTIAL_TAG(POS(N), TagSingularVerb) AND
EXISTENTIAL_TAG(POS(N), TagPresentIndicative) AND
!(EXISTENTIAL_TAG(POS(N), TagNoun OR_TAG TagAdjective) AND
!EXISTENTIAL_TAG(POS(N), TagNounAppreciativeY)) AND
(EXISTENTIAL_TAG(POS(N-1), TagArticle OR_TAG TagNumeral OR_TAG
TagDemonstrative OR_TAG
TagPossesive OR_TAG
TagPrenominalAdjective) AND
UNIVERSAL_TAG(POS(N-1), TagMasculine) AND
UNIVERSAL_TAG(POS(N-1), TagSingular) OR
FORM(POS(N-1), L"al|del") OR
9
Dissemination Level
C
EXISTENTIAL_TAG(POS(N-1), TagPreposition)) AND
EXISTENTIAL_TAG_WITH_ACCENTS(POS(N), TagMascSingNoun OR_TAG
TagMascSingAdjective, sTemp)
)
AND SAFE_CONTEXT(POS_WITHOUT_OVERFLOW(N-1),
POS_WITHOUT_OVERFLOW(N))
THEN
SUG_WORD(POS(N), sTemp);
ADD_ERROR(Error_Spelling, POS(N), POS(N),
msg(L"es", L"Posiblemente falte el acento.",
L"en", L"Maybe the diacritical mark is missing.",
L"fr", L"Assurez-vous de ne pas avoir oublié un
accent.",
L"it", L"Forse manca l'accento."),
Error_Poco_Seguro,
L"",
B1,
L"ReglaUltimoPorÚltimo");
END_RULE
o
replica-réplica
los familiares de la <error>victima</error>
trabajo en aquella <error>fabrica</error>
RULE(L"ReglaReplicaPorRéplica")
(
EXISTENTIAL_TAG(POS(N), TagVerb3Singular) AND
EXISTENTIAL_TAG(POS(N), TagSingularVerb) AND
EXISTENTIAL_TAG(POS(N), TagVerbPresent) AND
EXISTENTIAL_TAG(POS(N), TagVerb2) AND
EXISTENTIAL_TAG(POS(N), TagImperativeVerb) AND
!(EXISTENTIAL_TAG(POS(N), TagNoun OR_TAG TagAdjective) AND
!EXISTENTIAL_TAG(POS(N), TagNounAppreciativeY)) AND
((EXISTENTIAL_TAG(POS(N-1), TagNumeral OR_TAG
TagDemonstrative OR_TAG
TagPossesive OR_TAG
TagPrenominalAdjective) OR
FORM(POS(N-1), L"una") OR
FORM(POS(N-1), L"la")) AND
UNIVERSAL_TAG(POS(N-1), TagFeminine) AND
UNIVERSAL_TAG(POS(N-1), TagSingular) OR
EXISTENTIAL_TAG(POS(N-1), TagPreposition)) AND
!(EXISTENTIAL_TAG(POS(N-1), TagDemonstrative) AND
IS_FIRST_WORD(POS(N-1))) AND
EXISTENTIAL_TAG_WITH_ACCENTS(POS(N), TagCommonNounFemenine OR_TAG
10
Dissemination Level
C
TagFeminineSingularAdjective,
sTemp)
)
POS_WITHOUT_OVERFLOW(N))
THEN
SUG_WORD(POS(N), sTemp);
accent.",
L"it", L"Forse manca l'accento."),
Error_Poco_Seguro,
L"",
B1,
L"ReglaReplicaPorRéplica");
END_RULE
o
esta-está
El campesino <error>esta</error> asustado
me <error>esta</error> yendo muy bien
¿<error>Estas</error> furioso por eso?
no <error>esta</error> mal en absoluto que seas de un equipo como todo el
mundo
se <error>esta</error> de maravilla
RULE(L"ReglaEstaPorEstá")
(
FORM(POS(N), L"esta|estas") AND
(FORM(POS(N-1), L"cómo|dónde") AND
!EXISTENTIAL_TAG(POS(N+1), TagFemNoun) AND
NUMBER_AGREES(POS(N), POS(N+1)) OR
EXISTENTIAL_TAG(POS(N+1), TagAdjectiveOrParticiple OR_TAG TagNoun)
AND
!GENDER_AND_NUMBER_AGREE(POS(N), POS(N+1)) OR
EXISTENTIAL_TAG(POS(N+1), TagGerundVerb) AND
!EXISTENTIAL_TAG(POS(N+1), TagFemNoun) AND
!NUMBER_AGREES(POS(N), POS(N+1)) OR
FORM(POS(N-1), L"se"))
)
AND SAFE_CONTEXT(POS_WITHOUT_OVERFLOW(N),
POS_WITHOUT_OVERFLOW(N+2))
11
Dissemination Level
C
THEN
IF
FORM(POS(N), L"esta")
THEN
SUG_WORD(POS(N), L"está");
ELSEIF
FORM(POS(N), L"estas")
THEN
SUG_WORD(POS(N), L"estás");
END
accent.",
L"it", L"Forse manca l'accento "),
Error_Poco_Seguro,
L"",
B1,
L"ReglaEstaPorEstá");
END_RULE
o
hacia-hacía
la limpieza se <error>hacia</error> casi imposible
Se <error>amplia</error> el carácter participativo de los carteles
RULE(L"ReglaHaciaPorHacía")
(
FORM(POS(N), L"se") AND
FORM_ENDING(POS(N+1), L"ia|ian") AND
EXISTENTIAL_TAG_WITH_ACCENTS(POS(N+1), TagVerb3, sTemp)
)
AND SAFE_CONTEXT(POS_WITHOUT_OVERFLOW(N),
THEN
SUG_WORD(POS(N+1), sTemp);
ADD_ERROR(Error_Spelling, POS(N+1), POS(N+1),
accent.",
L"it", L"Forse manca l'accento "),
L"esDidac", L"Quizá te hayas olvidado de poner el
acento."),
Error_Poco_Seguro,
12
Dissemination Level
C
L"",
B1,
L"ReglaHaciaPorHacía");
END_RULE
o
mas-más
Newton dedicó <error>mas</error> tiempo a la investigación química que a la
física
Fue el pensador de mentalidad <error>mas</error> científica que produjo la
Edad Media.
La complejidad del genoma es mayor en los organismos <error>mas</error>
evolucionados.
Se añadirán tres meses <error>mas</error> de reclusión.
Aquí el fiscal no es <error>mas</error> que un hombre que denuncia.
El estado, <error>mas</error> comprometido en esa época.
todos los partidos políticos, <error>mas</error> de mil talleres de capacitación
con la sed permanente de <error>mas</error> y <error>mas</error> consumo
RULE(L"ReglaMasPorMás")
(
FORM(POS(N), L"mas") AND
EXISTENTIAL_TAG(POS(N), TagConjunction) AND
((UNIVERSAL_TAG(POS(N+1), TagNoun OR_TAG TagAdjective OR_TAG
TagParticiple) AND
!EXISTENTIAL_TAG(POS(N+1), TagProperNoun)) OR
EXISTENTIAL_TAG(POS(N+1), MorStrongPunctuationC OR_TAG
TagPreposition OR_TAG TagConjunction) OR
EXISTENTIAL_TAG(POS(N-1), TagConjunction OR_TAG TagPreposition)) AND
!FORM(POS(N+1), L"sin_embargo")
)
THEN
SUG_WORD(POS(N),L"más");
msg(L"es", L"Posible confusión al emplear la conjunción
mas en vez del adverbio más,
conviene revisar el acento.",
L"en", L"It may be a confusion using the conjunction
mas instead of the adverb más,
it is convenient to check the diacritical
mark.",
13
Dissemination Level
C
L"fr", L"Il y a peut-être une confusion entre la
conjonction mas et l’adverbe más.",
L"it", L"Possibile confusione nell'uso della
congiunzione mas invece dell'avverbio
más. Controllare l'accento."),
Error_Poco_Seguro,
L"Check_Panhisp",
C2,
L"ReglaMasPorMás");
END_RULE
Besides it has also been modified the rules used to detect proper nouns (Name + Surname).
Other modifications
Most of the problems detected during the monitoring phase where related to sentences with a high
number of prepositions or sentences with coordinated structures.
In the first case, the threshold used to detect whether a sentence is correct has been modified and
thus achieving a more adequate performance.
“la víctima y su marido vivían en el piso” [the victim and his husband lived in the same
apartment]
In the second case, the main problem arises when checking the concordance between verb and
subject when the subject is a coordinated structure. Finally this problem has been solved and
sentences such as the following are now correctly analyzed:
“los postres sobre todo el tiramisú estaban muy buenos” [desserts were splendid, tiramisu
over all]
Some other rules have been included to distinguish between pairs such as:
o
haber vs. a_ver
Mira <error>haber</error> si no llegas a tiempo
RULE(L"ReglaHaberPorAVer")
FORM(POS(N), L"haber") AND
IS_ELEMENTAL_TOKEN(POS(N)) AND
FORM(POS(N+1), L"si") AND
!EXISTENTIAL_TAG(POS(N-1), TagDeterminer OR_TAG TagPreposition)
14
Dissemination Level
C
THEN
SUG_WORD(POS(N), L"a");
SUG_SPACE();
SUG_WORD(POS(N+1), L"ver");
msg(L"es", L"Posible confusión al emplear la forma verbal
haber inadecuadamente.",
L"en", L"Possible confusion using the verb
haber improperly.",
L"fr", L"Le verbe haber n'est pas employé
correctement.",
L"it", L"Possibile confusione facendo uso della forma
verbale haber inadeguatamente."),
Error_Poco_Seguro,
L"",
C1,
L"ReglaHaberPorAVer");
END_RULE
o
o
sino vs. si_no
Mi mujer me sabía inofensivo, pues <error>sino</error> habría ideado otra solución
RULE(L"ReglaSinoPorSi_No")
FORM(POS(N), L"sino") AND
!(EXISTENTIAL_TAG(POS(N-1), TagDeterminer) AND
FORM(POS(N-1),
L"suyo|suya|suyos|suyas|mío|mía|míos|mías|tuyo|tuya|tuyos|tuyas")AND
(UNIVERSAL_TAG(POS(N+1), TagPersonalVerb) AND
!FORM(POS(N+1), L"hace") OR
EXISTENTIAL_TAG(POS(N+1), TagAccusativePersonal OR_TAG TagDativePersonal)
AND
UNIVERSAL_TAG(POS(N+2), TagPersonalVerb) OR
FORM(POS(N+1), L"no") AND
EXISTENTIAL_TAG(POS(N+2), TagAccusativePersonal OR_TAG TagDativePersonal)
AND
EXISTENTIAL_TAG(POS(N+2), TagDativePersonal) AND
EXISTENTIAL_TAG(POS(N+3), TagAccusativePersonal) AND
UNIVERSAL_TAG(POS(N+4), TagPersonalVerb)) AND
SAFE_CONTEXT(POS_WITHOUT_OVERFLOW(N-1), POS_WITHOUT_OVERFLOW(N+4))
THEN
SUG_WORD(POS(N), L"si no");
15
Dissemination Level
C
NEW_SUG();
SUG_WORD(POS(N), L"sino que");
msg(L"es", L"Posible confusión al emplear la conjunción
sino inadecuadamente.",
L"en", L"Possible confusion using the conjunction
sino improperly.",
L"fr", L"Il s'agit peut-être d'un usage incorrect de la
conjonction sino.",
L"it", L"Possibile confusione nell'uso della congiunzione
sino."),
Error_Poco_Seguro,
L"",
C1,
L"ReglaSinoPorSi_No");
END_RULE
Furthermore a new rule has been developed in order to detect unknown abbreviations and returning
an error without suggestion, previously the system suggested lots of possible forms.
It had also being considered whether it is correct adding to resources the determiners containing @
in order to indicate invariable genre (tod@s, ningun@s, algun@s...) and creating new rules that allow
to analyze sentences which contain pattern mismatch due to the use of oral patterns in written
language.
Finally since last 23th December orthographic rules are being modified following the
recommendations of the new RAE (Real Academia Española) orthography.
16
Dissemination Level
C
5. Appendix 1
This section contains the list of 172 words extracted from checking reports the 21st December. These
words are analyzed by our experts and included to the resources files with the corresponding
information. Each type of resource is included in a different file as shown below.
File: LAST_NAMES.db (15 terms included)
This file contains a list of last names in different languages.
Bakiyev
Crace
Faili
Fakhrizadeh
Ayad
Balfe
Bersani
Yeates
Bongiorno
Brisman
Cardle
Creevey
Asato
Hasseloff
Baqeri
Each of these terms is incorporated to resources specifying some grammatical and semantic
information.
Table 3: Entry example for LAST_NAMES.db
Grammatical info
BAKIYEV
form = Bakiyev
es = ok
en = ok
it = ok
fr = ok
cat = noun
noun_type = proper
dic = DIC_DEF
gender = undef
Semantic info
C_ BAKIYEV
sementity_type = PERSON
sementity_subtype = LAST_NAME
sementity_class = instance
sementity_fiction = nofiction
17
Dissemination Level
C
number = undef
concepts = C_BAKIYEV
File: CELEBRITIES.db(8 terms included)
File containing the names of celebrities.
Jonas_Brothers
Kanye_West
Alain_Bashung
Kurmanbek_Bakiyev
Usher
Blondie
The_New_Romantiques
Sparklehorse
information.
Table 4: Entry example for CELEBRITIES.db
Grammatical info
JONAS_BROTHERS
form = Jonas_Brothers
en = ok
it = ok
fr = ok
es = ok
cat = noun
noun_type = proper
gender = undef
number = undef
dic = DIC_SOC
concepts = C_JONAS_BROTHERS
Semantic info
C_JONAS_BROTHERS
semtheme_type = ARTS
semtheme_subtype = MUSIC
sementity_type = ORGANIZATION
sementity_subtype = ORGANIZATION_OTHER
sementity_subsubtype =
ARTISTIC_ORGANIZATION
File: ECONOMY.db(10 terms included)
Set of economic entities such as companies.
UKFI
UK_Financial_Investments
Nutella
18
Dissemination Level
C
ParkSinta
Rodilla
WD-40
Yonhap_News_Agency
Yonhap
Groupon
Bangladesh_Garments_Manufacturers_and_Exporters_Associations
information.
Table 5: Entry example for ECONOMY.db
Grammatical info
UKFI
form = UKFI
es = ok
en = ok
it = ok
fr = ok
cat = noun
noun_type = proper
src = INTERNET
gender = undef
number = sing
concepts = C_UKFI
text_form = initialism
remission_entitykey = UK_Financial_Investments
UK_Financial_Investments
form = UK_Financial_Investments
es = ok
en = ok
it = ok
fr = ok
cat = noun
noun_type = proper
src = INTERNET
gender = undef
number = sing
concepts = C_UK_Financial_Investments
Semantic info
C_UKFI
sementity_subtype = COMPANY
FINANCIAL_COMPANY
sementity_subsubsubtype =
INVESTMENT_COMPANY
C_UK_Financial_Investments
sementity_subtype = COMPANY
FINANCIAL_COMPANY
INVESTMENT_COMPANY
19
Dissemination Level
C
File: INSTITUTIONS.db(13 terms included)
File containing a set of institutions.
British_Council
Consejo_Británico
CFJ
Centre_de_Formation_des_Journalistes
Atelier_de_Poésie_Ouverte
Solidaritat_Catalana
ANC
African_National_Congress
Campaign_for_an_English_Parliament
CEP
FoE
PIRC
Parental_Information_and_Resource_Centers
information.
Table 6: Entry example for INSTITUTIONS.db
Grammatical info
BRITISH_COUNCIL
form = British_Council
en = ok
es = ok
fr = ok
it = ok
cat = noun
noun_type = proper
gender = masc
number = sing
concepts = C_BRITISH_COUNCIL
es_checkinfo_id = foreign_word
es_checkinfo_level = b1
es_checkinfo_form = Consejo_Británico
es_checkinfo_lang = english
CONSEJO_BRITÁNICO
form = Consejo_Británico
es = ok
Semantic info
C_BRITISH_COUNCIL
sementity_subtype = INSTITUTE
sementity_subsubtype = INSTITUTE_OTHER
semtheme_type = SOCIETY
semtheme_subtype = EDUCATION
C_CONSEJO_BRITÁNICO
sementity_subtype = INSTITUTE
20
Dissemination Level
C
cat = noun
noun_type = proper
gender = masc
number = sing
concepts = C_CONSEJO_BRITÁNICO
sementity_subsubtype = INSTITUTE_OTHER
semtheme_subtype = EDUCATION
File: FIRT_NAMES.db(28 terms included)
This file contains a list of first names in different languages.
Colima
Bashir
Ginni
Laureline
Marielle
Batefimbi
Suransky
Barbaree
Rumack
Mirlande
Mohamud
Philomene
Siân
Uri
Zwelakhe
Andi
Cath
Cheng
Choe
Chun
Fereidoun
Kurmanbek
Abdur
Aisling
Catia
Ayad
Brigid
Caity
information.
21
Dissemination Level
C
Table 7: Entry example for FIRST_NAMES.db
Grammatical info
COLIMA#2
form = Colima
es = ok
en = ok
it = ok
fr = ok
cat = noun
noun_type = proper
dic = DIC_DEF
gender = masc
number = sing
concepts = C_COLIMA#2
Semantic info
C_COLIMA#2
sementity_type = PERSON
sementity_subtype = FIRST_NAME
File: OTHER_NAMES.db(20 terms included)
This file contains a list of various proper names such as films, products, film festivals.
Ampera
El_Día_Después
meridiano_de_Greenwich
Waverley
Mediator
American_Music_Awards
AMA
Complexo_do_Alemão
Panopticon
REDD
Yeddah
Scary_Movie
Qasam
Sipdis
Secret_Internet_Protocol_Distribution
Secret_Internet_Protocol_Router_Network
SIPRNet
Wrongfully_Accused
Broxden_Junction
Broxden
22
Dissemination Level
C
information.
Table 8: Entry example for OTHER_NAMES.db
Grammatical info
AMPERA
form = Ampera
es = ok
en = ok
it = ok
fr = ok
cat = noun
noun_type = proper
dic = DIC_DEF
src = INTERNET
gender = undef
number = sing
concepts = C_AMPERA
EL_DÍA_DESPUÉS
form = El_Día_Después
es = ok
en = ok
it = ok
fr = ok
cat = noun
noun_type = proper
src = INTERNET
gender = masc
number = sing
concepts = C_EL_DÍA_DESPUÉS
Semantic info
C_AMPERA
semtheme_subtype = TRANSPORT
sementity_type = PRODUCT
sementity_subtype = PHYSICAL_PRODUCT
sementity_subsubtype = VEHICLE
C_EL_DÍA_DESPUÉS
sementity_subtype = CULTURAL_PRODUCT
sementity_subsubtype = SHOW
semtheme_type = SPORT
semtheme_subtype = FOOTBALL
File: LOCATION.db(20 terms included)
File containing a list of locations.
Ergenekon
Côte_d'Azur
Costa_Azul
Costa_Azzurra
Cité_Soleil
Ceyhan
23
Dissemination Level
C
Kouriles
Kuriles
Curiles
Terzigno
Mirail
Cadix
Paestum
Thézy-Glimont
Etampes
Colombey
Greymouth
Amilly
Yssingeaux
Mohammédia
information.
Table 9: Entry example for LOCATION.db
Grammatical info
ERGENEKON
form = Ergenekon
es = ok
en = ok
fr = ok
it = ok
cat = noun
noun_type = proper
gender = undef
number = sing
concepts = C_ERGENEKON
% En la mitología turca, Ergenekon es un lugar mítico
localizado en los inaccesibles valles de los montes
Altaï.
CÔTE_D'AZUR
form = Côte_d'Azur
en = ok
fr = ok
it = ok
es = ok
cat = noun
Semantic info
C_ERGENEKON
semgeo_countrykey = C_TURQUÍA
sementity_type = LOCATION
sementity_subtype = LOCATION_OTHER
sementity_fiction = fiction
semtheme_type = HUMANITIES
semtheme_subtype = MYTHOLOGY
C_CÔTE_D'AZUR
semgeo_countrykey = C_FRANCIA#1
sementity_type = LOCATION
sementity_subtype = GEOLOGICAL_REGION
GEOLOGICAL_REGION_OTHER
24
Dissemination Level
C
noun_type = proper
gender = undef
number = sing
concepts = C_CÔTE_D'AZUR
en_checkinfo_form = French_Riviera
en_checkinfo_id = wrong_adaptation
en_checkinfo_lang = french
en_checkinfo_level = b2
es_checkinfo_form = Costa_Azul
es_checkinfo_id = foreign_word
es_checkinfo_lang = french
es_checkinfo_level = b2
it_checkinfo_form = Costa_Azzurra
it_checkinfo_id = foreign_word
it_checkinfo_lang = french
it_checkinfo_level = b2
NATURAL_REGION
File: GENT.db(3 terms included)
Names given to the people from a particular region or country.
mecano
grisón
panyabí
information.
Table 10: Entry example for GENT.db
Grammatical info
MECANO#1
form = mecano
model = A_NINO
model = DIC_DEF
model = A
concepts = C_MECANO
src = DRAE
MECANO#2
form = mecano
model = N_NINO
model = DIC_DEF
Semantic info
C_MECANO
model = CITYGENT
semgeo_citykey = C_LA_MECA
semgeo_countrykey = C_ARABIA_SAUDÍ
C_MECANO
model = CITYGENT
semgeo_citykey = C_LA_MECA
semgeo_countrykey = C_ARABIA_SAUDÍ
25
Dissemination Level
C
model = N
concepts = C_MECANO
File: FOREIGN.db(7 terms included)
Set of foreign words that are commonly used in different languages.
currie
post
glamouroso
risotto
brunch
death_metal
trash_metal
information.
Table 11: Entry example for FOREIGN.db
Grammatical info
CURRIE
form = currie
model = ESTRES
model = DIC_DEF
concepts = C_CURRIE
model = N
checkinfo_orig = curry
checkinfo_lang = english
checkinfo_id = wrong_adaptation
checkinfo_form = curri
checkinfo_level = a1
checkinfo_src = panhisp
Semantic info
C_CURRIE
model = GASTRONOMY
File: NOMS.db(8 terms included)
List of nouns.
ye
relaciones_públicas
calidad-precio
cronicidad
26
Dissemination Level
C
autopase
neurodesarrollo
tempura
contraproyecto
information.
Table 12: Entry example for NOMS.db
Grammatical info
YE
form = ye
lex = ye
model = N_LUNA
model = DIC_DEF
concepts = C_YE
model = N
src = DRAE
% sin: i griega
RELACIONES_PÚBLICAS
form = relaciones_públicas
lex = relaciones_públicas
model = RUBIALES
model = DIC_DEF
concepts = C_RELACIONES_PÚBLICAS
model = N
src = DAL DRAE
Semantic info
C_YE
model = LINGUISTICS
C_RELACIONES_PÚBLICAS
model = SOCIETY
File: ADJS.db(7 terms included)
List of adjectives.
cachifo
ortotelefónico
calentorro
cantoso
orujero
hímnico
siseante
27
Dissemination Level
C
information.
Table 13: Entry example for ADJS.db
Grammatical info
CACHIFO
form = cachifo
model = A_NINO
model = COLOQ
model = A
src = SM3
File: ABREV-SMS.db(17 terms included)
File that contains SMS-like writing.
4ever
=mnt
1bs
1bso
awa
crk
cta
ksa
ksi
ktl?
mvil
npi
ns
prnt
qdamos
to2
ymm
information.
Table 14: Entry example for ABREV-SMS.db
Grammatical info
4ever
form = 4ever
28
Dissemination Level
C
model = COLOQ
cat = sc
tag = Z
es_checkinfo_id = wrong_form
es_checkinfo_level = a1
es_checkinfo_form = para siempre
File: EMOTIC.db(10 terms included)
Set of emoticons.
=)
:'(
:-X
:O
u_u
U_U
'O'
:S
:P
:*
information.
Table 15: Entry example for EMOTIC
Grammatical info
=)
form = =)
model = COLOQ
lex = contento
cat = sc
tag = Z
src = DAEDALUS
File: CL-IJ.db(6 terms included)
Interjections frequently used in blogs
jejeje
jejejeje
29
Dissemination Level
C
jajaja
juas
jajajaja
jeajeajeajea
information.
Table 16: Entry example for CL-IJ
Grammatical info
jejeje
form = jejeje
lex = jejeje
cat = ij
src = INTERNET
30
Dissemination Level
C

D3.1 Spanish and Italian text correction modules adapted to web

Transcripción

Documentos relacionados

SRP-270 Spooler