Extracting A Parallel Corpus from the Common Crawl Candidate

Transcripción

Extracting A Parallel Corpus from the Common Crawl Candidate
Extracting A Parallel Corpus from the Common Crawl
●
Candidate document pairs identified via URLs
○ http://europa.eu/index_de.htm
○ http://europa.eu/index_en.htm
●
HTML documents aligned using tags
○ misaligned pairs are dropped
●
Sentences aligned using Church & Gale
○ i.e. based on sentence length only
●
A handful of heuristics to check sentence alignment
○ if there are numbers they must match
●
currently no charset detection or language detection (!)
Accessing CommonCrawl Data
● 60TB of data on Amazon S3, only feasible to
access through Elastic Map-reduce
● Strategy:
○ Mappers search for language codes in URLs:
■
http://europa.eu/index_de.htm and http://europa.
eu/index_en.htm are both mapped to
http://europa.eu/index_*.htm
○ Reducers receive candidate bilingual document
pairs and return aligned parallel sentences
● Implementation:
○ Hadoop code (Java) which works on Amazon Elastic
Map-Reduce or on a local Hadoop install
More choices, better coverage
Más opciones y mejor cobertura
We pay 100% of covered costs
directly
Pagamos directamente el 100% dea
los costos
Clear terms, flexible payment
options
Condiciones claras y opciones de
pago flexibles
24-hour roadside assistance
Asistencia en carretera las 24 horas
Your choice of repair shops
Usted elige el taller de reparación
Towing and rental car provided
Servicio de remolque y alquiler de
automóviles
A company you can trust
Una empresa en la que usted puede
confiar
de-en:
de-es:
de-fr:
en-es:
en-fr:
es-fr:
sents
5025
408
186
390
5535
284
words
15971
883
477
902
25673
1106
chars
122701
8068
3435
6016
161938
7649
(from a 1GB sample of the 60TB
corpus)

Documentos relacionados