STEP 2014 - Modeling linguistic variation with machine learning, R

Transcripción

STEP 2014 - Modeling linguistic variation with machine learning, R
Modeling linguistic variation TechXploration
Wake Forest University
using
machine learning, R, and the WFU DEAC cluster
Dr. Jerid Francom, Department of Romance Languages
Dr. Damian Valles, Information Systems
Humanities & Big Data
lat
0
100
Serial processing bottleneck
www.PosterPresentations.com
Working with the DEAC team, I learned about strategies to
overcome the processing bottleneck sending portions of the
computational task to multiple cores on a single workstation
(PC) or to multiple stations’ CPUs in a cluster. Not all tasks
benefit from parallel computation.
Parallel processing advantage
MBA (64 bit) Performance for Naive Bayes Text Classification
DEAC (64 bit) Performance for Naive Bayes Text Classification
Elapsed time (mins)
25
20
15
10
5
0
Using R packages doMC, doParallel, and Rmpi I wrote my
own NB classifier which implemented parallel processing.
Step
create_model
15
00
0
14
00
0
13
00
0
12
00
0
11
00
0
00
10
00
0
00
90
00
80
00
70
00
60
00
50
00
40
00
30
20
10
00
15
00
0
14
00
0
13
00
0
12
00
0
11
00
0
00
10
00
0
00
90
00
80
00
70
00
60
00
50
00
40
00
0
Sample size (tweets)
Sample size (tweets)
create_splits
test_model
Parallel computing
Step
Elapsed time (secs)
create_model
test_model_mc
Standard 'tuning' of these parameters was performed in order to improve
performance and classification accuracy (i.e. removing common and/ or sparse terms,
3
2
by increasing the chances that the most relevant features are reflected in the model,
and, in turn, reducing/ eliminating those features that may blur correct generalizations).
Baseline 42.72
1
Run times were improved in the 15k set and a 1+ million Tweet classification model was
trained in just over an hour.
0
4
8
12
16
Cores
DEAC (64 bit) on 15k
4
Elapsed time (secs)
create_splits
Model improvement
DEAC (64 bit) on ‘GLM‘ task
4
1
0.0
0.2
0.4
0.6
0.8
Elapsed time (mins)
3
Step
create_splits
2
create_model
test_model_mc
DEAC (64 bit) on 1 million
Baseline 42.72
1
1
0
20
40
60
Elapsed time (mins)
0
Step
Recent applications
Asked various native Speakers to watch a
video and then send a brief narrative of the
events in the video
A comparable corpus based on 430 TV/
film transcripts from Argentina, Mexico,
and Spain. 3.9+ million total word tokens.
4
8
12
create_splits
create_model
test_model_mc
Overall test accuracy 72% using 1+ million Tweet model
16
Workstations
Llega un morrito puberto a la casa de su supuesta novia o cita. Lo recibe un señor que es el
papá de la chica. Se ve que es una familia fresa. El chico va vestido de traje y lleva un ramo de
flores para ella. El don se porta amable y le empieza a ofrecer un condón, una pachita y un
gallo. El chico no quiere esas cosas porque lo comprometen y está sacadísimo de onda, no
sabe qué hacer. El pobre acepta todo porque se siente comprometido y probablemente
piensa "el ruco es buena onda". Baja la chica toda guapa y se saludan. El papá empieza a
actuar diferente y le va sacando todas las cosas que le ofreció al chico para hacerlo quedar mal.
Al final el pobre chico queda como el pirata y el papá queda como el bueno. La chica se enoja
con el papá.
Scores
Argentina 32.4
Mexico
36.5
Spain
31.1
✓
●
80
●
60
●
40
●
●
●
●
●
20
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
corpus
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Words
RESEARCH POSTER PRESENTATION DESIGN © 2012
Parallel computing
long
Task
The goal of this task is to use status posts acquired through
the Twitter (API) to develop a text classification model
capable of recognizing and classifying tweets/ and related
language as associated with one of three countries
(Argentina, Mexico, and Spain). The accuracy of the classifier
depends on a few things: 1) the language from each of the
countries should show sufficiently different distributional
patterns, 2) there should be sufficient data at model training
time to provide robust estimates for each of the classes
(countries in this case) and 3) the model should be tuned
correctly to provide the most efficient algorithm to detect any
potentially indicative information useful for making accurate
predictions.
(2) Parallel approach: custom functions infused
with code to implement doMC/doParallel
parallel processing
200
50+ million Tweets collected via Twitter API in Jan/ Feb 2014 over 3 weeks
Conditional probability
My interest has been to approach the documentation and
research of language variation through access to emerging
data sources on the web (TV/film transcripts, onlinenewspapers, and other data repositories). A recent interest of
mine is to mine the social media network Twitter for
information about what is common and unique about
varieties of the Spanish language, in particular the political
entities: Mexico, Argentina, and Spain. Knowledge of this
sort is valuable in research into dialect variation and can aid
students of Spanish to gain a better appreciate of the
common and unique characteristics of regional varieties.
−100
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Argentina
●
Mexico
●
Spain
●
●
●
●
●
●
●
●
●
●
●
●
ac
e
a pta
amctua
ab r
b le
bu aja
buena
en
ca o
ch sa
ic
co
m chi a
c
pr
om c o
ita
co etid
nd o
ó
c
di os n
fe a
re s
n
em d te
pi on
e
en za
fa oja
m
il
fi ia
flo nal
r
gaes
gu llo
a
h p
ha acea
ce r
r
lle lo
g
lle a
v
ma
n a
of ovi l
re a
c
of e
re r
c
on ió
p da
pi apá
en
pi sa
r
pr
a
ob po ta
ab p bre
le or
m ta
e
qu nte
e
qu d
e a
qu da
ie r
ra re
re mo
ci
b
sa sa e
c b
sa ande
lu o
d
se an
si ñor
en
to te
to da
da
tra s
je
va
ve v
st e
id
o
Language variation
Out-of-the box e1071 R package’s naiveBayes() algorithm
and base predict() showed infeasible processing times for
the 10+ million words to be processed.
(1) Serial approach: e1071 and base functions
−50
30
✓ pilot various computing implementations on a
text classification task involving millions of
Twitter status posts for an ongoing study of
Spanish language variation.
Naive Bayes Text Classification: a robust machine learning
algorithm providing language models that can be explored
(unlike some other algorithms Support Vector Machines).
00
✓ support research, development, and
documentation of high-performance computing
strategies to process 'big data' resources for
applications in the Humanities at WFU.
0
20
STEP 2014 funds were awarded to:
Memory management/ performance can be problematic for
larger data processing tasks.
where words wi, …, wn are used as features and
countries are classes c ∈ C. The probability P of a sequence of words
predicts a class is estimated by calculating the
prior probability P(̂ c) of a country being the
source of any given tweet and then calculating
the individual conditional probabilities that a
word is associated with a given country P(̂ wi|c).
50
Elapsed time (mins)
Goals
R & text classification
R: a free and open source software with a comprehensive
package library and an active development community.
Naive Bayes Text Classification
WorldTweets 2014 Corpus
10
'Big data is a problem in the Humanities. Given the promise
of increasing amounts of data in electronic form it would
sound counter-intuitive; clearly more data is better. Yet the
leaps in the size of data have outpaced the power to process
such data on the typical PC. Industry and computationallyminded sectors of higher-education have long turned to
high-performance computing to take advantage of these
information-rich resources, and have done so with much
success. But high-performance computing not a household
term in the Humanities in large part due to lacking faculty
access to and knowledge of computing resources and
programming strategies.
Spring 2015
Results
✓ 8 cores on the DEAC workstation was the
optimal configuration showing major gains: from
47 to 5.5 minutes on 15k tweets.
✓ After model tuning, 1+ million tweets were
processed in just over 1 hour.
Upshot
Processing increasingly large amounts of textual data can be
difficult using traditional, single PC work environments.
Typical alternatives employed such as increased CPU power
and RAM size can proceed without adjustment to
programming routines. However, in some cases these two
strategies are not effective to overcome computational
requirements. From the tests conducted in this report in R,
parallel alternatives can be effectively employed on iterative
tasks quite simply with doMC or doParallel packages by
interjecting for-loop like code.
Findings here pave the way for other scholars on campus
with computational tasks that prove too unwieldy for
standard PCs. The DEAC environment is now equipped and
knowledgeable about how to implement parallelization in R,
this will help scholars varying in programming backgrounds
access high-performance computing resources more easily
and effectively.
References
Manning, CD, P Raghavan, and H Schütze. 2008. Introduction
to information retrieval. Cambridge University Press.

Documentos relacionados