Practical Data Mining Tutorial 1: Introduction to the WEKA Explorer

Transcripción

Practical Data Mining
Tutorial 1: Introduction to the WEKA Explorer
Mark Hall, Eibe Frank and Ian H. Witten
May 5, 2011
c
2006-2012
University of Waikato
1
Getting started
This tutorial introduces the main graphical user interface for accessing WEKA’s facilities, called the
WEKA Explorer. We will work with WEKA 3.6
(although almost everything is the same with other
versions), and we assume that it is installed on
your system.
Este tutorial presenta la interfaz gráfica de usuario
principal para acceder a las instalaciones de
WEKA, llamado Explorer WEKA. Vamos a trabajar con WEKA 3.6 (aunque casi todo es lo
mismo con otras versiones), y suponemos que se
ha instalado en su sistema.
Invoke WEKA from the Windows START menu
(on Linux or the Mac, double-click weka.jar or
weka.app). This starts up the WEKA GUI
Chooser. Click the Explorer button to enter the
WEKA Explorer.
Invocar WEKA desde el menú INICIO de Windows (en Linux o Mac, haga doble clic en weka.jar
o weka.app). Esto pone en marcha el GUI
Chooser WEKA. Haga clic en el Explorer botón
para entrar en el WEKA Explorer.
Just in case you are wondering about the other
buttons in the GUI Chooser: Experimenter
is a user interface for comparing the predictive
performance of learning algorithms; KnowledgeFlow is a component-based interface that has a
similar functionality as the Explorer; and Simple CLI opens a command-line interface that emulates a terminal and lets you interact with WEKA
in this fashion.
Sólo en caso de que usted se está preguntando
sobre el resto de botones en la GUI Chooser:
Experimenter es una interfaz de usuario para
comparar el rendimiento predictivo de algoritmos
de aprendizaje; KnowledgeFlow es una interfaz
basada en componentes que tiene una funcionalidad similar a la de Explorer; y Simple CLI se
abre un comando-lı́nea de interfaz que emula una
terminal y le permite interactuar con WEKA de
esta manera.
2
The panels in the Explorer
The user interface to the Explorer consists of six
panels, invoked by the tabs at the top of the window. The Preprocess panel is the one that is
open when the Explorer is first started. This
tutorial will introduce you to two others as well:
Classify and Visualize. (The remaining three
panels are explained in later tutorials.) Here’s a
brief description of the functions that these three
panels perform.
La interfaz de usuario de la Explorer se compone
de seis paneles, invocadas por las etiquetas en la
parte superior de la ventana. El panel de Preprocess es la que está abierta cuando la Explorer por
primera vez. Este tutorial le introducirá a otros
dos, ası́: Classify y Visualize. (Los otros tres
paneles se explican en tutoriales más tarde.) He
aquı́ una breve descripción de las funciones que estos tres grupos de realizar.
Preprocess is where you to load and preprocess
data. Once a dataset has been loaded, the
panel displays information about it. The
dataset can be modified, either by editing
it manually or by applying a filter, and the
modified version can be saved. As an alternative to loading a pre-existing dataset, an
artificial one can be created by using a generator. It is also possible to load data from
a URL or from a database.
Preprocess es donde puedes cargar los datos y
preproceso. Una vez que un conjunto de
datos se ha cargado, el panel muestra información sobre Àl. El conjunto de datos puede
ser modificado, ya sea mediante la edición
de forma manual o mediante la aplicación de
un filtro, y la versión modificada se puede
guardar. Como alternativa a la carga de un
conjunto de datos pre-existentes, una artificial se pueden crear mediante el uso de un
generador. También es posible cargar datos
desde una URL o desde una base de datos.
1
Classify is where you invoke the classification
methods in WEKA. Several options for the
classification process can be set, and the result of the classification can be viewed. The
training dataset used for classification is the
one loaded (or generated) in the Preprocess
panel.
Classify es donde se invoca a los métodos de clasificación en WEKA. Varias opciones para el
proceso de clasificación se puede establecer,
y el resultado de la clasificación se puede ver.
El conjunto de datos de entrenamiento utilizados para la clasificación es la carga (o generada) en el panel de Preprocess.
Visualize is where you can visualize the dataset
loaded in the Preprocess panel as twodimensional scatter plots. You can select the
attributes for the x and y axes.
Visualize es donde se puede visualizar el conjunto
de datos cargados en el panel de Preprocess
como diagramas de dispersión de dos dimensiones. Puede seleccionar los atributos de los
x y y ejes.
3
The Preprocess panel
Preprocess is the panel that opens when the
WEKA Explorer is started.
3.1
Preprocess es el panel que se abre cuando el Explorer WEKA se ha iniciado.
Loading a dataset
Before changing to any other panel, the Explorer
must have a dataset to work with. To load one
up, click the Open file... button in the top
left corner of the panel. Look around for the
folder containing datasets, and locate a file named
weather.nominal.arff (this file is in the data
folder that is supplied when WEKA is installed).
This contains the nominal version of the standard
“weather” dataset. Open this file. Now your
screen will look like Figure 1.
Antes de cambiar a cualquier otro panel, el Explorer debe tener un conjunto de datos para trabajar. Para cargar una, haga clic en el botón de
Open file... en la esquina superior izquierda del
panel. Mire a su alrededor para la carpeta que contiene los conjuntos de datos y busque un archivo
llamado weather.nominal.arff (este archivo está
en el carpeta de data que se suministra cuando
WEKA se instala). Este contiene la versión nominal de la norma “tiempo” conjunto de datos. Abrir
archivo. Ahora la pantalla se verá como la Figure 1.
The weather data is a small dataset with only 14
examples for learning. Because each row is an independent example, the rows/examples are called
“instances.” The instances of the weather dataset
have 5 attributes, with names ‘outlook’, ‘temperature’, ‘humidity’, ‘windy’ and ‘play’. If you click
on the name of an attribute in the left sub-panel,
information about the selected attribute will be
shown on the right. You can see the values of the
attribute and how many times an instance in the
dataset has a particular value. This information is
also shown in the form of a histogram.
Los datos de clima es un conjunto de datos
pequeño con sólo 14 ejemplos para el aprendizaje.
Debido a que cada fila es un ejemplo independiente, las filas/ejemplos son llamados “casos”. Los
casos del conjunto de datos meteorológicos tienen
5 atributos, con ‘perspectivas nombres’ , la ‘temperatura’, ‘humedad’, ‘jugar’ con mucho ‘viento’
y. Si hace clic en el nombre de un atributo en
el sub-panel de la izquierda, la información acerca
del atributo seleccionado se muestra a la derecha.
Usted puede ver los valores de los atributos y las
veces que una instancia del conjunto de datos tiene
un valor particular. Esta información se muestra
también en la forma de un histograma.
2
Figure 1: The Explorer’s Preprocess panel.
Todos los atributos de este conjunto de datos son
“nominales”, es decir, tienen un conjunto finito de
valores predefinidos. Cada instancia se describe
un pronóstico del tiempo para un dı́a en particular
y si a jugar un cierto juego en ese dı́a. No está
muy claro lo que el juego es, pero supongamos que
es el golf. ‘Jugar’ el último atributo es el atributo “class”—que clasifica la instancia. Su valor
puede ser ‘si’ o ‘no’. Sı́ significa que las condiciones climáticas están bien para jugar al golf, y
no significa que no están bien.
All attributes in this dataset are “nominal,” i.e.
they have a predefined finite set of values. Each
instance describes a weather forecast for a particular day and whether to play a certain game on that
day. It is not really clear what the game is, but let
us assume that it is golf. The last attribute ‘play’
is the “class” attribute—it classifies the instance.
Its value can be ‘yes’ or ‘no’. Yes means that the
weather conditions are OK for playing golf, and no
means they are not OK.
3.2
Exercises
To familiarize yourself with the functions discussed
so far, please do the following two exercises. The
solutions to these and other exercises in this tutorial are given at the end.
Para familiarizarse con las funciones discutido
hasta ahora, por favor, los dos ejercicios siguientes.
Las soluciones a estos y otros ejercicios de este tutorial se dan al final.
Ex. 1: What are the values that the attribute
‘temperature’ can have?
Ex. 1: Cuáles son los valores que la ‘temperatura’
el atributo puede tener?
3
Ex. 2: Load a new dataset. Press the ‘Open file’
button and select the file iris.arff. How
many instances does this dataset have? How
many attributes? What is the range of possible values of the attribute ’petallength’ ?
3.3
Ex. 2: Carga un nuevo conjunto de datos. Pulse
el botón ‘Abrir el archivo’ y seleccione el
archivo iris.arff. Cuántos casos se han
esta base de datos? Cómo muchos atributos? Cuál es el rango de valores posibles de
‘petallength’ el atributo?
The dataset editor
It is possible to view and edit an entire dataset
from within WEKA. To experiment with this, load
the file weather.nominal.arff again. Click the
Edit... button from the row of buttons at the
top of the Preprocess panel. This opens a new
window called Viewer, which lists all instances of
the weather data (see Figure 2).
3.3.1
Es posible ver y editar un conjunto de datos desde
el interior de WEKA. Para experimentar con esto,
cargar el archivo weather.nominal.arff nuevo.
Haga clic en el botón de Edit... de la fila de
botones en la parte superior del panel de Preprocess. Esto abre una nueva ventana llamada
Viewer, que enumera todas las instancias de los
datos meteorológicos (véase la Figure 2).
Exercises
Ex. 3: What is the function of the first column in
the Viewer?
Ex. 3: Cuál es la función de la primera columna
de la Viewer?
Ex. 4: Considering the weather data, what is the
class value of instance number 8?
Ex. 4: Teniendo en cuenta los datos meteorológicos, cuál es el valor de la clase de
número de instancia 8?
Ex. 5: Load the iris data and open it in the editor. How many numeric and how many nominal attributes does this dataset have?
Ex. 5: Carga los datos de iris y abrirlo en el editor. Cómo los atributos nominales muchas
numérico y el número de este conjunto de
datos se tienen?
3.4
Applying a filter
In WEKA, “filters” are methods that can be used
to modify datasets in a systematic fashion—that
is, they are data preprocessing tools. WEKA
has several filters for different tasks. Reload the
weather.nominal dataset, and let’s remove an attribute from it. The appropriate filter is called
Remove; its full name is:
En WEKA, “filtros” son métodos que se pueden
utilizar para modificar bases de datos de manera
sistemática—es decir, son datos del proceso previo herramientas. WEKA tiene varios filtros para
diferentes tareas. Actualizar el weather.nominal
conjunto de datos, y vamos a eliminar un atributo
de ella. El filtro adecuado se llama Remove, su
nombre completo es:
weka.filters.unsupervised.attribute.Remove
4
Figure 2: The data viewer.
Examine this name carefully. Filters are organized
into a hierarchical structure whose root is weka.
Those in the unsupervised category don’t require
a class attribute to be set; those in the supervised
category do. Filters are further divided into ones
that operate primarily on attributes/columns (the
attribute category) and ones that operate primarily on instances/rows (the instance category).
Examine cuidadosamente este nombre. Los filtros
están organizados en una estructura jerárquica,
cuya raı́z es weka. Los que están en la categorı́a de
unsupervised no requieren un atributo de clase
que se establece, los de la categorı́a supervised
hacer. Los filtros se dividen en los que operan principalmente en los atributos/columnas (la categorı́a
attribute) y los que operan principalmente en casos/filas (la categorı́a instance).
If you click the Choose button in the Preprocess
panel, a hierarchical editor opens in which you select a filter by following the path corresponding to
its full name. Use the path given in the full name
above to select the Remove filter. Once it is selected, the text “Remove” will appear in the field
next to the Choose button.
Si hace clic en el botón Choose en el panel de
Preprocess, se abre un editor jerárquico en el que
se selecciona un filtro, siguiendo la ruta de acceso
correspondiente a su nombre completo. Utilice la
ruta dada en por encima del nombre completo para
seleccionar el filtro de Remove. Una vez que se selecciona, el texto “Eliminar” aparecerá en el campo
situado junto al botón de Choose.
5
Click on the field containing this text. A window
opens called the GenericObjectEditor, which is
used throughout WEKA to set parameter values
for all of the tools. It contains a short explanation of the Remove filter—click More to get a
fuller description. Underneath there are two fields
in which the options of the filter can be set. The
first option is a list of attribute numbers. The second option—InvertSelection—is a switch. If it
is ‘false’, the specified attributes are removed; if it
is ‘true’, these attributes are NOT removed.
Haga clic en el campo que contiene este texto. Se
abre una ventana denominada GenericObjectEditor, que se utiliza en todo WEKA para establecer valores de los parámetros de todas las herramientas. Contiene una breve explicación del filtro de Remove—haga clic More para obtener una
descripción más completa. Debajo hay dos campos
en los que las opciones del filtro se puede establecer. La primera opción es una lista de números de
atributo. La segunda opción—InvertSelection—
es un interruptor. Si se trata de ‘falsos’, los atributos especificados se quitan, si es ‘verdadero’, estos
atributos no se quitan.
Enter “3” into the attributeIndices field and
click the OK button. The window with the filter options closes. Now click the Apply button
on the right, which runs the data through the filter. The filter removes the attribute with index 3
from the dataset, and you can see that the set of
attributes has been reduced. This change does not
affect the dataset in the file; it only applies to the
data held in memory. The changed dataset can be
saved to a new ARFF file by pressing the Save...
button and entering a file name. The action of the
filter can be undone by pressing the Undo button.
Again, this applies to the version of the data held
in memory.
Ingrese “3” en el campo attributeIndices y haga
clic en el botón de OK. La ventana con las opciones de filtro se cierra. Ahora haga clic en el
botón de Apply a la derecha, es decir, los datos a
través del filtro. El filtro elimina el atributo con el
ı́ndice 3 del conjunto de datos, y se puede ver que el
conjunto de atributos se ha reducido. Este cambio
no afecta al conjunto de datos en el archivo, sólo se
aplica a los datos recogidos en la memoria. El conjunto de datos modificado se puede guardar en un
archivo ARFF nuevo pulsando el botón de Save...
y entrar en un nombre de archivo. La acción del filtro se puede deshacer pulsando el botón de Undo.
Una vez más, esto se aplica a la versión de los datos
contenidos en la memoria.
What we have described illustrates how filters in
WEKA are applied to data. However, in the particular case of Remove, there is a simpler way of
achieving the same effect. Instead of invoking the
Remove filter, attributes can be selected using the
small boxes in the Attributes sub-panel and removed using the Remove button that appears at
the bottom, below the list of attributes.
Lo que hemos descrito se muestra cómo los filtros
en WEKA se aplican a los datos. Sin embargo,
en el caso particular de Remove, hay una manera más sencilla de lograr el mismo efecto. En lugar de invocar el Remove filtro, los atributos se
pueden seleccionar con los cuadros pequeños en la
Attributes sub-panel y eliminar con el botón de
Remove que aparece en la parte inferior, debajo
de la lista de atributos.
3.4.1
Exercises
Ex. 6: Ensure that the weather.nominal
Ex.
dataset is loaded.
Use the filter
weka.unsupervised.instance.RemoveWithValues
to remove all instances in which the ‘humidity’ attribute has the value ‘high’. To do
this, first make the field next to the Choose
button show the text ‘RemoveWithValues’.
Then click on it to get the GenericObjectEditor window and figure out how to
change the filter settings appropriately.
6
6: Asegúrese de que el weather.nominal
conjunto de datos se carga. Utilice el filtro
weka.unsupervised.instance.RemoveWithValues
para eliminar todos los casos en los que el
atributo ‘humedad’ tiene el valor ‘alto’. Para
ello, en primer lugar que el campo situado
junto al botón de Choose mostrará el
texto ‘RemoveWithValues’, a continuación,
haga clic en ella para mostrar la ventana
de GenericObjectEditor y encontrar la
manera de cambiar la configuración del filtro
adecuadamente.
Ex. 7: Undo the change to the dataset that you
just performed, and verify that the data is
back in its original state.
4
Ex. 7: Deshacer el cambio en el conjunto de datos
que acaba de realizar, y verificar que los
datos vuelve a su estado original.
The Visualize panel
We now take a look at WEKA’s data visualization
facilities. These work best with numeric data, so
we use the iris data.
Ahora eche un vistazo a las instalaciones de
WEKA de visualización de datos. Estos funcionan mejor con datos numéricos, por lo que utilizar
los datos del iris.
First, load iris.arff. This data contains flower
measurements. Each instance is classified as one
of three types: iris-setosa, iris-versicolor and irisvirginica. The dataset has 50 examples of each
type: 150 instances in all.
En primer lugar, la carga iris.arff. Estos datos
contienen mediciones de flores. Cada caso se clasifica como uno de tres tipos: setosa iris, iris versicolor y virginica iris. El conjunto de datos cuenta
con 50 ejemplos de cada tipo: 150 casos en total.
Click the Visualize tab to bring up the visualization panel. It shows a grid containing 25 twodimensional scatter plots, with every possible combination of the five attributes of the iris data on
the x and y axes. Clicking the first plot in the second row opens up a window showing an enlarged
plot using the selected axes. Instances are shown
as little crosses whose color cross depends on the
instance’s class. The x axis shows the ‘sepallength’
attribute, and the y axis shows ‘petalwidth’.
Haga clic en la ficha Visualize para que aparezca
el panel de visualización. Muestra una cuadrı́cula
que contiene 25 gráficos de dispersión de dos dimensiones, con todas las combinaciones posibles
de los cinco atributos de los datos del iris en los x
y y ejes. Al hacer clic en la primera parcela en la
segunda fila se abre una ventana que muestra una
trama ampliada con los ejes seleccionados. Las instancias se muestran como pequeñas cruces cuyo
color depende de la clase de cruz de la instancia.
El eje x muestra el atributo ‘sepallength’, y ‘petalwidth’ muestra el y eje.
Clicking on one of the crosses opens up an Instance Info window, which lists the values of all
attributes for the selected instance. Close the Instance Info window again.
Al hacer clic en una de las cruces se abre una ventana de Instance Info, que enumera los valores
de todos los atributos de la instancia seleccionada.
Cierre la ventana de Instance Info de nuevo.
The selection fields at the top of the window that
contains the scatter plot can be used to change the
attributes used for the x and y axes. Try changing
the x axis to ‘petalwidth’ and the y axis to ‘petallength’. The field showing “Colour: class (Num)”
can be used to change the colour coding.
Los campos de selección en la parte superior de
la ventana que contiene el diagrama de dispersión
se puede utilizar para cambiar los atributos utilizados por los x y y ejes. Pruebe a cambiar el
eje x a ‘petalwidth’ y el y eje ‘petallength’. El
campo muestra “Color: clase (Num)”se puede utilizar para cambiar el código de colores.
Each of the colorful little bar-like plots to the right
of the scatter plot window represents a single attribute. Clicking a bar uses that attribute for the
x axis of the scatter plot. Right-clicking a bar does
the same for the y axis. Try to change the x and
y axes back to ‘sepallength’ and ‘petalwidth’ using
these bars.
Cada una de las parcelas de colores poco como
la barra a la derecha de la ventana del gráfico de
dispersión representa un único atributo. Haciendo
clic en un bar que utiliza atributos para los x eje
del diagrama de dispersión. Derecho clic en un bar
hace lo mismo con los y eje. Trate de cambiar los
x y y ejes de nuevo a ‘sepallength’ y ‘petalwidth’
utilizando estas barras.
7
The Jitter slider displaces the cross for each instance randomly from its true position, and can
reveal situations where instances lie on top of one
another. Experiment a little by moving the slider.
El control deslizante Jitter desplaza la cruz por
cada instancia al azar de su verdadera posición, y
puede revelar las situaciones en que casos se encuentran en la parte superior de uno al otro. Experimente un poco moviendo la barra deslizante.
The Select Instance button and the Reset,
Clear and Save buttons let you change the
dataset. Certain instances can be selected and the
others removed. Try the Rectangle option: select
an area by left-clicking and dragging the mouse.
The Reset button now changes into a Submit
button. Click it, and all instances outside the rectangle are deleted. You could use Save to save the
modified dataset to a file, while Reset restores the
original dataset.
El botón de Select Instance y Reset, Clear, y
Save los botones le permiten cambiar el conjunto
de datos. Algunos casos se pueden seleccionar y
eliminar los demás. Pruebe la opción Rectangle:
seleccionar un área por la izquierda haciendo clic
y arrastrando el ratón. El Reset botón ahora se
transforma en un botón de Submit. Haga clic en
él, y todos los casos fuera del rectángulo se eliminan. Usted podrı́a utilizar Save para guardar el
conjunto de datos modificados en un archivo, mientras que Reset restaura el conjunto de datos original.
5
The Classify panel
Now you know how to load a dataset from a file
and visualize it as two-dimensional plots. In this
section we apply a classification algorithm—called
a “classifier” in WEKA—to the data. The classifier builds (“learns”) a classification model from
the data.
Ahora usted sabe cómo cargar un conjunto de
datos de un archivo y visualizarlo como parcelas de dos dimensiones. En esta sección se aplica
un algoritmo de clasificación—denominado “clasificador” en WEKA—a los datos. El clasificador se
basa (“aprende”) un modelo de clasificación de los
datos.
In WEKA, all schemes for predicting the value of a
single attribute based on the values of some other
attributes are called “classifiers,” even if they are
used to predict a numeric target—whereas other
people often describe such situations as “numeric
prediction” or “regression.” The reason is that,
in the context of machine learning, numeric prediction has historically been called “classification
with continuous classes.”
En WEKA, todos los esquemas para predecir el
valor de un atributo único, basado en los valores
de algunos atributos de otros se llaman “clasificadores”, incluso si se utilizan para predecir
un objetivo numérico—mientras que otras personas a menudo describen situaciones tales como
“numérica predicción” o “regresión”. La razón es
que, en el contexto de aprendizaje de máquina,
la predicción numérica históricamente ha sido llamada “la clasificación con clases continuas.”
Before getting started, load the weather
data again.
Go to the Preprocess panel,
click the Open file button, and select
weather.nominal.arff from the data directory. Then switch to the classification panel
by clicking the Classify tab at the top of the
window. The result is shown in Figure 3.
Antes de empezar, carga la información del
tiempo nuevo.
Ir al panel de Preprocess,
haga clic en el botón de Open file, y seleccione weather.nominal.arff desde el directorio
de datos. Luego cambiar a la mesa de clasificación,
haga clic en la ficha Classify en la parte superior de la ventana. El resultado se muestra en la
Figura 3.
8
Figure 3: The Classify panel.
5.1
Using the C4.5 classifier
A popular machine learning method for data mining is called the C4.5 algorithm, and builds decision trees. In WEKA, it is implemented in a
classifier called “J48.” Choose the J48 classifier
by clicking the Choose button near the top of the
Classifier tab. A dialogue window appears showing various types of classifier. Click the trees entry
to reveal its subentries, and click J48 to choose the
J48 classifier. Note that classifiers, like filters, are
organized in a hierarchy: J48 has the full name
weka.classifiers.trees.J48.
Una máquina popular método de aprendizaje para
la minerı́a de datos se denomina el algoritmo C4.5,
y construye árboles de decisión. En WEKA, se
implementa en un clasificador llamado “J48”. Seleccione el clasificador J48 haciendo clic en el botón
de Choose en la parte superior de la ficha Classifier. Una ventana de diálogo aparece mostrando
los diferentes tipos de clasificadores. Haga clic en
la entrada trees a revelar sus subentradas, y haga
clic en J48 elegir el clasificador J48. Tenga en
cuenta que los clasificadores, como los filtros, están
organizados en una jerarquı́a: J48 tiene el nombre
completo weka.classifiers.trees.J48.
The classifier is shown in the text box next to the
Choose button: it now reads J48 –C 0.25 –M 2.
The text after “J48” gives the default parameter
settings for this classifier. We can ignore these, because they rarely require changing to obtain good
performance from C4.5.
El clasificador se muestra en el cuadro de texto
junto al botón Choose: J48 –C 0.25 –M 2 se
sustituirá por el texto. El texto después de “J48”
da la configuración de los parámetros por defecto
para este clasificador. Podemos ignorar esto, ya
que rara vez se requieren cambios para obtener un
buen rendimiento de C4.5.
9
Decision trees are a special type of classification
model. Ideally, models should be able to predict
the class values of new, previously unseen instances
with high accuracy. In classification tasks, accuracy is often measured as the percentage of correctly classified instances. Once a model has been
learned, we should test it to see how accurate it is
when classifying instances.
Los árboles de decisión son un tipo especial de
modelo de clasificación. Idealmente, los modelos
deben ser capaces de predecir los valores de la clase
de nuevo, no visto previamente casos con gran precisión. En las tareas de clasificación, la precisión
se mide como el porcentaje de casos clasificados
correctamente. Una vez que un modelo que se ha
aprendido, hay que probarlo para ver cómo es exacto es la hora de clasificar los casos.
One option in WEKA is to evaluate performance
on the training set—the data that was used to
build the classifier. This is NOT generally a good
idea because it leads to unrealistically optimistic
performance estimates. You can easily get 100%
accuracy on the training data by simple rote learning, but this tells us nothing about performance
on new data that might be encountered when the
model is applied in practice. Nevertheless, for illustrative purposes it is instructive to consider performance on the training data.
Una opción en WEKA es evaluar el rendimiento
en el conjunto de entrenamiento—los datos que
se utilizó para construir el clasificador. Esto no
es generalmente una buena idea porque conduce a
las estimaciones de rendimiento irrealmente optimista. Usted puede obtener el 100% de precisión
en los datos de entrenamiento por el aprendizaje
de memoria sencillo, pero esto no nos dice nada
sobre el rendimiento de los nuevos datos que se
pueden encontrar cuando el modelo se aplica en la
práctica. No obstante, a tı́tulo ilustrativo es instructivo considerar el rendimiento de los datos de
entrenamiento.
In WEKA, the data that is loaded using the Preprocess panel is the “training data.” To evaluate on the training set, choose Use training
set from the Test options panel in the Classify panel. Once the test strategy has been set,
the classifier is built and evaluated by pressing the
Start button. This processes the training set using the currently selected learning algorithm, C4.5
in this case. Then it classifies all the instances in
the training data—because this is the evaluation
option that has been chosen—and outputs performance statistics. These are shown in Figure 4.
En WEKA, los datos que se carga mediante
el panel de Preprocess es el “datos de entrenamiento.” Para evaluar el conjunto de entrenamiento, elegir Use training set desde el panel
de Test options en el panel Classify. Una vez
que la estrategia de prueba se ha establecido, el
clasificador se construye y se evaluó con el botón
Start. Este proceso conjunto de entrenamiento
utilizando el algoritmo seleccionado aprendizaje,
C4.5 en este caso. Luego se clasifica a todas las
instancias en los datos de entrenamiento—porque
esta es la opción de evaluación que se ha elegido—
y estadı́sticas de resultados de desempeño. Estos
se muestran en la Figure 4.
5.2
Interpreting the output
The outcome of training and testing appears in
the Classifier output box on the right. You can
scroll through the text to examine it. First, look at
the part that describes the decision tree that was
generated:
El resultado de la formación y la prueba aparece
en el cuadro de Classifier output a la derecha.
Puede desplazarse por el texto para examinarla.
En primer lugar, busque en la parte que describe
el árbol de decisión que se ha generado:
J48 pruned tree
-----------------outlook = sunny
|
humidity = high: no (3.0)
|
humidity = normal: yes (2.0)
outlook = overcast: yes (4.0)
outlook = rainy
|
windy = TRUE: no (2.0)
|
windy = FALSE: yes (3.0)
10
Figure 4: Output after building and testing the classifier.
Number of Leaves
Size of the tree :
:
5
8
This represents the decision tree that was built,
including the number of instances that fall under
each leaf. The textual representation is clumsy to
interpret, but WEKA can generate an equivalent
graphical representation. You may have noticed
that each time the Start button is pressed and a
new classifier is built and evaluated, a new entry
appears in the Result List panel in the lower left
corner of Figure 4. To see the tree, right-click on
the trees.J48 entry that has just been added to
the result list, and choose Visualize tree. A window pops up that shows the decision tree in the
form illustrated in Figure 5. Right-click a blank
spot in this window to bring up a new menu enabling you to auto-scale the view, or force the tree
to fit into view. You can pan around by dragging
the mouse.
Esto representa el árbol de decisión que fue construido, incluyendo el número de casos que corresponden a cada hoja. La representación textual es
torpe de interpretar, pero WEKA puede generar
una representación gráfica equivalente. Puede
haber notado que cada vez que el botón se pulsa
Start y un clasificador de nueva construcción y se
evaluó, una nueva entrada aparece en el panel de
Result List en la esquina inferior izquierda de la
Figure 4. Para ver el árbol, haga clic en la entrada
trees.J48 que acaba de ser añadido a la lista de resultados, y elija Visualize tree. Aparece una ventana que muestra el árbol de decisión en la forma
ilustrada en la Figure 5. Haga clic en un punto en
blanco en esta ventana para que aparezca un nuevo
menú que le permite auto-escala de la vista, o la
fuerza del árbol para ajustarse a la vista. Puede
desplazarse por arrastrando el ratón.
11
Figure 5: The decision tree that has been built.
This tree is used to classify test instances. The
first condition is the one in the so-called “root”
node at the top. In this case, the ‘outlook’ attribute is tested at the root node and, depending
on the outcome, testing continues down one of the
three branches. If the value is ‘overcast’, testing
ends and the predicted class is ‘yes’. The rectangular nodes are called “leaf” nodes, and give the
class that is to be predicted. Returning to the root
node, if the ‘outlook’ attribute has value ’sunny’,
the ‘humidity’ attribute is tested, and if ’outlook’
has value ‘rainy, the ’windy’ attribute is tested. No
paths through this particular tree have more than
two tests.
Este árbol se utiliza para clasificar los casos de
prueba. La primera condición es la de la llamada
“raı́z” del nodo en la parte superior. En este caso,
el atributo ‘perspectivas’ se prueba en el nodo raı́z
y, dependiendo del resultado, la prueba continúa
por una de las tres ramas. Si el valor es ‘cubierto’,
finaliza las pruebas y la clase predicha es ‘sı́’. Los
nodos rectangulares se denominan “hojas” nodos,
y dar la clase que se predijo. Volviendo al nodo
raı́z, si el atributo ‘perspectivas’ tiene un valor
‘sol’, el atributo ‘humedad’ se prueba, y si ‘perspectivas’ tiene un valor de ‘lluvias’, el atributo
‘viento’ se prueba. No hay caminos a través de
este árbol en particular tiene más de dos pruebas.
Now let us consider the remainder of the information in the Classifier output area. The next
two parts of the output report on the quality of the
classification model based on the testing option we
have chosen.
Consideremos ahora el resto de la información en
el área de Classifier output. Las dos siguientes
partes del informe de salida en la calidad del modelo de clasificación basado en la opción de prueba
que hemos elegido.
The following states how many and what proportion of test instances have been correctly classified:
Los siguientes estados cuántos y qué proporción de
casos de prueba han sido correctamente clasificados:
Correctly Classified Instances
14
100%
12
This is the accuracy of the model on the data
used for testing. It is completely accurate (100%),
which is often the case when the training set is
used for testing. There are some other performance measures in the text output area, which we
won’t discuss here.
Esta es la precisión del modelo sobre los datos
utilizados para la prueba. Es totalmente preciso
(100%), que es a menudo el caso cuando el conjunto de entrenamiento se utiliza para la prueba.
Hay algunas medidas de desempeño en la zona de
salida de texto, que no vamos a discutir aquı́.
At the bottom of the output is the confusion matrix:
En la parte inferior de la salida es la matriz de
confusión:
=== Confusion Matrix ===
a b
<-- classified as
9 0 | a = yes
0 5 | b = no
Each element in the matrix is a count of instances.
Rows represent the true classes, and columns represent the predicted classes. As you can see, all 9
‘yes’ instances have been predicted as yes, and all
5 ‘no’ instances as no.
5.2.1
Cada elemento de la matriz es un recuento de los
casos. Las filas representan las clases de verdad, y
las columnas representan las clases previsto. Como
puede ver, todos los 9 ‘sı́’ casos se han previsto
como sı́, y los 5 ‘no’ casos como no.
Exercise
Ex 8: How would the following instance be classified using the decision tree?
Ex. 8: Cómo serı́a la siguiente instancia se clasificarán con el árbol de decisión?
outlook = sunny, temperature = cool, humidity = high, windy = TRUE
perspectivas = soleado, temperatura = fria,
humedad = viento, alta = TRUE
5.3
Setting the testing method
When the Start button is pressed, the selected
learning algorithm is started and the dataset that
was loaded in the Preprocess panel is used to
train a model. A model built from the full training set is then printed into the Classifier output
area: this may involve running the learning algorithm one final time.
Cuando el botón se pulsa Start, el algoritmo de
aprendizaje seleccionadas se inicia y el conjunto
de datos que se cargó en el panel de Preprocess
se utiliza para entrenar a un modelo. Un modelo
construido a partir del conjunto de entrenamiento
completo se imprime en el área de Classifier output: esto puede implicar que ejecuta el algoritmo
de aprendizaje por última vez.
The remainder of the output in the Classifier
output area depends on the test protocol that
was chosen using Test options. The Test options box gives several possibilities for evaluating
classifiers:
El resto de la producción en el área de Classifier
output depende del protocolo de prueba que fue
elegido con Test options. El cuadro de Test options da varias posibilidades para la evaluación de
los clasificadores:
13
Use training set Uses the same dataset that was
used for training (the one that was loaded in
the Preprocess panel). This is the option
we used above. It is generally NOT recommended because it gives over-optimistic performance estimates.
Usar el conjunto de la formacion Utiliza el
mismo conjunto de datos que se utilizó para
la formación (la que se cargó en el panel
de Preprocess). Esta es la opción que
usamos anteriormente. Por lo general, no
se recomienda porque da estimaciones de
rendimiento demasiado optimistas.
Supplied test set Lets you select a file containing a separate dataset that is used exclusively
for testing.
prueba suministrados conjunto Permite
seleccionar un archivo que contiene un
conjunto de datos independiente que se
utiliza exclusivamente para la prueba.
Cross-validation This is the default option, and
the most commonly-used one. It first splits
the training set into disjoint subsets called
“folds.” The number of subsets can be entered in the Folds field. Ten is the default, and in general gives better estimates
than other choices. Once the data has been
split into folds of (approximately) equal size,
all but one of the folds are used for training and the remaining, left-out, one is used
for testing. This involves building a new
model from scratch from the corresponding
subset of data and evaluating it on the letout fold. Once this has been done for the
first test fold, a new fold is selected for testing and the remaining folds used for training. This is repeated until all folds have
been used for testing. In this way each instance in the full dataset is used for testing
exactly once, and an instance is only used
for testing when it is not used for training. WEKA’s cross-validation is a stratified cross-validation, which means that
the class proportions are preserved when dividing the data into folds: each class is represented by roughly the same number of instances in each fold. This gives slightly improved performance estimates compared to
unstratified cross-validation.
La validacion cruzada Esta es la opción por defecto, y el más comúnmente utilizado. En
primer lugar, se divide el conjunto de entrenamiento en subconjuntos disjuntos llamados “pliegues”. El número de subconjuntos se pueden introducir en el campo Folds.
Diez es el valor predeterminado, y en general proporciona mejores estimaciones que
otras opciones. Una vez que los datos se
ha dividido en los pliegues de (aproximadamente) igual tamaño, todos menos uno de
los pliegues se utilizan para la formación y el
restante a cabo, a la izquierda-, uno se utiliza
para la prueba. Esto implica la construcción
de un nuevo modelo a partir de cero desde el
subconjunto de datos correspondientes y la
evaluación que sobre la que-a veces. Una vez
que esto se ha hecho para la primera prueba
doble, una nueva tapa está seleccionado para
las pruebas y los pliegues restante utilizado
para el entrenamiento. Esto se repite hasta
que todos los pliegues se han utilizado para
la prueba. De esta manera, cada instancia del conjunto de datos completo se utiliza
para probar una sola vez, y una instancia
sólo se utiliza para la prueba cuando no se
utiliza para el entrenamiento. WEKA cruz
de la validación es una stratified crossvalidation, lo que significa que las proporciones de clase se conservan al dividir los
datos en los pliegues: cada clase está representada por aproximadamente el mismo
número de casos en cada pliegue. Esto
proporciona un rendimiento mejorado ligeramente en comparación con las estimaciones
sin estratificar la validación cruzada.
14
Percentage split Shuffles the data randomly
and then splits it into a training and a test
set according to the proportion specified. In
practice, this is a good alternative to crossvalidation if the size of the dataset makes
cross-validation too slow.
Shuffles Porcentaje dividir los datos al azar y
luego se divide en un entrenamiento y un
conjunto de pruebas de acuerdo a la proporción especificada. En la práctica, esta es
una buena alternativa a la validación cruzada
si el tamaño del conjunto de datos hace que
la validación cruzada demasiado lento.
The first two testing methods, evaluation on the
training set and using a supplied test set, involve
building a model only once. Cross-validation involves building a model N +1 times, where N is the
chosen number of folds. The first N times, a fraction (N − 1)/N (90% for ten-fold cross-validation)
of the data is used for training, and the final
time the full dataset is used. The percentage split
method involves building the model twice, once on
the reduced dataset and again on the full dataset.
Los dos primeros métodos de prueba, la evaluación
en el conjunto de entrenamiento y el uso de una
unidad de prueba suministrada, implicarı́a la construcción de un modelo de una sola vez. La validación cruzada consiste en la construcción de un
modelo de N + 1 veces, donde N es el número
elegido de los pliegues. Los primeros N veces, una
fracción (N − 1)/N (90% de diez veces la validación cruzada) de los datos se utiliza para el entrenamiento y el tiempo final del conjunto de datos
completo se utiliza. El método de dividir el porcentaje implica la construcción del modelo en dos
ocasiones, una vez en el conjunto de datos reducidos y de nuevo en el conjunto de datos completo.
5.3.1
Exercise
Ex. 9 carga los datos del iris mediante el panel
de Preprocess. Evaluar C4.5 en estos datos
utilizando (a) el conjunto de entrenamiento
y (b) la validación cruzada. Cuál es el porcentaje estimado de clasificaciones correctas
para (a) y (b)? Que estiman es más realista?
Ex 9: Load the iris data using the Preprocess
panel. Evaluate C4.5 on this data using
(a) the training set and (b) cross-validation.
What is the estimated percentage of correct
classifications for (a) and (b)? Which estimate is more realistic?
5.4
Visualizing classification errors
WEKA’s Classify panel provides a way of visualizing classification errors. To do this, right-click
the trees.J48 entry in the result list and choose
Visualize classifier errors. A scatter plot window pops up. Instances that have been classified
correctly are marked by little crosses; whereas ones
that have been classified incorrectly are marked by
little squares.
5.4.1
Panel de WEKA de Classify proporciona una
manera de visualizar los errores de clasificación.
Para ello, haga clic en la entrada trees.J48 en
la lista de resultados y elegir Visualize classifier errors. Una ventana gráfica de dispersión
aparece. Casos que han sido clasificados correctamente marcadas por pequeñas cruces, mientras
que los que han sido clasificados incorrectamente
están marcados por pequeños cuadrados.
Exercise
15
Ex 10: Use the Visualize classifier errors function to find the wrongly classified test instances for the cross-validation performed in
Exercise 9. What can you say about the location of the errors?
Ex. 10: Utilice la función de Visualize classifier errors para encontrar las instancias de
prueba de mal clasificadas para la validación
cruzada realizada en el ejercicio 9. Qué
puede decir acerca de la ubicación de los errores?
16
6
Answers To Exercises
1. Hot, mild and cool.
1. caliente, suave y fresco.
2. The iris dataset has 150 instances and 5 attributes. So far we have only seen nominal values, but the attribute ‘petallength’ is
a numeric attribute and contains numeric
values. In this dataset the values for this
attribute lie between 1.0 and 6.9 (see Minimum and Maximum in the right panel).
2. El conjunto de datos del iris tiene 150 casos y
atributos 5. Hasta ahora sólo hemos visto
los valores de nominal, pero ‘petallength’ el
atributo es un atributo de numeric y contiene valores numéricos. En este conjunto
de datos los valores de este atributo se encuentran entre 1.0 y 6.9 (véase Minimum
Maximum y en el panel derecho).
3. The first column is the number given to an instance when it is loaded from the ARFF file.
It corresponds to the order of the instances
in the file.
3. La primera columna es el número dado en una
instancia cuando se carga desde el archivo
ARFF. Se corresponde con el orden de las
instancias en el archivo.
4. The class value of this instance is ‘no’. The row
with the number 8 in the first column is the
instance with instance number 8.
4. El valor de la clase de esta instancia es “no”. La
fila con el número 8 en la primera columna
es la instancia con el número de instancia
5. This can be easily seen in the Viewer window.
The iris dataset has four numeric and one
nominal attribute. The nominal attribute is
the class attribute.
5. Esto puede verse fácilmente en la ventana de
Viewer. El conjunto de datos del iris tiene
cuatro numérico y un atributo nominal. El
atributo nominal es el atributo de clase.
6. Select the RemoveWithValues filter after
clicking the Choose button. Click on the
field that is located next to the Choose button and set the field attributeIndex to 3
and the field nominalIndices to 1. Press
OK and Apply.
6. Seleccione el RemoveWithValues filtro después de hacer clic en el botón de Choose.
Haga clic en el campo que se encuentra
al lado del botón de Choose y establezca
el campo attributeIndex a 3 y el campo
nominalIndices a 1. Pulse OK y Apply.
7. Click the Undo button.
7. Haga clic en el botón de Undo.
8. The test instance would be classified as ’no’.
8. La instancia de prueba serı́a clasificado como
‘no’.
17
9. Percent correct on the training data is 98%.
Percent correct under cross-validation is
96%. The cross-validation estimate is more
realistic.
9. porcentaje correcto en los datos de entrenamiento es de 98%. Porcentaje de respuestas correctas en la validación cruzada es del
96%. La estimación de la validación cruzada
es más realista.
10. The errors are located at the class boundaries.
10. Los errores se encuentran en los lı́mites de
clase.
18
Tutorial 2: Nearest Neighbor Learning and Decision Trees
Eibe Frank and Ian H. Witten
May 5, 2011
c
2006-2012
1
Introduction
In this tutorial you will experiment with nearest
neighbor classification and decision tree learning.
For most of it we use a real-world forensic glass
classification dataset.
En este tutorial podrás experimentar con la clasificación más cercano vecino y árbol de decisión
aprendizaje. Para la mayorı́a de los que usamos
un mundo real forenses conjunto de datos de clasificación de vidrio.
We begin by taking a preliminary look at this
dataset. Then we examine the effect of selecting
different attributes for nearest neighbor classification. Next we study class noise and its impact
on predictive performance for the nearest neighbor
method. Following that we vary the training set
size, both for nearest neighbor classification and
decision tree learning. Finally, you are asked to
interactively construct a decision tree for an image
segmentation dataset.
Empezamos por echar un vistazo preliminar a esta
base de datos. A continuación, examinamos el
efecto de la selección de atributos diferentes para
la clasificación del vecino más cercano. A continuación se estudia el ruido de clase y su impacto
en el rendimiento predictivo del método del vecino más cercano. Después de que variar el tamaño
del conjunto de la formación, tanto para la clasificación del vecino más cercano y el árbol de decisión
aprendizaje. Por último, se le pide para construir
de forma interactiva un árbol de decisión para un
conjunto de datos de segmentación de la imagen.
Before continuing with this tutorial you should review in your mind some aspects of the classification
task:
Antes de continuar con este tutorial es necesario
que revise en su mente algunos aspectos de la tarea
de clasificación:
• How is the accuracy of a classifier measured?
• Cómo es la precisión de un clasificador de
medir?
• What are irrelevant attributes in a data set,
and can additional attributes be harmful?
• Cuáles son los atributos irrelevantes en un
conjunto de datos y atributos adicionales
pueden ser perjudiciales?
• What is class noise, and how would you measure its effect on learning?
• Cuál es el ruido de clase, y cómo medir su
efecto en el aprendizaje?
• What is a learning curve?
• Qué es una curva de aprendizaje?
• If you, personally, had to invent a decision
tree classifier for a particular dataset, how
would you go about it?
• Si usted, personalmente, tenı́a que inventar
un clasificador de árbol de decisión para un
conjunto de datos particular, cómo hacerlo?
1
2
The glass dataset
The glass dataset glass.arff from the US Forensic Science Service contains data on six types of
glass. Glass is described by its refractive index and
the chemical elements it contains, and the aim is
to classify different types of glass based on these
features. This dataset is taken from the UCI data
sets, which have been collected by the University
of California at Irvine and are freely available on
the World Wide Web. They are often used as a
benchmark for comparing data mining algorithms.
El conjunto de datos de cristal glass.arff de
los EE.UU. Servicio de Ciencias Forenses contiene
datos sobre los seis tipos de vidrio. El vidrio es
descrito por su ı́ndice de refracción y los elementos
quı́micos que contiene, y el objetivo es clasificar
los diferentes tipos de vidrio sobre la base de estas caracterı́sticas. Este conjunto de datos se ha
tomado de los conjuntos de datos de la UCI, que
han sido recogidos por la Universidad de California en Irvine y están disponibles libremente en la
World Wide Web. A menudo se utilizan como referencia para comparar los algoritmos de minerı́a
de datos.
Find the dataset glass.arff and load it into the
WEKA Explorer. For your own information, answer the following questions, which review material
covered in Tutorial 1.
Encontrar el conjunto de datos glass.arff y cargarlo en la Explorer WEKA. Para su propia información, conteste las siguientes preguntas, que
el material objeto de examen en el Tutorial 1.
Ex. 1: How many attributes are there in the glass
dataset? What are their names? What is the
class attribute?
Ex. 1: Cómo los atributos con los que cuenta el
conjunto de datos de cristal? Cuáles son sus
nombres? Cuál es el atributo de la clase?
Run
the
classification
algorithm
IBk
(weka.classifiers.lazy.IBk).
Use crossvalidation to test its performance, leaving the
number of folds at the default value of 10. Recall
that you can examine the classifier options in
the GenericObjectEditor window that pops
up when you click the text beside the Choose
button. The default value of the KNN field is 1:
this sets the number of neighboring instances to
use when classifying.
Ejecutar el algoritmo de clasificación IBK
(weka.classifiers.lazy.IBk). Utilice la validación cruzada para probar su funcionamiento, dejando el número de pliegues en el valor predeterminado de 10. Recuerde que usted puede examinar las opciones del clasificador en la ventana de
GenericObjectEditor que aparece al hacer clic
en el texto junto al botón Choose. El valor por
defecto del campo KNN es una: este establece el
número de casos de vecinos a utilizar en la clasificación.
Ex. 2: What is the accuracy of IBk (given in the
Classifier output box)?
Ex. 2: Qué es la exactitud de IBk (que figuran
en el cuadro de Classifier output)?
Run IBk again, but increase the number of neighboring instances to k = 5 by entering this value in
the KNN field. Here and throughout this tutorial,
continue to use cross-validation as the evaluation
method.
Ejecutar IBK otra vez, pero aumentar el número
de casos de vecinos a k = 5 por entrar en este valor
en el campo KNN. Aquı́ ya lo largo de este tutorial, seguir utilizando la validación cruzada como
el método de evaluación.
Ex. 3: What is the accuracy of IBk with 5 neighboring instances (k = 5)?
Ex. 3: Qué es la exactitud de IBk con 5 casos de
vecinos (k = 5)?
2
3
Attribute selection for glass classification
Now we find what subset of attributes produces
the best cross-validated classification accuracy for
the IBk nearest neighbor algorithm with k = 1 on
the glass dataset. WEKA contains automated attribute selection facilities, which we examine in a
later tutorial, but it is instructive to do this manually.
Ahora nos encontramos con lo subconjunto de los
atributos produce la exactitud de la clasificación
mejor validación cruzada para el algoritmo de vecino más cercano IBk con k = 1 en el conjunto
de datos de vidrio. WEKA contiene automatizado
instalaciones para la selección de atributos, que se
examinan más adelante en un tutorial, pero es instructivo para hacerlo manualmente.
Performing an exhaustive search over all possible subsets of the attributes is infeasible (why?),
so we apply a procedure called “backwards selection.” To do this, first consider dropping each
attribute individually from the full dataset consisting of nine attributes (plus the class), and run
a cross-validation for each reduced version. Once
you have determined the best 8-attribute dataset,
repeat the procedure with this reduced dataset to
find the best 7-attribute dataset, and so on.
Realización de una búsqueda exhaustiva sobre todos los posibles subconjuntos de los atributos no es
factible (por qué?), por lo que aplicar un procedimiento llamado “al revés de selección.” Para ello,
en primer lugar considerar abandonar cada atributo individual del conjunto de datos completa que
consiste en nueve atributos (además de la clase), y
ejecutar una validación cruzada para cada versión
reducida. Una vez que haya determinado el conjunto de datos más de 8 atributo, repita el procedimiento con este conjunto de datos reduce a encontrar el mejor conjunto de datos 7-atributo, y
ası́ sucesivamente.
Ex. 4: Record in Table 1 the best attribute set
and the greatest accuracy obtained in each
iteration.
Ex. 4: Registro en la Table 1 el mejor conjunto
de atributos y la mayor precisión obtenida
en cada iteración.
Table 1: Accuracy obtained using IBk, for different attribute subsets
Subset size
9 attributes
8 attributes
7 attributes
6 attributes
5 attributes
4 attributes
3 attributes
2 attributes
1 attribute
0 attributes
Attributes in “best” subset
Classification accuracy
The best accuracy obtained in this process is quite
a bit higher than the accuracy obtained on the full
dataset.
La mejor precisión obtenida en este proceso es un
poco mayor que la precisión obtenida en el conjunto de datos completo.
Ex. 5: Is this best accuracy an unbiased estimate
of accuracy on future data? Be sure to explain your answer.
Ex. 5: Es esto mejor precisión una estimación no
sesgada de precisión en los datos de futuro?
Asegúrese de explicar su respuesta.
3
(Hint: to obtain an unbiased estimate of accuracy
on future data, we must not look at the test data
at all when producing the classification model for
which we want to obtain the estimate.)
4
(Sugerencia: para obtener una estimación objetiva
de la exactitud en los datos de futuro, no debemos
mirar el at all datos de prueba cuando se produce el modelo de clasificación para la que queremos obtener la estimación.)
Class noise and nearest-neighbor learning
Nearest-neighbor learning, like other techniques,
is sensitive to noise in the training data. In this
section we inject varying amounts of class noise
into the training data and observe the effect on
classification performance.
Aprendizaje más cercana al vecino, al igual que
otras técnicas, es sensible al ruido en los datos de
entrenamiento. En esta sección se inyectan cantidades variables de class noise en los datos de entrenamiento y observar el efecto en el rendimiento
de la clasificación.
You can flip a certain percentage of class labels in
the data to a randomly chosen other value using an
unsupervised attribute filter called AddNoise, in
weka.filters.unsupervised.attribute. However, for our experiment it is important that the
test data remains unaffected by class noise.
Puede invertir un cierto porcentaje de las etiquetas de clase en los datos a un valor escogido de forma aleatoria otras mediante un atributo sin supervisión filtro llamado AddNoise,
en weka.filters.unsupervised.attribute. Sin
embargo, para nuestro experimento es importante
que los datos de prueba no se ve afectado por el
ruido de la clase.
Filtering the training data without filtering the
test data is a common requirement, and is achieved
using a “meta” classifier called FilteredClassifier, in weka.classifiers.meta. This meta classifier should be configured to use IBk as the classifier and AddNoise as the filter. The FilteredClassifier applies the filter to the data before running the learning algorithm. This is done in two
batches: first the training data and then the test
data. The AddNoise filter only adds noise to the
first batch of data it encounters, which means that
the test data passes through unchanged.
Filtrado de los datos de entrenamiento sin filtrar los datos de prueba es un requisito común, y
se realiza con un “meta” clasificador denominado
FilteredClassifier, en weka.classifiers.meta.
Este clasificador meta debe estar configurado para
utilizar como IBk AddNoise el clasificador y el
filtro. El FilteredClassifier se aplica el filtro a
los datos antes de ejecutar el algoritmo de aprendizaje. Esto se hace en dos tandas: en primer lugar
los datos de entrenamiento y, a continuación los
datos de prueba. El AddNoise filtro sólo hacı́a
que el primer lote de datos que encuentra, lo que
significa que los datos de prueba pasa a través de
cambios.
Table 2: Effect of class noise on IBk, for different neighborhood sizes
Percent noise
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
k=1
k=3
4
k=5
Ex. 6: Reload the original glass dataset, and
record in Table 2 the cross-validated accuracy estimate of IBk for 10 different percentages of class noise and neighborhood sizes
k = 1, k = 3, k = 5 (determined by the value
of k in the k-nearest-neighbor classifier).
Ex. 6: Actualizar el conjunto de datos de vidrio
original, y registrar en la Table 2 la exactitud
validación cruzada estimación de IBk por 10
diferentes porcentajes de ruido de la clase y
el barrio tamaños k = 1, k = 3, k = 5 (determinado por el valor de k en el clasificador
k vecino más cercano).
Ex. 7: What is the effect of increasing the amount
of class noise?
Ex. 7: Cuál es el efecto de aumentar la cantidad
de ruido de clase?
Ex. 8: What is the effect of altering the value of
k?
Ex. 8: Qué elemento es el efecto de modificar el
valor de k?
5
Varying the amount of training data
In this section we consider “learning curves,”
which show the effect of gradually increasing the
amount of training data. Again we use the glass
data, but this time with both IBk and the C4.5
decision tree learner, implemented in WEKA as
J48.
En esta sección tenemos en cuenta “las curvas de
aprendizaje”, que muestran el efecto de aumentar gradualmente la cantidad de datos de entrenamiento. Una vez más se utilizan los datos de
vidrio, pero esta vez con dos IBk y la decisión C4.5
alumno árbol, implementado en WEKA como J48.
To obtain learning curves, use the FilteredClassifier again, this time in conjunction with
weka.filters.unsupervised.instance.Resample,
which extracts a certain specified percentage of a
given dataset and returns the reduced dataset.1
Again this is done only for the first batch to which
the filter is applied, so the test data passes unmodified through the FilteredClassifier before
it reaches the classifier.
Para obtener las curvas de aprendizaje, el uso de
la FilteredClassifier, esta vez en relación con el
weka.filters.unsupervised.instance.Resample,
que extrae un porcentaje especificado de un conjunto de datos y devuelve el conjunto de datos
reducidos.2 Una vez más esto se hace sólo para el
primer grupo al que se aplica el filtro, por lo que
los datos de prueba pasa sin modificar a través
de la FilteredClassifier antes que alcanza el
clasificador.
Ex. 9: Record in Table 3 the data for learning curves for both the one-nearest-neighbor
classifier (i.e., IBk with k = 1) and J48.
Ex. 9: Registro en la Table 3 los datos de las
curvas de aprendizaje tanto para el unoclasificador del vecino más cercano (es decir,
IBk con k = 1) y J48.
1 This
filter performs sampling with replacement, rather than sampling without replacement, but the effect is minor and
we will ignore it here.
2 Este filtro realiza el muestreo con reemplazo, en lugar de muestreo sin reemplazo, pero el efecto es menor y se lo ignora
aquı́.
5
Table 3: Effect of training set size on IBk and J48
Percentage of training set
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
IBk
J48
Ex. 10: What is the effect of increasing the
amount of training data?
Ex. 10: Cuál es el efecto de aumentar la cantidad
de datos de entrenamiento?
Ex. 11: Is this effect more pronounced for IBk or
J48?
Ex. 11: Es este tema efecto más pronunciado para
IBk o J48?
6
Interactive decision tree construction
One of WEKA’s classifiers is interactive: it lets
the user—i.e., you!—construct your own classifier.
Here’s a competition: let’s see who can build a
classifier with the highest predictive accuracy!
Uno de los clasificadores WEKA es interactiva:
permite que el usuario—es decir, que—construir
su propio clasificador. Aquı́ hay una competencia: a ver quién puede construir un clasificador
con mayor precisión de predicción!
Load the file segment-challenge.arff (in the
data folder that comes with the WEKA distribution). This dataset has 20 attributes and 7 classes.
It is an image segmentation problem, and the task
is to classify images into seven different groups
based on properties of the pixels.
Cargar el archivo segment-challenge.arff (en
la carpeta de datos que viene con la distribución
de WEKA). Este conjunto de datos cuenta con 20
atributos y las clases 7. Se trata de un problema
de segmentación de la imagen, y la tarea consiste
en clasificar las imágenes en siete grupos diferentes
basados en las propiedades de los pı́xeles.
Set the classifier to UserClassifier, in the
weka.classifiers.trees package. We will use a
supplied test set (performing cross-validation with
the user classifier is incredibly tedious!). In the
Test options box, choose the Supplied test set
option and click the Set... button. A small
window appears in which you choose the test
set. Click Open file... and browse to the file
segment-test.arff (also in the WEKA distribution’s data folder). On clicking Open, the small
window updates to show the number of instances
(810) and attributes (20); close it.
Ajuste el clasificador a UserClassifier, en el
weka.classifiers.trees paquete. Vamos a utilizar una unidad de prueba suministrada (realizar
la validación cruzada con el clasificador de usuario
es muy aburrido!). En el cuadro de Test options, seleccione la opción de Supplied test set
y haga clic en el botón de Set.... Aparecerá una
pequeña ventana en la que usted elija el equipo
de prueba. Haga clic en Open file... y busque
el archivo segment-test.arff (también en la carpeta de datos de la distribución de WEKA). Al
hacer clic en Open, las actualizaciones pequeña
ventana para mostrar el número de casos (810) y
atributos (20), ciérrelo.
6
Click Start. The behaviour of UserClassifier differs from all other classifiers. A special window appears and WEKA waits for you to use it to build
your own classifier. The tabs at the top of the
window switch between two views of the classifier.
The Tree visualizer view shows the current state
of your tree, and the nodes give the number of class
values there. The aim is to come up with a tree
where the leaf nodes are as pure as possible. To
begin with, the tree has just one node—the root
node—containing all the data. More nodes will
appear when you proceed to split the data in the
Data visualizer view.
Haga clic en Start. El comportamiento de UserClassifier se diferencia de todos los otros clasificadores. Una ventana especial aparece y WEKA
espera a que se utilizar para construir su propio
clasificador. Las pestañas en la parte superior del
interruptor de la ventana entre dos puntos de vista
del clasificador. El punto de vista Tree visualizer muestra el estado actual de su árbol, y los
nodos dar el número de valores de clase allı́. El
objetivo es llegar a un árbol donde los nodos hoja
son tan puros como sea posible. Para empezar, el
árbol tiene un solo nodo—el nodo raı́z—que contiene todos los datos. Más nodos aparecerá cuando
se procede a dividir los datos en la vista de Data
visualizer.
Click the Data visualizer tab to see a 2D plot in
which the data points are colour coded by class.
You can change the attributes used for the axes
either with the X and Y drop-down menus at the
top, or by left-clicking (for X) or right-clicking (for
Y) the horizontal strips to the right of the plot
area. These strips show the spread of instances
along each particular attribute.
Haga clic en la ficha Data visualizer para ver un
gráfico 2D en el que los puntos de datos están codificados por colores según la clase. Puede cambiar
los atributos utilizados para los ejes, ya sea con la
X e Y menús desplegables en la parte superior, o
presionando el botón izquierdo (para X) o el botón
derecho del ratón (para Y) las tiras horizontales a
la derecha del área de trazado . Estas tiras muestran la propagación de casos a lo largo de cada
atributo en particular.
You need to try different combinations of X and
Y axes to get the clearest separation you can find
between the colours. Having found a good separation, you then need to select a region in the plot:
this will create a branch in your tree. Here is a hint
to get you started: plot region-centroid-row on
the X-axis and intensity-mean on the Y-axis.
You will see that the red class (’sky’) is nicely separated from the rest of the classes at the top of the
plot.
Tendrá que probar diferentes combinaciones de
ejes X e Y para obtener la más clara la separación
que se encuentran entre los colores. Cuando exista
una buena separación, a continuación, deberá seleccionar una región en la trama: esto creará una
rama en el árbol. Aquı́ está una sugerencia para
comenzar: parcela region-centroid-row en el eje
X y intensity-media en el eje. Usted verá que la
clase de color rojo (‘cielo’) está muy bien separado
del resto de las clases en la parte superior de la
parcela.
There are three tools for selecting regions in the
graph, chosen using the drop-down menu below
the Y-axis selector:
Existen tres herramientas para la selección de las
regiones en el gráfico, elegidos mediante el menú
desplegable debajo del selector de eje:
1. Rectangle allows you to select points by
dragging a rectangle around them.
1. Rectangle le permite seleccionar los puntos
arrastrando un rectángulo alrededor de ellos.
2. Polygon allows you to select points by drawing a free-form polygon. Left-click to add
vertices; right-click to complete the polygon.
The polygon will be closed off by connecting
the first and last points.
2. Polygon le permite seleccionar los puntos
dibujando un polı́gono de forma libre. Haga
clic izquierdo para añadir vértices, haga clic
para completar el polı́gono. El polı́gono se
cierran mediante la conexión de los puntos
primero y el último.
7
3. Polyline allows you to select points by drawing a free-form polyline. Left-click to add
vertices; right-click to complete the shape.
The resulting shape is open, as opposed to
the polygon which is closed.
3. Polyline le permite seleccionar los puntos
dibujando una polilı́nea de forma libre. Haga
clic izquierdo para añadir vértices, haga clic
para completar la forma. La forma resultante es abierto, en comparación con el
polı́gono que está cerrado.
When you have selected an area using any of these
tools, it turns gray. Clicking the Clear button
cancels the selection without affecting the classifier. When you are happy with the selection, click
Submit. This creates two new nodes in the tree,
one holding all the instances covered by the selection and the other holding all remaining instances.
These nodes correspond to a binary split that performs the chosen geometric test.
Cuando haya seleccionado un área usando
cualquiera de estas herramientas, que se vuelve
gris. Al hacer clic en el botón Clear cancela la
selección sin afectar el clasificador. Cuando usted
está satisfecho con la selección, haga clic en Submit. Esto crea dos nuevos nodos en el árbol,
una celebración de todos los casos cubiertos por
la selección y el otro posee la totalidad de los casos restantes. Estos nodos se corresponden a una
división binaria que realiza la prueba geométrica
elegida.
Switch back to the Tree visualizer view to examine the change in the tree. Clicking on different
nodes alters the subset of data that is shown in the
Data visualizer section. Continue adding nodes
until you obtain a good separation of the classes—
that is, the leaf nodes in the tree are mostly pure.
Remember, however, that you do not want to overfit the data, because your tree will be evaluated on
a separate test set.
Cambie de nuevo a la vista de Tree visualizer
para examinar el cambio en el árbol. Al hacer clic
en los nodos diferentes altera el subconjunto de los
datos que se muestra en la sección de Data visualizer. Continúe añadiendo nodos hasta obtener
una buena separación de las clases—es decir, los
nodos hoja en el árbol son en su mayorı́a puro.
Sin embargo, recuerde que usted no desea sobreajuste de los datos, ya que el árbol será evaluado en
un conjunto de prueba independiente.
When you are satisfied with the tree, right-click
any blank space in the Tree visualizer view and
choose Accept The Tree. WEKA evaluates your
tree against the test set and outputs statistics that
show how well you did.
Cuando esté satisfecho con el árbol, haga clic en
cualquier espacio en blanco en la vista Tree visualizer y elija Accept The Tree. WEKA evalúa el
árbol contra el equipo de prueba y las estadı́sticas
de resultados que muestran lo bien que hizo.
You are competing for the best accuracy score
of a hand-built UserClassifier produced on the
‘segment-challenge’ dataset and tested on the
‘segment-test’ set. Try as many times as you like.
A good score is anything close to 90% correct or
better. Run J48 on the data to see how well an automatic decision tree learner performs on the task.
Usted está compitiendo por la mejor puntuación
de exactitud de una mano-construido UserClassifier conjunto de datos producidos en el ‘segmentchallenge’ y de prueba en el set del ‘segment-test’.
Trate tantas veces como quieras. Un buen resultado es algo cercano a 90% de aciertos o mejor.
Ejecutar J48 en los datos para ver qué tan bien
un estudiante de árbol de decisión automática realiza la tarea.
Ex. 12: When you think you have a good score,
right-click the corresponding entry in the
Result list, save the output using Save result buffer, and copy it into your answer for
this tutorial.
Ex. 12: Cuando usted piensa que tiene un buen
puntaje, haga clic en la entrada correspondiente en la Result list, guardar el resultado
con Save result buffer, y copiarlo en su
respuesta para este tutorial.
8
Tutorial 3: Classification Boundaries
Eibe Frank and Ian H .Witten
May 5, 2011
c
2008-2012
1
Introduction
In this tutorial you will look at the classification
boundaries that are produced by different types
of models. To do this, we use WEKA’s BoundaryVisualizer. This is not part of the WEKA Explorer that we have been using so far. Start up the
WEKA GUI Chooser as usual from the Windows
START menu (on Linux or the Mac, double-click
weka.jar or weka.app). From the Visualization
menu at the top, select BoundaryVisualizer.
En este tutorial se verá en los lı́mites de clasificación que son producidas por diferentes tipos de
modelos. Para ello, utilizamos BoundaryVisualizer de WEKA. Esto es no parte del Explorador de
WEKA que hemos estado utilizando hasta ahora.
Poner en marcha el GUI Chooser WEKA como
de costumbre en el menú INICIO de Windows
(en Linux o Mac, haga doble clic en weka.jar o
weka.app). En el menú Visualization en la parte
superior, seleccione BoundaryVisualizer.
The boundary visualizer shows two-dimensional
plots of the data, and is most appropriate for
datasets with two numeric attributes. We will use
a version of the iris data without the first two
attributes. To create this from the standard iris
data, start up the Explorer, load iris.arff using the Open file button and remove the first
two attributes (‘sepallength’ and ‘sepalwidth’) by
selecting them and clicking the Remove button
that appears at the bottom. Then save the modified dataset to a file (using Save) called, say,
iris.2D.arff.
El visualizador muestra los lı́mites parcelas de dos
dimensiones de los datos, y es más adecuado para
conjuntos de datos con dos atributos numéricos.
Vamos a utilizar una versión de los datos del iris,
sin los dos primeros atributos. Para crear esta partir de los datos del iris estándar, la puesta en marcha del Explorer, la carga iris.arff usando el
botón de Open file y quite los dos primeros atributos (‘sepallength’ y ‘sepalwidth’), seleccionando y
haciendo clic en el botón que Remove aparece en
la parte inferior. A continuación, guarde el conjunto de datos modificados en un archivo (usando
Save) llamado, por ejemplo, iris.2D.arff.
Now leave the Explorer and open this file for visualization using the boundary visualizer’s Open
File... button. Initially, the plot just shows the
data in the dataset.1
Ahora deja el Explorer y abrir este archivo para la
visualización mediante el visualizador de Fronteras
botón Open File.... Inicialmente, la trama sólo
muestra los datos en el conjunto de datos.2
2
Visualizing 1R
Just plotting the data is nothing new. The real
purpose of the boundary visualizer is to show the
predictions of a given model for each location in
space. The points representing the data are color
coded based on the prediction the model generates.
We will use this functionality to investigate the decision boundaries that different classifiers generate
for the reduced iris dataset.
Sólo graficar los datos no es nada nuevo. El verdadero propósito del visualizador lı́mite es mostrar
la predicciones de un modelo determinado para
cada lugar en el espacio. Los puntos que representan los datos están codificados por colores basados
en la predicción del modelo genera. Vamos a utilizar esta funcionalidad para investigar los lı́mites
de la decisión que los clasificadores diferentes para
generar el conjunto de datos del iris reducida.
1 There
is a bug in the initial visualization. To get a true plot of the data, select a different attribute for either the x or
y axis by clicking the appropriate button.
2 No es un error en la visualización inicial. Para obtener una verdadera trama de los datos, seleccione un atributo
diferente, ya sea para los x o y eje haciendo clic en el botón correspondiente.
1
We start with the 1R rule learner. Use the
Choose button of the boundary visualizer to select weka.classifiers.rules.OneR . Make sure
you tick Plot training data, otherwise only the
predictions will be plotted. Then hit the Start
button. The program starts plotting predictions
in successive scan lines. Hit the Stop button once
the plot has stabilized—as soon as you like, in this
case—and the training data will be superimposed
on the boundary visualization.
Empezamos con el aprendiz regla 1R. Utilice el botón de Choose del visualizador lı́mite
para seleccionar weka.classifiers.rules.OneR.
Asegúrese de que usted marque Plot training
data, de lo contrario sólo las predicciones se
trazan. A continuación, pulse el botón Start.
El programa comienza a las predicciones de conspirar en las sucesivas lı́neas de exploración. Pulse
el botón de Stop, una vez la trama se ha
estabilizado—tan pronto como quiera, en este
caso—y los datos de entrenamiento se superpone
a la visualización de frontera.
Ex. 1: Explain the plot based on what you know
about 1R. (Hint: use the Explorer to look at
the rule set that 1R generates for this data.)
Ex. 1: Explicar el argumento basado en lo que
sabe sobre 1R. (Sugerencia: usar el Explorer a mirar el conjunto de reglas que 1R
genera para estos datos.)
Ex. 2: Study the effect of the minBucketSize
parameter on the classifier by regenerating
the plot with values of 1, and then 20, and
then some critical values in between. Describe what you see, and explain it. (Hint:
you could speed things up by using the Explorer to look at the rule sets.)
Ex. 2: Estudiar el efecto del parámetro minBucketSize en el clasificador por la regeneración de la parcela con valores de 1, y luego
20 y, a continuación algunos valores crı́ticos
en el medio. Describe lo que ves, y explicarlo. (Sugerencia: puede acelerar las cosas
mediante el Explorer a ver algunos de los
conjuntos de reglas.)
3
Visualizing nearest-neighbor learning
Ahora nos fijamos en los lı́mites de clasificación creado por el método del vecino más cercano. Utilice el botón de visualizador lı́mite de
Choose... para seleccionar el clasificador IBk
(weka.classifiers.lazy.IBk) y la trama de sus
lı́mites de decisión para reducir los datos del iris.
Now we look at the classification boundaries created by the nearest neighbor method. Use the
boundary visualizer’s Choose... button to select
the IBk classifier (weka.classifiers.lazy.IBk)
and plot its decision boundaries for the reduced
iris data.
2
In WEKA, OneR’s predictions are categorical: for
each instance they predict one of the three classes.
In contrast, IBk outputs probability estimates for
each class, and these are used to mix the colors
red, green, and blue that correspond to the three
classes. IBk estimates class probabilities by counting the number of instances of each class in the set
of nearest neighbors of a test case and uses the
resulting relative frequencies as probability estimates. With k = 1, which is the default value, you
might expect there to be only one instance in the
set of nearest neighbors of each test case (i.e. pixel
location). Looking at the plot, this is indeed almost always the case, because the estimated probability is one for almost all pixels, resulting in a
pure color. There is no mixing of colors because
one class gets probability one and the others probability zero.
En WEKA, las predicciones OneR son
categóricos: para cada instancia que predicen una de las tres clases. Por el contrario, las
salidas IBk estimaciones de probabilidad para
cada clase, y estas se utilizan para mezclar los
colores rojo, verde y azul, que corresponden a las
tres clases. IBk estimaciones de probabilidades de
clase contando el número de casos de cada clase
en el conjunto de los vecinos más cercanos de un
caso de prueba y utiliza las frecuencias resultantes
relativa como las estimaciones de probabilidad.
Con k = 1, que es el valor por defecto, es de
esperar que haya una sola instancia en el conjunto
de vecinos más cercanos de cada caso de prueba
(es decir, lugar de pı́xeles). En cuanto a la trama,
esto es de hecho casi siempre el caso, ya que la
probabilidad estimada es uno de casi todos los
pı́xeles, dando como resultado un color puro. No
hay mezcla de colores, porque una clase recibe
una probabilidad y la probabilidad de los demás
cero.
Ex. 3: Nevertheless, there is a small area in the
plot where two colors are in fact mixed. Explain this. (Hint: look carefully at the data
using the Visualize panel in the Explorer.)
Ex. 3: Sin embargo, hay una pequeña área de la
parcela en la que dos colores son en realidad
mixta. Explique esto. (Sugerencia: mirar
cuidadosamente los datos mediante el panel
Visualizar en el Explorer.)
Ex. 4: Experiment with different values for k, say
5 and 10. Describe what happens as k increases.
Ex. 4: Experimente con diferentes valores de k,
por ejemplo 5 y 10. Describir lo que sucede
cuando aumenta k.
4
Visualizing naive Bayes
Turn now to the naive Bayes classifier. This assumes that attributes are conditionally independent given a particular class value. This means
that the overall class probability is obtained by
simply multiplying the per-attribute conditional
probabilities together. In other words, with two
attributes, if you know the class probabilities along
the x-axis and along the y-axis, you can calculate
the value for any point in space by multiplying
them together. This is easier to understand if you
visualize it as a boundary plot.
Paso ahora a los ingenuos clasificador de Bayes.
Esto supone que los atributos son condicionalmente independientes dado un valor de clase especial. Esto significa que la probabilidad de clase
global se obtiene simplemente multiplicando por
el atributo de probabilidades condicionales juntos.
En otras palabras, con dos atributos, no sé si las
probabilidades de clase a lo largo del eje X ya lo
largo del eje, se puede calcular el valor de cualquier
punto del espacio multiplicando juntos. Esto es
más fácil de entender si la visualizan como una
parcela de contorno.
3
Plot the predictions of naive Bayes. But first, you
need to discretize the attribute values. By default,
NaiveBayes assumes that the attributes are normally distributed given the class (i.e., they follow
a bell-shaped distribution). You should override
this by setting useSupervisedDiscretization to
true using the GenericObjectEditor. This will
cause NaiveBayes to discretize the numeric attributes in the data using a supervised discretization technique.3
Parcela las predicciones de Bayes ingenuo. Pero
primero, tiene que discretizar los valores de atributo. De forma predeterminada, NaiveBayes
asume que los atributos tienen una distribución
normal habida cuenta de la clase (es decir, que
siguen una distribución en forma de campana).
Usted debe cambiar este ajuste de useSupervisedDiscretization a true utilizando el GenericObjectEditor. Esto hará que NaiveBayes
para discretizar los atributos numéricos de los
datos mediante una técnica de discretización supervisado.4
In almost all practical applications of NaiveBayes, supervised discretization works better
than the default method, and that is why we consider it here. It also produces a more comprehensible visualization.
En casi todas las aplicaciones prácticas de la
NaiveBayes, discretización supervisado es más
eficaz que el método por defecto, y es por eso que
lo consideramos aquı́. También produce una visualización más comprensible.
Ex. 5: The plot that is generated by visualizing the predicted class probabilities of naive
Bayes for each pixel location is quite different
from anything we have seen so far. Explain
the patterns in it.
Ex. 5: La trama que se genera mediante la visualización de las probabilidades de clase previsto de Bayes ingenuo para cada posición de
pı́xel es muy diferente de todo lo que hemos
visto hasta ahora. Explicar los patrones en
ella.
5
Visualizing decision trees and rule sets
Decision trees and rule sets are similar to nearestneighbor learning in the sense that they are also
quasi-universal: in principle, they can approximate
any decision boundary arbitrarily closely. In this
section, we look at the boundaries generated by
JRip and J48.
Los árboles de decisión y conjuntos de reglas son
similares a los del vecino más próximo de aprendizaje en el sentido de que son también casi universal: en principio, se puede aproximar cualquier
lı́mite de la decisión arbitraria de cerca. En esta
sección, nos fijamos en los lı́mites generados por
JRip y J48.
Generate a plot for JRip, with default options.
Generar una parcela de JRip, con las opciones predeterminadas.
Ex. 6: What do you see? Relate the plot to the
output of the rules that you get by processing
the data in the Explorer.
Ex. 6: Qué ves? La trama a la salida de las normas que se obtiene al procesar los datos en
la Explorer.
Ex. 7: The JRip output assumes that the rules
will be executed in the correct sequence.
Write down an equivalent set of rules that
achieves the same effect regardless of the order in which they are executed.
Ex. 7: La salida JRip asume que las normas se
ejecutará en el orden correcto. Escriba un
conjunto equivalente de las normas que logra
el mismo efecto sin importar el orden en que
se ejecutan.
3 The
technique used is “supervised” because it takes the class labels of the instances into account to find good split
points for the discretization intervals.
4 La técnica utilizada es “supervisada”, porque tiene las etiquetas de clase de las instancias en cuenta para encontrar
buenos puntos de partido para los intervalos de discretización.
4
Generate a plot for J48, with default options.
Generar una parcela de J48, con las opciones predeterminadas.
Ex. 8: What do you see? Relate the plot to the
output of the tree that you get by processing
the data in the Explorer.
Ex. 8: Qué ves? La trama a la salida del árbol
que se obtiene al procesar los datos en la Explorer.
One way to control how much pruning J48 performs before it outputs its tree is to adjust the
minimum number of instances required in a leaf,
minNumbObj.
Una forma de controlar la cantidad de poda J48
realiza antes de que los resultados de su árbol es
para ajustar el número mı́nimo de casos necesarios
en una hoja, minNumbObj.
Ex. 9: Suppose you want to generate trees with
3, 2, and 1 leaf nodes respectively. What are
the exact ranges of values for minNumObj
that achieve this, given default values for all
other parameters?
Ex. 9: Supongamos que desea generar árboles
con 3, 2 y 1 respectivamente nodos de la hoja.
Cuáles son los rangos de los valores exactos
de minNumObj que lograr este objetivo,
los valores por defecto para todos los otros
parámetros?
6
Messing with the data
With the BoundaryVisualizer you can modify
the data by adding or removing points.
Con el BoundaryVisualizer se pueden modificar
los datos, añadiendo o quitando puntos.
Ex. 10: Introduce some “noise” into the data and
study the effect on the learning algorithms
we looked at above. What kind of behavior do you observe for each algorithm as you
introduce more noise?
Ex. 10: Introducir algunos “ruidos” en los datos
y estudiar el efecto sobre los algoritmos de
aprendizaje que vimos anteriormente. Qué
tipo de comportamiento no se observa para
cada algoritmo como introducir más ruido?
7
1R revisited
Return to the 1R rule learner on the reduced iris
dataset used in Section 2 (not the noisy version
you just created). The following questions will require you to think about the internal workings of
1R. (Hint: it will probably be fastest to use the Explorer to look at the rule sets.)
Volver al alumno regla 1R en el iris reducido conjunto de datos utilizado en la Sección 2 (no la
versión ruidosa que acaba de crear). Las siguientes preguntas le exigirá que pensar en el funcionamiento interno de 1R. (Sugerencia: es probable que sea más rápido utilizar el Explorer a ver
algunos de los conjuntos de reglas.)
Ex. 11: You saw in Section 2 that the plot always
has three regions. But why aren’t there more
for small bucket sizes (e.g., 1)? Use what
you know about 1R to explain this apparent
anomaly.
Ex. 11: Se vio en la Sección 2 que la trama siempre tiene tres regiones. Pero por qué no hay
más para las dimensiones de cubo pequeño
(por ejemplo, 1)? Usa lo que sabes sobre 1R
para explicar esta aparente anomalı́a.
5
Ex. 12: Can you set minBucketSize to a value
that results in less than three regions? What
is the smallest possible number of regions?
What is the smallest value for minBucketSize that gives you this number of regions?
Explain the result based on what you know
about the iris data.
Ex. 12: Se puede configurar minBucketSize a
un valor que los resultados en menos de tres
regiones? Cuál es el menor número posible
de regiones? Cuál es el valor más pequeño
de minBucketSize que le da este número
de regiones? Explicar el resultado sobre la
base de lo que sabe acerca de los datos del
iris.
6
Tutorial 4: Preprocessing and Parameter Tuning
May 5, 2011
c
2008-2012
1
Introduction
Data preprocessing is often necessary to get data
ready for learning. It may also improve the outcome of the learning process and lead to more accurate and concise models. The same is true for
parameter tuning methods. In this tutorial we
will look at some useful preprocessing techniques,
which are implemented as WEKA filters, as well
as a method for automatic parameter tuning.
2
Preprocesamiento de datos es a menudo necesario
para obtener los datos listos para el aprendizaje.
También puede mejorar el resultado del proceso de
aprendizaje y dar lugar a modelos más precisos y
concisos. Lo mismo es cierto para los métodos de
ajuste de parámetros. En este tutorial vamos a
ver algunas de las técnicas de preprocesamiento
útil, que se aplican como filtros de WEKA, ası́
como un método para el ajuste automático de los
parámetros.
Discretization
Numeric attributes can be converted into discrete
ones by splitting their ranges into numeric intervals, a process known as discretization. There are
two types of discretization techniques: unsupervised ones, which are “class blind.,” and supervised
one, which take the class value of the instances into
account when creating intervals. The aim with supervised techniques is to create intervals that are
as consistent as possible with respect to the class
labels.
los atributos numéricos se pueden convertir en los
discretos mediante el fraccionamiento de sus áreas
de distribución en intervalos numéricos, un proceso conocido como discretización. Hay dos tipos
de técnicas de discretización: sin supervisión los
que son “de clase ciego,” y una supervisión, que
tienen el valor de clase de las instancias en cuenta
al crear intervalos. El objetivo con las técnicas de
supervisión es la creación de intervalos que sean
tan coherentes como sea posible con respecto a las
etiquetas de clase.
The main unsupervised technique for discretizing numeric attributes in WEKA is
weka.filters.unsupervised.attribute.
Discretize. It implements two straightforward
methods: equal-width and equal-frequency discretization. The first simply splits the numeric
range into equal intervals. The second chooses
the width of the intervals so that they contain
(approximately) the same number of instances.
The default is to use equal width.
El principal técnica unsupervisada para discretizar los atributos numéricos en WEKA
es
weka.filters.unsupervised.attribute.
Discretize. Se implementa dos métodos sencillos: la igualdad de ancho y discretización de igual
frecuencia. El primero, simplemente se divide el
rango numérico en intervalos iguales. El segundo
opta por la amplitud de los intervalos para que los
mismos contienen (aproximadamente) el mismo
número de casos. El valor por defecto es usar la
misma anchura.
Find the glass dataset glass.arff and load it
into the Explorer. Apply the unsupervised discretization filter in the two different modes discussed above.
Encontrar el conjunto de datos de cristal
glass.arff y cargarlo en la Explorer. Aplicar
el filtro de discretización sin supervisión en las dos
modalidades anteriormente expuestas.
Ex. 1: What do you observe when you compare
the histograms obtained? Why is the one for
equal-frequency discretization quite skewed
for some attributes?
Ex. 1: Qué observa al comparar los histogramas
obtenidos? Por qué es la discretización de
la igualdad de frecuencia muy sesgada de algunos atributos?
1
The main supervised technique for discretizing
numeric
attributes
in
WEKA
is
weka.filters.supervised.attribute.
Discretize. Locate the iris data, load it in,
apply the supervised discretization scheme, and
look at the histograms obtained. Supervised
discretization attempts to create intervals such
that the class distributions differ between intervals
but are consistent within intervals.
El principal supervisado técnica para discretizar los atributos numéricos en WEKA
es
weka.filters.supervised.attribute.
Discretize.
Busque los datos del iris, se
carga en, aplicar el esquema de discretización
supervisado, y ver los histogramas obtenidos. Encuadramiento intentos de discretización para crear
intervalos de tal manera que las distribuciones
difieren entre los intervalos de clase, pero son
coherentes dentro de los intervalos.
Ex. 2: Based on the histograms obtained, which
of the discretized attributes would you consider the most predictive ones?
Ex. 2: Con base en los histogramas obtenidos,
que de los atributos discretizados se tiene en
cuenta los más predictivo?
Reload the glass data and apply supervised discretization to it.
Actualizar los datos de vidrio y aplicar discretización supervisada a la misma.
Ex. 3: There is only a single bar in the histograms
for some of the attributes. What does that
mean?
Ex. 3: Sólo hay una sola barra en los histogramas
de algunos de los atributos. Qué significa
eso?
Discretized attributes are normally coded as nominal attributes, with one value per range. However,
because the ranges are ordered, a discretized attribute is actually on an ordinal scale. Both filters
also have the ability to create binary attributes
rather than multi-valued ones, by setting the option makeBinary to true.
Atributos discretizado normalmente codificados
como atributos nominales, con un valor por rango.
Sin embargo, debido a los rangos están ordenados,
un atributo discretizado es en realidad en una escala ordinal. Ambos filtros también tienen la capacidad de crear los atributos binarios en lugar de
los múltiples valores, mediante el establecimiento
de la makeBinary opción de verdad.
Ex. 4: Choose one of the filters and apply it
to create binary attributes. Compare to
the output generated when makeBinary is
false. What do the binary attributes represent?
Ex. 4: Elegir un de los filtros y aplicarlo para
crear atributos binarios. Compare con el
resultado generado cuando makeBinary es
falsa. Qué significan los atributos binarios
representan?
3
More on Discretization
Here we examine the effect of discretization when
building a J48 decision tree for the data in
ionosphere.arff. This dataset contains information about radar signals returned from the ionosphere. “Good” samples are those showing evidence of some type of structure in the ionosphere,
while for “bad” ones the signals pass directly
through the ionosphere. For more details, take a
look the comments in the ARFF file. Begin with
unsupervised discretization.
Aquı́ se examina el efecto de la discretización en
la construcción de un árbol de decisión J48 para
los datos de ionosphere.arff. Este conjunto de
datos contiene información acerca de las señales
de radar de regresar de la ionosfera. “Bueno” son
las muestras que presenten indicios de algún tipo
de estructura de la ionosfera, mientras que para los
“malos” las señales pasan directamente a través de
la ionosfera. Para obtener más información, visita
los comentarios en el archivo ARFF. Comience con
discretización sin supervisión.
2
Ex. 5: Compare the cross-validated accuracy of
J48 and the size of the trees generated for
(a) the raw data, (b) data discretized by the
unsupervised discretization method in default mode, (c) data discretized by the same
method with binary attributes.
Ex. 5: Comparación de la precisión validación
cruzada de J48 y el tamaño de los árboles
generados por (a) los datos en bruto, (b)
los datos discretizados por el método de discretización sin supervisión en el modo por defecto, (c) los datos discretizados por el mismo
método con atributos binarios.
Now turn to supervised discretization. Here a subtle issue arises. If we simply repeated the previous
exercise using a supervised discretization method,
the result would be over-optimistic. In effect, since
cross-validation is used for evaluation, the data in
the test set has been taken into account when determining the discretization intervals. This does not
give a fair estimate of performance on fresh data.
Ahora pasa a la discretización supervisado. Aquı́
surge una cuestión sutil. Si nos limitamos a repetir el ejercicio anterior utilizando un método de
discretización supervisado, el resultado serı́a demasiado optimista. En efecto, ya que la validación
cruzada se utiliza para la evaluación, los datos en
el conjunto de pruebas se ha tenido en cuenta para
determinar los intervalos de discretización. Esto
no da una estimación razonable de rendimiento en
nuevos datos.
To evaluate supervised discretization in a fair fashion, we use the FilteredClassifier from WEKA’s
meta-classifiers. This builds the filter model from
the training data only, before evaluating it on the
test data using the discretization intervals computed for the training data. After all, that is how
you would have to process fresh data in practice.
Para evaluar discretización supervisado de manera justa, se utiliza el FilteredClassifier de meta
de WEKA-clasificadores. Esto se basa el modelo
de filtro de los datos de entrenamiento solamente,
antes de evaluar que en los datos de prueba mediante los intervalos de discretización calculados para
los datos de entrenamiento. Después de todo, que
es como se tendrı́a que procesar los datos frescos
en la práctica.
Ex. 6: Compare the cross-validated accuracy and
the size of the trees generated using the FilteredClassifier and J48 for (d) supervised
discretization in default mode, (e) supervised
discretization with binary attributes.
Ex. 6: Comparación de la precisión validación
cruzada y el tamaño de los árboles generados con el FilteredClassifier y J48 para (d)
discretización supervisado en su modo normal, (e) discretización de supervisión de los
atributos binarios.
Ex. 7: Compare these with the results for the raw
data ((a) above). Can you think of a reason of why decision trees generated from discretized data can potentially be more accurate predictors than those built from raw numeric data?
Ex. 7: Compare estos datos con los resultados de
los datos en bruto ((a) anterior). Puedes
pensar en una razón de por qué los árboles de
decisión generados a partir de datos discretos
pueden ser potencialmente predictores más
fiables que las construye a partir de datos
numéricos en bruto?
3
4
Automatic Attribute Selection
In most practical applications of supervised learning not all attributes are equally useful for predicting the target. Depending on the learning scheme
employed, redundant and/or irrelevant attributes
can result in less accurate models being generated.
The task of manually identifying useful attributes
in a dataset can be tedious, as you have seen in the
second tutorial—but there are automatic attribute
selection methods that can be applied.
En la mayorı́a de las aplicaciones prácticas de
aprendizaje supervisado, no todos los atributos son
igualmente útiles para predecir el destino. Dependiendo de la actividad de aprendizaje empleados, redundantes y/o atributos irrelevantes pueden
dar lugar a modelos menos precisos generando.
La tarea de identificar manualmente los atributos
útiles en un conjunto de datos puede ser tedioso, ya
que hemos visto en el segundo tutorial—pero hay
métodos automáticos de selección de atributos que
se pueden aplicar.
They can be broadly divided into those that rank
individual attributes (e.g., based on their information gain) and those that search for a good subset
of attributes by considering the combined effect
of the attributes in the subset. The latter methods can be further divided into so-called filter and
wrapper methods. Filter methods apply a computationally efficient heuristic to measure the quality
of a subset of attributes. Wrapper methods measure the quality of an attribute subset by building
and evaluating an actual classification model from
it, usually based on cross-validation. This is more
expensive, but often delivers superior performance.
Pueden dividirse en aquellos que se clasifican los
atributos individuales (por ejemplo, sobre la base
de su ganancia de información) y los de búsqueda
que para un subconjunto de los atributos de buena
considerando el efecto combinado de los atributos
en el subconjunto. Estos métodos se pueden dividir en los llamados filtro y contenedor métodos.
métodos de aplicar un filtro eficiente computacionalmente heurı́stica para medir la calidad de un
subconjunto de los atributos. métodos Wrapper
medir la calidad de un subconjunto de atributos
mediante la construcción y evaluación de un modelo de clasificación real de ella, generalmente se
basa en la validación cruzada. Esto es más caro,
pero a menudo ofrece un rendimiento superior.
In the WEKA Explorer, you can use the Select attributes panel to apply an attribute selection method on a dataset. The default is CfsSubsetEval. However, if we want to rank individual attributes, we need to use an attribute
evaluator rather than a subset evaluator, e.g., the
InfoGainAttributeEval. Attribute evaluators
need to be applied with a special “search” method,
namely the Ranker.
En el Explorer WEKA, puede utilizar el panel
de Select attributes de aplicar un método de
selección de atributos en un conjunto de datos.
El valor predeterminado es CfsSubsetEval. Sin
embargo, si queremos clasificar los atributos individuales, tenemos que recurrir a un evaluador
de atributos en vez de un subgrupo evaluador,
por ejemplo, la InfoGainAttributeEval. evaluadores de atributos deben ser aplicados con un especial de “búsqueda” método, a saber, la Ranker.
Ex. 8: Apply this technique to the labour negotiations data in labor.arff. What are the
four most important attributes based on information gain?1
Ex. 8: Aplicar esta técnica para las negociaciones
laborales de datos en labor.arff. Cuáles
son los cuatro atributos más importantes
basadas en el aumento de la información?2
1 Note that most attribute evaluators, including InfoGainAttributeEval, discretize numeric attributes using WEKA’s
supervised discretization method before they are evaluated. This is also the case for CfsSubsetEval.
2 Nota que la mayorı́a de los evaluadores de atributos, incluyendo InfoGainAttributeEval, discretizar los atributos
numéricos mediante el método de discretización supervisado WEKA antes de que se evalúan. Este es también el caso de
CfsSubsetEval.
4
WEKA’s default attribute selection method, CfsSubsetEval, uses a heuristic attribute subset evaluator in a filter search method. It aims to identify a subset of attributes that are highly correlated with the target while not being strongly correlated with each other. By default, it searches
through the space of possible attribute subsets
for the “best” one using the BestFirst search
method.3 You can choose others, like a genetic
algorithm or even an exhaustive search. In fact,
choosing GreedyStepwise and setting searchBackwards to true gives “backwards selection,”
the search method you used manually in the second tutorial.
WEKA atributo por defecto el método de selección, CfsSubsetEval, utiliza un subconjunto
de atributos evaluador heurı́stica en un método de
filtro de búsqueda. Su objetivo es identificar un
subconjunto de los atributos que están muy correlacionados con el objetivo sin ser fuertemente
correlacionados entre sı́. De forma predeterminada, se busca a través del espacio de subconjuntos de atributos posibles para el “mejor” con
el método de búsqueda BestFirst.4 Usted puede
elegir otros, como un algoritmo genético o incluso
una exhaustiva búsqueda. De hecho, la elección
de GreedyStepwise searchBackwards y el establecimiento de verdad da “al revés de selección,”
el método de búsqueda que usa manualmente en el
segundo tutorial.
To use the wrapper method rather than a filter
method like CfsSubsetEval, you need to select
WrapperSubsetEval. You can configure this by
choosing a learning algorithm to apply. You can
also set the number of folds for the cross-validation
that is used to evaluate the model on each subset
of attributes.
Para utilizar el método de envoltura en vez de un
método de filtro como CfsSubsetEval, es necesario seleccionar WrapperSubsetEval. Puede
configurar esta eligiendo un algoritmo de aprendizaje de aplicar. También puede establecer el
número de pliegues para la validación cruzada que
se utiliza para evaluar el modelo en cada subconjunto de atributos.
Ex. 9: On the same data, run CfsSubsetEval
for correlation-based selection, using BestFirst search. Then run the wrapper method
with J48 as the base learner, again using
BestFirst search. Examine the attribute
subsets that are output. Which attributes
are selected by both methods? How do they
relate to the output generated by ranking using information gain?
Ex. 9: En los mismos datos, CfsSubsetEval correr para la selección basada en la correlación,
mediante la búsqueda de BestFirst. A continuación, ejecute el método de envoltura con
J48 como el aprendiz de base, utilizando de
nuevo la búsqueda BestFirst. Examinar
los subconjuntos de atributos que se emiten.
Qué atributos son seleccionados por ambos
métodos? Cómo se relacionan con el resultado generado por el aumento de clasificación
de información utiliza?
5
More on Automatic Attribute Selection
The Select attribute panel allows us to gain insight into a dataset by applying attribute selection
methods to a dataset. However, using this information to reduce a dataset becomes problematic
if we use some of the reduced data for testing the
model (as in cross-validation).
3 This
4 Este
El panel de Select attribute nos permite profundizar en un conjunto de datos mediante la aplicación de métodos de selección de atributos de un
conjunto de datos. Sin embargo, utilizar esta información para reducir un conjunto de datos se
convierte en un problema si utilizamos algunos de
los datos reducidos para probar el modelo (como
en la validación cruzada).
is a standard search method from AI.
es un método de búsqueda estándar de la influenza aviar.
5
The reason is that, as with supervised discretization, we have actually looked at the class labels in
the test data while selecting attributes—the “best”
attributes were chosen by peeking at the test data.
As we already know (see Tutorial 2), using the test
data to influence the construction of a model biases the accuracy estimates obtained: measured
accuracy is likely to be greater than what will be
obtained when the model is deployed on fresh data.
To avoid this, we can manually divide the data into
training and test sets and apply the attribute selection panel to the training set only.
La razón es que, al igual que con discretización supervisado, que se han mirado en las etiquetas de
clase en los datos de prueba, mientras que la selección de los atributos—la “mejor” los atributos
fueron elegidos por espiar a los datos de prueba.
Como ya sabemos (ver Tutorial 2), utilizando los
datos de prueba para influir en la construcción de
un modelo de los sesgos de la exactitud estimaciones obtenidas: La precisión de medida es probable que sea mayor de lo que se obtiene cuando el
modelo se implementa en nuevos datos. Para evitar esto, se puede dividir manualmente los datos en
conjuntos de entrenamiento y de prueba y aplicar
el comité de selección de atributos al conjunto de
entrenamiento solamente.
A more convenient method is to use the
AttributeSelectedClassifer, one of WEKA’s
meta-classifiers. This allows us to specify an attribute selection method and a learning algorithm
as part of a classification scheme. The AttributeSelectedClassifier ensures that the chosen set of
attributes is selected based on the training data
only, in order to give unbiased accuracy estimates.
Un método más conveniente es utilizar el
AttributeSelectedClassifer, uno de los metaclasificadores de WEKA. Esto nos permite especificar un método de selección de atributos y un algoritmo de aprendizaje como parte de un esquema
de clasificación. El AttributeSelectedClassifier
asegura que el conjunto seleccionado de atributos se selecciona basándose en los datos de entrenamiento solamente, a fin de dar estimaciones insesgadas precisión.
Now we test the various attribute selection methods tested above in conjunction with NaiveBayes.
Naive Bayes assumes (conditional) independence
of attributes, so it can be affected if attributes
are redundant, and attribute selection can be very
helpful.
Ahora ponemos a prueba los métodos de selección de atributos diferentes probado anteriormente en relación con NaiveBayes. Bayesiano
asume (condicional) la independencia de los atributos, por lo que puede verse afectado si los atributos son redundantes, y la selección de atributos
puede ser muy útil.
You can see the effect of redundant
attributes
on
naive
Bayes
by
adding
copies of an existing attribute to a
dataset using the unsupervised filter class
weka.filters.unsupervised.attribute.Copy
in the Preprocess panel. Each copy is obviously
perfectly correlated with the original.
Usted puede ver el efecto de los atributos redundantes en Bayes ingenuo mediante la adición de
copias de un atributo existente a un conjunto de
datos utilizando la clase de filtro sin supervisión
weka.filters.unsupervised.attribute.Copy
en el panel de Preprocess. Cada copia es,
obviamente, una correlación perfecta con el
original.
Ex. 10: Load the diabetes classification data in
diabetes.arff and start adding copies of
the first attribute in the data, measuring the
performance of naive Bayes (with useSupervisedDiscretization turned on) using
cross-validation after you have added each
copy. What do you observe?
Ex. 10: carga los datos de clasificación de la diabetes diabetes.arff y comenzar a agregar
copias de la primera cualidad de los datos,
medir el rendimiento de Bayes naive (con
useSupervisedDiscretization encendido)
con validación cruzada después de haber
agregado cada copia. Qué observa?
Let us now check whether the three attribute selection methods from above, used in conjunction
with AttributeSelectedClassifier and NaiveBayes, successfully eliminate the redundant attributes. The methods are:
Vamos ahora a comprobar si los tres métodos
de selección de atributos de arriba, se utiliza junto con AttributeSelectedClassifier y
NaiveBayes, con éxito eliminar los atributos redundantes. Los métodos son:
6
• InfoGainAttributeEval with Ranker (8
attributes)
• InfoGainAttributeEval con Ranker (8
atributos)
• CfsSubsetEval with BestFirst
• CfsSubsetEval con BestFirst
• WrapperSubsetEval with NaiveBayes
and BestFirst.
• WrapperSubsetEval con NaiveBayes y
BestFirst.
Run each method from within AttributeSelectedClassifier to see the effect on cross-validated
accuracy and check the attribute subset selected
by each method. Note that you need to specify the
number of ranked attributes to use for the Ranker
method. Set this to eight, because the original diabetes data contains eight attributes (excluding the
class). Note also that you should specify NaiveBayes as the classifier to be used inside the wrapper method, because this is the classifier that we
want to select a subset for.
Ejecutar cada método dentro de AttributeSelectedClassifier para ver el efecto en la cruzvalidado la exactitud y verificar el subconjunto de
atributos seleccionados por cada método. Tenga
en cuenta que es necesario especificar el número
de atributos clasificó a utilizar para el método de
Ranker. Ponga esto a ocho, porque los datos de
la diabetes original contiene ocho atributos (con
exclusión de la clase). Tenga en cuenta también
que debe especificar NaiveBayes como el clasificador para ser utilizado en el método de envoltura,
porque este es el clasificador que desea seleccionar
un subconjunto de.
Ex. 11: What can you say regarding the performance of the three attribute selection methods? Do they succeed in eliminating redundant copies? If not, why not?
Ex. 11: Qué puede decir respecto al rendimiento
de los tres métodos de selección de atributos? No tienen éxito en la eliminación de las
copias redundantes? Si no, por qué no?
6
Automatic parameter tuning
Many learning algorithms have parameters that
can affect the outcome of learning. For example,
the decision tree learner C4.5 (J48 in WEKA) has
two parameters that influence the amount of pruning that it does (we saw one, the minimum number
of instances required in a leaf, in the last tutorial).
The k-nearest-neighbor classifier IBk has one that
sets the neighborhood size. But manually tweaking
parameter settings is tedious, just like manually
selecting attributes, and presents the same problem: the test data must not be used when selecting
parameters—otherwise the performance estimates
will be biased.
Muchos algoritmos de aprendizaje tienen
parámetros que pueden afectar los resultados
del aprendizaje. Por ejemplo, el árbol de decisión C4.5 alumno (J48 en WEKA) tiene dos
parámetros que influyen en la cantidad de la
poda que hace (hemos visto a uno, el número
mı́nimo de casos necesarios en una hoja, en el
último tutorial). El k -clasificador del vecino más
próximo IBk tiene uno que establece el tamaño de
la vecindad. Pero manualmente modificando los
ajustes de parámetros es tedioso, al igual que los
atributos seleccionar manualmente, y presenta el
mismo problema: los datos de prueba no debe ser
utilizado cuando los parámetros de selección—lo
contrario las estimaciones de rendimiento se hará
con preferencia.
7
WEKA
has
a
“meta”
classifier,
CVParameterSelection,
that automatically
searches for the “best” parameter settings by
optimizing cross-validated accuracy on the training data. By default, each setting is evaluated
using 10-fold cross-validation. The parameters to
optimize re specified using the CVParameters
field in the GenericObjectEditor. For each
one, we need to give (a) a string that names it
using its letter code, (b) a numeric range of values
to evaluate, and (c) the number of steps to try in
this range (Note that the parameter is assumed
to be numeric.) Click on the More button in
the GenericObjectEditor for more information,
and an example.
WEKA
tiene
una
“meta”
clasificador,
CVParameterSelection,
que
busca
automáticamente los “mejores” valores de los
parámetros mediante la optimización de cruzvalidado la exactitud de los datos de entrenamiento.
De forma predeterminada, cada
ajuste se evaluó utilizando 10 veces la validación
cruzada. Los parámetros para volver a optimizar
el uso especificado en el campo CVParameters
GenericObjectEditor. Para cada uno de ellos,
tenemos que dar (a) una cadena que le asigna el
nombre utilizando su código de letras, (b) una
serie de valores numéricos para evaluar, y (c) el
número de medidas para tratar en este rango de
(Tenga en cuenta que el parámetro se supone que
es numérico.) Haga clic en el botón de More
en la GenericObjectEditor para obtener más
información, y un ejemplo.
For the diabetes data used in the previous section,
use CVParameterSelection in conjunction with
IBk to select the “best” value for the neighborhood size, ranging from 1 to 10 in ten steps. The
letter code for the neighborhood size is K. The
cross-validated accuracy of the parameter-tuned
version of IBk is directly comparable with its accuracy using default settings, because tuning is performed by applying inner cross-validation runs to
find the best parameter setting for each training
set occuring in the outer cross-validation—and the
latter yields the final performance estimate.
Para los datos de la diabetes utilizados en la
sección anterior, el uso CVParameterSelection
IBk en conjunto con el fin de seleccionar la
“mejor” valor para el tamaño de la vecindad, que
van desde 1 a 10 en diez pasos. El código de letras para el tamaño de esta zona: K. La precisión
de validación cruzada de la versión parámetro afinado de IBk es directamente comparable con la
precisión con la configuración predeterminada, ya
que ajuste se realiza mediante la aplicación de interior validación cruzada se ejecuta para encontrar
el mejor ajuste de parámetros para cada conjunto
de entrenamiento se producen en el exterior validación cruzada—y los rendimientos de este último
la estimación final de ejecución.
Ex. 12: What accuracy is obtained in each case?
What value is selected for the parametertuned version based on cross-validation on
the full training set? (Note: this value is
output in the Classifier output text area.)
Ex. 12: Qué precisión se obtiene en cada caso?
Qué valor se selecciona para la versión
parámetro afinado sobre la base de la validación cruzada en el conjunto de entrenamiento completo? (Nota: este valor es la
producción en el área de texto Classifier de
salida.)
Now consider parameter tuning for J48. We can
use CVParameterSelection to perform a grid
search on both pruning parameters simultaneously
by adding multiple parameter strings in the CVParameters field. The letter code for the pruning
confidence parameter is C, and you should evaluate values from 0.1 to 0.5 in five steps. The letter
code for the minimum leaf size parameter is M ,
and you should evaluate values from 1 to 10 in ten
steps.
Ahora considere ajuste de parámetros de J48.
Podemos utilizar CVParameterSelection para
realizar una búsqueda de la rejilla en ambos
parámetros al mismo tiempo de poda mediante
la adición de varias cadenas de parámetros en el
campo CVParameters. El código de letras para
el parámetro de la confianza de la poda es de C,
y usted debe evaluar los valores de 0,1 a 0,5 en
cinco pasos. El código de letras para el parámetro
de hoja de tamaño mı́nimo es de M , y se deben
evaluar los valores de 1 a 10 en diez pasos.
8
Ex. 13: Run CVParameterSelection to find
the best parameter values in the resulting
grid. Compare the output you get to that
obtained from J48 with default parameters.
Has accuracy changed? What about tree
size? What parameter values were selected
by CVParameterSelection for the model
built from the full training set?
Ex. 13: Ejecutar CVParameterSelection para
encontrar los mejores valores de parámetros
en la red resultante. Comparar la salida se
llega a la obtenida de J48 con los parámetros
por defecto. Tiene una precisión cambiado?
Qué pasa con el tamaño del árbol? Qué valores de los parámetros han sido seleccionados
por CVParameterSelection para el modelo construido a partir del conjunto de entrenamiento completo?
9
Tutorial 5: Document Classification
May 5, 2011
c
2008-2012
1
Introduction
Text classification is a popular application of machine learning. You may even have used it: email
spam filters are classifiers that divide email messages, which are just short documents, into two
groups: junk and not junk. So-called “Bayesian”
spam filters are trained on messages that have been
manually labeled, perhaps by putting them into
appropriate folders (e.g. “ham” vs “spam”).
Clasificación de texto es una aplicación popular de
aprendizaje automático. Puede que incluso lo han
utilizado: los filtros de spam de correo electrónico
son los clasificadores que dividen a los mensajes
de correo electrónico, que son documentos poco
menos, en dos grupos: basura y no deseado. Los
llamados “Bayesiano” filtros de spam son entrenados en los mensajes que han sido etiquetados
de forma manual, tal vez por su puesta en carpetas correspondientes (por ejemplo, “jamón” vs
“spam”).
In this tutorial we look at how to perform document classification using tools in WEKA. The raw
data is text, but most machine learning algorithms
expect examples that are described by a fixed set
of attributes. Hence we first convert the text data
into a form suitable for learning. This is usually
done by creating a dictionary of terms from all
the documents in the training corpus and making
a numeric attribute for each term. Then, for a
particular document, the value of each attribute is
based on the frequency of the corresponding term
in the document. There is also the class attribute,
which gives the document’s label.
En este tutorial vamos a ver cómo llevar a cabo la
clasificación de documentos usando herramientas
en WEKA. Los datos en bruto es de texto, pero la
mayorı́a de algoritmos de aprendizaje automático
esperar ejemplos que se describen mediante un conjunto fijo de atributos. Por lo tanto, primero convertir los datos de texto en una forma adecuada
para el aprendizaje. Esto suele hacerse mediante
la creación de un diccionario de términos de todos
los documentos en el corpus de entrenamiento y haciendo un atributo numérico de cada término. Entonces, para un documento particular, el valor de
cada atributo se basa en la frecuencia del término
correspondiente en el documento. también existe
el atributo de clase, lo que da la etiqueta del documento.
2
Data with string attributes
WEKA’s
unsupervised
attribute
filter
StringToWordVector can be used to convert
raw text into term-frequency-based attributes.
The filter assumes that the text of the documents
is stored in an attribute of type String, which is
a nominal attribute without a pre-specified set of
values. In the filtered data, this string attribute is
replaced by a fixed set of numeric attributes, and
the class attribute is put at the beginning, as the
first attribute.
Atributo sin supervisión WEKA el filtro
StringToWordVector se puede utilizar para
convertir el texto en bruto en los atributos plazo
basado en la frecuencia. El filtro se supone que
el texto de los documentos se almacena en un
atributo de tipo String, que es un atributo
nominal sin un conjunto previamente especificado
de valores. En los datos filtrados, este atributo
de cadena se sustituye por un conjunto fijo de
atributos numéricos, y el atributo de la clase se
pone al principio, como el primer atributo.
To perform document classification, we first
need to create an ARFF file with a string
attribute that holds the documents’ text—
declared in the header of the ARFF file using
@attribute document string, where document
is the name of the attribute. We also need a nominal attribute that holds the document’s classification.
Para realizar la clasificación de documentos,
primero tenemos que crear un archivo de
ARFF con un atributo de cadena que contiene texto de los documentos—declarado en
el encabezado del archivo ARFF mediante
@attribute document string, donde document
es el nombre del atributo. también necesitamos
un atributo nominal que contiene la clasificación
del documento.
1
Document text
The price of crude oil has increased significantly
Demand of crude oil outstrips supply
Some people do not like the flavor of olive oil
The food was very oily
Crude oil is in short supply
Use a bit of cooking oil in the frying pan
Classification
yes
yes
no
no
yes
no
Table 1: Training “documents”.
Document text
Oil platforms extract crude oil
Canola oil is supposed to be healthy
Iraq has significant oil reserves
There are different types of cooking oil
Classification
Unknown
Unknown
Unknown
Unknown
Table 2: Test “documents”.
Ex. 1: To get a feeling for how this works,
make an ARFF file from the labeled mini“documents” in Table 1 and run StringToWordVector with default options on
this data. How many attributes are generated? Now change the value of the option
minTermFreq to 2. What attributes are
generated now?
Ex. 1: Para tener una idea de cómo funciona
esto, hacer un archivo ARFF de la etiqueta
mini “documentos” en la Table 1 y ejecutar StringToWordVector con las opciones
predeterminadas en estos datos. Cómo se
generan muchos atributos? Ahora cambia
el valor de la opción de minTermFreq 2.
Quéatributos se generan ahora?
Ex. 2: Build a J48 decision tree from the last version of the data you generated. Give the tree
in textual form.
Ex. 2: Construir un árbol de decisión J48 de la
última versión de los datos que generan. Dar
el árbol en forma textual.
Usually, the purpose of a classifier is to classify new
documents. Let’s classify the ones given in Table 2,
based on the decision tree generated from the documents in Table 1. To apply the same filter to both
training and test documents, we can use the FilteredClassifier, specifying the StringToWordVector filter and the base classifier that we want
to apply (i.e., J48).
Por lo general, el objetivo de un clasificador para
clasificar los documentos nuevos. Vamos a clasificar a las dadas en la Table 2, basado en el árbol de
decisión de los documentos generados en la Table 1.
Para aplicar el mismo filtro a los dos documentos de entrenamiento y prueba, podemos usar el
FilteredClassifier, especificando el filtro StringToWordVector y el clasificador base que queremos aplicar (es decir, J48).
Ex. 3: Create an ARFF file from Table 2, using question marks for the missing class labels. Configure the FilteredClassifier using default options for StringToWordVector and J48, and specify your new ARFF
file as the test set. Make sure that you select Output predictions under More options... in the Classify panel. Look at the
model and the predictions it generates, and
verify that they are consistent. What are the
predictions (in the order in which the documents are listed in Table 2)?
Ex. 3: Crear un archivo de ARFF de la Table 2,
con signos de interrogación para las etiquetas de clase perdido. Configurar el FilteredClassifier utilizando las opciones predeterminadas para StringToWordVector
y J48, y especificar el archivo ARFF nuevo
el equipo de prueba. Asegúrese de que selecciona Output predictions en More options... Classify en el panel. Mira el modelo y las predicciones que genera, y verificar
que sean compatibles. Cuáles son las predicciones (en el orden en que los documentos
son enumerados en la Table 2)?
2
3
Classifying actual short text documents
There is a standard collection of newswire
articles that is widely used for evaluating document classifiers.
ReutersCorn-train.arff
and
ReutersGrain-train.arff
are
sets
of training data derived from this collection;
ReutersCorn-test.arff
and
ReutersGrain-test.arff are corresponding
test sets. The actual documents in the corn and
grain data are the same; just the labels differ.
In the first dataset, articles that talk about
corn-related issues have a class value of 1 and the
others have 0; the aim is to build a classifier that
can be used to identify articles that talk about
corn. In the second, the analogous labeling is
performed with respect to grain-related issues,
and the aim is to identify these articles in the test
set.
No es una colección estándar de los artı́culos
agencia de noticias que es ampliamente utilizado para la evaluación de los clasificadores
de documentos.
ReutersCorn-train.arff
y
ReutersGrain-train.arff
son
conjuntos de datos de aprendizaje derivados de
esta colección;
ReutersCorn-test.arff y
ReutersGrain-test.arff son correspondientes
unidades de prueba. Los documentos reales en
los datos de maı́z y el grano son las mismas,
sólo las etiquetas son diferentes. En el primer
conjunto de datos, artı́culos que hablan de temas
relacionados con el maı́z tiene un valor de la clase
de 1 y el resto a 0, el objetivo es construir un
clasificador que se puede utilizar para identificar
los artı́culos que hablan de maı́z. En el segundo,
el etiquetado similar se realiza con respecto a
cuestiones relacionadas con granos, y el objetivo es
identificar estos artı́culos en el equipo de prueba.
Ex. 4: Build document classifiers for the two
training sets by applying the FilteredClassifier with StringToWordVector using (a)
J48 and (b) NaiveBayesMultinomial, in
each case evaluating them on the corresponding test set. What percentage of correct classifications is obtained in the four scenarios?
Based on your results, which classifier would
you choose?
Ex. 4: Construir clasificadores de documentos
para los dos conjuntos de formación mediante la aplicación de la FilteredClassifier
StringToWordVector con el uso (a) J48 y
(b) NaiveBayesMultinomial, en cada caso
a la evaluación en el sistema de la prueba
correspondiente. Qué porcentaje de clasificaciones correctas se obtiene en los cuatro
escenarios? Con base en sus resultados, que
clasificador elegirı́as?
The percentage of correct classifications is not the
only evaluation metric used for document classification. WEKA includes several other per-class
evaluation statistics that are often used to evaluate information retrieval systems like search engines. These are tabulated under Detailed Accuracy By Class in the Classifier output text
area. They are based on the number of true positives (TP), number of false positives (FP), number
of true negatives (TN), and number of false negatives (FN) in the test data. A true positive is
a test instance that is classified correctly as belonging to the target class concerned, while a false
positive is a (negative) instance that is incorrectly
assigned to the target class. FN and TN are defined analogously. The statistics output by WEKA
are computed as follows:
El porcentaje de clasificaciones correctas no es la
métrica de evaluación utilizado para la clasificación
de documentos. WEKA incluye varias otras estadı́sticas de evaluación por cada clase que se utilizan con frecuencia para evaluar los sistemas de
recuperación de información como los motores de
búsqueda. Estos son tabulados en Detailed Accuracy By Class en el área de texto Classifier
output. Se basan en el número de verdaderos positivos (VP), el número de falsos positivos (FP), el
número de verdaderos negativos (VN), y el número
de falsos negativos (FN) en los datos de prueba. A
positivos true es un ejemplo de prueba que está
clasificado correctamente como pertenecientes a la
clase de destino en cuestión, mientras que un falsos positivos es un ejemplo (negativo) que está mal
asignado a la clase de destino. FN y TN se define
de manera similar. La salida de las estadı́sticas por
WEKA se calculan de la siguiente manera:
• TP Rate: TP / (TP + FN)
• TP Precio: TP / (TP + FN)
3
• FP Rate: FP / (FP + TN)
• FP Precio: FP / (FP + TN)
• Precision: TP / (TP + FP)
• Precisión: TP / (TP + FP)
• Recall: TP / (TP + FN)
• Recuperación: TP / (TP + FN)
• F-Measure: the harmonic mean of precision
and recall
• F-Medida: la media armónica de precisión y
recuperación
(2/F = 1/precision + 1/recall).
(2/F = 1/precisión +1/recuperación).
Ex. 5: Based on the formulas, what are the best
possible values for each of the statistics in
this list? Describe in English when these values are attained.
Ex. 5: Con base en las fórmulas, Cuáles son los
mejores valores posibles para cada una de las
estadı́sticas en esta lista? Describa en Inglés
cuando estos valores se alcanzan.
The Classifier Output table also gives the ROC
area, which differs from the other statistics because it is based on ranking the examples in the
test data according to how likely they are to belong to the positive class. The likelihood is given
by the class probability that the classifier predicts.
(Most classifiers in WEKA can produce probabilities in addition to actual classifications.) The ROC
area (which is also known as AUC) is the probability that a randomly chosen positive instance in
the test data is ranked above a randomly chosen
negative instance, based on the ranking produced
by the classifier.
En la tabla Classifier Output también da la
ROC area, que difiere de las estadı́sticas de otros
porque se basa en el ranking de los ejemplos de
los datos de prueba de acuerdo a la probabilidad
que existe de pertenecer a la positivo clase. La
posibilidad está dada por la probabilidad de clase
que el clasificador predice. (La mayorı́a de los
clasificadores en WEKA pueden producir probabilidades, además de las clasificaciones actuales.)
La zona de la República de China (que también
se conoce como AUC) es la probabilidad de que
un ejemplo elegido al azar positivo en los datos
de prueba se clasifica por encima de un ejemplo
elegido al azar negativas, sobre la base de la clasificación producido por el clasificador.
The best outcome is that all positive examples are
ranked above all negative examples. In that case
the AUC is one. In the worst case it is zero. In
the case where the ranking is essentially random,
the AUC is 0.5. Hence we want an AUC that is at
least 0.5, otherwise our classifier has not learned
anything from the training data.
El mejor resultado es que todos los ejemplos positivos se sitúa por encima de todos los ejemplos
negativos. En ese caso las AUC es uno. En el peor
de los casos es cero. En el caso de que la clasificación es esencialmente al azar, las AUC es de
0,5. Por lo tanto queremos una AUC, que es al
menos 0,5, de lo contrario nuestro clasificador no
ha aprendido nada de los datos de entrenamiento.
Ex. 6: Which of the two classifiers used above
produces the best AUC for the two Reuters
datasets? Compare this to the outcome for
percent correct. What do the different outcomes mean?
Ex. 6: Cuál de los dos clasificadores utilizados
anterior produce los mejores AUC para los
dos conjuntos de datos de Reuters? Compare esto con los resultados de porcentaje de
respuestas correctas. Quésignifican los diferentes resultados?
4
Ex. 7: Interpret in your own words the difference
between the confusion matrices for the two
classifiers.
Ex. 7: Interpretar en sus propias palabras la
diferencia entre las matrices de confusión
para los dos clasificadores.
There is a close relationship between ROC Area
and the ratio TP Rate/FP Rate. Rather than
just obtaining a single pair of values for the true
and false positive rates, a whole range of value
pairs can be obtained by imposing different classification thresholds on the probabilities predicted
by the classifier.
Existe una relación estrecha entre ROC Area y la
relación de TP Rate/FP Rate. En lugar de simplemente obtener un solo par de valores para las
tasas de positivos verdaderos y falsos, toda una serie de pares de valores se puede obtener mediante la
imposición de diferentes umbrales de clasificación
de las probabilidades predichas por el clasificador.
By default, an instance is classified as “positive”
if the predicted probability for the positive class is
greater than 0.5; otherwise it is classified as negative. (This is because an instance is more likely
to be positive than negative if the predicted probability for the positive class is greater than 0.5.)
Suppose we change this threshold from 0.5 to some
other value between 0 and 1, and recompute the ratio TP Rate/FP Rate. Repeating this with different thresholds produces what is called an ROC
curve. You can show it in WEKA by right-clicking
on an entry in the result list and selecting Visualize threshold curve.
De forma predeterminada, una instancia se clasifica como “positivo” si la probabilidad predicha
para la clase positivo es superior a 0,5, de lo contrario se clasifica como negativa. (Esto se debe
a un caso es más probable que sea positivo que
negativo si la probabilidad predicha para la clase
positivo es superior a 0.5.) Supongamos que el
cambio de este umbral de 0,5 a algún otro valor
entre 0 y 1, y volver a calcular la proporción de
TP Rate/FP Rate. Repetir esto con diferentes
umbrales produce lo que se llama ROC curve.
Se puede mostrar en WEKA haciendo clic derecho sobre una entrada en la lista de resultados y
la selección de Visualize threshold curve.
When you do this, you get a plot with FP Rate on
the x axis and TP Rate on the y axis. Depending
on the classifier you use, this plot can be quite
smooth, or it can be fairly discrete. The interesting
thing is that if you connect the dots shown in the
plot by lines, and you compute the area under the
resulting curve, you get the ROC Area discussed
above! That is where the acronym AUC for the
ROC Area comes from: “Area Under the Curve.”
Al hacer esto, se obtiene una parcela con FP Rate
en el eje x y TP Rate en el y eje. En función del
clasificador que usa, esta parcela puede ser muy
suave, o puede ser bastante discretos. Lo interesante es que si se conecta los puntos de muestra en
el gráfico por las lı́neas, y calcular el área bajo la
curva resultante, se obtiene el ROC Area discutido arriba! Ahı́ es donde la AUC acrnimo de la
Área de la ROC viene de: “Área bajo la curva.”
Ex. 8: For the Reuters dataset that produced the
most extreme difference in Exercise 6 above,
look at the ROC curves for class 1. Make a
very rough estimate of the area under each
curve, and explain it in words.
Ex. 8: Para el conjunto de datos producidos a
Reuters que la diferencia más extrema en el
ejercicio 6 anterior, visita las curvas ROC
para la clase 1. Hacer una estimación muy
aproximada del área debajo de cada curva, y
explicarlo con palabras.
Ex. 9: What does the ideal ROC curve corresponding to perfect performance look like (a
rough sketch, or a description in words, is
sufficient)?
Ex. 9: Quéhace el ideal de la curva ROC correspondiente a buscar un rendimiento perfecto
como (un boceto o una descripción verbal, es
suficiente)?
5
Using the threshold curve GUI, you can also plot
other types of curves, e.g. a precision/recall curve,
with Recall on the x axis and Precision on the
y axis. This plots precision against recall for each
probability threshold evaluated.
Utilizando la curva de umbral de interfaz gráfica
de usuario, también puede trazar otros tipos de
curvas, por ejemplo, una precisión/recuperación
curva, con Recall en el eje x y Precision en el
y eje. Este gráfico de precisión contra el recuerdo
de cada umbral de probabilidad evaluada.
Ex. 10: Change the axes to obtain a precision/recall curve. What shape does the ideal
precision/recall curve corresponding to perfect performance have (again a rough sketch
or verbal description is sufficient)?
Ex. 10: Cambiar los ejes para obtener una precisión/recuperación curva. Quéforma tiene
la ideal precisión/recuperación curva que
corresponde a un rendimiento perfecto que
(de nuevo un croquis o descripción verbal es
suficiente)?
4
Exploring the StringToWordVector filter
By default, the StringToWordVector filter simply makes the attribute value in the transformed
dataset 1 or 0 for all raw single-word terms, depending on whether the word appears in the document or not. However, there are many options
that can be changed, e.g:
De forma predeterminada, el filtro de StringToWordVector, simplemente hace que el valor
del atributo en el conjunto de datos transformados 1 o 0 para todos los términos primas de una
sola palabra, dependiendo de si la palabra aparece
en el documento o no. Sin embargo, hay muchas
opciones que se pueden cambiar, por ejemplo:
• outputWordCounts causes actual word
counts to be output.
• outputWordCounts causas palabra real
cuenta de la salida.
• IDFTransform and TFTransform: when
both are set to true, term frequencies are
transformed into so-called T F × IDF values
that are popular for representing documents
in information retrieval applications.
• IDFTransform y TFTransform: cuando
ambos se ponen a true, las frecuencias plazo
se transforman en los llamados T F × F DI
valores que son populares para la representación de documentos en aplicaciones de
recuperación de información.
• stemmer allows you to choose from different
word stemming algorithms that attempt to
reduce words to their stems.
• stemmer le permite elegir entre diferentes
palabras derivadas algoritmos que tratan de
reducir las palabras a sus tallos.
• useStopList allows you determine whether
or not stop words are deleted. Stop words are
uninformative common words (e.g. a, the).
• useStopList le permite determinar si se detiene se suprimirán las palabras. Las palabras vacı́as son poco informativos palabras
comunes (por ejemplo, a, la).
• tokenizer allows you to choose a different tokenizer for generating terms, e.g. one
that produces word n-grams instead of single
words.
• tokenizer le permite elegir un analizador de
términos diferentes para generar, por ejemplo, que produce la palabra n-gramos en lugar de palabras sueltas.
6
There are several other useful options. For more
information, click on More in the GenericObjectEditor.
Hay varias opciones útiles. Para obtener más información, haga clic en More en la GenericObjectEditor.
Ex. 11: Experiment with the options that are
available. What options give you a good
AUC value for the two datasets above, using
NaiveBayesMultinomial as the classifier?
(Note: an exhaustive search is not required.)
Ex. 11: Experimento con las opciones que están
disponibles. Quéopciones le dan un buen
valor de AUC para los dos conjuntos de datos
anterior, con NaiveBayesMultinomial en
el clasificador? (Nota: una búsqueda exhaustiva no es necesario.)
Often, not all attributes (i.e., terms) are important
when classifying documents, because many words
may be irrelevant for determining the topic of an
article. We can use WEKA’s AttributeSelectedClassifier, using ranking with InfoGainAttributeEval and the Ranker search, to try and
eliminate attributes that are not so useful. As
before we need to use the FilteredClassifier to
transform the data before it is passed to the AttributeSelectedClassifier.
A menudo, no todos los atributos (es decir,
términos) son importantes para la clasificación
de documentos, ya que muchas palabras pueden
ser irrelevantes para determinar el tema de un
artı́culo. Podemos utilizar AttributeSelectedClassifier WEKA, utilizando ranking con InfoGainAttributeEval Ranker y la búsqueda, para
tratar de eliminar los atributos que no son tan
útiles. Al igual que antes tenemos que utilizar
el FilteredClassifier para transformar los datos
antes de que se pasa al AttributeSelectedClassifier.
Ex. 12: Experiment with this set-up, using default options for StringToWordVector
and NaiveBayesMultinomial as the classifier. Vary the number of most-informative
attributes that are selected from the infogain-based ranking by changing the value
of the numToSelect field in the Ranker.
Record the AUC values you obtain. What
number of attributes gives you the best AUC
for the two datasets above? What AUC
values are the best you manage to obtain?
(Again, an exhaustive search is not required.)
Ex. 12: Experimento con esta puesta en marcha,
utilizando las opciones predeterminadas para
StringToWordVector y NaiveBayesMultinomial en el clasificador. Variar el
número de los atributos más informativo
que se seleccionan de la clasificación de
información de ganancia basado en cambiar
el valor del campo en el numToSelect
Ranker. Registre los valores del AUC de
obtener. Quénúmero de atributos que ofrece
la mejor AUC para los dos conjuntos de
datos anteriores? Quévalores AUC son los
mejores que logran obtener? (De nuevo, una
búsqueda exhaustiva no es necesario.)
7
Tutorial 6: Mining Association Rules
May 5, 2011
c
2008-2012
1
Introduction
Association rule mining is one of the most prominent data mining techniques. In this tutorial, we
will work with Apriori—the association rule mining algorithm that started it all. As you will see, it
is not straightforward to extract useful information
using association rule mining.
2
La minerı́a de reglas de asociación es una de las
técnicas de minerı́a de datos más destacados. En
este tutorial, vamos a trabajar con Apriori—la
regla de asociación algoritmo de minerı́a de datos
que lo empezó todo. Como se verá, no es fácil de
extraer información útil con la minerı́a de reglas
de asociación.
Association rule mining in WEKA
In WEKA’s Explorer, techniques for association
rule mining are accessed using the Associate
panel. Because this is a purely exploratory data
mining technique, there are no evaluation options,
and the structure of the panel is simple. The default method is Apriori, which we use in this tutorial. WEKA contains a couple of other techniques
for learning associations from data, but they are
probably more interesting to researchers than practitioners.
En Explorer WEKA, técnicas para la extracción
de reglas de asociación se accede mediante el panel
de Associate. Debido a que esta es una técnica de
minerı́a de datos puramente exploratoria, no hay
opciones de evaluación, y la estructura del panel
es simple. El método predeterminado es Apriori,
que utilizamos en este tutorial. WEKA contiene
un par de otras técnicas para el aprendizaje de las
asociaciones de los datos, pero son probablemente
más interesante para los investigadores de los profesionales.
To get a feel for how to apply Apriori, we start
by mining rules from the weather.nominal.arff
data that we used in Tutorial 1. Note that this algorithm expects data that is purely nominal: numeric attributes must be discretized first. After
loading the data in the Preprocess panel, hit
the Start button in the Associate panel to run
Apriori with default options. It outputs ten rules,
ranked according to the confidence measure given
in parentheses after each one. The number following a rule’s antecedent shows how many instances
satisfy the antecedent; the number following the
conclusion shows how many instances satisfy the
entire rule (this is the rule’s “support”). Because
both numbers are equal for all ten rules, the confidence of every rule is exactly one.
Para tener una idea de cómo aplicar Apriori,
empezamos por las normas de la minerı́a de la
weather.nominal.arff datos que se utilizó en el
Tutorial 1. Tenga en cuenta que este algoritmo espera de datos que es puramente nominal: los atributos numéricos deben ser discretos en primer lugar.
Después de cargar los datos en el panel de Preprocess, pulsa el botón Start en el panel de Associate para ejecutar Apriori con las opciones predeterminadas. Hace salir diez reglas, ordenadas de
acuerdo a la medida de confianza entre paréntesis
después de cada uno. El número siguiente antecedente de una regla se muestra cómo muchos
casos cumplen el antecedente, el número después
de la conclusión muestra cuántas instancias satisfacer toda la regla (esta es la regla de “apoyo”).
Debido a que ambos números son iguales para todas las diez reglas, la confianza de cada regla es
exactamente uno.
1
In practice, it is tedious to find minimum support and confidence values that give satisfactory
results. Consequently WEKA’s Apriori runs the
basic algorithm several times. It uses same userspecified minimum confidence value throughout,
given by the minMetric parameter. The support level is expressed as a proportion of the total
number of instances (14 in the case of the weather
data), as a ratio between 0 and 1. The minimum
support level starts at a certain value (upperBoundMinSupport, which should invariably be
left at 1.0 to include the entire set of instances).
In each iteration the support is decreased by a
fixed amount (delta, default 0.05, 5% of the instances) until either a certain number of rules has
been generated (numRules, default 10 rules) or
the support reaches a certain “minimum minimum” level (lowerBoundMinSupport, default
0.1—typically rules are uninteresting if they apply
to only 10% of the dataset or less). These four
values can all be specified by the user.
En la práctica, es tedioso para encontrar un apoyo
mı́nimo y los valores de la confianza que dan resultados satisfactorios. En consecuencia WEKA’s
Apriori corre el algoritmo básico en varias ocasiones. Utiliza el mismo valor mı́nimo especificado
por el usuario a través de la confianza, dado por
el parámetro minMetric. El nivel de soporte se
expresa como un porcentaje del número total de
casos (14 en el caso de los datos meteorológicos),
como una relación entre 0 y 1. El nivel mı́nimo
de apoyo se inicia en un determinado valor (upperBoundMinSupport, que invariablemente se
debe dejar en 1.0 para incluir todo el conjunto
de casos). En cada iteración el apoyo se reduce
en una cantidad fija (delta, por defecto 0.05, 5%
de los casos) hasta que un cierto número de reglas se ha generado (numRules, por defecto 10
normas) o el apoyo llega a un cierto “mı́nimo
mı́nimo “nivel (lowerBoundMinSupport, por
defecto 0.1—normalmente reglas son poco interesantes si se aplican a sólo el 10% del conjunto de
datos o menos). Estos cuatro valores pueden ser
especificados por el usuario.
This sounds pretty complicated, so let us examine
what happens on the weather data. From the output in the Associator output text area, we see
that the algorithm managed to generate ten rules.
This is based on a minimum confidence level of 0.9,
which is the default, and is also shown in the output. The Number of cycles performed, which
is shown as 17, tells us that Apriori was actually
run 17 times to generate these rules, with 17 different values for the minimum support. The final
value, which corresponds to the output that was
generated, is 0.15 (corresponding to 0.15 ∗ 14 ≈ 2
instances).
Esto suena bastante complicado, ası́ que vamos
a examinar lo que sucede en los datos meteorológicos. Desde la salida en el área de texto
Associator output, vemos que el algoritmo de
gestión para generar diez reglas. Esto se basa en
un nivel de confianza mı́nimo de 0.9, que es el predeterminado, y también se muestra en la salida.
El Number of cycles performed, que se muestra
como 17, nos dice que Apriori era en realidad ejecuta 17 veces para generar estas normas, con 17
valores diferentes de la ayuda mı́nima. El coste
final, que corresponde a la salida que se ha generado, es de 0,15 (que corresponde a 0.15 ∗ 14 ≈ 2
instances).
By looking at the options in the GenericObjectEditor, we can see that the initial value for
the minimum support (upperBoundMinSupport) is 1 by default, and that delta is 0.05. Now,
1 − 17 × 0.05 = 0.15, so this explains why a minimum support value of 0.15 is reached after 17 iterations. Note that upperBoundMinSupport is
decreased by delta before the basic Apriori algorithm is run for the first time.
Al mirar las opciones de la GenericObjectEditor, podemos ver que el valor inicial de la ayuda
mı́nima (upperBoundMinSupport) es 1 por defecto, y que delta es de 0,05. Ahora, 1−17×0.05 =
0, 15, ası́ que esto explica por qué un valor mı́nimo
de apoyo de 0,15 que se llegó después de 17 iteraciones. Tenga en cuenta que upperBoundMinSupport delta es disminuido por antes de la base
Apriori algoritmo se ejecuta por primera vez.
2
Minimum confidence
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
Minimum support
0.3
0.2
0.1
0.3
0.2
0.1
0.3
0.2
0.1
Number of rules
Table 1: Total number of rules for different values of minimum confidence and support
The Associator output text area also shows the
number of frequent item sets that were found,
based on the last value of the minimum support
that was tried (i.e. 0.15 in this example). We
can see that, given a minimum support of two instances, there are 12 item sets of size one, 47 item
sets of size two, 39 item sets of size three, and 6
item sets of size four. By setting outputItemSets
to true before running the algorithm, all those different item sets and the number of instances that
support them are shown. Try this.
El área de texto Associator output también
muestra el número de conjuntos de ı́tems frecuentes que se encontraron, con base en el último
valor de la ayuda mı́nima que fue juzgado (es decir, 0.15 en este ejemplo). Podemos ver que, dado
un apoyo mı́nimo de dos casos, hay 12 conjuntos de punto del tamaño de una, 47 conjuntos de
punto del tamaño de dos, 39 conjuntos de punto
del tamaño de tres, y seis conjuntos de punto del
tamaño de cuatro. Al establecer outputItemSets
a true antes de ejecutar el algoritmo, todos los
conjuntos de ı́tems diferentes y el número de casos
que los apoyan se muestran. Pruebe esto.
Ex. 1: Based on the output, what is the support
of the item set
Ex. 1: Sobre la base de la salida, lo que es el soporte del tema conjunto
outlook=rainy
perspectivas=lluvias
humidity=normal
humedad=normal
windy=FALSE
ventoso=FALSO
play=yes?
jugar=sı̀?
Ex. 2: Suppose we want to generate all rules with
a certain confidence and minimum support.
This can be done by choosing appropriate
values for minMetric, lowerBoundMinSupport, and numRules. What is the total number of possible rules for the weather
data for each combination of values in Table 1?
Ex. 2: Supongamos que desea generar todas
las reglas con cierta confianza y el apoyo
mı́nimo. Esto se puede hacer eligiendo valores adecuados para minMetric, lowerBoundMinSupport, y numRules. Cuál
es el número total de posibles reglas para los
datos del tiempo para cada combinación de
valores de la Table 1?
3
Apriori has some further parameters. If significanceLevel is set to a value between zero and
one, the association rules are filtered based on a
χ2 test with the chosen significance level. However, applying a significance test in this context
is problematic because of the so-called “multiple
comparison problem”: if we perform a test hundreds of times for hundreds of association rules, it
is likely that a significant effect will be found just
by chance (i.e., an association seems to be statistically significant when really it is not). Also, the
χ2 test is inaccurate for small sample sizes (in this
context, small support values).
Apriori tiene algunos parámetros más. Si significanceLevel se establece en un valor entre cero
y uno, las reglas de asociación se filtran sobre la
base de un χ2 la prueba con el nivel de significación
elegido. Sin embargo, la aplicación de una prueba
de significación en este contexto es problemático
debido a los llamados “problemas de comparación
múltiple”: si realizamos una prueba cientos de veces por cientos de reglas de asociación, es probable
que un efecto significativo se encuentran sólo por
casualidad (es decir, una asociación parece ser estadı́sticamente significativa, cuando en realidad no
lo es). Además, el χ2 la prueba es inexacto para
pequeños tamaños de muestra (en este contexto,
los valores de apoyar a los pequeños).
There are alternative measures for ranking rules.
As well as Confidence, Apriori supports Lift,
Leverage, and Conviction. These can be selected using metricType. More information is
available by clicking More in the GenericObjectEditor.
Hay medidas alternativas para las reglas de clasificación. Además de Confidence, Apriori Lift
apoya, Leverage y Conviction. Estos pueden ser
seleccionados con metricType. Más información
está disponible haciendo clic More en el GenericObjectEditor.
Ex. 3: Run Apriori on the weather data with
each of the four rule ranking metrics, and
default settings otherwise. What is the topranked rule that is output for each metric?
Ex. 3: Ejecutar Apriori en la información del
tiempo con cada uno de los cuatro indicadores regla de clasificación, y la configuración por defecto de otra manera. Cuál es
la primera regla de clasificación que se emite
para cada métrica?
3
Mining a real-world dataset
Now consider a real-world dataset, vote.arff,
which gives the votes of 435 U.S. congressmen on
16 key issues gathered in the mid-80s, and also includes their party affiliation as a binary attribute.
This is a purely nominal dataset with some missing values (actually, abstentions). It is normally
treated as a classification problem, the task being
to predict party affiliation based on voting patterns. However, we can also apply association rule
mining to this data and seek interesting associations. More information on the data appears in
the comments in the ARFF file.
Consideremos ahora un conjunto de datos del
mundo real, vote.arff, lo que da los votos de 435
congresistas EE.UU. el 16 de cuestiones clave se
reunieron a mediados de los años 80, y también incluye su afiliación a un partido como un atributo
binario. Se trata de un conjunto de datos puramente nominal con algunos valores que faltan (de
hecho, abstenciones). Normalmente se trata como
un problema de clasificación, la tarea que para predecir afiliación a un partido basado en los patrones
de voto. Sin embargo, también podemos aplicar
la minerı́a de reglas de asociación a estos datos y
buscar asociaciones interesantes. Más información
sobre los datos aparecen en los comentarios en el
archivo ARFF.
Ex. 4: Run Apriori on this data with default settings. Comment on the rules that are generated. Several of them are quite similar. How
are their support and confidence values related?
Ex. 4: Ejecutar Apriori en estos datos con la
configuración predeterminada. Opina sobre
las reglas que se generan. Varios de ellos
son bastante similares. Cómo son su apoyo
y confianza de los valores asociados?
4
Ex. 5: It is interesting to see that none of
the rules in the default output involve
Class=republican. Why do you think that
is?
4
Ex. 5: Es interesante ver que ninguna de las reglas en la salida predeterminada implican
Clase=republicana. Por qué crees que es?
Market basket analysis
A popular application of association rule mining is
market basket analysis—analyzing customer purchasing habits by seeking associations in the items
they buy when visiting a store. To do market basket analysis in WEKA, each transaction is coded
as an instance whose attributes represent the items
in the store. Each attribute has only one value: if a
particular transaction does not contain it (i.e., the
customer did not buy that particular item), this is
coded as a missing value.
Una aplicación popular de la minerı́a de reglas
de asociación es el análisis de la cesta—analizar
los hábitos de compra de los clientes mediante
la búsqueda de asociaciones en los productos que
compran al visitar una tienda. Para hacer análisis
de la cesta de WEKA, cada transacción se codifica como una instancia cuyos atributos representan los artı́culos de la tienda. Cada atributo tiene
un único valor: si una transacción en particular
no lo contiene (es decir, el cliente no comprar ese
artı́culo en particular), esto se codifica como un
valor que falta.
Your job is to mine supermarket checkout data for
associations. The data in supermarket.arff was
collected from an actual New Zealand supermarket. Take a look at this file using a text editor
to verify that you understand the structure. The
main point of this exercise is to show you how difficult it is to find any interesting patterns in this
type of data!
Su trabajo consiste en extraer datos supermercado para las asociaciones.
Los datos de
supermarket.arff se obtuvo de un verdadero supermercado de Nueva Zelanda. Echa un vistazo
a este archivo utilizando un editor de texto para
comprobar que entender la estructura. El punto
principal de este ejercicio es mostrar lo difı́cil que
es encontrar cualquier patrones interesantes en este
tipo de datos!
Ex. 6: Experiment with Apriori and investigate
the effect of the various parameters discussed
above. Write a brief report on your investigation and the main findings.
Ex. 6: Experimente con Apriori e investigar el
efecto de la diversos parmetros discutidos anteriormente. Escriba un breve informe en su
investigacin y las conclusiones principales.
5

Practical Data Mining Tutorial 1: Introduction to the WEKA Explorer

Transcripción

Documentos relacionados

Transparencias

“Make love your aim.” - 1 Corinthians 14:1 “Love never ends. ”

J.A. Del Río