Development of a representative image detector for

Transcripción

Development of a representative image detector for
Development of a representative image detector for
digital newspapers
O. Malingre-Pérez
Higher Technical School of Computer Engineering, University of Vigo
As Lagoas Campus, 32004 Ourense, Spain
[email protected]
Abstract. Relating images to the content of a web page is quite difficult due to
the number of images existent in any web site. As image we consider any icon,
background, banner, ads, pictures, thumbnails, etc. This paper addresses the
image classification problem in order to determine the most relevant image
inside a given web page. This work analyses the images in order to extract the
features, select the most suitable and finally classify them. We have used
several selection algorithms in order to optimize the classification process. In
addition we have probed with distinct number of attributes to determine if this
is a relevant analysis factor. Furthermore, a set of classifiers have been used and
compared according to performance requirements. The three factors are
evaluated through kappa and accuracy to measure the relevance. The results
show the better selection method to combine with the best classifier and
determine the optimal number of features.
Keywords: image feature selection, number of features, image classification
features.
1 Introduction and Motivation
Nowadays people get in touch with each other by using the Internet social networks,
such as ‘Facebook’ [1], ‘Twitter’ [2] or ‘Linked-in’ [3], as well as derivative services,
such as ‘Paper.li’ [4], a personal on-line newspaper. There are also multiple social
groups with their own Internet social network like University of Vigo with the
ESEINET [5].
Using these services implies ‘content sharing’, including links to external web
pages, typically news, articles and blog entries, among others. These resources usually
contain attached graphical information, which could, and could not, be related to the
textual content. In this sense, the images embedded in the shared documents can be
divided into the following groups according to their finality: (i) decorative images,
those included to ornament the web page like the background, image header, icons or
button images; (ii) advertising images that links to another web pages and usually
named as banners and finally (iii) content-related images, these ones are deliberately
inserted by the author of the content. The main problem is how differentiate between
the images due to all of them are identically represented in HTML code. All graphic
objects are inserted using IMG (image) tag [6,7], so it is difficult to identify those
relevant images.
This paper addresses the image classification problem in order to determine the
most relevant image inside a given web page. The starting point of this work is a
previous research about web images feature ranking [8]. We will study the accuracy
of classification algorithms as well as feature selection techniques over an images
dataset containing 2300 instances coming from several web pages, including blogs,
magazines, newspapers, etc. Those images are represented with 210 HTML and EXIF
(Exchangeable image file format) [9] attributes. In this sense, we will (i) select by
applying feature selection techniques [10] the best features in order to determine the
relevance of the image and finally (ii) we will compare several classifiers in order to
establish which one is more accurate to solve this problem.
The present work is structured as follows. First, Section 2 introduces the related
work in image classification. Section 3 describes the experimental setup used to carry
out the tests while section 4 presents and discusses the obtained experimental results.
Finally, Section 5 presents the conclusions and further work of the present study.
2 Related work
There are many works using classification methods and applied to many other areas.
We can find appliances in biology, for example classifying proteins [11],
anthropological studies, medical or psychological diagnosis, systems modelling and
other different subjects. All the classification studies share the same difficulties; they
intend to solve complex classification problems with large data sets.
Classification methods can be divided into following groups: (i) statistical
processes, where the classification is made by considering the statistical features of
the data, where we can find Bayesian-based classifiers, (ii) linear classifiers, (iii)
instance-based learners, (iv) decision trees, where the classifier looks for promising
attributes to iteratively split the instances into groups based on decision branches and
(v) Artificial Intelligence-based techniques.
As example of image classification we can see an interesting research about image
spam filtering techniques [12]. Here author uses pattern recognition and image
classification to detect if an image is spam.
Other work shows that support vector machines (SVM) can generalize well on
difficult image classification problems where the only features are high dimensional
histograms [13].
Other related works use image classification to build indices and classify images
according a semantic meaning [14].
3 Experimental setup
The proposed study implements a workflow comprising several steps, including (i)
analysis and image feature extraction, (ii) feature selection and (iii) classification, as it
is shown in Figure 1. Due to the size of the main problem, we started from a
consolidated dataset, optimized in an explicit feature ranking research work [8].
This section describes how we have performed (i) the evaluation of the selected
classifier techniques, (ii) the study of the best number of features to build up the
testing process, (iii) the execution of the classification methods, to know the
behaviour over the dataset and (iv) the results comparison, by means of the AnovaTukey test [15].
WEB PAGES
IMAGE
DATA
210 FEATURES
DATA SET
(50 FEATURES)
FE. SELECTION
EXIF
HTML
PREPROCESSING
FEATURE
EXTRACTION
50 FEATURES
FEATURE NUMBER
CLASSIFIERS
5 X ( CROSS-V = 10 )
IMAGE
DATA
DATA SET
RESULTS
Figure 1. Experiment workflow
3.1 Feature selection
As we commented before, we use a dataset obtained from a previous research about
feature extraction in web pages images. We will put attention on studying the
behaviour of the evaluating model methods.
Most of the dataset have lots of non-relevant information and sometimes the data
are inconsistent or irrelevant. The aim of the feature extraction is: (i) reduce the
training dataset by deleting irrelevant features, (ii) improve quality of the model and
(iii) reduce the problem dimension.
The used features selection [16] methods are: (i) ReliefF algorithm (Ref) [17,18];
(ii) Chi-Squared (Chi) [19]; (iii) Information Gain (InG) [20, 21]; (iv) Information
Gain Ratio (GaR) [22]. In addition, we have tested the model without using any
feature selection algorithm; this is referenced as No-Selection (NOS). Table 1
summarizes the used algorithms.
Table 1: Feature selection algorithms
Selection Algorithm
ReliefF (ReF)
Chi-Squared (Chi)
Information Gain
(InG)
Information Gain
Ratio (GaR)
Concept
Evaluates the worth of an attribute by repeatedly sampling
an instance and considering the value of the given attribute
for the nearest instance of the same and different class.
Evaluates the worth of an attribute by computing the value
of the chi-squared statistic with respect to the class.
Evaluates the worth of an attribute by measuring the
information gain with respect to the class.
Evaluates the worth of an attribute by measuring the gain
ratio with respect to the class.
In order to compare the selection methods, we have included a scoreboard with the
results of the algorithms execution over the training dataset. This will be exposed in
Section 4.
3.2 Number of features
Starting from a dataset with a reduced number of features we have to optimize it due
to use because it is one of the important issues when estimating the features relevance.
In order to establish this feature number we have done some training using different
sizes samples. Furthermore the results will be showed in a table.
We have tried with different number of features to determine which is the better
option to achieve higher performance with classification methods. We will see that
this factor must be considered to get better results.
3.3 Classification
Classification methods [23] are highly domain-dependent and often it is no possible to
apply all known classifiers over the problem under study. We have chosen the
algorithms that better fits on the nature of the present problem, including (i)
probabilistic models, (ii) instance-based learning, (iii) decision trees, (iv) Support
Vector Machines (SVM), (v) boosting and (vi) decision tables.
As probabilistic models, we have chosen Bayesian networks (BaN) [24], Naïve
Bayes (NaB) [25], Naïve Bayes Simple (NBS) and Naïve Bayes Kernel (NBK). All of
these algorithms are based on the Bayes’ theorem and in the variable independence
assumption.
As instance-based learning algorithm we chose the K-nearest neighbours (KNN)
algorithm [26]. The training stage just consists of storing the training samples. To
classify a new case, its K-nearest neighbours (i.e. the K samples most similar to in
terms of the concrete problem) are selected. The most frequent class among these K
neighbours is the class predicted for the new case.
As decision tree learning algorithms, we chose C4.5 [27, 28] and Random forest
[29]. A Random forest is an ensemble classifier that consists of many decision trees,
where the output class is a combination from the output classes of each individual
tree. As result, a decision tree is built by recursively partitioning the training model
with the aim of maximizing the class homogeneity of the resulting subsets. The
selected variable is that which ensures the maximal reduction of class heterogeneity.
In the case of C4.5 this measure is the entropy.
Regarding Support Vector Machines (SVMs) [30], we choose Sequential Minimal
Option (SMO), SVM Lineal Kernel (SML) and SVM-RBF kernel (SVR). Sequential
Minimal Optimization (SMO) [31] is a kind of SVM where optimization is applied.
On the other hand, we use two SVMs: one using a radial basis function (SVR) and the
other using a linear function (SVL).
Boosting algorithm we have chosen is AdaBoost M1 (AdB) [32]. It is a machine
learning algorithm. It is a meta-algorithm, and can be used in conjunction with many
other learning algorithms to improve their performance. This algorithm is adaptive in
the sense that subsequent classifiers built are tweaked in favour of those instances
misclassified by previous classifiers.
Finally we use decision tables (DeT) [33]. This algorithm is based in choosing
attributes subsets one by one and add to each set those attributes that have not yet
included. The algorithm tests the subset accuracy using cross validation or the error
estimation method ‘leave-one-out’.
Figure 2 shows the workflow characteristics classification tasks.
DATA SET
(50 FEATURES)
CLASSIFIERS GROUPS
INSTANCE-BASED LEARNING
K NEAREST NEIGHBOURS
DECISION TREES
RANDON FOREST, C4.5
LINEALS
SMO, SVL, SVR
ADABOOST
BOOSTING
DECISION TABLES
DECISION TABLES
KAPPA
ACCURACY
PERFORMANCE
RESULTS
BN, NaB, NBS, NBK
PROBABILISTIC MODELS
5 X (CROSS-VALIDATION = 10)
Figure 2. Classification workflow
In order to measure the accuracy of each classifier we can use two indicators: (i)
the percentage of correct classifications (accuracy) and (ii) Cohen's Kappa statistic.
Kappa statistic [34] is a chance-corrected measure of agreement between the
classifications and the true classes. Kappa value is calculated with the following
expression:
( )
( )
( )
(1)
Where Pr(a) is the percentage agreement (for example between the classifier and
ground truth) and Pr(e) is the chance agreement. Kappa=1 indicates perfect
agreement, whereas Kappa=0 indicates chance agreement.
3.4 Test ANOVA-Tukey
The Anova-Tukey procedure has been designed to determine which factor means
amongst a set of factors presents significant differences from a statistic point of view.
Anova (Analysis of variance) [35] tries to find factors that affect the classifiers
performance and the existence of interaction between these factors. This means that if
a factor is changed then Anova determines if other factor have changed too.
Tukey is executed after Anova test to compare all possible pairs of means. It is a
method to compute the probability of finding the same value given two input factors
means and variances (Null Hypothesis). This test tries to determine the value of each
factor to get the better performance.
4 Experimental results
In this section we will show the results of the statistical test applied over the dataset
comparing the results by pairs of factors values (as described in subsection ‘3.1
Feature selection’).
In order to evaluate the classifiers we have executed five cross validations with ten
folds. The cross-validation [35] is executed as many times as indicates the ‘folds’
parameter (5 x 10 stratified cross-validation).
The goal is obtain the most fitting value level when applying over the dataset. The
tables score each classifier in comparison with the other used methods. In addition,
factor values appear organized in descending order of capacity for evaluating image
features.
Table 2 shows the results provided by ANOVA-Tukey test. This test allows us to
see: (i) for each parameter, if there are significant differences among its possible
values and (ii) if there are interactions between parameters. The scope of this test is to
detect parameters that impact significantly on the classification and possible
interactions among them.
The ANOVA test reveals that all factors (Feature Selection, Num. Features,
Classifier) have influence in classification performance because the resultant p-value
is under 0,05.
Table 2: Anova-Tukey
FACTORS
FACTOR
Df
Sum Sq
Mean Sq
F-value
p-value
classifier
11
886,07
80,552
8727,311
<2,2e-16
featureSelection
4
30,2
7,549
817,891
<2,2e-16
numFeatures
7
32,29
4,631
499,747
<2,2e-16
INTERACTIONS
FACTOR
Df
Sum Sq
Mean Sq
F-value
p-value
classifier:featureSelection
44
122,82
2,791
302,425
<2,2e-16
classifier:numFeatures
77
197,98
2,571
278,57
<2,2e-16
featureSelection:numFeatures
21
46,77
2,227
241,289
<2,2e-16
classifier:featureSelection:numFeatures
231
52,6
0,228
24,672
<2,2e-16
Table 3 shows the features selection methods comparative applied to the whole
dataset. As can be seen, ReliefF performace is better than the performance of the
other feature selections in all the cases. Nevertheless, the fact of no selecting any
features (NOS) implies worst valuation. This method evaluates an attribute value
using repetitive sampling of an instance and considering the attribute value obtained
from the closest instance of the same class and the different one.
It is also interest to note that applying the Information Gain Ratio algorithm has
similar performance that no using any selection method.
Table 3: Feature selection comparative
Feature Selection
ReF
Chi
InG
GaR
NOS
TOTAL
Feature Selection
ReF
-0,0603
-0,0747
-0,1063
-0,1126
-0,0885
Chi
InG
GaR
NOS
TOTAL
0,0603
0,0747
0,0145
0,1063
0,0460
0,0316
0,1126
0,0523
0,0379
0,0063
0,0885
0,0131
-0,0049
-0,0444
-0,0523
0,0000
-0,0145
-0,0460
-0,0523
-0,0131
-0,0316
-0,0379
0,0049
-0,0063
0,0444
0,0523
The represented value factors are the following: (i) ReliefF algorithm (ReF), (ii)
Chi-Squared, (iii) Information Gain (InG), (iii) Information Gain Ratio (GaR), (iv)
No-Selection (NOS).
Each cell contains the performance mean difference using TUKEY test. Blue cells
reports better performance in rows than in columns (p-valor over 0,05).
Table 4 presents the performance comparison of using different number of
features.
We observe the best results using a short number of attributes, while it loses
efficiency with larger attribute sets.
Table 4: Feature number comparative
Number
Attributes
5
10
15
ALL
25
35
20
30
40
TOTAL
5
-0,049
-0,084
-0,090
-0,115
-0,115
-0,116
-0,119
-0,120
-0,101
10
15
ALL
0,049
0,084
0,034
0,090
0,041
0,006
Number Attributes
25
35
20
0,115
0,066
0,031
0,025
-0,034
-0,041
-0,066
-0,066
-0,067
-0,069
-0,071
-0,006
-0,031
-0,032
-0,032
-0,035
-0,037
-0,025
-0,025
-0,026
-0,029
-0,031
-0,046
-0,007
0,000
0,000
-0,001
-0,004
-0,005
0,029
0,115
0,066
0,032
0,025
0,000
-0,001
-0,003
-0,005
0,029
0,116
0,067
0,032
0,026
0,001
0,001
-0,003
-0,005
0,029
30
40
TOTAL
0,119
0,069
0,035
0,029
0,004
0,003
0,003
0,120
0,071
0,037
0,031
0,005
0,005
0,005
0,002
0,101
0,046
0,007
0,000
-0,029
-0,029
-0,029
-0,033
-0,034
0,000
-0,002
0,033
0,034
As we can see in the previous table, the best number of features is five. It is
significant that using all attributes have better results than using more than 25.
As we previously mentioned in subsection 3.3, Table 5 shows a comparative of the
classifiers behaviour over the dataset.
Table 5: Classifiers comparative
Classifier
Classifier
KNN
KNN
RaF
SMO
J48
DeT
AdB
SVL
SVR
BaN
NBK
NaB
NBS
TOTAL
0,001
0,008
0,017
0,055
0,089
0,165
0,210
0,469
0,472
0,517
0,527
0,230
0,006
0,016
0,054
0,088
0,164
0,209
0,467
0,471
0,516
0,526
0,229
0,010
0,047
0,081
0,158
0,202
0,461
0,465
0,510
0,519
0,222
0,037
0,071
0,148
0,193
0,451
0,455
0,500
0,510
0,211
0,034
0,111
0,155
0,414
0,417
0,463
0,472
0,170
0,077
0,121
0,380
0,383
0,429
0,438
0,133
0,045
0,303
0,307
0,352
0,362
0,050
0,259
0,262
0,307
0,317
0,001
0,004
0,049
0,058
-0,281
0,045
0,055
-0,285
0,010
-0,334
RaF
-0,001
SMO
-0,008
-0,006
C4.5
-0,017
-0,016
-0,010
DeT
-0,055
-0,054
-0,047
-0,037
AdB
-0,089
-0,088
-0,081
-0,071
-0,034
SVL
-0,165
-0,164
-0,158
-0,148
-0,111
-0,077
SVR
-0,210
-0,209
-0,202
-0,193
-0,155
-0,121
-0,045
BaN
-0,469
-0,467
-0,461
-0,451
-0,414
-0,380
-0,303
-0,259
NBK
-0,472
-0,471
-0,465
-0,455
-0,417
-0,383
-0,307
-0,262
-0,004
NaB
-0,517
-0,516
-0,510
-0,500
-0,463
-0,429
-0,352
-0,307
-0,049
-0,045
NBS
-0,527
-0,526
-0,519
-0,510
-0,472
-0,438
-0,362
-0,317
-0,058
-0,055
-0,010
TOTAL
-0,230
-0,229
-0,222
-0,211
-0,170
-0,133
-0,050
-0,001
0,281
0,285
0,334
-0,345
0,345
0,000
This used classifiers are the following: (i) Instance Base KNN (KNN), (ii) Random
Forest (RaF), (iii) Sequential minimal optimization (SMO), (iv) C4.5 algorithm
(C4.5), (v) Decision Tables (DeT), (vi) AdaBoost M1 (AdB), (vii) SVL algorithm
(SVL), (viii) SVR algorithm (SVR), (ix) Bayes Net (BaN), (x) Naive Bayes Kernel
(NBK), (xi) Naive Bayes (NaB), (xii) Naive Bayes Simple (NBS).
Looking carefully to the table, we can see that better results are obtained for
knowledge-based algorithms like kNN, Decision Trees (RaF, C4.5) and support
vector machine (SMO). However the worst results come from the probabilistic
methods like BayesNes, Naïve Bayes, Naive Bayes Simple and Naive Bayes Kernel.
Nevertheless, SVM has acceptable results when it is optimized. It would be
interesting to study using optimized SVM Lineal and SVM RBF.
Table 6 shows Kappa and Accuracy values. As we can see, KNN algorithm
achieves a kappa and accuracy average better than NBS algorithm.
Table 6: Average and standard deviation for kappa statistic and accuracy values
Classifier
AdB
BaN
C4.5
DeT
KNN
NaB
NBK
NBS
RaF
SMO
SVL
SVR
Kappa Statistic
Average
Standard Desv.
0,8072
0,4275
0,8786
0,8413
0,8960
0,3787
0,4239
0,3691
0,8948
0,8885
0,7306
0,6861
0,1047
0,3031
0,1000
0,0938
0,0800
0,2352
0,2683
0,2413
0,0802
0,0942
0,2320
0,1364
Average
98,84%
87,28%
99,24%
99,00%
99,34%
86,90%
88,31%
86,05%
99,34%
99,27%
98,55%
98,38%
Accuracy
Standard Desv.
0,57%
9,82%
0,57%
0,58%
0,49%
8,79%
8,79%
9,33%
0,47%
0,57%
0,91%
0,59%
In order to measure the accuracy of each classifier we can use two indicators: (i)
the percentage of correct classifications (accuracy) and (ii) Cohen's Kappa statistic.
This table confirms the results obtained in the TUKEY’s test, where classifier
KNN was identified as the best classifier.
Figure 3. Kappa Statics Average shows bar diagram with ‘Kappa’ average value
for each classifier. In addition, we have added the error bars to give a general idea of
how accurate a measurement is, or conversely, how far from the reported value the
true (error free) value might be.
Figure 3. Kappa Statics Average
Error bars represents the standard deviation. This representation helps to see which
are the better methods. Bayes methods are more volatiles because have bigger
deviation.
5 Conclusions and Further work
In this work we have studied the influence of three factors to classify over a dataset of
image features: (i) selection method, (ii) number of features, (iii) classifier.
The best selection method is ReliefF because considers the value of the selected
attribute for the nearest instance of the same and different class.
In addition, the best classifiers are based in learning (kNN) and trees (Random
Forest). This is useful to use with the ReliefF selection method.
As regards the number of features the better option is to choose a lower number,
but it is remarkable that getting all attributes is better than use 25, 30, etc. May be the
selection method fails choosing any attribute so it does not select an instance (image)
as relevant.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
Facebook (2012) http://www.facebook.com. Accessed February 2012
Twitter (2012) http://www.twitter.com. Accessed February 2012
Linkedin Corporation (2012) http://www.linkedin.com. Accessed February 2012
SmallRivers (2012) http://paper.li. Accessed March 2012
Eseinet (2012) http://www.eseinet.es. Accessed March 2012
W3Schools (1999-2012) HTML img tag. http://www.w3schools.com/tag_img.asp.
Accessed February 2012
Wikipedia (2012) HTML. http://es.wikipedia.org/wiki/HTML. Accessed February
2012
Monzoncillo-Barreiros MI. (2012) Characterization study of images and text for the
development of a representative image detector in published news by digital
newspapers
Wikipedia
(2012)
EXIF.
http://es.wikipedia.org/wiki/Exchangeable_image_file_format. Accessed February
2012
López Molina José Manuel, Herrero José García (2006) Técnicas de Análisis de
Datos. Aplicaciones prácticas utilizando Microsoft Excel y Weka
Koji Tsuda, HyunJung Shin and Bernhard Schölkopf (2005) Fast protein
classification with multiple networks
Battista Biggio, Giorgio Fumera, Ignazio Pillai, Fabio Roli (2011) A survey and
experimental evaluation of image spam filtering techniques
O. Chapelle (1999) Support vector machines for histogram-based image classification
A. Vailaya, A. Jain, H. J. Zhag (1998) On image classification: city images vs.
landscapes
Benjamini Y., Braun H., John W. (2002) Tukey's Contributions to Multiple
Comparisons
Guyon I. (2003) An Introduction to Variable and Feature Selection
Kononenko I. (1994) Estimating attributes: Analysis and extensions of RELIEF
18. Yijun Sun, Dapeng Wu. A RELIEF Based Feature Extraction Algorithm
19. Wolfram
Research,
Inc.
(2009)
Chi-squared
distribution.
http://mathworld.wolfram.com/Chi-SquaredDistribution.html. Accessed March 2012
20. Wikipedia
(2012)
Information
gain
in
decision
trees.
http://en.wikipwdia.org/wiki/Infomation_gain_in_decision_trees. Accessed March
2012
21. Alec Pawling, Nitesh V. Chawla, and Amitabh Chaudhary (2005) Computing
Information Gain in Data Streams
22. Wikipedia
(2012)
Information
gain
ratio.
http://en.wikipedia.org/wiki/Information_gain_ratio. Accessed March 2012
23. Mitchell Tom M. (1997) Machine Learning
24. Pearl, J. (1985) Bayesian Networks: A Model of Self-Activated Memory for
Evidential Reasoning. Proceedings of the 7th Conference of the Cognitive Science
Society. University of California
25. John, G. H., & Langle, P. (1995) Estimating Continuous Distributions in Bayesian
Classifiers. Proceeding of the 11 h Conference on Uncertainty in Artificial
Intelligence, Morgan Kaufman Publishers
26. Aha, D. W., Kibler, D., Albert, M. K. (1991) Instance-Based Learning Algorithms.
Machine Learning
27. Quinlan, R. (1993) C4. 5: Programs for machine learning. Morgan Kaufmann
Publishers, San Mateo
28. Ruggieri, S. (2002) Efficient C4.5
29. Breiman, L. (2001) Random Forests. Machine Learning
30. Platt, J. (1999) Fast training of support vector machines using sequential minimal
optimization. Advances in kernel methods – support vector learning. MIT Press
Cambridge, MA, USA
31. Platt, John. Fast (1998) Training of Support Vector Machines using Sequential
Minimal Optimization, in Advances in Kernel Methods – Support Vector Learning
32. Falaki H. AdaBoost Algorithm. Computer Science Department. University of
California
33. Kohavi R. (1995). The Power of Decision Tables. In Proc European Conference on
Machine Learning.
34. Ben-David, A. (2008) Comparison of Classification Accuracy using Cohen's
Weighted Kappa. Expert Systems with Applications
35. Wikipedia
(2012)
Analysis
of
variance.
http://en.wikipedia.org/wiki/Analysis_of_variance. Access March 2012
36. Kohovi R. (2006) A Study of Cross-Validation and Bootstrap for Accuracy
Estimation and Model Selection. Computer Science Department. Stanford University
Appendix A: Summary in Spanish
1 Introducción
Hoy en día la manera más rápida de estar en contacto es utilizando las redes
sociales como Facebook [], Twitter [] o Linked-in [] incluso mantener un periódico
personal digital como Paper.li [] u otros servicios privados como ESEINET que
pertenece a la Universidad de Vigo.
A través de estos servicios se comparte información a través de enlaces, noticias o
entradas de blog entre otros. Lo habitual es que contengan texto e imágenes. Lo que
nos centra en este estudio es la clasificación de estas imágenes. En rasgos generales,
estas imágenes se pueden clasificar según su finalidad como (i) decorativas, que
incluyen iconos, fondos, cabeceras o pies de página, botones, etc.; (ii) anuncios que
habitualmente enlazan a otras páginas web y comúnmente se conocen como banners;
(iii) imágenes insertadas por el autor del contenido principal.
Este trabajo está orientado hacia esta clasificación para determinar cuál de todas las
imágenes de una web son representativas del contenido principal.
2 Proceso de desarrollo
Se parte de un estudio previo de selección de características del que se toman datos
de los atributos más representativos de las imágenes en cualquier web. Estos atributos
han sido extraídos de información HTML y metadatos EXIF, quedando el conjunto de
datos en un total de 2300 instancias (imágenes) con 50 atributos cada una. Todas las
imágenes procedían de una selección de webs de diferentes contenidos (blogs,
noticias, organizaciones gubernamentales, etc.)
Se ha realizado un nuevo proceso de selección de características para poder
determinar cuáles de ellas son relevantes a la hora de identificar una imagen
representativa. Para ello se han aplicado diferentes métodos se selección: (i) ReliefF,
(ii) Chi-cuadrado, (iii) Ganancia de información, (iv) Ratio de ganancia de
información. Además se han comparado con los resultados obtenidos de no utilizar un
método de selección.
Se ha trabajado con diferentes número de características para que utilicen los
anteriores métodos. De este modo se han probado utilizando grupos de 5, 10, hasta el
total de atributos. Con esta cantidad de atributos seleccionados, se han pasado a
diferentes algoritmos de clasificación.
Los algoritmos de clasificación probados son de diferentes tipos: (i) basados en
aprendizaje, (ii) árboles de decisión, (iii) Máquinas de soporte vectorial, (iv)
algoritmos de boosting, (v) algoritmos Bayesianos y (vi) tablas de decisión.
2 Test ANOVA-Tukey
El test de ANOVA se realiza para determinar qué factores influyen en la
clasificación y cómo interaccionan entre sí estos factores. A continuación se ha
ejecutado Tukey para conocer el valor de cada factor con el que conseguimos mejor
rendimiento.
3 Resultados obtenidos
Para evaluar los clasificadores se han ejecutado 5 validaciones cruzadas con 10
cruces (10 folds).
El test de ANOVA indica que todos los son necesarios (selección de atributos,
número y clasificador).
Como método de clasificación el mejor resultado se obtiene con el algoritmo
ReliefF.
Con respecto al número de atributos la mejor opción es seleccionar pocos atributos,
aunque con más de 15 la mejor opción es seleccionarlos todos en vez de hacer más
particiones.
Con respecto a la clasificación, los mejores métodos están basados en el
aprendizaje (kNN) y en árboles (RaF).
4 Conclusiones
Los mejores métodos son los basados en aprendizaje ya que utilizan el
conocimiento de los elementos vecinos para determinar la importancia de un atributo
(algoritmos de selección ReliefF y de clasificación kNN)
El número de atributos es importante a la hora de realizar la clasificación. De todos
modos es importante resaltar, que una vez seleccionados 15 atributos la siguiente
mejor opción es seleccionarlos todos (52 características); siendo mejor que utilizar un
menor número.

Documentos relacionados