Development of a representative image detector for
Transcripción
Development of a representative image detector for
Development of a representative image detector for digital newspapers O. Malingre-Pérez Higher Technical School of Computer Engineering, University of Vigo As Lagoas Campus, 32004 Ourense, Spain [email protected] Abstract. Relating images to the content of a web page is quite difficult due to the number of images existent in any web site. As image we consider any icon, background, banner, ads, pictures, thumbnails, etc. This paper addresses the image classification problem in order to determine the most relevant image inside a given web page. This work analyses the images in order to extract the features, select the most suitable and finally classify them. We have used several selection algorithms in order to optimize the classification process. In addition we have probed with distinct number of attributes to determine if this is a relevant analysis factor. Furthermore, a set of classifiers have been used and compared according to performance requirements. The three factors are evaluated through kappa and accuracy to measure the relevance. The results show the better selection method to combine with the best classifier and determine the optimal number of features. Keywords: image feature selection, number of features, image classification features. 1 Introduction and Motivation Nowadays people get in touch with each other by using the Internet social networks, such as ‘Facebook’ [1], ‘Twitter’ [2] or ‘Linked-in’ [3], as well as derivative services, such as ‘Paper.li’ [4], a personal on-line newspaper. There are also multiple social groups with their own Internet social network like University of Vigo with the ESEINET [5]. Using these services implies ‘content sharing’, including links to external web pages, typically news, articles and blog entries, among others. These resources usually contain attached graphical information, which could, and could not, be related to the textual content. In this sense, the images embedded in the shared documents can be divided into the following groups according to their finality: (i) decorative images, those included to ornament the web page like the background, image header, icons or button images; (ii) advertising images that links to another web pages and usually named as banners and finally (iii) content-related images, these ones are deliberately inserted by the author of the content. The main problem is how differentiate between the images due to all of them are identically represented in HTML code. All graphic objects are inserted using IMG (image) tag [6,7], so it is difficult to identify those relevant images. This paper addresses the image classification problem in order to determine the most relevant image inside a given web page. The starting point of this work is a previous research about web images feature ranking [8]. We will study the accuracy of classification algorithms as well as feature selection techniques over an images dataset containing 2300 instances coming from several web pages, including blogs, magazines, newspapers, etc. Those images are represented with 210 HTML and EXIF (Exchangeable image file format) [9] attributes. In this sense, we will (i) select by applying feature selection techniques [10] the best features in order to determine the relevance of the image and finally (ii) we will compare several classifiers in order to establish which one is more accurate to solve this problem. The present work is structured as follows. First, Section 2 introduces the related work in image classification. Section 3 describes the experimental setup used to carry out the tests while section 4 presents and discusses the obtained experimental results. Finally, Section 5 presents the conclusions and further work of the present study. 2 Related work There are many works using classification methods and applied to many other areas. We can find appliances in biology, for example classifying proteins [11], anthropological studies, medical or psychological diagnosis, systems modelling and other different subjects. All the classification studies share the same difficulties; they intend to solve complex classification problems with large data sets. Classification methods can be divided into following groups: (i) statistical processes, where the classification is made by considering the statistical features of the data, where we can find Bayesian-based classifiers, (ii) linear classifiers, (iii) instance-based learners, (iv) decision trees, where the classifier looks for promising attributes to iteratively split the instances into groups based on decision branches and (v) Artificial Intelligence-based techniques. As example of image classification we can see an interesting research about image spam filtering techniques [12]. Here author uses pattern recognition and image classification to detect if an image is spam. Other work shows that support vector machines (SVM) can generalize well on difficult image classification problems where the only features are high dimensional histograms [13]. Other related works use image classification to build indices and classify images according a semantic meaning [14]. 3 Experimental setup The proposed study implements a workflow comprising several steps, including (i) analysis and image feature extraction, (ii) feature selection and (iii) classification, as it is shown in Figure 1. Due to the size of the main problem, we started from a consolidated dataset, optimized in an explicit feature ranking research work [8]. This section describes how we have performed (i) the evaluation of the selected classifier techniques, (ii) the study of the best number of features to build up the testing process, (iii) the execution of the classification methods, to know the behaviour over the dataset and (iv) the results comparison, by means of the AnovaTukey test [15]. WEB PAGES IMAGE DATA 210 FEATURES DATA SET (50 FEATURES) FE. SELECTION EXIF HTML PREPROCESSING FEATURE EXTRACTION 50 FEATURES FEATURE NUMBER CLASSIFIERS 5 X ( CROSS-V = 10 ) IMAGE DATA DATA SET RESULTS Figure 1. Experiment workflow 3.1 Feature selection As we commented before, we use a dataset obtained from a previous research about feature extraction in web pages images. We will put attention on studying the behaviour of the evaluating model methods. Most of the dataset have lots of non-relevant information and sometimes the data are inconsistent or irrelevant. The aim of the feature extraction is: (i) reduce the training dataset by deleting irrelevant features, (ii) improve quality of the model and (iii) reduce the problem dimension. The used features selection [16] methods are: (i) ReliefF algorithm (Ref) [17,18]; (ii) Chi-Squared (Chi) [19]; (iii) Information Gain (InG) [20, 21]; (iv) Information Gain Ratio (GaR) [22]. In addition, we have tested the model without using any feature selection algorithm; this is referenced as No-Selection (NOS). Table 1 summarizes the used algorithms. Table 1: Feature selection algorithms Selection Algorithm ReliefF (ReF) Chi-Squared (Chi) Information Gain (InG) Information Gain Ratio (GaR) Concept Evaluates the worth of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the same and different class. Evaluates the worth of an attribute by computing the value of the chi-squared statistic with respect to the class. Evaluates the worth of an attribute by measuring the information gain with respect to the class. Evaluates the worth of an attribute by measuring the gain ratio with respect to the class. In order to compare the selection methods, we have included a scoreboard with the results of the algorithms execution over the training dataset. This will be exposed in Section 4. 3.2 Number of features Starting from a dataset with a reduced number of features we have to optimize it due to use because it is one of the important issues when estimating the features relevance. In order to establish this feature number we have done some training using different sizes samples. Furthermore the results will be showed in a table. We have tried with different number of features to determine which is the better option to achieve higher performance with classification methods. We will see that this factor must be considered to get better results. 3.3 Classification Classification methods [23] are highly domain-dependent and often it is no possible to apply all known classifiers over the problem under study. We have chosen the algorithms that better fits on the nature of the present problem, including (i) probabilistic models, (ii) instance-based learning, (iii) decision trees, (iv) Support Vector Machines (SVM), (v) boosting and (vi) decision tables. As probabilistic models, we have chosen Bayesian networks (BaN) [24], Naïve Bayes (NaB) [25], Naïve Bayes Simple (NBS) and Naïve Bayes Kernel (NBK). All of these algorithms are based on the Bayes’ theorem and in the variable independence assumption. As instance-based learning algorithm we chose the K-nearest neighbours (KNN) algorithm [26]. The training stage just consists of storing the training samples. To classify a new case, its K-nearest neighbours (i.e. the K samples most similar to in terms of the concrete problem) are selected. The most frequent class among these K neighbours is the class predicted for the new case. As decision tree learning algorithms, we chose C4.5 [27, 28] and Random forest [29]. A Random forest is an ensemble classifier that consists of many decision trees, where the output class is a combination from the output classes of each individual tree. As result, a decision tree is built by recursively partitioning the training model with the aim of maximizing the class homogeneity of the resulting subsets. The selected variable is that which ensures the maximal reduction of class heterogeneity. In the case of C4.5 this measure is the entropy. Regarding Support Vector Machines (SVMs) [30], we choose Sequential Minimal Option (SMO), SVM Lineal Kernel (SML) and SVM-RBF kernel (SVR). Sequential Minimal Optimization (SMO) [31] is a kind of SVM where optimization is applied. On the other hand, we use two SVMs: one using a radial basis function (SVR) and the other using a linear function (SVL). Boosting algorithm we have chosen is AdaBoost M1 (AdB) [32]. It is a machine learning algorithm. It is a meta-algorithm, and can be used in conjunction with many other learning algorithms to improve their performance. This algorithm is adaptive in the sense that subsequent classifiers built are tweaked in favour of those instances misclassified by previous classifiers. Finally we use decision tables (DeT) [33]. This algorithm is based in choosing attributes subsets one by one and add to each set those attributes that have not yet included. The algorithm tests the subset accuracy using cross validation or the error estimation method ‘leave-one-out’. Figure 2 shows the workflow characteristics classification tasks. DATA SET (50 FEATURES) CLASSIFIERS GROUPS INSTANCE-BASED LEARNING K NEAREST NEIGHBOURS DECISION TREES RANDON FOREST, C4.5 LINEALS SMO, SVL, SVR ADABOOST BOOSTING DECISION TABLES DECISION TABLES KAPPA ACCURACY PERFORMANCE RESULTS BN, NaB, NBS, NBK PROBABILISTIC MODELS 5 X (CROSS-VALIDATION = 10) Figure 2. Classification workflow In order to measure the accuracy of each classifier we can use two indicators: (i) the percentage of correct classifications (accuracy) and (ii) Cohen's Kappa statistic. Kappa statistic [34] is a chance-corrected measure of agreement between the classifications and the true classes. Kappa value is calculated with the following expression: ( ) ( ) ( ) (1) Where Pr(a) is the percentage agreement (for example between the classifier and ground truth) and Pr(e) is the chance agreement. Kappa=1 indicates perfect agreement, whereas Kappa=0 indicates chance agreement. 3.4 Test ANOVA-Tukey The Anova-Tukey procedure has been designed to determine which factor means amongst a set of factors presents significant differences from a statistic point of view. Anova (Analysis of variance) [35] tries to find factors that affect the classifiers performance and the existence of interaction between these factors. This means that if a factor is changed then Anova determines if other factor have changed too. Tukey is executed after Anova test to compare all possible pairs of means. It is a method to compute the probability of finding the same value given two input factors means and variances (Null Hypothesis). This test tries to determine the value of each factor to get the better performance. 4 Experimental results In this section we will show the results of the statistical test applied over the dataset comparing the results by pairs of factors values (as described in subsection ‘3.1 Feature selection’). In order to evaluate the classifiers we have executed five cross validations with ten folds. The cross-validation [35] is executed as many times as indicates the ‘folds’ parameter (5 x 10 stratified cross-validation). The goal is obtain the most fitting value level when applying over the dataset. The tables score each classifier in comparison with the other used methods. In addition, factor values appear organized in descending order of capacity for evaluating image features. Table 2 shows the results provided by ANOVA-Tukey test. This test allows us to see: (i) for each parameter, if there are significant differences among its possible values and (ii) if there are interactions between parameters. The scope of this test is to detect parameters that impact significantly on the classification and possible interactions among them. The ANOVA test reveals that all factors (Feature Selection, Num. Features, Classifier) have influence in classification performance because the resultant p-value is under 0,05. Table 2: Anova-Tukey FACTORS FACTOR Df Sum Sq Mean Sq F-value p-value classifier 11 886,07 80,552 8727,311 <2,2e-16 featureSelection 4 30,2 7,549 817,891 <2,2e-16 numFeatures 7 32,29 4,631 499,747 <2,2e-16 INTERACTIONS FACTOR Df Sum Sq Mean Sq F-value p-value classifier:featureSelection 44 122,82 2,791 302,425 <2,2e-16 classifier:numFeatures 77 197,98 2,571 278,57 <2,2e-16 featureSelection:numFeatures 21 46,77 2,227 241,289 <2,2e-16 classifier:featureSelection:numFeatures 231 52,6 0,228 24,672 <2,2e-16 Table 3 shows the features selection methods comparative applied to the whole dataset. As can be seen, ReliefF performace is better than the performance of the other feature selections in all the cases. Nevertheless, the fact of no selecting any features (NOS) implies worst valuation. This method evaluates an attribute value using repetitive sampling of an instance and considering the attribute value obtained from the closest instance of the same class and the different one. It is also interest to note that applying the Information Gain Ratio algorithm has similar performance that no using any selection method. Table 3: Feature selection comparative Feature Selection ReF Chi InG GaR NOS TOTAL Feature Selection ReF -0,0603 -0,0747 -0,1063 -0,1126 -0,0885 Chi InG GaR NOS TOTAL 0,0603 0,0747 0,0145 0,1063 0,0460 0,0316 0,1126 0,0523 0,0379 0,0063 0,0885 0,0131 -0,0049 -0,0444 -0,0523 0,0000 -0,0145 -0,0460 -0,0523 -0,0131 -0,0316 -0,0379 0,0049 -0,0063 0,0444 0,0523 The represented value factors are the following: (i) ReliefF algorithm (ReF), (ii) Chi-Squared, (iii) Information Gain (InG), (iii) Information Gain Ratio (GaR), (iv) No-Selection (NOS). Each cell contains the performance mean difference using TUKEY test. Blue cells reports better performance in rows than in columns (p-valor over 0,05). Table 4 presents the performance comparison of using different number of features. We observe the best results using a short number of attributes, while it loses efficiency with larger attribute sets. Table 4: Feature number comparative Number Attributes 5 10 15 ALL 25 35 20 30 40 TOTAL 5 -0,049 -0,084 -0,090 -0,115 -0,115 -0,116 -0,119 -0,120 -0,101 10 15 ALL 0,049 0,084 0,034 0,090 0,041 0,006 Number Attributes 25 35 20 0,115 0,066 0,031 0,025 -0,034 -0,041 -0,066 -0,066 -0,067 -0,069 -0,071 -0,006 -0,031 -0,032 -0,032 -0,035 -0,037 -0,025 -0,025 -0,026 -0,029 -0,031 -0,046 -0,007 0,000 0,000 -0,001 -0,004 -0,005 0,029 0,115 0,066 0,032 0,025 0,000 -0,001 -0,003 -0,005 0,029 0,116 0,067 0,032 0,026 0,001 0,001 -0,003 -0,005 0,029 30 40 TOTAL 0,119 0,069 0,035 0,029 0,004 0,003 0,003 0,120 0,071 0,037 0,031 0,005 0,005 0,005 0,002 0,101 0,046 0,007 0,000 -0,029 -0,029 -0,029 -0,033 -0,034 0,000 -0,002 0,033 0,034 As we can see in the previous table, the best number of features is five. It is significant that using all attributes have better results than using more than 25. As we previously mentioned in subsection 3.3, Table 5 shows a comparative of the classifiers behaviour over the dataset. Table 5: Classifiers comparative Classifier Classifier KNN KNN RaF SMO J48 DeT AdB SVL SVR BaN NBK NaB NBS TOTAL 0,001 0,008 0,017 0,055 0,089 0,165 0,210 0,469 0,472 0,517 0,527 0,230 0,006 0,016 0,054 0,088 0,164 0,209 0,467 0,471 0,516 0,526 0,229 0,010 0,047 0,081 0,158 0,202 0,461 0,465 0,510 0,519 0,222 0,037 0,071 0,148 0,193 0,451 0,455 0,500 0,510 0,211 0,034 0,111 0,155 0,414 0,417 0,463 0,472 0,170 0,077 0,121 0,380 0,383 0,429 0,438 0,133 0,045 0,303 0,307 0,352 0,362 0,050 0,259 0,262 0,307 0,317 0,001 0,004 0,049 0,058 -0,281 0,045 0,055 -0,285 0,010 -0,334 RaF -0,001 SMO -0,008 -0,006 C4.5 -0,017 -0,016 -0,010 DeT -0,055 -0,054 -0,047 -0,037 AdB -0,089 -0,088 -0,081 -0,071 -0,034 SVL -0,165 -0,164 -0,158 -0,148 -0,111 -0,077 SVR -0,210 -0,209 -0,202 -0,193 -0,155 -0,121 -0,045 BaN -0,469 -0,467 -0,461 -0,451 -0,414 -0,380 -0,303 -0,259 NBK -0,472 -0,471 -0,465 -0,455 -0,417 -0,383 -0,307 -0,262 -0,004 NaB -0,517 -0,516 -0,510 -0,500 -0,463 -0,429 -0,352 -0,307 -0,049 -0,045 NBS -0,527 -0,526 -0,519 -0,510 -0,472 -0,438 -0,362 -0,317 -0,058 -0,055 -0,010 TOTAL -0,230 -0,229 -0,222 -0,211 -0,170 -0,133 -0,050 -0,001 0,281 0,285 0,334 -0,345 0,345 0,000 This used classifiers are the following: (i) Instance Base KNN (KNN), (ii) Random Forest (RaF), (iii) Sequential minimal optimization (SMO), (iv) C4.5 algorithm (C4.5), (v) Decision Tables (DeT), (vi) AdaBoost M1 (AdB), (vii) SVL algorithm (SVL), (viii) SVR algorithm (SVR), (ix) Bayes Net (BaN), (x) Naive Bayes Kernel (NBK), (xi) Naive Bayes (NaB), (xii) Naive Bayes Simple (NBS). Looking carefully to the table, we can see that better results are obtained for knowledge-based algorithms like kNN, Decision Trees (RaF, C4.5) and support vector machine (SMO). However the worst results come from the probabilistic methods like BayesNes, Naïve Bayes, Naive Bayes Simple and Naive Bayes Kernel. Nevertheless, SVM has acceptable results when it is optimized. It would be interesting to study using optimized SVM Lineal and SVM RBF. Table 6 shows Kappa and Accuracy values. As we can see, KNN algorithm achieves a kappa and accuracy average better than NBS algorithm. Table 6: Average and standard deviation for kappa statistic and accuracy values Classifier AdB BaN C4.5 DeT KNN NaB NBK NBS RaF SMO SVL SVR Kappa Statistic Average Standard Desv. 0,8072 0,4275 0,8786 0,8413 0,8960 0,3787 0,4239 0,3691 0,8948 0,8885 0,7306 0,6861 0,1047 0,3031 0,1000 0,0938 0,0800 0,2352 0,2683 0,2413 0,0802 0,0942 0,2320 0,1364 Average 98,84% 87,28% 99,24% 99,00% 99,34% 86,90% 88,31% 86,05% 99,34% 99,27% 98,55% 98,38% Accuracy Standard Desv. 0,57% 9,82% 0,57% 0,58% 0,49% 8,79% 8,79% 9,33% 0,47% 0,57% 0,91% 0,59% In order to measure the accuracy of each classifier we can use two indicators: (i) the percentage of correct classifications (accuracy) and (ii) Cohen's Kappa statistic. This table confirms the results obtained in the TUKEY’s test, where classifier KNN was identified as the best classifier. Figure 3. Kappa Statics Average shows bar diagram with ‘Kappa’ average value for each classifier. In addition, we have added the error bars to give a general idea of how accurate a measurement is, or conversely, how far from the reported value the true (error free) value might be. Figure 3. Kappa Statics Average Error bars represents the standard deviation. This representation helps to see which are the better methods. Bayes methods are more volatiles because have bigger deviation. 5 Conclusions and Further work In this work we have studied the influence of three factors to classify over a dataset of image features: (i) selection method, (ii) number of features, (iii) classifier. The best selection method is ReliefF because considers the value of the selected attribute for the nearest instance of the same and different class. In addition, the best classifiers are based in learning (kNN) and trees (Random Forest). This is useful to use with the ReliefF selection method. As regards the number of features the better option is to choose a lower number, but it is remarkable that getting all attributes is better than use 25, 30, etc. May be the selection method fails choosing any attribute so it does not select an instance (image) as relevant. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. Facebook (2012) http://www.facebook.com. Accessed February 2012 Twitter (2012) http://www.twitter.com. Accessed February 2012 Linkedin Corporation (2012) http://www.linkedin.com. Accessed February 2012 SmallRivers (2012) http://paper.li. Accessed March 2012 Eseinet (2012) http://www.eseinet.es. Accessed March 2012 W3Schools (1999-2012) HTML img tag. http://www.w3schools.com/tag_img.asp. Accessed February 2012 Wikipedia (2012) HTML. http://es.wikipedia.org/wiki/HTML. Accessed February 2012 Monzoncillo-Barreiros MI. (2012) Characterization study of images and text for the development of a representative image detector in published news by digital newspapers Wikipedia (2012) EXIF. http://es.wikipedia.org/wiki/Exchangeable_image_file_format. Accessed February 2012 López Molina José Manuel, Herrero José García (2006) Técnicas de Análisis de Datos. Aplicaciones prácticas utilizando Microsoft Excel y Weka Koji Tsuda, HyunJung Shin and Bernhard Schölkopf (2005) Fast protein classification with multiple networks Battista Biggio, Giorgio Fumera, Ignazio Pillai, Fabio Roli (2011) A survey and experimental evaluation of image spam filtering techniques O. Chapelle (1999) Support vector machines for histogram-based image classification A. Vailaya, A. Jain, H. J. Zhag (1998) On image classification: city images vs. landscapes Benjamini Y., Braun H., John W. (2002) Tukey's Contributions to Multiple Comparisons Guyon I. (2003) An Introduction to Variable and Feature Selection Kononenko I. (1994) Estimating attributes: Analysis and extensions of RELIEF 18. Yijun Sun, Dapeng Wu. A RELIEF Based Feature Extraction Algorithm 19. Wolfram Research, Inc. (2009) Chi-squared distribution. http://mathworld.wolfram.com/Chi-SquaredDistribution.html. Accessed March 2012 20. Wikipedia (2012) Information gain in decision trees. http://en.wikipwdia.org/wiki/Infomation_gain_in_decision_trees. Accessed March 2012 21. Alec Pawling, Nitesh V. Chawla, and Amitabh Chaudhary (2005) Computing Information Gain in Data Streams 22. Wikipedia (2012) Information gain ratio. http://en.wikipedia.org/wiki/Information_gain_ratio. Accessed March 2012 23. Mitchell Tom M. (1997) Machine Learning 24. Pearl, J. (1985) Bayesian Networks: A Model of Self-Activated Memory for Evidential Reasoning. Proceedings of the 7th Conference of the Cognitive Science Society. University of California 25. John, G. H., & Langle, P. (1995) Estimating Continuous Distributions in Bayesian Classifiers. Proceeding of the 11 h Conference on Uncertainty in Artificial Intelligence, Morgan Kaufman Publishers 26. Aha, D. W., Kibler, D., Albert, M. K. (1991) Instance-Based Learning Algorithms. Machine Learning 27. Quinlan, R. (1993) C4. 5: Programs for machine learning. Morgan Kaufmann Publishers, San Mateo 28. Ruggieri, S. (2002) Efficient C4.5 29. Breiman, L. (2001) Random Forests. Machine Learning 30. Platt, J. (1999) Fast training of support vector machines using sequential minimal optimization. Advances in kernel methods – support vector learning. MIT Press Cambridge, MA, USA 31. Platt, John. Fast (1998) Training of Support Vector Machines using Sequential Minimal Optimization, in Advances in Kernel Methods – Support Vector Learning 32. Falaki H. AdaBoost Algorithm. Computer Science Department. University of California 33. Kohavi R. (1995). The Power of Decision Tables. In Proc European Conference on Machine Learning. 34. Ben-David, A. (2008) Comparison of Classification Accuracy using Cohen's Weighted Kappa. Expert Systems with Applications 35. Wikipedia (2012) Analysis of variance. http://en.wikipedia.org/wiki/Analysis_of_variance. Access March 2012 36. Kohovi R. (2006) A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. Computer Science Department. Stanford University Appendix A: Summary in Spanish 1 Introducción Hoy en día la manera más rápida de estar en contacto es utilizando las redes sociales como Facebook [], Twitter [] o Linked-in [] incluso mantener un periódico personal digital como Paper.li [] u otros servicios privados como ESEINET que pertenece a la Universidad de Vigo. A través de estos servicios se comparte información a través de enlaces, noticias o entradas de blog entre otros. Lo habitual es que contengan texto e imágenes. Lo que nos centra en este estudio es la clasificación de estas imágenes. En rasgos generales, estas imágenes se pueden clasificar según su finalidad como (i) decorativas, que incluyen iconos, fondos, cabeceras o pies de página, botones, etc.; (ii) anuncios que habitualmente enlazan a otras páginas web y comúnmente se conocen como banners; (iii) imágenes insertadas por el autor del contenido principal. Este trabajo está orientado hacia esta clasificación para determinar cuál de todas las imágenes de una web son representativas del contenido principal. 2 Proceso de desarrollo Se parte de un estudio previo de selección de características del que se toman datos de los atributos más representativos de las imágenes en cualquier web. Estos atributos han sido extraídos de información HTML y metadatos EXIF, quedando el conjunto de datos en un total de 2300 instancias (imágenes) con 50 atributos cada una. Todas las imágenes procedían de una selección de webs de diferentes contenidos (blogs, noticias, organizaciones gubernamentales, etc.) Se ha realizado un nuevo proceso de selección de características para poder determinar cuáles de ellas son relevantes a la hora de identificar una imagen representativa. Para ello se han aplicado diferentes métodos se selección: (i) ReliefF, (ii) Chi-cuadrado, (iii) Ganancia de información, (iv) Ratio de ganancia de información. Además se han comparado con los resultados obtenidos de no utilizar un método de selección. Se ha trabajado con diferentes número de características para que utilicen los anteriores métodos. De este modo se han probado utilizando grupos de 5, 10, hasta el total de atributos. Con esta cantidad de atributos seleccionados, se han pasado a diferentes algoritmos de clasificación. Los algoritmos de clasificación probados son de diferentes tipos: (i) basados en aprendizaje, (ii) árboles de decisión, (iii) Máquinas de soporte vectorial, (iv) algoritmos de boosting, (v) algoritmos Bayesianos y (vi) tablas de decisión. 2 Test ANOVA-Tukey El test de ANOVA se realiza para determinar qué factores influyen en la clasificación y cómo interaccionan entre sí estos factores. A continuación se ha ejecutado Tukey para conocer el valor de cada factor con el que conseguimos mejor rendimiento. 3 Resultados obtenidos Para evaluar los clasificadores se han ejecutado 5 validaciones cruzadas con 10 cruces (10 folds). El test de ANOVA indica que todos los son necesarios (selección de atributos, número y clasificador). Como método de clasificación el mejor resultado se obtiene con el algoritmo ReliefF. Con respecto al número de atributos la mejor opción es seleccionar pocos atributos, aunque con más de 15 la mejor opción es seleccionarlos todos en vez de hacer más particiones. Con respecto a la clasificación, los mejores métodos están basados en el aprendizaje (kNN) y en árboles (RaF). 4 Conclusiones Los mejores métodos son los basados en aprendizaje ya que utilizan el conocimiento de los elementos vecinos para determinar la importancia de un atributo (algoritmos de selección ReliefF y de clasificación kNN) El número de atributos es importante a la hora de realizar la clasificación. De todos modos es importante resaltar, que una vez seleccionados 15 atributos la siguiente mejor opción es seleccionarlos todos (52 características); siendo mejor que utilizar un menor número.