Feature selection on hierarchy of web documents

作者:

Highlights:

摘要

The paper describes feature subset selection used in learning on text data (text learning) and gives a brief overview of feature subset selection commonly used in machine learning. Several known and some new feature scoring measures appropriate for feature subset selection on large text data are described and related to each other. Experimental comparison of the described measures is given on real-world data collected from the Web. Machine learning techniques are used on data collected from Yahoo, a large text hierarchy of Web documents. Our approach includes some original ideas for handling large number of features, categories and documents. The high number of features is reduced by feature subset selection and additionally by using ‘stop-list’, pruning low-frequency features and using a short description of each document given in the hierarchy instead of using the document itself. Documents are represented as feature-vectors that include word sequences instead of including only single words as commonly used when learning on text data. An efficient approach to generating word sequences is proposed. Based on the hierarchical structure, we propose a way of dividing the problem into subproblems, each representing one of the categories included in the Yahoo hierarchy. In our learning experiments, for each of the subproblems, naive Bayesian classifier was used on text data. The result of learning is a set of independent classifiers, each used to predict probability that a new example is a member of the corresponding category. Experimental evaluation on real-world data shows that the proposed approach gives good results. The best performance was achieved by the feature selection based on a feature scoring measure known from information retrieval called Odds ratio and using relatively small number of features.

论文关键词:Text mining,Feature selection,Document categorization,Maintaining document ontology,Machine learning,Data mining

论文评审过程:Available online 31 May 2002.

论文官网地址:https://doi.org/10.1016/S0167-9236(02)00097-0