Please use this identifier to cite or link to this item: http://scholarbank.nus.edu.sg/handle/10635/40410
Title: A refinement approach to handling model misfit in text categorization
Authors: Wu, H.
Phang, T.H. 
Liu, B. 
Li, L. 
Keywords: na ve Bayesian classifier
Rocchio algorithm
Text categorization
Issue Date: 2002
Source: Wu, H.,Phang, T.H.,Liu, B.,Li, L. (2002). A refinement approach to handling model misfit in text categorization. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining : 207-216. ScholarBank@NUS Repository.
Abstract: Text categorization or classification is the automated assigning of text documents to pre-defined classes based on their contents. This problem has been studied in information retrieval, machine learning and data mining. So far, many effective techniques have been proposed. However, most techniques are based on some underlying models and/or assumptions. When the data fits the model well, the classification accuracy will be high. However, when the data does not fit the model well, the classification accuracy can be very low. In this paper, we propose a refinement approach to dealing with this problem of model misfit. We show that we do not need to change the classification technique itself (or its underlying model) to make it more flexible. Instead, we propose to use successive refinements of classification on the training data to correct the model misfit. We apply the proposed technique to improve the classification performance of two simple and efficient text classifiers, the Rocchio classifier and the na ve Bayesian classifier. These techniques are suitable for very large text collections because they allow the data to reside on disk and need only one scan of the data to build a text classifier. Extensive experiments on two benchmark document corpora show that the proposed technique is able to improve text categorization accuracy of the two techniques dramatically. In particular, our refined model is able to improve the na ve Bayesian or Rocchio classifier prediction performance by 45% on average.
Source Title: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
URI: http://scholarbank.nus.edu.sg/handle/10635/40410
Appears in Collections:Staff Publications

Show full item record
Files in This Item:
There are no files associated with this item.

Page view(s)

41
checked on Dec 16, 2017

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.