On the effectiveness of latent semantic analysis for the categorization of call centre records
Menon, R. ; Keerthi, S.S. ; Loh, H.T. ; Brombacher, A.C.
Menon, R.
Brombacher, A.C.
Citations
Altmetric:
Alternative Title
Abstract
Text categorization is an important component in many information management tasks such as real-time sorting of emails or files. An important consideration in text categorization performance is the choice of feature sets for text representation. A popular approach for text representation is the vector space model. It represents the 'units of content' of a document as a vector. In most situations, each distinct word is used as a content unit. However, such a representation, called the bag-of-word approach has drawbacks. Firstly, a large number of features are required for document representation. Secondly, it does not take into account the effects of synonymy and polysemy, which could have an impact on classification accuracy. Latent semantic analysis addresses the above shortcomings by simultaneously modelling all the interrelationships among terms and documents, using the singular value decomposition technique which allows the representation of the terms and documents in a reduced dimensional space. It has been widely used to enhance the performance of information retrieval systems and recently used for text classification purposes as well. In this study, we further explore its use, for the classification of call centre data sets obtained from a Multi-National Company. These spontaneously created documents exhibit characteristics different from benchmark data sets used in most studies, hence necessitating this investigation. Further, the effect on classification, of various weighting schemes as well as the number of dimensions was explored. Results revealed that the LSA approach marginally improved the classification accuracy. It was also found that the weighting scheme used did not significantly affect classification performance unlike in some retrieval applications where as much as a 40% average improvement in performance was observed. Further, the widely recommended use of 100 to 300 dimensions for document representation was found to be inapplicable for the investigated data sets. © 2004 IEEE.
Keywords
Call Centre Records, Latent Semantic Analysis, Singular Value Decomposition, Support Vector Machines, Text Classification
Source Title
IEEE International Engineering Management Conference
Publisher
Series/Report No.
Collections
Rights
Date
2004
DOI
Type
Conference Paper