Semi-supervised text classification using partitioned EM

Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/38930

DC Field	Value
dc.title	Semi-supervised text classification using partitioned EM
dc.contributor.author	Cong, G.
dc.contributor.author	Lee, W.S.
dc.contributor.author	Wu, H.
dc.contributor.author	Liu, B.
dc.date.accessioned	2013-07-04T07:30:07Z
dc.date.available	2013-07-04T07:30:07Z
dc.date.issued	2004
dc.identifier.citation	Cong, G.,Lee, W.S.,Wu, H.,Liu, B. (2004). Semi-supervised text classification using partitioned EM. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2973 : 482-493. ScholarBank@NUS Repository.
dc.identifier.issn	03029743
dc.identifier.uri	http://scholarbank.nus.edu.sg/handle/10635/38930
dc.description.abstract	Text classification using a small labeled set and a large unlabeled data is seen as a promising technique to reduce the labor-intensive and time consuming effort of labeling training data in order to build accurate classifiers since unlabeled data is easy to get from the Web. In [16] it has been demonstrated that an unlabeled set improves classification accuracy significantly with only a small labeled training set. However, the Bayesian method used in [16] assumes that text documents are generated from a mixture model and there is a one-to-one correspondence between the mixture components and the classes. This may not be the case in many applications. In many real-life applications, a class may cover documents from many different topics, which violates the one-to-one correspondence assumption. In such cases, the resulting classifiers can be quite poor. In this paper, we propose a clustering based partitioning technique to solve the problem. This method first partitions the training documents in a hierarchical fashion using hard clustering. After running the expectation maximization (EM) algorithm in each partition, it prunes the tree using the labeled data. The remaining tree nodes or partitions are likely to satisfy the one-to-one correspondence condition. Extensive experiments demonstrate that this method is able to achieve a dramatic gain in classification performance. © Springer-Verlag 2004.
dc.source	Scopus
dc.subject	Labeled and unlabeled data
dc.subject	Semi-supervised learning
dc.subject	Text classification
dc.subject	Text mining
dc.type	Article
dc.contributor.department	COMPUTER SCIENCE
dc.description.sourcetitle	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
dc.description.volume	2973
dc.description.page	482-493
dc.identifier.isiut	NOT_IN_WOS
Appears in Collections:	Staff Publications

Show simple item record

Files in This Item:

There are no files associated with this item.

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM