Please use this identifier to cite or link to this item:
https://scholarbank.nus.edu.sg/handle/10635/16205
DC Field | Value | |
---|---|---|
dc.title | A new term weighting method for text categorization | |
dc.contributor.author | LAN MAN | |
dc.date.accessioned | 2010-04-08T11:02:10Z | |
dc.date.available | 2010-04-08T11:02:10Z | |
dc.date.issued | 2007-06-06 | |
dc.identifier.citation | LAN MAN (2007-06-06). A new term weighting method for text categorization. ScholarBank@NUS Repository. | |
dc.identifier.uri | http://scholarbank.nus.edu.sg/handle/10635/16205 | |
dc.description.abstract | Text representation is the task of transforming the content of a textual document into a compact representation of its content so that the document could be recognized and classified by a computer or a classifier. This thesis focuses on the development of an effective and efficient term weighting method for text categorization task. We selected the single token as the unit of feature because the previous researches showed that this simple type of features outperformed other complicated type of features. We have investigated several widely-used unsupervised and supervised term weighting methods on several popular data collections in combination with SVM and kNN algorithms. In consideration of the distribution of relevant documents in the collection and analysis of the term's discriminating power, we have proposed a new term weighting scheme, namely $tf.rf$. The controlled experimental results showed that the term weighting methods show mixed performance in terms of different category distribution data sets and different learning algorithms. Most of the supervised term weighting methods which are based on information theory have not shown satisfactory performance according to our experimental results. However, the newly proposed $tf.rf$ method shows a consistently better performance than other term weighting methods. On the other hand, the popularly used $tf.idf$ method has not shown a uniformly good performance with respect to different category distribution data sets. | |
dc.language.iso | en | |
dc.subject | Text Categorization, Term Weighting Method, Support Vector Machine, kNN | |
dc.type | Thesis | |
dc.contributor.department | COMPUTER SCIENCE | |
dc.contributor.supervisor | TAN CHEW LIM | |
dc.description.degree | Ph.D | |
dc.description.degreeconferred | DOCTOR OF PHILOSOPHY | |
dc.identifier.isiut | NOT_IN_WOS | |
Appears in Collections: | Ph.D Theses (Open) |
Show simple item record
Files in This Item:
File | Description | Size | Format | Access Settings | Version | |
---|---|---|---|---|---|---|
LanMan.pdf | 1.92 MB | Adobe PDF | OPEN | None | View/Download |
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.