Please use this identifier to cite or link to this item:
|Title:||Straightforward feature selection for scalable latent semantic indexing||Authors:||Yan, J.
|Issue Date:||2009||Citation:||Yan, J.,Yan, S.,Liu, N.,Chen, Z. (2009). Straightforward feature selection for scalable latent semantic indexing. Society for Industrial and Applied Mathematics - 9th SIAM International Conference on Data Mining 2009, Proceedings in Applied Mathematics 3 : 1153-1164. ScholarBank@NUS Repository.||Abstract:||Latent Semantic Indexing (LSI) has been validated to be effective on many small scale text collections. However, little evidence has shown its effectiveness on unsampled large scale text corpus due to its high computational complexity. In this paper, we propose a straightforward feature selection strategy, which is named as Feature Selection for Latent Semantic Indexing (FSLSI), as a preprocessing step such that LSI can be efficiently approximated on large scale text corpus. We formulate LSI as a continuous optimization problem and propose to optimize its objective function in terms of discrete optimization, which leads to the FSLSI algorithm. We show that the closed form solution of this optimization is as simple as scoring each feature by Frobenius norm and filter out the ones with small scores. Theoretical analysis guarantees the loss of the features filtered out by FSLSI algorithm is minimized for approximating LSI. Thus we offer a general way for studying and applying LSI on large scale corpus. The large scale study on more than 1 million TREC documents shows the effectiveness of FSLSI in Information Retrieval (IR) tasks.||Source Title:||Society for Industrial and Applied Mathematics - 9th SIAM International Conference on Data Mining 2009, Proceedings in Applied Mathematics||URI:||http://scholarbank.nus.edu.sg/handle/10635/71871||ISBN:||9781615671090|
|Appears in Collections:||Staff Publications|
Show full item record
Files in This Item:
There are no files associated with this item.
checked on Apr 21, 2019
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.