Please use this identifier to cite or link to this item:
https://doi.org/10.1214/15-AOS1423
DC Field | Value | |
---|---|---|
dc.title | Influential features PCA for high dimensional clustering | |
dc.contributor.author | Jin, J | |
dc.contributor.author | Wang, W | |
dc.date.accessioned | 2021-06-29T09:53:41Z | |
dc.date.available | 2021-06-29T09:53:41Z | |
dc.date.issued | 2016-12-01 | |
dc.identifier.citation | Jin, J, Wang, W (2016-12-01). Influential features PCA for high dimensional clustering. Annals of Statistics 44 (6) : 2323-2359. ScholarBank@NUS Repository. https://doi.org/10.1214/15-AOS1423 | |
dc.identifier.issn | 00905364 | |
dc.identifier.uri | https://scholarbank.nus.edu.sg/handle/10635/192301 | |
dc.description.abstract | We consider a clustering problem where we observe feature vectors Xi ∈ Rp, i = 1, 2,⋯, n, from K possible classes. The class labels are unknown and the main interest is to estimate them. We are primarily interested in the modern regime of p ≫ n, where classical clustering methods face challenges. We propose Influential Features PCA (IF-PCA) as a new clustering procedure. In IF-PCA, we select a small fraction of features with the largest Kolmogorov-Smirnov (KS) scores, obtain the first (K - 1) left singular vectors of the post-selection normalized data matrix, and then estimate the labels by applying the classical k-means procedure to these singular vectors. In this procedure, the only tuning parameter is the threshold in the feature selection step. We set the threshold in a data-driven fashion by adapting the recent notion of Higher Criticism. As a result, IF-PCA is a tuning-free clustering method. We apply IF-PCA to 10 gene microarray data sets. The method has competitive performance in clustering. Especially, in three of the data sets, the error rates of IF-PCA are only 29% or less of the error rates by other methods. We have also rediscovered a phenomenon on empirical null by Efron [J. Amer. Statist. Assoc. 99 (2004) 96-104] on microarray data. With delicate analysis, especially post-selection eigen-analysis, we derive tight probability bounds on the Kolmogorov-Smirnov statistics and show that IF-PCA yields clustering consistency in a broad context. The clustering problem is connected to the problems of sparse PCA and low-rank matrix recovery, but it is different in important ways. We reveal an interesting phase transition phenomenon associated with these problems and identify the range of interest for each. | |
dc.publisher | Institute of Mathematical Statistics | |
dc.source | Elements | |
dc.subject | stat.ME | |
dc.subject | stat.ME | |
dc.subject | math.ST | |
dc.subject | stat.TH | |
dc.subject | Primary 62H30, 62G32, secondary 62E20, 62P10 | |
dc.type | Article | |
dc.date.updated | 2021-06-29T08:27:16Z | |
dc.contributor.department | STATISTICS & APPLIED PROBABILITY | |
dc.description.doi | 10.1214/15-AOS1423 | |
dc.description.sourcetitle | Annals of Statistics | |
dc.description.volume | 44 | |
dc.description.issue | 6 | |
dc.description.page | 2323-2359 | |
dc.published.state | Published | |
Appears in Collections: | Staff Publications Elements |
Show simple item record
Files in This Item:
File | Description | Size | Format | Access Settings | Version | |
---|---|---|---|---|---|---|
IF-PCA.pdf | Accepted version | 705.24 kB | Adobe PDF | OPEN | Post-print | View/Download |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.