Please use this identifier to cite or link to this item: https://doi.org/10.1214/16-aos1522
DC FieldValue
dc.titlePhase transitions for high dimensional clustering and related problems
dc.contributor.authorJin, Jiashun
dc.contributor.authorKe, Zheng Tracy
dc.contributor.authorWang, Wanjie
dc.date.accessioned2019-07-30T06:54:59Z
dc.date.available2019-07-30T06:54:59Z
dc.date.issued2017-10-01
dc.identifier.citationJin, Jiashun, Ke, Zheng Tracy, Wang, Wanjie (2017-10-01). Phase transitions for high dimensional clustering and related problems. The Annals of Statistics 45 (5) : 2151-2189. ScholarBank@NUS Repository. https://doi.org/10.1214/16-aos1522
dc.identifier.issn00905364
dc.identifier.urihttps://scholarbank.nus.edu.sg/handle/10635/157285
dc.description.abstractConsider a two-class clustering problem where we observe $X_i = \ell_i \mu + Z_i$, $Z_i \stackrel{iid}{\sim} N(0, I_p)$, $1 \leq i \leq n$. The feature vector $\mu\in R^p$ is unknown but is presumably sparse. The class labels $\ell_i\in\{-1, 1\}$ are also unknown and the main interest is to estimate them. We are interested in the statistical limits. In the two-dimensional phase space calibrating the rarity and strengths of useful features, we find the precise demarcation for the Region of Impossibility and Region of Possibility. In the former, useful features are too rare/weak for successful clustering. In the latter, useful features are strong enough to allow successful clustering. The results are extended to the case of colored noise using Le Cam's idea on comparison of experiments. We also extend the study on statistical limits for clustering to that for signal recovery and that for hypothesis testing. We compare the statistical limits for three problems and expose some interesting insight. We propose classical PCA and Important Features PCA (IF-PCA) for clustering. For a threshold $t > 0$, IF-PCA clusters by applying classical PCA to all columns of $X$ with an $L^2$-norm larger than $t$. We also propose two aggregation methods. For any parameter in the Region of Possibility, some of these methods yield successful clustering. We find an interesting phase transition for IF-PCA. Our results require delicate analysis, especially on post-selection Random Matrix Theory and on lower bound arguments.
dc.publisherInstitute of Mathematical Statistics
dc.sourceElements
dc.subjectmath.ST
dc.subjectmath.ST
dc.subjectstat.ML
dc.subjectstat.TH
dc.subject62H30, 62H25 (Primary) 62G05, 62G10 (Secondary)
dc.typeArticle
dc.date.updated2019-07-30T06:52:10Z
dc.contributor.departmentSTATISTICS & APPLIED PROBABILITY
dc.description.doi10.1214/16-aos1522
dc.description.sourcetitleThe Annals of Statistics
dc.description.volume45
dc.description.issue5
dc.description.page2151-2189
dc.published.statePublished
Appears in Collections:Staff Publications
Elements

Show simple item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
3Phase.pdfPublished version681.5 kBAdobe PDF

OPEN

Post-printView/Download

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.