Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/237667
Title: SUPPORTING GENERIC CLUSTERING AND SIMILARITY SEARCH FOR MASSIVE DATASETS
Authors: LUO PINGYI
ORCID iD:   orcid.org/0000-0002-7389-4266
Keywords: Clustering, Similarity Search, Generic Framework, Locality-Sensitive Hashing, Large-scale data, Parallel Computing
Issue Date: 1-Aug-2022
Citation: LUO PINGYI (2022-08-01). SUPPORTING GENERIC CLUSTERING AND SIMILARITY SEARCH FOR MASSIVE DATASETS. ScholarBank@NUS Repository.
Abstract: Clustering and Similarity Search are two crucial challenges in big data analytics. The large cardinality, dimensionality, and diverse nature of big data require analytic methods to be both efficient and flexible for various distance functions. However, there is a lack of research focused on generic functionality that simultaneously supports multiple distance functions. In this thesis, we first propose a generic distributed clustering framework to process large-scale data with different distance functions by leveraging different Locality-Sensitive Hashing. We further investigate the clustering problem in sparse data and introduce a new method k-FreqItems with a novel sparse center representation called FreqItem. Additionally, we present a seeding technique that significantly improves the convergence speed of k-FreqItems. Finally, with the support of our clustering methods, we incorporate a generic similarity search framework GENIE with a two-level distributed indexing structure and develop a scalable framework dGENIE for billion-scale data.
URI: https://scholarbank.nus.edu.sg/handle/10635/237667
Appears in Collections:Ph.D Theses (Open)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
LuoPY.pdf13.87 MBAdobe PDF

OPEN

NoneView/Download

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.