Please use this identifier to cite or link to this item: http://scholarbank.nus.edu.sg/handle/10635/18629
Title: Efficiently indexing sparse wide tables in community systems
Authors: HUI MEI
Keywords: Community Web Management Systems,sparse wide table, iVA-File,nG-signature, inverted index, CW2I index
Issue Date: 25-May-2010
Source: HUI MEI (2010-05-25). Efficiently indexing sparse wide tables in community systems. ScholarBank@NUS Repository.
Abstract: The increasing popularity of Community Web Management Systems(CWMSs) calls for tailor-made data management approaches for them. In CWMSs, storage structures inspired by universal tables are being used increasingly to manage sparse datasets. Such a sparse wide table (SWT) typically embodies thousands of attributes, with many of them not well defined in each tuple. Low-dimensional structured similarity search and general complex query on a combination of numerical and text attributes is common operations. However, many properties of wide tables and their associated Web 2.0 services render most multi-dimensional indexing structures ineffective. Recent studies in this area have mainly focused on improving the efficiency of storage management and the deployment of inverted indexes; so far no new data structure has been proposed for indexing SWTs. The inverted index is fast for scanning but not efficient in reducing random accesses to the data file as it captures little information about the attribute information and the content of attribute values. Furthermore, itis not sufficient for complex queries. In this thesis, we examine this problem and propose iVA-file indexing structure for structured similarity query and CW2I indexing scheme for complex query respectively. The iVA-file works on the basis of approximate contents and guarantees scanning efficiency within a bounded range. We introduce the $n$G-signature to approximately represent data strings and improve the existing approximate vectors for numerical values. We also present an efficient query processing strategy for the iVA-file, which is different from strategies used for existing scan-based indexes. To enable the usage of different metrics of distance between a query and a tuple varying from application to application, the iVA-file has been designed to be metric-oblivious and to provide efficient filter-and-refine search based on any rational metric. Extensive experiments on real datasets show that the iVA-file outperforms existing proposals in query efficiency significantly, while at the same time keeps a good update speed. CW2I combines two effective indexing methods: inverted index and direct index for each attribute. Inverted index gathers a list of tuples which are sorted by tuple ID for each attribute value; the inverted index is sorted by value itself. Separate direct index for each attribute provides fast access to those tuples for which the given attribute is defined. The direct index is sorted by tuple ID following a column-oriented architecture. Comparative experiments demonstrate that our proposed scheme outperforms other approaches for answering complex queries on community web data. In summary, this thesis proposes indexing techniques for efficient structured similarity query and complex query over sparse wide table in community systems. Extensive performance studies show that these proposed indexes significantly improve the query performance.
URI: http://scholarbank.nus.edu.sg/handle/10635/18629
Appears in Collections:Master's Theses (Open)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
report.pdf724.23 kBAdobe PDF

OPEN

NoneView/Download

Page view(s)

279
checked on Jan 19, 2018

Download(s)

290
checked on Jan 19, 2018

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.