Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/77711
Title: ART: A Large Scale Microblogging Data Management System
Authors: LI FENG
Keywords: mircoblogging, distributed system, storage, search
Issue Date: 4-Feb-2014
Citation: LI FENG (2014-02-04). ART: A Large Scale Microblogging Data Management System. ScholarBank@NUS Repository.
Abstract: Microblogging, a new social network, has attracted the interest of billions of users in recent years. As its data volume keeps increasing, it has becomes challenging to efficiently manage these data and process queries on these data. Although considerable researches have been conducted on the large scale data management problems and the microblogging service providers have also designed scalable parallel processing systems and distributed storage systems, these approaches are still inefficient comparing to traditional DBMSs that have been studied for decades. The performance of these systems can be improved with proper optimization strategies. This thesis is aimed to design a scalable, efficient and full-functional microblogging data management system. We propose ART (AQUA, R-Store and TI), a large scale microblogging data management system that is able to handle various user queries (such as updates and real-time search) and the data analysis queries (such as join and aggregation queries). Furthermore, ART is specifically optimized for three types of queries: multi-way join query, real-time aggregation query and real-time search query. Three principle modules are included in ART: Offline analytics module. ART utilizes MapReduce as the batch parallel processing engine and implements AQUA, a cost-based optimizer on top of MapReduce. In AQUA, we propose a cost model to estimate the cost of each join plan, and the near-optimal one is selected by the plan iteration algorithm. OLTP and real-time analysis module. In ART, we implement a distributed key/value store, R-Store, for the OLTP and real-time aggregation query processing. A real-time data cube is maintained as the historical data, and the newly updated data are merged with the data cube on the fly during the processing of the real-time query. Real-time search module. The last component of ART is TI, a distributed real-time indexing system for supporting real-time search. The ranking function considers the social graphs and discussion topics in the microblogging data, and the partial indexing scheme is proposed to improve the throughput of updating the real-time inverted index. The result of experiments conducted on TPC-H data set and the real Twitter data set demonstrates that (1) the join plan selected by AQUA outperforms the manually optimized plan significantly; (2) the performance of the real-time aggregation query processing approach implemented in R-Store is better than the default one when the selectivity of the aggregation query is high; (3) the real-time search results returned by TI are more meaningful than the current ranking methods. Overall, to the best of our knowledge, this thesis is the first work that systematically studies how these queries are efficiently processed in a large scale microblogging system.
URI: http://scholarbank.nus.edu.sg/handle/10635/77711
Appears in Collections:Ph.D Theses (Open)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
thesis_lifeng.pdf1.96 MBAdobe PDF

OPEN

NoneView/Download

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.