SHAPE: Scalable hadoop-based analytical processing environment | ScholarBank@NUS

Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/25054

Title:	SHAPE: Scalable hadoop-based analytical processing environment
Authors:	GUO FEI
Keywords:	mapreduce,query processing,hadoop,distributed;OLAP
Issue Date:	11-Jan-2011
Citation:	GUO FEI (2011-01-11). SHAPE: Scalable hadoop-based analytical processing environment. ScholarBank@NUS Repository.
Abstract:	MapReduce is a parallel programming model designed for data-intensive tasks processed on commodity hardware. It provides an interface with two ?simple? functions, namely, map and reduce, making programs amenable to a great degree of of parallelism, load balancing, workload scheduling and fault tolerance in large clusters. However, as MapReduce has not been designed for generic data analytic workload, cloud-based analytical processing systems such as Hive and Pig need to translate a query into multiple MapReduce tasks, generating a significant overhead of startup latency and intermediate results I/O. Further, this multi-stage process makes it more difficult to locate performance bottlenecks, limiting the potential use of self-tuning techniques. In this thesis, we present SHAPE, an efficient and scalable analytical processing environment based on Hadoop - an open source implementation of MapReduce. To ease OLAP on large-scale data set, we provide a SQL engine to cloud application developers who can easily plug in their own functions and optimization rules. On other hand, compared to Hive or Pig, SHAPE also introduces several key innovations: firstly, we adopt horizontal fragmentation from distributed DBMS to exploit data locality. Secondly, we efficiently perform n-way joins and aggregation in a single MapReduce task. Such an integrated approach, which is the first of its kind, considerably improves query processing performance. Last but not least, our optimizer supports rule-based, cost-based and adaptive optimization, facilitating workload-specific performance optimization and providing good opportunities for self-tuning. Our preliminary experimental study using the TPC-H benchmark shows that SHAPE outperforms Hive by a wide margin.
URI:	http://scholarbank.nus.edu.sg/handle/10635/25054
Appears in Collections:	Master's Theses (Open)

Show full item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
thesis.pdf		336.37 kB	Adobe PDF	OPEN	None	View/Download

Google Scholar^TM

Check

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.