Please use this identifier to cite or link to this item:
Title: Chimera: Large-scale Data Collection and Processing
Authors: GONG JIAN
Keywords: centralized,limits,distributed,comparison,stream,large-scale
Issue Date: 12-Aug-2011
Citation: GONG JIAN (2011-08-12). Chimera: Large-scale Data Collection and Processing. ScholarBank@NUS Repository.
Abstract: Companies depend on the analysis of data collected by their applications and services to improve their products. With the rise of large online services, massive amounts of data are being produced. Known as Big Data, these datasets are expected to reach 32.2ZB globally in 2011. As traditional tools are unable to process Big Data in a timely fashion, a new paradigm of handling Big Data has been proposed. Known as Stream Processing, there has been a lot of work on this paradigm from both the academic and commercial worlds, leading to a large number of stream processing systems with varying designs. They can be broadly classified into two categories: centralized or distributed. The former processes data atomically while the latter breaks up a processing operation, deploys the sub-operations across multiple nodes, and combines the output from those nodes to produce the final results. In this thesis, we attempt to understand the limits of a centralized stream processing system when it is under real-world workloads. We do this by evaluating Esper, an open-source centralized stream processor, with data from a game deployed on Facebook. We also developed our own distributed stream processing system, called Chimera, and compared Esper with it. This is to understand how much more performance we can gain if we process the same data with a distributed system. We found that Esper?s performance varies widely depending on the kind of queries given to it. While, the performance is very good when the queries are simple, it quickly starts to deteriorate when the queries become complex. Therefore, although a centralized system might seem attractive due to lower costs in deployment, developers might be better off using a distributed system if they process data in a complex manner. We also found that a distributed system may perform better than Esper, even when both of them are deployed on a single machine. This is because the distributed system may be simpler in design compared to Esper. Therefore, if developers do not need the various features offered by Esper, using a simpler stream processing system would provide them with better performance.
Appears in Collections:Master's Theses (Open)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
GongJ.pdf370.31 kBAdobe PDF



Page view(s)

checked on Apr 26, 2019


checked on Apr 26, 2019

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.