Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/20928
Title: Exploring time related issues in data stream processing
Authors: WU JI
Keywords: data stream, stream database, stream join, data stream scheduling, scientific sensor data processing, stream query processing
Issue Date: 2-Jul-2010
Citation: WU JI (2010-07-02). Exploring time related issues in data stream processing. ScholarBank@NUS Repository.
Abstract: The past few years have witnessed a surge of data in the form of streams such as network traffics, stock updates and readings from sensor devices. The fast, time-varying and unbounded nature of data streams, however, challenges the traditional database management paradigm which is intended for store-based data only. The new Data Stream Management System (DSMS) has been proposed by the database community to tackle new issues arising from processing persistent queries running over these continuous data. One can say that a DSMS query is a DBMS query extended in time domain. This implies that both input and output of a DSMS query are better to be modelled as functions of time rather than static values or sets. This observation leads us to study DSMS with the emphasis on time, the critical aspect that distinguishes traditional query processing from stream query processing. In the first piece of work, we study time issues on stream input. As data is only accessible in sequential manner in stream processing, the input sequence hence becomes crucial. Most stream data are naturally sorted according to the time when they are generated. Such a temporal order, however, is often scrambled for various reasons as the data are transmitted over the network. A scrambled tuple order poses a significant challenge on memory management for stateful operations (such as join) as these operations require a huge amount of memory space to buffer the received input in order to absorb the impact due to tuple disorder. Traditionally, memory management for these operations is query-driven: a query has to explicitly define a window for each (potentially unbounded) input to bound the size of the buffer allocated for that stream. However, output produced this way may not be desirable (if the window size is not part of the intended query semantic) due to the volatile input characteristics. We propose a new data-driven memory management scheme which explores the intrinsic properties of stream input to intelligently allocate buffer space. Results show that our new scheme not only improves the query result accuracy but also significantly reduces the memory overhead. Time also plays an important role in stream output. Data stream applications often involve time-critical tasks such as disaster early warning, network intrusion detection and online financial analysis. These applications impose very strict requirements on the timeliness of output delivery. Experience shows that the traditional operator-based stream scheduling strategies may not always be sufficient to fulfill such real-time requirements. In the second piece of work, we focus on tuple-based stream scheduling that features fine-grained resource control to meet these timing requirements. By drawing an analogy between tuple scheduling and job scheduling, we propose several effective resource allocation strategies inspired by the classic job scheduling problem. We also compare the pros and cons of each strategy and discuss their applicability under different scenarios. The last piece of work is devoted to a case study of data stream applications. We built a scientific sensor data processing engine with the aim to integrate data streams collected from heterogeneous sensor stations and offer a unified data platform to query, analyze and visualize sensor information to facilitate scientific research and data exploration. Time issues discussed in the previous works are revisited in the context scientific data stream processing to appreciate their significance in better understanding stream processing characteristics and, consequently, how they can be leveraged to improve system performance in practice. To summarize, we use time as the key to approaching several important issues in DSMS. Both the experiments and the case study show that our proposed algorithms and strategies are effective in boosting the performance of data stream processing.
URI: http://scholarbank.nus.edu.sg/handle/10635/20928
Appears in Collections:Ph.D Theses (Open)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
WuJ.pdf1.35 MBAdobe PDF

OPEN

NoneView/Download

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.