WU SAI

Email Address
idmws@nus.edu.sg


Organizational Units
Organizational Unit
COMPUTING
faculty
Organizational Unit
COMPUTER SCIENCE
dept
Organizational Unit

Publication Search Results

Now showing 1 - 10 of 21
  • Publication
    ES2: A cloud data storage system for supporting both OLTP and OLAP
    (2011) Cao, Y.; Chen, C.; Guo, F.; Jiang, D.; Lin, Y.; Ooi, B.C.; Vo, H.T.; Wu, S.; Xu, Q.; COMPUTER SCIENCE
    Cloud computing represents a paradigm shift driven by the increasing demand of Web based applications for elastic, scalable and efficient system architectures that can efficiently support their ever-growing data volume and large-scale data analysis. A typical data management system has to deal with real-time updates by individual users, and as well as periodical large scale analytical processing, indexing, and data extraction. While such operations may take place in the same domain, the design and development of the systems have somehow evolved independently for transactional and periodical analytical processing. Such a system-level separation has resulted in problems such as data freshness as well as serious data storage redundancy. Ideally, it would be more efficient to apply ad-hoc analytical processing on the same data directly. However, to the best of our knowledge, such an approach has not been adopted in real implementation. Intrigued by such an observation, we have designed and implemented epiC, an elastic power-aware data-itensive Cloud platform for supporting both data intensive analytical operations (ref. as OLAP) and online transactions (ref. as OLTP). In this paper, we present ES2 - the elastic data storage system of epiC, which is designed to support both functionalities within the same storage. We present the system architecture and the functions of each system component, and experimental results which demonstrate the efficiency of the system. © 2011 IEEE.
  • Publication
    SiMPSON: Efficient similarity search in metric spaces over P2P structured overlay networks
    (2009) Vu, Q.H.; Lupu, M.; Wu, S.; COMPUTER SCIENCE
    Similarity search in metric spaces over centralized systems has been significantly studied in the database research community. However, not so much work has been done in the context of P2P networks. This paper introduces SiMPSON: a P2P system supporting similarity search in metric spaces. The aim is to answer queries faster and using less resources than existing systems. For this, each peer first clusters its own data using any off-the-shelf clustering algorithms. Then, the resulting clusters are mapped to one-dimensional values. Finally, these one-dimensional values are indexed into a structured P2P overlay. Our method slightly increases the indexing overhead, but allows us to greatly reduce the number of peers and messages involved in query processing: we trade a small amount of overhead in the data publishing process for a substantial reduction of costs in the querying phase. Based on this architecture, we propose algorithms for processing range and kNN queries. Extensive experimental results validate the claims of efficiency and effectiveness of SiMPSON. © 2009 Springer.
  • Publication
    Approximate aggregations in structured P2P networks
    (2011) Sun, D.; Wu, S.; Jiang, S.; Li, J.; COMPUTER SCIENCE
    In corporate networks, daily business data are generated in gigabytes or even terabytes. It is costly to process aggregate queries in those systems. In this paper, we propose PACA, a probably approximately correct aggregate query processing scheme, for answering aggregate queries in structured Peer-to-Peer (P2P) network. PACA retrieves random samples from peers' databases and applies the samples to process queries. Instead of scanning the entire database of each peer, PACA only accesses a small random number of data. Moreover, based on the query distribution,PACA publishes a precomputed synopsis and uses the synopsis to answer future queries. Most queries are expected to be answered by the precomputed synopsis partially or fully. And the synopsis is adaptively tuned to follow the query distribution. Experiments on the PlanetLab show the effectiveness of the approach. © 2011 IEEE.
  • Publication
    Online Aggregation
    (2013) Wu, S.; Ooi, B.C.; Tan, K.-L.; COMPUTER SCIENCE
    In this chapter, we introduce a new promising technique for query processing, online aggregation. Online aggregation is proposed based on the assumption that for some applications, the precise results are not always required. Instead, the approximate results can provide a good enough estimation. Compared to the precise results, computing the approximate ones are more cost effective, especially for large-scale datasets. To generate the approximate result, online aggregation retrieves samples continuously from the database. The samples are streamed to the query engine for processing the query. The accuracy of the approximate result is described by a statistical model. Normally, the result is refined as more samples are obtained. The user can terminate the processing at any time, when he/she is satisfied with the quality of the result. The performance of online aggregation relies on the sampling approach and estimation model. In this chapter, our discussion is focused on these two components. Besides introducing the basic principles of online aggregation, we also review some new applications built on top of it. We complete the chapter by discussing the challenges of online aggregation and some future directions. © Springer-Verlag Berlin Heidelberg 2013.
  • Publication
    Evaluating large graph processing in MapReduce based on message passing
    (2011) Pan, W.; Li, Z.-H.; Wu, S.; Chen, Q.; COMPUTER SCIENCE
    Since analyzing large-scale graph is usually difficult to be implemented on a single machine, how to design efficient parallel large-scale graph algorithms is receiving more and more attention. Constrained by embarrassingly parallel assumption, parallel graph algorithms are not easy to express in MapReduce. Inspired by Bulk Synchronous Parallel model, we propose a message-enhanced version of Hadoop MapReduce that breaks its key assumption. Enhanced implementation is compatible with original Hadoop MapReduce, existing Hadoop MapReduce programs can run directly on this platform without modification, and uses message passing mechanisms to facilitate interactive data communication between supersteps of tasks. It also provides a highly flexible self-defined message passing interface and two adaptive message passing mechanisms to support efficient implementation of graph algorithms with data transition and iterative computation. The experimental results on the real Stanford large network dataset collection demonstrate the superiority of enhanced version over original Hadoop MapReduce on PageRank algorithm.
  • Publication
    The performance of mapreduce: An indepth study
    (2010) Jiang, D.; Ooi, B.C.; Shi, L.; Wu, S.; COMPUTER SCIENCE
    MapReduce has been widely used for large-scale data analysis in the Cloud. The system is well recognized for its elastic scalability and fine-grained fault tolerance although its performance has been noted to be suboptimal in the database context. According to a recent study [19], Hadoop, an open source implementation of MapReduce, is slower than two state-of-the-art parallel database systems in performing a variety of analytical tasks by a factor of 3.1 to 6.5. MapReduce can achieve better performance with the allocation of more compute nodes from the cloud to speed up computation; however, this approach of "renting more nodes" is not cost effective in a pay-as-you-go environment. Users desire an economical elastically scalable data processing system, and therefore, are interested in whether MapReduce can offer both elastic scalability and effciency. In this paper, we conduct a performance study of MapReduce (Hadoop) on a 100-node cluster of Amazon EC2 with various levels of parallelism. We identify five design factors that affect the performance of Hadoop, and investigate alternative but known methods for each factor. We show that by carefully tuning these factors, the overall performance of Hadoop can be improved by a factor of 2.5 to 3.5 for the same benchmark used in [19], and is thus more comparable to that of parallel database systems. Our results show that it is therefore possible to build a cloud data processing system that is both elastically scalable and effcient. © 2010 VLDB Endowment.
  • Publication
    Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework
    (2011) Lin, Y.; Agrawal, D.; Chen, C.; Ooi, B.C.; Wu, S.; COMPUTER SCIENCE
    To achieve high reliability and scalability, most large-scale data warehouse systems have adopted the cluster-based architecture. In this paper, we propose the design of a new cluster-based data warehouse system, LLama, a hybrid data management system which combines the features of row-wise and column-wise database systems. In Llama, columns are formed into correlation groups to provide the basis for the vertical partitioning of tables. Llama employs a distributed file system (DFS) to disseminate data among cluster nodes. Above the DFS, a MapReduce-based query engine is supported. We design a new join algorithm to facilitate fast join processing. We present a performance study on TPC-H dataset and compare Llama with Hive, a data warehouse infrastructure built on top of Hadoop. The experiment is conducted on EC2. The results show that Llama has an excellent load performance and its query performance is significantly better than the traditional MapReduce framework based on row-wise storage. © 2011 ACM.
  • Publication
    Cross domain search by exploiting Wikipedia
    (2012) Liu, C.; Wu, S.; Jiang, S.; Tung, A.K.H.; COMPUTER SCIENCE
    The abundance of Web 2.0 resources in various media formats calls for better resource integration to enrich user experience. This naturally leads to a new cross-modal resource search requirement, in which a query is a resource in one modal and the results are closely related resources in other modalities. With cross-modal search, we can better exploit existing resources. Tags associated with Web 2.0 resources are intuitive medium to link resources with different modality together. However, tagging is by nature an ad hoc activity. They often contain noises and are affected by the subjective inclination of the tagger. Consequently, linking resources simply by tags will not be reliable. In this paper, we propose an approach for linking tagged resources to concepts extracted from Wikipedia, which has become a fairly reliable reference over the last few years. Compared to the tags, the concepts are therefore of higher quality. We develop effective methods for cross-modal search based on the concepts associated with resources. Extensive experiments were conducted, and the results show that our solution achieves good performance. © 2012 IEEE.
  • Publication
    Efficient btree based indexing for cloud data processing
    (2010) Wu, S.; Jiang, D.; Ooi, B.C.; Wu, K.; COMPUTER SCIENCE
    A Cloud may be seen as a type of flexible computing infrastructure consisting of many compute nodes, where resizable computing capacities can be provided to different customers. To fully harness the power of the Cloud, efficient data management is needed to handle huge volumes of data and support a large number of concurrent end users. To achieve that, a scalable and high-throughput indexing scheme is generally required. Such an indexing scheme must not only incur a low maintenance cost but also support parallel search to improve scalability. In this paper, we present a novel, scalable B +-tree based indexing scheme for efficient data processing in the Cloud. Our approach can be summarized as follows. First, we build a local B +-tree index for each compute node which only indexes data residing on the node. Second, we organize the compute nodes as a structured overlay and publish a portion of the local B +-tree nodes to the overlay for efficient query processing. Finally, we propose an adaptive algorithm to select the published B +-tree nodes according to query patterns. We conduct extensive experiments on Amazon's EC2, and the results demonstrate that our indexing scheme is dynamic, efficient and scalable. © 2010 VLDB Endowment.
  • Publication
    Distributed online aggregations
    (2009) Wu, S.; Jiang, S.; Ooi, B.C.; Tan, K.L.; COMPUTER SCIENCE
    In many decision making applications, users typically issue aggregate queries. To evaluate these computationally expensive queries, online aggregation has been developed to provide approximate answers (with their respective confidence intervals) quickly, and to continuously refine the answers. In this paper, we extend the online aggregation technique to a distributed context where sites are maintained in a DHT (Distributed Hash Table) network. Our Distributed Online Aggregation (DoA) scheme iteratively and progressively produces approximate aggregate answers as follows: in each iteration, a small set of random samples are retrieved from the data sites and distributed to the processing sites; at each processing site, a local aggregate is computed based on the allocated samples; at a coordinator site, these local aggregates are combined into a global aggregate. DoA adaptively grows the number of processing nodes as the sample size increases. To further reduce the sampling overhead, the samples are retained as a precomputed synopsis over the network to be used for processing future queries. We also study how these synopsis can be maintained incrementally. We have conducted extensive experiments on PlanetLab. The results show that our DoA scheme reduces the initial waiting time significantly and provides high quality approximate answers with running confidence intervals progressively. © 2009 VLDB Endowment.