Please use this identifier to cite or link to this item: http://scholarbank.nus.edu.sg/handle/10635/134921
Title: FROM RAW DATA TO PROCESSABLE INFORMATIVE DATA: TRAINING DATA MANAGEMENT FOR BIG DATA ANALYTICS
Authors: GAO JINYANG
Keywords: BigData, Machine Learning, Indexing, Algorithm, Active Learning, Crowdsourcing
Issue Date: 21-Sep-2016
Source: GAO JINYANG (2016-09-21). FROM RAW DATA TO PROCESSABLE INFORMATIVE DATA: TRAINING DATA MANAGEMENT FOR BIG DATA ANALYTICS. ScholarBank@NUS Repository.
Abstract: Due to the surging volume of Big Data, data-driven approaches are playing an ever-increasing role in nowadays knowledge discoveries and decision makings. Though cheap raw data from various sources are produced everywhere, most of them cannot be directly used as training data and benefit analytics tasks. This is mainly because the size of raw data is usually too large to be directly processed, and the informative value in raw data is not as high as that collected from deliberately designed experiments. To fulfill the use of Big Data, there is an increasing need to establish an infrastructure for training data management, transforming raw data to processable informative data, by leveraging both human effort and computational resources. In this thesis, we aim to develop effective and efficient solutions to transform the Big Data into a processable and informative form. Two challenging problems are discussed and addressed. The first challenge is to increase the information value in Big Data, mainly by acquiring extra supervised information from data annotation. We propose a preference quantified model to annotate complex tasks where the supervised information is difficult to be represent by simple labels, and adapt an active learning approach to reduce the cost of human efforts. To further reduce the cost of data annotation by using crowdsourcing, we develop a cost-sensitive method for crowdsourced data quality management. The second challenge is to squeeze and reorganize the data to a processable form without losing much information inside the original data, which typically includes representing, compressing, indexing and sampling the data to increase the computational efficiency. We propose a hashing method to transform the training data into better compact representation, while preserving both internal information in each instance and external relations among those instances. Moreover, we index the data which are usually high-dimensional to support similarity queries based on the distance independent $k$-nearest neighbor measure. Finally, we study the effect of data sampling pattern on the efficiency of analytics model training, aiming to provide the most informative data in a processable size to the analytics model to speed up the model training procedure.
URI: http://scholarbank.nus.edu.sg/handle/10635/134921
Appears in Collections:Ph.D Theses (Open)

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
thesis.pdf2.27 MBAdobe PDF

OPEN

NoneView/Download

Page view(s)

32
checked on Jan 14, 2018

Download(s)

37
checked on Jan 14, 2018

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.