A model driven approach to imbalanced data learning | ScholarBank@NUS

Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/30689

Title:	A model driven approach to imbalanced data learning
Authors:	YIN HONGLI
Keywords:	Imbalanced Data, Model Driven Sampling, Data Sampling, Domain Knowledge, Context Sensitive, Asthma Control
Issue Date:	15-Mar-2011
Citation:	YIN HONGLI (2011-03-15). A model driven approach to imbalanced data learning. ScholarBank@NUS Repository.
Abstract:	Many real life problems, especially in health care and biomedicine, are characterized by imbalanced data. In general, people tend to be more interested in rare events or phenomena. For example, in prognostic predictions, the physicians can take necessary precautions to reduce the risks of the small group of patients who cannot recover in time. Traditional machine learning algorithms often fail to predict the minorities that are of interest. The objective of imbalanced data learning is to correctly identify the rarities without sacrificing prediction of the majorities. In this thesis, we review the existing approaches to deal with the imbalanced data problem, including data level approaches and algorithm level approaches. Most data sampling approaches are ad-hoc and the exact mechanisms of how they improve prediction performance are not clear. For example, random sampling generates duplicate samples to ?fool? the classifier to bias its decision in favor of minorities. Oversampling often leads to data overfitting, and under sampling tends to remove useful information from the original data set. The Synthetic Minority over-Sampling Technique creates synthetic data from the nearest neighbor, but it only makes use of local information and often leads to data over-generalization. On the other hand, most of the algorithmic level approaches have been shown to be equivalent to data sampling approaches. Some other approaches make additional assumptions. For example, a popular approach is cost sensitive learning which assigns different cost values to different types of misclassifications; but the cost values are usually unknown, and it is hard to discover the right cost value. We propose a model driven sampling (MDS) approach that can generate new samples based on the global understanding of the entire data set and domain experts? knowledge. This is a first attempt to make use of probabilistic graphical methods to represent the training space and generate synthetic data. Our empirical studies show that in a large class of problems, MDS generally outperforms previous approaches or performs comparably to the best previous approach in the worst case scenario. It performs especially well for extremely imbalanced data without complex connected structures. MDS also works well when domain knowledge is available, as the model created with domain knowledge is better ?educated? than that constructed purely from training data and thus, the synthetic data generated are more meaningful. We have also extended MDS to context sensitive MDS and progressive MDS. Context sensitive MDS reduces the problem size by creating more accurate sub models for each individual context. Therefore, the data sampled from context sensitive MDS are more relevant to each context. Instead of assuming the optimal distribution is balanced, progressive MDS iterates over all possible data distributions and selects the best performing data distribution as the optimal distribution. Therefore, progressive MDS improves over MDS by always obtaining the optimal data distribution, as shown by our empirical studies.
URI:	http://scholarbank.nus.edu.sg/handle/10635/30689
Appears in Collections:	Ph.D Theses (Open)

Show full item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
HL_PHD_thesis_Final_Submission.pdf		1.92 MB	Adobe PDF	OPEN	None	View/Download

Google Scholar^TM

Check

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.