Simulation of Synthetic Diabetes Tabular Data Using Generative Adversarial Networks

Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/192780

DC Field	Value
dc.title	Simulation of Synthetic Diabetes Tabular Data Using Generative Adversarial Networks
dc.contributor.author	Hargreaves, Carol Anne
dc.contributor.author	Heng, Wee Lin Eunice
dc.date.accessioned	2021-07-01T06:03:51Z
dc.date.available	2021-07-01T06:03:51Z
dc.date.issued	2021-05-31
dc.identifier.citation	Hargreaves, Carol Anne, Heng, Wee Lin Eunice (2021-05-31). Simulation of Synthetic Diabetes Tabular Data Using Generative Adversarial Networks. Clinical Medicine Journal 7 : 49-11. ScholarBank@NUS Repository.
dc.identifier.issn	23817631
dc.identifier.uri	https://scholarbank.nus.edu.sg/handle/10635/192780
dc.description.abstract	Generative Adversarial Networks (GANs) is a relatively new research avenue in the domain of Deep Learning and Artificial Intelligence. Over the past few years, GANs have been extensively researched into due to their ability to generate realistic synthetic data. The creation of synthetic tabular data is especially useful when it is desirable to avoid using the original data set due to privacy reasons. Therefore, the objective of this paper was to review the effectiveness of GANs in simulating synthetic tabular diabetes data. Methodology: Prior to GAN training, we applied min-max normalization on the features. To analyze the similarity between the real data set and each synthetic data set, we conducted exploratory data analysis before employing some statistical methods. We compared the synthesized data with the original data using data visualizations including histograms and boxplots. We also computed confidence intervals for the means of the real data variables and compared them with the confidence intervals for the means of the synthetic data. Results: The results showed that 8 of the 9 confidence intervals overlapped. We also checked whether the mean of a particular variable in the synthetic data set fell into the confidence interval of the same variable in the real data set. For each variable, we had two different probability distributions: the true distribution (from the real data); and an approximation of that distribution (from the synthetic data). To quantify the difference between the two distributions, we computed the Kull-Lieber (KL) divergence score. The KL scores for all 8 predictors were relatively small and close to 0, which is ideal. A model for classifying patients as having diabetes was built using only the real data. Another model for classifying patients as having diabetes was built using the combined real and synthetic data. The model using the combined real and synthetic data achieved a much higher accuracy of 87.0% as compared to 78.7% attained when only using the real data. Conclusion: We built a realistic synthetic data set using generative adversarial networks. The synthetic data set proved to be very similar to the real dataset and could successfully replace the real data for analysis for research purposes. Further, we verified that the availability of more training data for diabetes classification helped to improve the accuracy of the classifier, while achieving a relatively high recall.
dc.publisher	American Institute of Science
dc.source	Elements
dc.type	Article
dc.date.updated	2021-06-30T12:56:55Z
dc.contributor.department	STATISTICS & APPLIED PROBABILITY
dc.description.sourcetitle	Clinical Medicine Journal
dc.description.volume	7
dc.description.page	49-11
dc.description.place	Singapore
dc.published.state	Published
Appears in Collections:	Staff Publications Elements

Show simple item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
Simulation of Synthetic Diabetes Tabular Data Using Generative Adversarial Networks.pdf		441.8 kB	Adobe PDF	OPEN	Published	View/Download

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM