Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/192780
DC FieldValue
dc.titleSimulation of Synthetic Diabetes Tabular Data Using Generative Adversarial Networks
dc.contributor.authorHargreaves, Carol Anne
dc.contributor.authorHeng, Wee Lin Eunice
dc.date.accessioned2021-07-01T06:03:51Z
dc.date.available2021-07-01T06:03:51Z
dc.date.issued2021-05-31
dc.identifier.citationHargreaves, Carol Anne, Heng, Wee Lin Eunice (2021-05-31). Simulation of Synthetic Diabetes Tabular Data Using Generative Adversarial Networks. Clinical Medicine Journal 7 : 49-11. ScholarBank@NUS Repository.
dc.identifier.issn23817631
dc.identifier.urihttps://scholarbank.nus.edu.sg/handle/10635/192780
dc.description.abstractGenerative Adversarial Networks (GANs) is a relatively new research avenue in the domain of Deep Learning and Artificial Intelligence. Over the past few years, GANs have been extensively researched into due to their ability to generate realistic synthetic data. The creation of synthetic tabular data is especially useful when it is desirable to avoid using the original data set due to privacy reasons. Therefore, the objective of this paper was to review the effectiveness of GANs in simulating synthetic tabular diabetes data. Methodology: Prior to GAN training, we applied min-max normalization on the features. To analyze the similarity between the real data set and each synthetic data set, we conducted exploratory data analysis before employing some statistical methods. We compared the synthesized data with the original data using data visualizations including histograms and boxplots. We also computed confidence intervals for the means of the real data variables and compared them with the confidence intervals for the means of the synthetic data. Results: The results showed that 8 of the 9 confidence intervals overlapped. We also checked whether the mean of a particular variable in the synthetic data set fell into the confidence interval of the same variable in the real data set. For each variable, we had two different probability distributions: the true distribution (from the real data); and an approximation of that distribution (from the synthetic data). To quantify the difference between the two distributions, we computed the Kull-Lieber (KL) divergence score. The KL scores for all 8 predictors were relatively small and close to 0, which is ideal. A model for classifying patients as having diabetes was built using only the real data. Another model for classifying patients as having diabetes was built using the combined real and synthetic data. The model using the combined real and synthetic data achieved a much higher accuracy of 87.0% as compared to 78.7% attained when only using the real data. Conclusion: We built a realistic synthetic data set using generative adversarial networks. The synthetic data set proved to be very similar to the real dataset and could successfully replace the real data for analysis for research purposes. Further, we verified that the availability of more training data for diabetes classification helped to improve the accuracy of the classifier, while achieving a relatively high recall.
dc.publisherAmerican Institute of Science
dc.sourceElements
dc.typeArticle
dc.date.updated2021-06-30T12:56:55Z
dc.contributor.departmentSTATISTICS & APPLIED PROBABILITY
dc.description.sourcetitleClinical Medicine Journal
dc.description.volume7
dc.description.page49-11
dc.description.placeSingapore
dc.published.statePublished
Appears in Collections:Staff Publications
Elements

Show simple item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
Simulation of Synthetic Diabetes Tabular Data Using Generative Adversarial Networks.pdf441.8 kBAdobe PDF

OPEN

PublishedView/Download

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.