Please use this identifier to cite or link to this item:
Title: Effect of training data size and noise level on support vector machines virtual screening of genotoxic compounds from large compound libraries
Authors: Kumar, P.
Ma, X. 
Liu, X.
Jia, J.
Bucong, H.
Xue, Y.
Li, Z.R.
Yang, S.Y.
Wei, Y.Q.
Chen, Y.Z. 
Keywords: Bioinformatics
Computer aided drug design
Drug safety
Machine learning
Statistical learning
Support vector machine
Issue Date: May-2011
Citation: Kumar, P., Ma, X., Liu, X., Jia, J., Bucong, H., Xue, Y., Li, Z.R., Yang, S.Y., Wei, Y.Q., Chen, Y.Z. (2011-05). Effect of training data size and noise level on support vector machines virtual screening of genotoxic compounds from large compound libraries. Journal of Computer-Aided Molecular Design 25 (5) : 455-467. ScholarBank@NUS Repository.
Abstract: Various in vitro and in-silico methods have been used for drug genotoxicity tests, which show limited genotoxicity (GT?) and non-genotoxicity (GT-) identification rates. New methods and combinatorial approaches have been explored for enhanced collective identification capability. The rates of in-silco methods may be further improved by significantly diversified training data enriched by the large number of recently reported GT? and GTcompounds, but a major concern is the increased noise levels arising from high false-positive rates of in vitro data. In this work, we evaluated the effect of training data size and noise level on the performance of support vector machines (SVM) method known to tolerate high noise levels in training data. Two SVMs of different diversity/ noise levels were developed and tested. H-SVM trained by higher diversity higher noise data (GT? in any in vivo or in vitro test) outperforms L-SVM trained by lower noise lower diversity data (GT? in in vivo or Ames test only). H-SVM trained by 4,763 GT? compounds reported before 2008 and 8,232 GT- compounds excluding clinical trial drugs correctly identified 81.6% of the 38 GT? Compounds reported since 2008, predicted 83.1% of the 2,008 clinical trial drugs as GT-, and 23.96% of 168 K MDDR and 27.23% of 17.86M PubChem compounds as GT?. These are comparable to the 43.1-51.9% GT? and 75-93% GTrates of existing in-silico methods, 58.8% GT? and 79% GT- rates of Ames method, and the estimated percentages of 23% in vivo and 31-33% in vitro GT? compounds in the "universe of chemicals". There is a substantial level of agreement between H-SVM and L-SVM predicted GT? and GT- MDDR compounds and the prediction from TOPKAT. SVM showed good potential in identifying GT? compounds from large compound libraries based on higher diversity and higher noise training data. © 2011 Springer Science+Business Media B.V.
Source Title: Journal of Computer-Aided Molecular Design
ISSN: 0920654X
DOI: 10.1007/s10822-011-9431-3
Appears in Collections:Staff Publications

Show full item record
Files in This Item:
There are no files associated with this item.


checked on Jun 24, 2021


checked on Jun 15, 2021

Page view(s)

checked on Jun 20, 2021

Google ScholarTM



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.