Machine learning in prediction of intrinsic aqueous solubility of drug-like compounds: Generalization, complexity, or predictive ability?

Please use this identifier to cite or link to this item: https://doi.org/10.1002/cem.3349

DC Field	Value
dc.title	Machine learning in prediction of intrinsic aqueous solubility of drug-like compounds: Generalization, complexity, or predictive ability?
dc.contributor.author	Lovri?, M.
dc.contributor.author	Pavlovi?, K.
dc.contributor.author	Žuvela, P.
dc.contributor.author	Spataru, Adrian
dc.contributor.author	Lu?i?, B.
dc.contributor.author	Kern, Roman
dc.contributor.author	Wong, Ming Wah
dc.date.accessioned	2022-10-11T08:08:48Z
dc.date.available	2022-10-11T08:08:48Z
dc.date.issued	2021-05-07
dc.identifier.citation	Lovri?, M., Pavlovi?, K., Žuvela, P., Spataru, Adrian, Lu?i?, B., Kern, Roman, Wong, Ming Wah (2021-05-07). Machine learning in prediction of intrinsic aqueous solubility of drug-like compounds: Generalization, complexity, or predictive ability?. Journal of Chemometrics 35 (7-Aug) : e3349. ScholarBank@NUS Repository. https://doi.org/10.1002/cem.3349
dc.identifier.issn	0886-9383
dc.identifier.uri	https://scholarbank.nus.edu.sg/handle/10635/232224
dc.description.abstract	We present a collection of publicly available intrinsic aqueous solubility data of 829 drug-like compounds. Four different machine learning algorithms (random forests [RF], LightGBM, partial least squares, and least absolute shrinkage and selection operator [LASSO]) coupled with multistage permutation importance for feature selection and Bayesian hyperparameter optimization were used for the prediction of solubility based on chemical structural information. Our results show that LASSO yielded the best predictive ability on an external test set with a root mean square error (RMSE) (test) of 0.70 log points, an R2(test) of 0.80, and 105 features. Taking into account the number of descriptors as well, an RF model achieves the best balance between complexity and predictive ability with an RMSE(test) of 0.72 log points, an R2(test) of 0.78, and with only 17 features. On a more aggressive test set (principal component analysis [PCA]-based split), better generalization was observed for the RF model. We propose a ranking score for choosing the best model, as test set performance is only one of the factors in creating an applicable model. The ranking score is a weighted combination of generalization, number of features, and test performance. Out of the two best learners, a consensus model was built exhibiting the best predictive ability and generalization with RMSE(test) of 0.67 log points and a R2(test) of 0.81. © 2021 The Authors. Journal of Chemometrics published by John Wiley & Sons Ltd.
dc.publisher	John Wiley and Sons Ltd
dc.rights	Attribution 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.source	Scopus OA2021
dc.subject	consensus modeling
dc.subject	LASSO
dc.subject	LightGBM
dc.subject	PCA
dc.subject	permutation importance
dc.subject	QSAR
dc.subject	random forests
dc.type	Article
dc.contributor.department	CHEMISTRY
dc.description.doi	10.1002/cem.3349
dc.description.sourcetitle	Journal of Chemometrics
dc.description.volume	35
dc.description.issue	7-Aug
dc.description.page	e3349
Appears in Collections:	Elements Staff Publications

Show simple item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
10_1002_cem_3349.pdf		6.49 MB	Adobe PDF	OPEN	None	View/Download

Google Scholar^TM

Check

Altmetric

This item is licensed under a Creative Commons License

Files in This Item:

Google ScholarTM

Altmetric

Google Scholar^TM