Please use this identifier to cite or link to this item: https://doi.org/10.1002/cem.3349
Title: Machine learning in prediction of intrinsic aqueous solubility of drug-like compounds: Generalization, complexity, or predictive ability?
Authors: Lovri?, M.
Pavlovi?, K.
Žuvela, P. 
Spataru, Adrian
Lu?i?, B.
Kern, Roman
Wong, Ming Wah 
Keywords: consensus modeling
LASSO
LightGBM
PCA
permutation importance
QSAR
random forests
Issue Date: 7-May-2021
Publisher: John Wiley and Sons Ltd
Citation: Lovri?, M., Pavlovi?, K., Žuvela, P., Spataru, Adrian, Lu?i?, B., Kern, Roman, Wong, Ming Wah (2021-05-07). Machine learning in prediction of intrinsic aqueous solubility of drug-like compounds: Generalization, complexity, or predictive ability?. Journal of Chemometrics 35 (7-Aug) : e3349. ScholarBank@NUS Repository. https://doi.org/10.1002/cem.3349
Rights: Attribution 4.0 International
Abstract: We present a collection of publicly available intrinsic aqueous solubility data of 829 drug-like compounds. Four different machine learning algorithms (random forests [RF], LightGBM, partial least squares, and least absolute shrinkage and selection operator [LASSO]) coupled with multistage permutation importance for feature selection and Bayesian hyperparameter optimization were used for the prediction of solubility based on chemical structural information. Our results show that LASSO yielded the best predictive ability on an external test set with a root mean square error (RMSE) (test) of 0.70 log points, an R2(test) of 0.80, and 105 features. Taking into account the number of descriptors as well, an RF model achieves the best balance between complexity and predictive ability with an RMSE(test) of 0.72 log points, an R2(test) of 0.78, and with only 17 features. On a more aggressive test set (principal component analysis [PCA]-based split), better generalization was observed for the RF model. We propose a ranking score for choosing the best model, as test set performance is only one of the factors in creating an applicable model. The ranking score is a weighted combination of generalization, number of features, and test performance. Out of the two best learners, a consensus model was built exhibiting the best predictive ability and generalization with RMSE(test) of 0.67 log points and a R2(test) of 0.81. © 2021 The Authors. Journal of Chemometrics published by John Wiley & Sons Ltd.
Source Title: Journal of Chemometrics
URI: https://scholarbank.nus.edu.sg/handle/10635/232224
ISSN: 0886-9383
DOI: 10.1002/cem.3349
Rights: Attribution 4.0 International
Appears in Collections:Elements
Staff Publications

Show full item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
10_1002_cem_3349.pdf6.49 MBAdobe PDF

OPEN

NoneView/Download

Google ScholarTM

Check

Altmetric


This item is licensed under a Creative Commons License Creative Commons