Fig. 2: Training of machine learning models and computational screening. | Nature Communications

Fig. 2: Training of machine learning models and computational screening.

From: Discovery of senolytics using machine learning

Fig. 2

a Pipeline for model training, compound screening, and hit validation. Several classification scores were used as performance metrics to determine the most suitable model for the computational screen. b Results from three machine learning models trained on 2523 compounds (Fig. 1a) and a reduced set of 165 features (Supplementary Fig. 1a); bar plots show average performance metrics computed in 5-fold cross-validation, with error bars denoting one standard deviation across folds. Mean ± s.d. are shown from n = 5 data folds. c The confusion matrices were computed from models trained on 70% of compounds, and tested on 17 positives and 740 negatives that were held-out from training. All models displayed poor performance metrics (Supplementary Table 1), and we chose the XGBoost algorithm for screening because of its lower number of false positives. d Results from computational screen of the L2100 TargetMol Anticancer and L3800 Selleck FDA-approved & Passed Phase chemical libraries, totalling 4340 compounds. The XGBoost model is highly selective and scored most compounds with a low probability of having senolytic action; a small fraction of N = 21 compounds were scored with P > 44%, which we selected for experimental validation. e Compounds selected for screening, ranked according to their z-score normalised prediction scores from the XGBoost model; the selected compounds are far outliers in the distribution of panel c. f Two-dimensional t-SNE visualisation of all compounds employed in this work; t-SNE plots were generated with perplexity 50, learning rate 200, and maximal number of iterations 120065. Compounds with prediction scores above 44% from the XGBoost model are marked with orange circles.

Back to article page