Fig. 4: Our data-driven model of predictors identification. | Communications Chemistry

Fig. 4: Our data-driven model of predictors identification.

From: Predicting permeation of compounds across the outer membrane of P. aeruginosa using molecular descriptors

Fig. 4

a Hierarchical clustering algorithm is used to select different combinations of x descriptors. A random forest classifier is trained on the x descriptors alongside with IC50 ratios, and the descriptors performance are scored accordingly. Over the course of several random selections of x descriptors, the aggregated x scores are used to rank the clusters according to predictability. The lowest ranked cluster is eliminated and the value of x is reduced. In parallel, for each classification run, the fitted model is tested in a separate set of compounds and the evaluation metrics are stored. b Model performance accuracy for each cycle of the model. Individual circles represent the average accuracy score of a single random combination of x descriptors using a random forest classifier over 50 random training/validation splits. The dashed green line represents the average accuracy score for a random forest classifier using the full set of 174 descriptors. c Top-9 clusters ranked according to their testing performance. The table in the left panel distinguishes the cluster number, its size (number of descriptors comprising the cluster), and type of descriptors they contain. The central panel is the aggregated cluster score where all values add to 104, which is the total number of runs for a particular value of x. The right panel lists the top-9 optimal descriptors that produce a testing accuracy of 96.2%.

Back to article page