Fig. 5: Model prediction analysis.

a Classification of compounds according to their predictability by our model. 100 random samples of 120 compounds each were tested on the remaining of the data. Compounds that were correctly predicted at each model realization are represented by a green bar pointing above the x-axis (set G). Compounds that were incorrectly predicted in every run are represented by a red bar pointing below the x-axis (set R). Compounds that in some runs were correctly predicted and in some other, incorrectly predicted, are represented by blue bars pointing both ways (set B). The color bar in the bottom indicates the structural chemotype a given compound belongs to as defined in Fig. 1. b Probability density q(y) as a function of the probability value y associated with each category of descriptors (G, R, and B) for the dominant target class, i.e., \(y=\max \{{p}_{0},{p}_{1}\}\), where p0 and p1 are the probabilities of being a weak or a strong permeator, respectively. Vertical lines indicate the average probability \(\bar{y}\) for each case. c Number of compounds and percentage of each set (G, R, and B as defined in a) for each structural chemotype following color scheme and ordering as a. d Analysis of three selected subgroups according to a complete Tanimoto similarity analysis that contain a relevant amount of compounds from the sets R (inverted triangles) and B (squares). Each panel shows the specific subgroups (SB201, SB71, and SB168) in the space of two descriptors identified by our model (Fig. 4a and compared to their respective experimental class: strong permeator (red) and weak permeator (blue). Dashed line in the left panel is produced by a support vector machine classification algorithm.