Extended Data Figure 4: Development of the random forest classifier.
From: DNA methylation-based classification of central nervous system tumours

a, The random forest training consists of four steps. First, basic filtering of probes that were not included on the EPIC array, probes located on the X and Y chromosomes, probes affected by single nucleotide polymorphisms, and probes not mapping uniquely to the genome was performed. In the second step, the probe-wise batch effects between samples from FFPE and frozen material were estimated and adjusted by a linear model approach. In the third step, feature selection was performed by training a random forest algorithm using all probes and selecting the 10,000 probes with highest variable importance measure. In the last step, the final random forest is trained using only the 10,000 selected probes. The validation of the random forest classifier involves a threefold nested cross-validation. In the outer loop of the cross-validation, the complete random forest training procedure consisting of four steps as described above are applied to the training data and the resulting random forest is used to predict the test data to generate random forest scores. In the inner loop of the cross-validation a threefold cross-validation is applied to training data of the outer loop in order to generate random forest scores independent of the test data in the outer loop. These scores are then used to fit a calibration model, that is, a L2-penalized, multinomial, logistic regression that takes the random forest scores of the test data in the outer cross-validation loop to estimate tumour class probabilities (P1, P2, P3). To fit a calibration model to estimate class probabilities of diagnostic samples using all data in the reference set, the random forest scores generated in the outer cross-validation loop were used. b, Schematic depiction of three example binary decision trees of the random forest classifier (left), and magnification on five example decisions nodes relevant for glioblastoma classification (right). For prediction, a diagnostic sample enters the root node of each of the 10,000 trees. At every decision node, the decision path is determined on the methylation level of a single CpG, until it reaches a terminal node that provides the class prediction. The joint class prediction of all trees represents the raw prediction score. The colour code and abbreviations are identical to Fig. 1a.