Table 1 Impact of Quality Control, Imputation of missing genotypes and Coding methods on the results of Lasso penalized Logistic Regression.

From: Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data

 

AUC train

AUC test

\({{\boldsymbol{N}}}_{{\bf{S}}{\bf{N}}{\bf{P}}}^{{\boldsymbol{p}}}\)

N SNP≠0

\({{\boldsymbol{I}}}_{{\bf{S}}{\bf{N}}{\bf{P}}}^{(\ast )}\)

\({{\boldsymbol{I}}}_{{\bf{L}}{\bf{o}}{\bf{c}}{\bf{i}}}^{(\ast )}\)

\({{\boldsymbol{I}}}_{{\bf{t}}{\bf{o}}{\bf{p}}{\bf{S}}{\bf{N}}{\bf{P}}}^{(\ast )}\)

\({{\boldsymbol{I}}}_{{\bf{t}}{\bf{o}}{\bf{p}}{\bf{L}}{\bf{o}}{\bf{c}}{\bf{i}}}^{(\ast )}\)

\({{\boldsymbol{I}}}_{{\bf{L}}{\bf{o}}{\bf{c}}{\bf{i}}}^{({\bf{G}}{\bf{W}}{\bf{A}}{\bf{S}})}\)

\({{\boldsymbol{I}}}_{{\bf{t}}{\bf{o}}{\bf{p}}{\bf{L}}{\bf{o}}{\bf{c}}{\bf{i}}}^{({\bf{G}}{\bf{W}}{\bf{A}}{\bf{S}})}\)

NoQC/Unkw/OHE

0.925 ± 0.003

0.922

23583

2927

29%

55%

6%

35%

88%

19%

QC/Unkw/OHE

0.808 ± 0.008

0.802

21896

3198

69%

87%

38%

48%

89%

25%

NoQC/Maj/sum

0.901 ± 0.003

0.897

23583

3419

36%

69%

6%

23%

90%

12%

QC/Maj/sum

0.805 ± 0.008

0.800

21896

3553

91%

100%

64%

63%

91%

27%

NoQC/HWc/sum

0.812 ± 0.007

0.803

23583

2730

38%

66%

26%

45%

87%

29%

QC/HWc/sum

0.803 ± 0.008

0.800

21896

2575

89%

36%

QC/HWc/OHE

0.796 ± 0.008

0.786

21896

3242

72%

89%

49%

60%

89%

29%

QC/HWc/raw

0.800 ± 0.008

0.792

21896

2757

72%

88%

57%

68%

89%

29%

QC/HWa/sum

0.803 ± 0.008

0.799

21896

2579

94%

99%

91%

96%

89%

36%

  1. Area Under Curve (AUC) obtained for 10-fold cross-validation on Train set and evaluation on the Test set, for Lasso penalized Logistic Regression applied to different combinations of QC/imputation/coding choices (notations as in Materials and Methods section). The line with bold characters corresponds to our benchmark case (QC/HWc/sum). \({N}_{{\rm{SNP}}}^{p}\) indicate the number of preselected SNPs used as input of the model, \({N}_{{\rm{SNP}}\ne 0}\) is the number of SNPs associated with a nonzero coefficient. \({I}_{{\rm{SNP}}}^{(\ast )}\) and \({I}_{{\rm{Loci}}}^{(\ast )}\) refer respectively to the percentage of SNP and loci (as defined in the main test) with associated non-zero coefficient, in common with the benchmark case. \({I}_{{\rm{topSNP}}}^{(\ast )}\) and \({I}_{{\rm{topLoci}}}^{(\ast )}\) columns show the same things for the corresponding 100 features with highest weight (in absolute value). \({I}_{{\rm{Loci}}}^{({\rm{GWAS}})}\) and \({I}_{{\rm{topLoci}}}^{({\rm{GWAS}})}\) compare instead the same quantities to the list given in3.