Table 1 Impact of Quality Control, Imputation of missing genotypes and Coding methods on the results of Lasso penalized Logistic Regression.

	AUC train	AUC test	\({{\boldsymbol{N}}}_{{\bf{S}}{\bf{N}}{\bf{P}}}^{{\boldsymbol{p}}}\)	N _SNP≠0	\({{\boldsymbol{I}}}_{{\bf{S}}{\bf{N}}{\bf{P}}}^{(\ast )}\)	\({{\boldsymbol{I}}}_{{\bf{L}}{\bf{o}}{\bf{c}}{\bf{i}}}^{(\ast )}\)	\({{\boldsymbol{I}}}_{{\bf{t}}{\bf{o}}{\bf{p}}{\bf{S}}{\bf{N}}{\bf{P}}}^{(\ast )}\)	\({{\boldsymbol{I}}}_{{\bf{t}}{\bf{o}}{\bf{p}}{\bf{L}}{\bf{o}}{\bf{c}}{\bf{i}}}^{(\ast )}\)	\({{\boldsymbol{I}}}_{{\bf{L}}{\bf{o}}{\bf{c}}{\bf{i}}}^{({\bf{G}}{\bf{W}}{\bf{A}}{\bf{S}})}\)	\({{\boldsymbol{I}}}_{{\bf{t}}{\bf{o}}{\bf{p}}{\bf{L}}{\bf{o}}{\bf{c}}{\bf{i}}}^{({\bf{G}}{\bf{W}}{\bf{A}}{\bf{S}})}\)
NoQC/Unkw/OHE	0.925 ± 0.003	0.922	23583	2927	29%	55%	6%	35%	88%	19%
QC/Unkw/OHE	0.808 ± 0.008	0.802	21896	3198	69%	87%	38%	48%	89%	25%
NoQC/Maj/sum	0.901 ± 0.003	0.897	23583	3419	36%	69%	6%	23%	90%	12%
QC/Maj/sum	0.805 ± 0.008	0.800	21896	3553	91%	100%	64%	63%	91%	27%
NoQC/HW_c/sum	0.812 ± 0.007	0.803	23583	2730	38%	66%	26%	45%	87%	29%
QC/HW_c/sum	0.803 ± 0.008	0.800	21896	2575	—	—	—	—	89%	36%
QC/HW_c/OHE	0.796 ± 0.008	0.786	21896	3242	72%	89%	49%	60%	89%	29%
QC/HW_c/raw	0.800 ± 0.008	0.792	21896	2757	72%	88%	57%	68%	89%	29%
QC/HW_a/sum	0.803 ± 0.008	0.799	21896	2579	94%	99%	91%	96%	89%	36%

Area Under Curve (AUC) obtained for 10-fold cross-validation on Train set and evaluation on the Test set, for Lasso penalized Logistic Regression applied to different combinations of QC/imputation/coding choices (notations as in Materials and Methods section). The line with bold characters corresponds to our benchmark case (QC/HW_c/sum). \({N}_{{\rm{SNP}}}^{p}\) indicate the number of preselected SNPs used as input of the model, \({N}_{{\rm{SNP}}\ne 0}\) is the number of SNPs associated with a nonzero coefficient. \({I}_{{\rm{SNP}}}^{(\ast )}\) and \({I}_{{\rm{Loci}}}^{(\ast )}\) refer respectively to the percentage of SNP and loci (as defined in the main test) with associated non-zero coefficient, in common with the benchmark case. \({I}_{{\rm{topSNP}}}^{(\ast )}\) and \({I}_{{\rm{topLoci}}}^{(\ast )}\) columns show the same things for the corresponding 100 features with highest weight (in absolute value). \({I}_{{\rm{Loci}}}^{({\rm{GWAS}})}\) and \({I}_{{\rm{topLoci}}}^{({\rm{GWAS}})}\) compare instead the same quantities to the list given in³.

Search