Extended Data Fig. 4: Quality control of proteomics and impact of copy number alteration on mRNA and protein expression.

a, Bar plot showing the detected genes in each batch. The totality of detected genes was 10864. b, Principal component analysis (PCA) evaluating the batch effect with all genes that were detected in over 70% of included samples after normalization and batch effect removement. c, Dot plots showing the Pearson’s correlation between technical replicates (samples within batch 33 and 34) with all genes that were detected in over 70% of included samples after normalization and batch effect removement. d, Venn diagrams depicting the cis-effect of CNA (FDR < 0.05) along the central dogma in this study and the studies published by Mertins and colleagues10 (n = 74 tumors) and by Krug and colleagues11 (n = 122 tumors). e, f, Boxplot showing the mRNA level and protein level of WWP1 (e) and CCND1 (f) across different GISTIC scores in each PAM50 subtype. For WWP1 analysis, the number of samples were as follows: LumA: n = 188 tumors in the RNA analysis and n = 52 tumors in the protein analysis; LumB: n = 198 tumors in the RNA analysis and n = 73 tumors in the protein analysis; HER2: n = 121 tumors in the RNA analysis and n = 49 tumors in the protein analysis; Basal: n = 88 tumors in the RNA analysis and n = 44 tumors in the protein analysis; Normal: n = 47 tumors in the RNA analysis and n = 20 tumors in the protein analysis. For CCND1 analysis, the number of samples were as follows: LumA: n = 147 tumors in the RNA analysis and n = 41 tumors in the protein analysis; LumB: n = 163 tumors in the RNA analysis and n = 58 tumors in the protein analysis; HER2: n = 105 tumors in the RNA analysis and n = 41 tumors in the protein analysis; Basal: n = 76 tumors in the RNA analysis and n = 31 tumors in the protein analysis; Normal: n = 37 tumors in the RNA analysis and n = 12 tumors in the protein analysis. In boxplots, the centreline represents the median, the box limits represent the upper and lower quartiles, the whiskers represent the 1.5× interquartile range, and the points represent individual samples. g, h, Forest plot of multivariate Cox regression analysis for relapse free survival adjusting for PAM50 clusters, tumor size and lymph node status in overall population (n = 271 tumors) (g) and HR+HER2- subgroup (n = 148 tumors) (h). Error bars represent the 95% confidence intervals (CI) of the hazard ratio (HR) and the center for the error bars indicates HRs. i, Gene set enrichment analysis (GSEA) comparing the molecular characteristics of each integrated cluster with the others. Pathways that were significantly enriched in certain cluster (FDR < 0.25) were shown. j, Heat map showing the abundance of immune cells in Cluster 3 (n = 75 tumors) and non-Cluster 3 (n = 196 tumors) breast cancers. Cell types that were significantly elevated in Cluster 3 subgroup were marked with asterisks. k, Enrichment of immunotherapy predictive signatures in integrated clusters and PAM50 subtypes indicated by logistic model in overall population (n = 271 tumors) and HR+HER2- (n = 148 tumors) subgroups. For d, P values were obtained from Spearman’s rank test with false discovery rate correction. For e, f, two-sided Wilcoxon rank tests were conducted to compare the mRNA level or protein level between samples with GISTIC scores of ‘0’ and ‘2’ in different PAM50 subtypes. *: P value < 0.05; N.S.: not significant, P value > 0.05. For g, h, P values were obtained from two-sided multivariate Cox regression analysis. The bold font indicates a P value less than 0.05. For j, P values were obtained from unpaired two-sided t-test.