Introduction

In genomic medicine, clinical tests have a wide range of roles, applications and expectations. Liquid biopsies, CGP testing for hematopoietic tumors, whole exon analysis for intractable and rare diseases, whole genome analysis, etc. are rapidly expanding techniques in actual clinical practice1,2,3,4,5,6. To ensure accurate and reproducible results, EQA/PT for genetic-related testing in clinical settings is necessary7,8,9,10,11. Genetic-related tests in clinical laboratories include three categories: nucleic acid (NA) testing from pathogenic microorganisms, human DNA from somatic and germline cells. ISO 15189 accreditation requires participation in EQA/PT, and a study on EQA for NGS testing across 48 countries demonstrated that annual assessments improved technical capabilities, including bioinformatics workflows12. In this study, we aggregated and analyzed the results from various institutions and examined differences in results based on NGS equipment and reagents as well as on the detection rate for each genetic mutation. Further, we evaluated the characteristics depending on method and NGS platform. This allowed us to learn about the current state of cancer gene panel testing in Japan and allowed each participating facility to discuss the results. This study aimed to conduct an alternative, cost-effective EQA/PT on human DNA from somatic cells in 10 laboratory institutions. The DNA was extracted from cancer cell lines and from simulated tumor (T)/non-tumor (N) matched-pair samples in the form of commercially available FFPE specimens, to standardize gene-related tests such as companion diagnostic (CDx) for clinical applications. Furthermore, mutation detection may vary depending on the analysis pipeline used, the presence of multiple variants in the same gene, and the limit of detection (LOD). VAF associated with a 95% detection rate may also differ depending on the reporting threshold established at each facility. The present results aim to highlight potential pitfalls in CGP testing and contribute to the standardization of NGS analysis. Together, implementing EQA/PT procedures across laboratories will help ensure the delivery of high-quality genomic medicine.

Results

EQA/PT design and protocol

Representative CGP tests consist of three steps: pre-analytic (sampling and NA extraction), analytic (library preparation and DNA sequencing), and post-analytic (Fig. 1A)13,14. In this study, we categorized bioinformatics as part of the post-analytic phase to distinguish bioinformatics data processing from the process of obtaining raw sequence data from NGS. Therefore, this study designed two independent EQA/PT (Fig. 1A, B). With respect to the EQA, CGP testing methodology was compared among five university hospital laboratories and five company-laboratories in Japan (Table 1). NA was extracted from several cancer cell lines and its quality, yield, as well as the detection rate of CDx genes at each facility were examined and compared for a 1st EQA in 2022. The 2nd EQA was performed in 2023; the department of laboratory medicine in Chiba University Hospital delivered simulated paired matched T/N FFPE samples, and each institute (Laboratories A–C, E–G, I, and J) extracted DNA/RNA and performed the following steps on their own (Fig. 1B). First, to compare analysis and post-analysis, EQA/PT was performed with delivered NAs of cancer cell lines. We did not observe a clear difference in the detection frequency of pathogenic variants among all facilities, and the LOD, VAF with a detection rate of 95%, was approximately 6–6.1% for NAs of cancer cell lines and 10.8–10.9% for FFPE samples. Next, EQA/PT was performed, including NAs extraction from FFPE as a pre-analysis process, using simulated tumor (T)/non-tumor (N) matched-pair samples.

Fig. 1
figure 1

(A) Representative CGP tests indicated as pre-analytic (sampling and nucleic acids (DNA/RNA) extraction), analytic (Library preparation and DNA sequencing), and post-analytic (Bioinformatics). Validation, IQC, and EQA are required for all processes. For the 1st EQA, NA (DNA/RNA) extraction (pre-analysis) was carried out at laboratory D and DNA/RNA samples were delivered to participated laboratories (Laboratories A–C, E–G, I, and J). (B). For the 2nd EQA, laboratory D delivered simulated paired matched T/N FFPE samples and each institute (Laboratories A–C, E–G, I, and J) extracted DNA/RNA and performed the rest of the process.

Table 1 Summary of analytic procedures in participated institutes.

The limit of detection (LOD) of pathogenic gene variants was 6 to 6.1% VAF in cancer cells

For the 1st EQA, five genomic DNA samples of cancer cell lines (1 reference and 4 test samples; Table S1) were prepared by the EQA organizer (Laboratory D in Chiba University) and delivered to the other nine laboratories (Laboratories A–I; Table 1). Analytic process in the participated laboratories included library preparation, DNA sequencing, and bioinformatics analysis (Fig. 1A). NGS system, library preparation procedures, reagents for examination, reference human genome version (GRCh37/hg19 or GRCh38/hg38), and sensitivities of SNV/Indel in each institute are indicated (Table 1). In addition, gene name, type of variants (according to Human Genome Variation Society nomenclature), VAF (mean, %), relative standard deviation (%RSD), and %Detection (rate and sample count, N) were calculated (Table 2A). For example, in the pathogenic variant of cell Caco-2: K-562, VAF (mean, %), %RSD and %Detection (rate and N) of KRAS c.38G > A (p.G13D) were 11.7%, 11%, and 100% (9/9), respectively (Table 2A). In four test samples, the Caco-2 mixed with K562 (ratio of 1:3) sample had no variants in BRAF, EGFR, KRAS, NRAS, or PIK3CA genes (Table S1). In the case of BRAF: c.1799 T > A (p.V600E), VAF (mean, %), %RSD and %Detection (rate and N) were 4.9%, 16%, and 67% (6/9) in cell HT-29: K-562, whereas they were 12.5%, 8%, and 100% (9/9) in cell RKO: K-562. For BRAF: c.1799 T > A (p.V600E) in cell HT-29: K-562, 3 out of 9 laboratories did not report this variant. However, these 3 laboratories confirmed the presence of this variant in Binary Alignment/Map (BAM) files. These results suggest that a VAF of 5% may approximate the limit of detection (LOD) for the participating laboratories. However, as LOD (defined as the VAF at which variants are detected 95% of the time) was not formally assessed in this study, this value should be considered an estimate rather than a definitive threshold.

Table 2 Summary of 1st and 2nd EQA/PT.

DNA extraction from FFPE is critical for the VAF of pathogenic variants and variants with companion diagnostic indications detection

Laboratory D in Chiba University delivered simulated paired tumor (T)/non-tumor (N) matched-pair samples to each institute (Laboratories A–C, E–G, I, and J), and they extracted DNA by their own procedures (Fig. 1B). For the 2nd EQA, sufficient DNA yields were obtained from T/N matched-pair samples by NA extraction. These simulated matched-pair samples were essential for EQA/PT in matched-pair analysis techniques, such as those used in the NCC Oncopanel™ which require both tumor and normal (non-tumor) samples for accurate evaluation. Regarding comprehensive genomic testing using NGS, detailed EQA/PT data for the curation of analytical processes is summarized in Table 2B. FFPE samples had more comprehensive hotspot variants in AKT1, BRAF, EGFR, KIT, KRAS, NRAS, and PIK3CA (Table 2B). Most variants were accurately detected in the 10 participating laboratories. Each variant had < 20% VAFs on average (8.1–19.1%) and there was wide variability among labs (%RSD ranged 13–60%). Variants with low detection rate (< 80%) among labs were BRAF: c.1798_1799delinsAA (p.V600K), EGFR: c.2235_2249del (p.E746_A750del), and EGFR: c.2254_2277del (p.S752_I759del), all with < 10% VAFs. Notably, two laboratories reported BRAF V600M but not V600K. These incorrect results, observed when using the VarScan2 pipeline, were due to the independent variant calling of BRAF; where c.1798G > A and c.1799 T > A were occurring on separate strands instead of together as c.1798_1799delinsAA (p.V600K) (Figure S1A–C). Thus, these results were affected by not only sensitivity but also variant call accuracy. Correlation analysis of VAFs showed good correlation between laboratories where the same instrument and reagents were used: the Spearman’s rho of Lab A and B was 0.75 (instrument, Thermo; method, amplicon); Lab F and G was 0.87 (instrument, Illumina; method, capture; Fig. 2, Table S1). As compared with amplicon and capture methods, no significant differences were observed in variant frequency (Figure S2).

Fig. 2
figure 2

Correlation matrix. Analysis and post-analysis of participating facilities using different procedures for CGP panel testing. A correlation matrix between facilities based on these results suggested that the analysis and post-analysis influenced the results.

The 11% VAF affected the detection rate of gene variants depending on DNA extraction procedures from FFPE

The relationship and difference between % VAF and detection rate is indicated in the 1st EQA and the 2nd EQA (Fig. 1B, Table 2). In the 1st EQA using DNA from cancer cell lines, 100% detection was achieved for VAF > 5–10% (Table 2A). However, in the 2nd EQA, the detection rate was 100% only for VAF > 15%. These findings highlight that VAF required for detection is significantly influenced by DNA quality and quantity. Furthermore, DNA library preparation for NGS analysis affected VAF depending on genetic variants. In the case of 1st EQA, which did not include deletion variants, there was no significant difference in VAF in terms of genetic variants between amplicon-based (Table 1: Labs A–E) or capture-based (Labs F–J) methods for library preparation (Table 1). However, in the 2nd EQA, which included three EGFR deletion variants, a tendency was observed where VAF for EGFR exon 19 deletion variants was higher with amplicon-based method compared to the capture-based method (Fig. S2). Based on our findings, we hypothesize that in the case of deletion variants, the PCR efficacy may increase, while capture efficacy decreases depending on the deleted sequence. Further research is needed to corroborate this hypothesis. Therefore, the process of library preparation was critical for the standardization of CGP tests.

In ordinary clinical testing in the clinical laboratories whether in the hospital or companies, once we have established the initial analytic pipelines, if there are no problems with validation tests, we will continue testing without making any changes. Therefore, based on the results obtained from this study, we hope that conducting EQA/PT will provide an opportunity to review the pipeline to deal with pitfalls that are usually not noticed.

Challenges and considerations in detecting BRAF and EGFR pathogenic variants

Precise detection of BRAF V600K and EGFR pathogenic variants is crucial for molecular target therapy15. In this study, the detection rates for BRAF V600K and EGFR variants varied among participating laboratories: %Detection (rate and N) for BRAF c.1798_1799delinsAA (p.V600K) (Fig S1A, B) was 70% (7/10), while EGFR c.2235_2249del (p.E746_A750del) and c.2254_2277del (p.S752_I759del) were detected at 70% (7/10) and 60% (6/10), respectively. In contrast, mutations like KRAS c.34G > T (p.G12C) and NRAS c.181C > A (p.Q61K) were detected by all ten facilities, even when the %VAF was below 10% (Table 2).

The detection discrepancies for BRAF and EGFR variants were analyzed. For BRAF V600, differences in detection were attributed to the use of different variant callers (e.g., GATK mutect2 and VarScan2), which likely influenced the results. The TT variant of BRAF V600K was misidentified as V600M due to misclassification on the sense or anti-sense DNA strands (Fig S1A-C). The reasons for this incorrect annotation were thought to be as follows. GATK mutect2 correctly detected BRAF 7: 140,453,136–14,045,317 (TC) and 140,453,136–14,045,317 (TT) as c.1799 T > A, V600E and c.1798_1799delinsAA, V600K (Fig S1B, C); however, VarScan2 misidentified BRAF 7: 140,453,136–14,045,316 (T) and 14,045,317–14,045,317 (T) on the different strand as c.1798G > A, V600M, and c.1799 T > A, V600E (Fig S1C). For EGFR, multiple variants were in proximity, making their detection challenging. This issue can arise because regions with high discrepancy are sometimes filtered out by analysis pipelines. To address these issues, it is essential to select a pipeline that accounts for such biases and to perform visual inspection using the Integrative Genomics Viewer (IGV) to minimize errors in variant calling (Fig. S1A). While achieving 100% accuracy is unrealistic in clinical settings, sharing effective protocols and leveraging limited resources efficiently is critical to ensure the reliable detection of clinically significant pathogenic variants.

If CGP testing will be the first start up in the facility, EQA/PT will be able to notice pitfalls as performed in this project. Since it is not easy to modify a pipeline once created, it is important to perform EQA/PT in advance, previously reported that mentions “5 patients had test results by Guardant360 v14.0, 4 by Guardant360 v15.0; 5 by Guardant360 v16.0, and 2 by Guardant360 v17.016. What is important in the process is that data from past samples is reanalyzed with updated pipelines to determine whether modified pipelines still perform acceptably, and preferably with improved overall performance.

In particular, objective evaluation should be conducted blindly. Furthermore, in the future, it would be advantageous in terms of cost if ethical issues could be resolved and EQA/PT using human clinical specimens could be implemented even on a small scale. The requirements of ISO 15189 require evaluation of the suitability of services provided by external parties (including those like this survey), so we believe it would be ideal to be able to choose from a variety of programs. It was surprising to see that the variation in VAF was large. The authors think it is suitable for evaluating the extent to which it is possible to detect alleles with a relatively low allele frequency of 5%. Regarding the variation in VAF, it is necessary to equalize the read amount (depth) of the sequences. Together, in case of high-quality DNA such as extracts of cell lines, % VAF of EQA/PT needs to be 5–10%, however, low quality DNA specimens as extracts of FFPE, % VAF of EQA/PT needs to be 15%. This threshold corresponds to a tumor content of approximately 30%, which may indeed be high for some FFPE sections. This study now highlighted the need for enrichment methods, such as macro- or microdissection, to improve the detection of clinically significant somatic mutations in samples with lower tumor content.

Correlation matrix among laboratories

The analysis and post-analysis processes in the participating facilities included variations in library preparation methods, DNA sequencing reagents (Manufacture), DNA sequencers (Instrument), human reference genome, and sensitivity (LOD; Table 1). We compared genetic variants and %VAF for CDx variants across facilities (Table S2), and created a correlation matrix between them. The results indicated that both the analysis and post-analysis processed influenced the detection results (Fig. 2). Accordingly, it was considered important for each facility to understand the characteristics of its own method when participating in EQA of NGS testing.

Discussion

In clinical practice, accurate detection is essential to ensure appropriate treatment17,18. EQA/PT play a vital role in maintaining accuracy in cancer genome analysis19. This study among 10 clinical laboratories highlighted the importance of preanalytical processes, particularly NA extraction in determining the quality of CGP tests. The content of this EQA closely resembles the requirements of ISO 15189, emphasizing the importance of standardized procedures to ensure test reliability. Variations in NA extraction methods among laboratories, along with updated histopathology protocols like shorter fixation times, could improve DNA specimen quality and enhance test performance.

Differences in the detection of BRAF and EGFR variants were observed among laboratories, which were attributed to variations in the pipelines used, including the choice of variant callers. The idea of “the most suitable pipeline” varies between laboratories. Evaluating individual components of the pipeline, such as variant callers and their integration into the overall analysis process, is essential. Sharing best practices within limited medical resources can improve testing reliability. The concept of “clinical validity” also underscores the need for pipelines to produce results that align with clinical requirements.

While tools like IGV can exclude false positives through visual inspection, they cannot detect variants that were missed in the pipeline. Although this study did not implement pipeline modifications based on EQA results, such changes will likely become critical in the future16. Unexpected variability in VAF across facilities may be due to differences in capture or PCR efficiency, experimental conditions, or analysis pipelines. The degree of error likely depends on the specific variant to be detected20, partially due to methodological differences in the preparation of library preparation. Analyzing VAF and detection sensitivity through EQA provides a practical reference range for ensuring consistent detection of clinically significant variants across laboratories. For FFPE samples, we proposed a %VAF threshold of 15% for EQA/PT. However, this threshold corresponds to a tumor cell content of approximately 30%, which may be high for some FFPE sections. Without enrichment steps such as macro- or microdissection, laboratories may fail to detect clinically significant somatic mutations.

Using reference samples with known allele frequencies is a strength of this EQA study, as it allows systematic biases in VAF to be assessed. However, both patient-derived FFPE samples and artificially created materials have distinct advantages and limitations, making it difficult to definitively determine which is better for EQA. In brief, if the number of repetitive EQA among laboratories is performed, it will tell us the degree of systematic error that occurs with a particular test method21. In addition, traceability is also important for company samples to see whether the allele frequency is truly correct, so that it is considered highly reliable22. Future studies are needed to determine the extent to which this difference depends on the methods used.

It is necessary to consider what to evaluate, including the effects of differences in depth and software (variant caller). The true answer is unknown in clinical human samples; therefore, it is necessary to consider what should be evaluated as a correct answer in an actual sample. For example, tumor mutation burden (TMB) and microsatellite instability (MSI) were not included in the data requested for submission this time, it is important to look at them as well. Currently, artificial samples such as commercially available standard materials spiked with fragments with artificially introduced genetic mutations, or commercially available products subjected to various types of gene editing have been used to perform QA23. This is advantageous because it allows systematically introducing genetic mutations in advance and the ability to produce it in large quantities. However, samples produced in this way do not ensure compatibility depending on the measurement method (commutability) and have different properties from the samples used for routine analysis22.

Patient-derived FFPE samples provide better compatibility with clinical testing methods, ensuring greater commutability. However, they present logistical challenges, including heterogeneity across sections and difficulties in obtaining sufficient material for consistent assessments among laboratories. On the other hand, artificially created samples, such as standard materials spiked with known mutations, offer scalability, consistency, and ease of production. Nonetheless, they may lack commutability with clinical specimens, potentially limiting their applicability to routine clinical testing.

Ideally, EQA programs should aim to integrate the benefits of both approaches by using patient-derived samples to focus assessments on the NGS analysis process while maintaining the scalability and standardization offered by artificial materials. However, the use of patient-derived FFPE samples for EQA outside of clinical research requires further discussion to address issues related to material consistency, logistical feasibility, and ethical considerations24,25. Achieving a consensus on their application will be essential for improving the reliability and accuracy of EQA in genomic testing.

Timely EQAs and reviews of analysis pipelines are critical for advancing genomic medicine26. Sharing library preparation methods, sequencing quality metrics, and performance indicators can improve test reliability27,28. While genome versions or annotation methods like MANE29 may not impact clinicians directly, aligning these notations would help EQA/PT providers evaluate laboratory performance more consistently.

In summary, this study highlights the critical role of EQAs in identifying systematic biases and improving genomic testing reliability across laboratories. To enhance the quality of CGP tests, it is essential to focus on all stages of testing process, align pipelines with clinical needs, and use standardized reference materials. Monitoring changes over time through regular EQAs will help improve testing reliability. Future efforts should aim to develop more consistent EQA methods using patient-derived specimens while addressing challenges related to commutability and scalability. By fostering collaboration and sharing best practices, the reliability of cancer genomic testing can be further advanced, ultimately benefiting patient care.

Materials and methods

Participants

Ten clinical laboratories of five university hospitals (Chiba University, University of Tokyo, Kagoshima University, Mie university, and Fujita medical college) and five commercial companies in Japan (SRL, BML, RIKEN-Tsukiji, RIKEN-Kawasaki, and Sysmex Co. Ltd.) participated in this study (Table 1). Five laboratories used the amplicon method and the rest used the capture method for DNA detection. Manufactures of library preparation kits were Thermo for four labs, Qiagen for one, and Agilent for five.

DNA extraction from cell lines

DNA samples were extracted from one leukemia cell line (K-562) and four colorectal cancer cell lines (Caco-2, HCT 116, HT-29, and RKO). DNA concentrations were adjusted to 25 ng/µL using a fluorometric assay. Then, EQA/PT samples were prepared by mixing the K562 cell line, used as baseline for matched pair analysis, and each colorectal cancer cell line at a 3:1 ratio (Fig. 2). Microsatellite instability and EGFR, KRAS, NRAS, BRAF, PIK3CA, and TP53 variants were examined as target genes of CDx. K-562, human leukemia cell line, was used for standard that has no pathogenic variants in CDx genes. Sample preparation was performed by mixing cell lines with mutations. We conducted a simulated matched pair study in which “DNA derived from a cell line without the target gene mutation” was derived from normal (non-tumor) cells, and “DNA derived from a cell line with the target gene mutation” was derived from tumor cells.

Sample cells (1–4) were prepared at laboratory D and delivered to other EQA participated laboratories.

  1. (1)

    Sample cell (Caco-2: K-562): No mutations detected in EGFR, KRAS, NRAS, BRAF, PIK3CA. Microsatellite status, stable.

  2. (2)

    Sample cell (HCT 116: K-562): KRAS p.G13D mutation detected at approximately 12.5% and PIK3CA p.H1047R detected at approximately 12.5%. Microsatellite status, high.

  3. (3)

    Sample cell (HT-29: K-562): BRAF p.V600E mutation detected at approximately 6%. Microsatellite status, stable.

  4. (4)

    Sample cell (RKO: K-562): BRAF p.V600E mutation detected at approximately 16.75%, and PIK3CA p.H1047R detected at approximately 12.5%. Microsatellite status, high.

Evaluation genes, relatively common genes in solid tumors with variant hotspots:

  1. (1)

    EGFR exons 18–21: Mutations in EGFR are well-known drivers in non-small cell lung cancer and are targeted by approved EGFR tyrosine kinase inhibitors.

  2. (2)

    KRAS codons 12, 13, 59, 61, 117, 146: KRAS mutations are predictive biomarkers of resistance to anti-EGFR monoclonal antibodies in colorectal cancer.

  3. (3)

    BRAF codon 600: The BRAF p.V600E mutation is a clinically actionable biomarker that has been approved for treatment across multiple cancer types.

  4. (4)

    PIK3CA codons 542, 545, 1047: In breast cancer with activating mutations in PIK3CA, an AKT inhibitor and a selective estrogen receptor degrader (SERD) are indicated for hormone receptor-positive and HER2-negative cases.

FFPE samples and PT evaluation methods for matched paired T/N samples

FFPE reference cells carrying synthetic DNA constructs manufactured to more closely mimic the quality of patient tissue. Two specimens, Seraseq® Compromised FFPE Tumor DNA Reference Material (0710–1492) and Seraseq® Compromised FFPE WT (DNA/RNA) Reference Material (0710–1710), were purchased from LGC Clinical Diagnostic, Inc., (Milford, MA, USA). Seraseq® Compromised FFPE Tumor DNA RM, 0710–1492; LGC Clinical Diagnostic, Inc., (Milford, MA, USA) consist of cell line GM24385 engineered cells producing 34 genetic variants which are then formalin treated and embedded in paraffin imbibing “patient-like” characteristics. Compromised FFPE WT RM, 0710–1710; LGC Clinical Diagnostic, Inc., (Milford, MA USA) uses GM24385 cell line which has been formalin treated to mimic degraded DNA and RNA found in FFPE treated samples. These allow reproducible samples to answer analysis for mutation positive and wild type samples with patient like performance. Using cell line GM24385 for both mutation carrying and WT reference material mimic time matched tumor/normal (non-tumor) samples. Artificial FFPE samples used in this study produce a transparent nucleic acid pellet smaller than a typical clinical specimen, care had to be taken to ensure that the sample was not lost during extraction. DNA analysis (RNA analysis optional) was performed at all facilities. The test was to be conducted using typical methods (EQA/PT). Nucleic acid extraction was performed using the method of each facility, and the results were read together on a later date.

EQA evaluation protocol

Detected variants and their frequencies (% VAF) were examined in a mutually blinded manner using each facility’s detection methods.

The appropriateness of a sample was determined based on the match rate of detected variants across facilities. A match rate of ≥ 80% between facilities was considered suitable, indicating consistency in detection, while a match rate of < 80% was deemed inappropriate, suggesting inconsistency in results.

Statistical analysis

Statistical analysis (correlation analyses) was performed using R (version 4.4.0).