Research ResourceCancer

A machine learning approach for somatic mutation discovery

See allHide authors and affiliations

Science Translational Medicine  05 Sep 2018:
Vol. 10, Issue 457, eaar7939
DOI: 10.1126/scitranslmed.aar7939
  • Fig. 1 Overview of Cerebro for somatic mutation detection.

    Paired tumor-normal whole-exome sequence data were mapped to the human reference genome using a dual alignment protocol for consensus mutation calling. Candidate mutations were assessed for confidence using an extremely randomized trees classification model (Cerebro) that evaluates a large set of decision trees to generate a confidence score for each candidate variant. A variety of characteristics are considered by the Cerebro model including distinct coverage, mutant allele frequency (MAF), GC content, and mapping quality.

  • Fig. 2 Comparison of Cerebro to mutation detection methods.

    (A) Positive predictive value versus sensitivity for simulated low-purity tumor data sets created from normal cell line sequence data. (B) Sensitivity stratified by mutation type, calculated from simulated mutations across mutation types and allele frequencies. (C) Positive predictive value versus sensitivity for cell line data sets with experimentally validated somatic mutations. DEL, deletion; INS, insertion; 1, 2, or 3, length of indel.

  • Fig. 3 Comparison of TCGA and Cerebro mutations for 1368 exomes.

    Somatic mutations from the TCGA MC3 project and Cerebro were compared for concordance. (A) Total mutational loads between the two mutation calls shared by cancer type. Mutation loads were defined as the total number of nonsynonymous mutations per sample. LUSC, lung squamous cell carcinoma; LUAD, lung adenocarcinoma; BLCA, bladder; STAD, stomach; COAD, colorectal; SARC, sarcoma; HNSC, head and neck squamous cell; SKCM, melanoma; UCEC, uterine; LIHC, liver; KIRC, kidney; *-H, set enriched for high–mutation load samples. (B) Unique/shared status for somatic mutations across all samples. C, colorectal; H, head and heck squamous cell.

  • Fig. 4 Analysis of cancer driver gene mutations.

    Evaluation of mutations in 66 oncogenes and tumor suppressor genes indicated (A) a large number of low-confidence mutations unique to TCGA associated with various problematic features. No High-Quality Alignment, no consistent alignment found with at least one mutant base with quality higher than 30; MAF < 5%, mutant allele frequency below 5%; TBQ < 30, tumor base quality below 30. (B) Distribution of problematic TCGA driver gene calls by genes with approved FDA therapies (left panel) or ongoing clinical trials (right panel). (C) Quality characteristics of mutations uniquely called by TCGA were more problematic than other identified mutations. NMAF, normal mutant allele frequency; TMMQ, tumor mutant mapping quality; TBQ, tumor mutant base quality).

  • Fig. 5 Analysis of TMB in patients treated with immune checkpoint blockade.

    Comparison of Cerebro mutation calls with published calls associated with NSCLC (left panels) or melanoma (right panels). (A) Unique/shared mutation status for all patients. (B) Problematic mutations unique to original publications annotated by characteristic issue. Annotation, conflicting consequence; TDP < 3, tumor distinct pairs less than 3; TBQ < 30, tumor base quality less than 30; Unaligned, no alignment of mutated reads to the mutation position; MAF < 5%, mutant allele frequency less than 5%. (C) Kaplan-Meier analysis of progression-free survival (left) or overall survival (right) using tumor mutation loads from original publications. (D) Kaplan-Meier analysis of the same samples using Cerebro mutational loads. Log-rank P value shown for each survival plot. DCB, durable clinical benefit; NDB, no durable benefit.

  • Table 1 WES analyses and somatic mutation loads.
    Tumor type*Number of
    samples
    CerebroTCGA MC3
    Median
    mutation load
    % Load
    > 100
    % Load
    > 250
    % Load
    > 500
    % Load
    > 1000
    Median
    mutation load
    % Load
    > 100
    % Load
    > 250
    % Load
    > 500
    % Load
    > 1000
    Lung
    adenocarcinoma
    (LUAD)
    47314160.329.48.92.115262.831.310.12.1
    Lung squamous
    cell (LUSC)
    13417882.827.66.71.518884.334.37.51.5
    Bladder (BLCA)36013664.423.15.61.1141.565.023.95.81.1
    Enriched for high–mutation load tumors
    Liver (LIHC-H)1417525.53.51.40.07224.13.51.40.7
    Melanoma
    (SKCM-H)
    10530870.557.133.39.531771.458.135.211.4
    Colon (COAD-H)44786.588.665.963.634.1892.588.661.459.145.5
    Uterine (UCEC-H)417846.346.326.80.07846.346.322.00.0
    Kidney renal clear
    cell (KIRC-H)
    25648.00.00.00.0620.00.00.00.0
    Head and neck
    (HNSC-H)
    15493100.086.746.713.3486100.093.346.713.3
    Stomach (STAD-H)6998100.0100.0100.050.0872.5100.0100.0100.050.0
    Sarcoma (SARC-H)5540100.0100.060.00.052260.060.060.00.0
    Other-H1973484.273.768.436.866894.784.268.436.8

    *-H, tumor samples enriched for high mutation load.

    • Table 2 Comparison of NGS cancer sequencing panels.

      TP, true positive; FN, false negative; FP, false positive; 95% CI, 95% confidence interval; Indel, insertion or deletion; PPV, positive predictive value.

      Performance
      metric
      CancerSelect 125Oncomine comprehensive assayTruSeq AmpliconCancer Panel
      TPFNPoint
      estimate
      (%)
      95% CITPFNPoint
      estimate
      (%)
      95% CITPFNPoint
      estimate
      (%)
      95% CI
      SBS sensitivity30010085.9–10029196.780.9–99.8151550.033.2–66.8
      Indel
      sensitivity
      6010051.7–1005183.336.4–99.13350.018.8–81.2
      TPFPPoint
      estimate
      (%)
      95% CITPFPPoint
      estimate
      (%)
      95% CITPFPPoint
      estimate
      (%)
      95% CI
      SBS PPV30010085.9–100291763.047.5–76.4151698.24.8–13.3
      Indel PPV6010051.7–1005010046.3–1003633.39.0–69.1
    • Table 3 Comparison of clinical cancer sequencing panels.
      Performance
      metric
      CancerSelect 125MSK-IMPACT
      TPFNPoint
      estimate (%)
      95% CITPFNPoint
      estimate (%)
      95% CI
      SBS sensitivity36010088.0–10035197.283.8–99.9
      Indel
      sensitivity
      10010065.5–10010010065.5–100
      TPFPPoint
      estimate
      95% CITPFPPoint
      estimate (%)
      95% CI
      SBS PPV36010088.0–10035010087.7–100
      Indel PPV10010065.5–10010010065.5–100

    Supplementary Materials

    • www.sciencetranslationalmedicine.org/cgi/content/full/10/457/eaar7939/DC1

      Materials and Methods

      Fig. S1. False-positive evaluation for somatic mutation callers.

      Fig. S2. ddPCR mutation validation analyses.

      Fig. S3. Precision-recall and ROC curve analyses of Cerebro and other mutation callers using experimentally validated alterations.

      Fig. S4. Mutation loads of TCGA exomes using different mutation calling methods.

      Fig. S5. Concordance rates (% of total mutations) of Cerebro compared to other mutation call sets for TCGA exomes.

      Fig. S6. Response to checkpoint inhibitors associated with mutational load.

      Fig. S7. Survival analysis by mutation load stratified by discovery and validation cohorts.

      Fig. S8. Alterations confirmed by ddPCR in clinical NGS panel comparison.

      Fig. S9. Comparative results of three clinical sequencing panels.

      Table S1. Confidence scoring model features for the Cerebro machine learning algorithm.

      Table S2. Performance results for simulated low-purity tumors.

      Table S3. False-positive rates of mutation calling methods.

      Table S4. Sensitivity results by variant type of mutation calling methods.

      Table S5. Sensitivity results for substitutions by MAF.

      Table S6. Sensitivity results for insertion-deletions by MAF.

      Table S7. Results of ddPCR validation of somatic mutations.

      Table S8. Performance results for cell lines with validated somatic mutations.

      Table S9. Unique and shared mutation load results for Cerebro and TCGA (MC3).

      Table S10. Concordance results for Cerebro and TCGA for driver oncogenes and tumor suppressor genes.

      Table S11. Comparison of mutation calls in immunotherapy publications and Cerebro reanalysis.

      Table S12. Evaluation of TruSeq false-positive calls using raw CS125 and Oncomine sequence data.

      Table S13. Comparison of mutation callers for clinical samples.

      Table S14. Shared and unique somatic mutation calls between Cerebro and TCGA.

      Table S15. Genomic ROIs used in clinical panel comparisons.

    • The PDF file includes:

      • Materials and Methods
      • Fig. S1. False-positive evaluation for somatic mutation callers.
      • Fig. S2. ddPCR mutation validation analyses.
      • Fig. S3. Precision-recall and ROC curve analyses of Cerebro and other mutation callers using experimentally validated alterations.
      • Fig. S4. Mutation loads of TCGA exomes using different mutation calling methods.
      • Fig. S5. Concordance rates (% of total mutations) of Cerebro compared to other mutation call sets for TCGA exomes.
      • Fig. S6. Response to checkpoint inhibitors associated with mutational load.
      • Fig. S7. Survival analysis by mutation load stratified by discovery and validation cohorts.
      • Fig. S8. Alterations confirmed by ddPCR in clinical NGS panel comparison.
      • Fig. S9. Comparative results of three clinical sequencing panels.

      [Download PDF]

      Other Supplementary Material for this manuscript includes the following:

      • Table S1 (Microsoft Excel format). Confidence scoring model features for the Cerebro machine learning algorithm.
      • Table S2 (Microsoft Excel format). Performance results for simulated low-purity tumors.
      • Table S3 (Microsoft Excel format). False-positive rates of mutation calling methods.
      • Table S4 (Microsoft Excel format). Sensitivity results by variant type of mutation calling methods.
      • Table S5 (Microsoft Excel format). Sensitivity results for substitutions by MAF.
      • Table S6 (Microsoft Excel format). Sensitivity results for insertion-deletions by MAF.
      • Table S7 (Microsoft Excel format). Results of ddPCR validation of somatic mutations.
      • Table S8 (Microsoft Excel format). Performance results for cell lines with validated somatic mutations.
      • Table S9 (Microsoft Excel format). Unique and shared mutation load results for Cerebro and TCGA (MC3).
      • Table S10 (Microsoft Excel format). Concordance results for Cerebro and TCGA for driver oncogenes and tumor suppressor genes.
      • Table S11 (Microsoft Excel format). Comparison of mutation calls in immunotherapy publications and Cerebro reanalysis.
      • Table S12 (Microsoft Excel format). Evaluation of TruSeq false-positive calls using raw CS125 and Oncomine sequence data.
      • Table S13 (Microsoft Excel format). Comparison of mutation callers for clinical samples.
      • Table S14 (Microsoft Excel format). Shared and unique somatic mutation calls between Cerebro and TCGA.
      • Table S15 (Microsoft Excel format). Genomic ROIs used in clinical panel comparisons.

    Navigate This Article