Research ArticleGENETIC DIAGNOSIS

Diagnosis of genetic diseases in seriously ill children by rapid whole-genome sequencing and automated phenotyping and interpretation

See allHide authors and affiliations

Science Translational Medicine  24 Apr 2019:
Vol. 11, Issue 489, eaat6177
DOI: 10.1126/scitranslmed.aat6177
  • Fig. 1 Flow diagrams of the diagnosis of genetic diseases by standard genome sequencing and rWGS.

    (A) Steps in conventional clinical diagnosis of a single patient by genome sequencing (GS) with manual analysis and interpretation in a minimum of 26 hours but with a mean time to diagnosis of 16 days (8, 1630). Genome sequencing was requested manually. We manually extracted genomic DNA from blood samples, assessed the DNA quality (QA), and manually normalized the DNA concentration. We then manually prepared TruSeq PCR-free DNA sequencing libraries, performed the QA again, and manually normalized the library concentration. Genome sequencing was performed on the HiSeq 2500 system (Illumina) in rapid run mode (RRM). Sequences were manually transferred to the DRAGEN Platform version 1 (Illumina) for alignment and variant calling. Phenotypic features were identified by manual review of the electronic health record (EHR). Variant files and phenotypic features were manually loaded into Opal software (Fabric), and interpretation was performed manually. (B) Steps in autonomous diagnosis of up to six patients concurrently in a minimum of 19 hours (fig. S3). Steps included (i) automation of order entry from the EHR with a portal; (ii) manual or robotic preparation of Nextera DNA Flex sequencing libraries directly from the blood in 2.5 hours; (iii) rapid 40-fold coverage genome sequencing in 15.5 hours with the NovaSeq 6000 system and S1 flow cell (Illumina); (iv) automation of sequence transfer, alignment, and variant calling in 1 hour with the DRAGEN platform, version 2 (Illumina); (v) automated extraction of patient phenomes from the EHR by clinical natural language processing (CNLP) and translation to Human Phenotype Ontology (HPO) terms in 20 s; and (vi) automated transfer of variant and phenotype files and automated Bayesian comparison of the CNLP phenome with those of all genetic diseases (MOON, Diploid) combined with automated assessment of the pathogenicity of their genomic variants based on aggregated literature knowledge and in silico predictive tools (InterVar) and with automated display of the highest-ranked provisional diagnosis(es).

  • Fig. 2 CNLP can extract a more detailed phenome than manual EHR review or OMIM clinical synopsis.

    (A) Example CNLP of a sentence from the EHR of an 8-day-old baby (patient 341) with maple syrup urine disease, showing four extracted HPO terms. ED, emergency department. (B) Hierarchical display of HPO phenotypic features extracted by manual review of the EHR of neonate 341 and by CNLP (red) and expected phenotypic features (from the OMIM Clinical Synopsis; blue). Yellow circles: Phenotypic features extracted by both CNLP and expert review. Purple circles: Phenotypic overlap between CNLP and OMIM. Gray circles: The location of parent terms of identified phenotypic features within the HPO hierarchy. The information content (IC) was defined by IC(phenotype) = −log(pphenotype), where pphenotype was the probability of observing the exact term or one of its subclasses across all diseases in OMIM. IC increases from top (general) to bottom (specific).

  • Fig. 3 Comparison of observed and expected phenotypic features of 375 children with suspected genetic diseases.

    (A to D) One hundred one children diagnosed with 105 genetic diseases. (E to H) Two hundred seventy-four children with suspected genetic diseases that were not diagnosed by genome sequencing. Phenotypic features identified by manual EHR review are in yellow, those identified by CNLP are in red, and the expected phenotypic features, derived from the OMIM Clinical Synopsis, are in blue. (A) Frequency distribution of the number of phenotypic features (log-transformed) in 101 children with genetic diseases. The mean number of features detected per patient was 4.2 (SD, 2.6; range, 1 to 16) for manual review, 116.1 (SD, 93.6; range, 13 to 521) for CNLP, and 27.3 (SD, 22.8; range, 1 to 100) for OMIM (OMIM versus manual, P < .0001; CNLP versus OMIM, P < .0001; CNLP versus manual, P < 0.0001; paired Wilcoxon tests). (B) Frequency distribution of IC for each phenotypic feature set in 101 diagnosed patients. The mean IC was 7.8 (SD, 2.0; range, 2.1 to 11.4) for manual review, 8.1 (SD, 2.0; range, 2.6 to 11.4) for CNLP, and 7.3 (SD, 1.7; range, 3.2 to 11.4) for OMIM (manual versus OMIM, P < .0001; CNLP versus OMIM, P < .0001; manual versus CNLP, P = 0.003; Mann-Whitney U tests). (C) Correlation of the mean IC of phenotypic terms with the number of phenotypic terms in each patient. Spearman’s rank correlation coefficient (rs) was 0.24 for manually extracted phenotypic features (P = 0.02), 0.44 for CNLP (P < 0.0001), and −0.001 for OMIM (P > 0.05). (D) Venn diagram showing overlap of phenotypic terms by the three methods for diagnosed patients. Phenotypic features extracted by CNLP overlapped expected OMIM phenotypic features (mean, 4.31 terms; SD, 4.59; range, 0 to 32) significantly more than manually (mean, 0.92 terms; SD, 1.02; range, 0 to 4; P < 0.0001, paired Wilcoxon test for the difference in the number of terms that overlap with OMIM). (E) Frequency distribution of the number of phenotypic features (log-transformed) in 274 children with suspected genetic diseases that were not diagnosed by genome sequencing. The mean number of features was 3.0 (SD, 1.9; range, 1 to 12) for manual review and 90.7 (SD, 81.1; range, 6 to 482) for CNLP (CNLP versus manual, P < 0.0001; paired Wilcoxon test). (F) Frequency distribution IC for each phenotypic feature set in 274 undiagnosed patients. The mean IC was 7.7 (SD, 2.1; range, 2.1 to 11.4) for manual review and 8.1 (SD, 2.0; range, 2.6 to 11.4) for CNLP (manual versus CNLP, P < 0.0001; Mann-Whitney U test). (G) Correlation of the mean IC of phenotypic terms with the number of phenotypic terms in each patient. rs was 0.02 for manually extracted phenotypic features (P > 0.05) and 0.30 for CNLP (P < 0.0001). (H) Venn diagram showing overlap of phenotypic terms for undiagnosed patients by CNLP and manual methods.

  • Table 1 Duration and metrics for the major steps in the diagnosis of genetic diseases by genome sequencing with rapid standard methods and a rapid, autonomous platform.

    Primary (1°) and secondary (2°) analyses: Conversion of raw data from base call to FASTQ format, read alignment to the reference genomes, and variant calling. Tertiary (3°) analysis processing: Time to process variants and phenotypic features and make them available for manual interpretation in Opal interpretation software (Fabric Genomics) or to display a provisional, automated diagnosis(es) in MOON interpretation software (Diploid). Std., rapid standard methods; auto., rapid, autonomous platform; dev. delay, global developmental delay; PPHN, persistent pulmonary hypertension of the newborn; HIE, hypoxic ischemic encephalopathy; n.a., not applicable. Patients 263, 6124, and 3003 were retrospectively analyzed by the autonomous system. Patient 263 was analyzed two times by the autonomous system. Patients 6194, 290, 352, 362, 412, and 7072 were prospectively analyzed by both autonomous and standard diagnostic methods.

    Use typeRetrospective patientsProspective patients
    Subject ID2636124300361942903523623747052412
    Age8 days14 years1 year5 days3 days7 weeks4 weeks2 days17 months3 days
    Sex
    Abbreviated
    presentation
    Neonatal
    seizures
    Rhabdo
    myolysis
    Dystonia,
    dev. delay
    Hypoglycemia,
    seizures
    Pulmonary
    hemorrhage,
    PPHN
    Diabetic
    ketoacidosis
    Neonatal
    seizures
    HIE, anemiaPseudomonal
    septic shock
    Neonatal
    seizures
    MethodAuto.Auto.Auto.Auto.Auto.Std.Auto.Std.Auto.Std.Auto.Std.Auto.Std.Auto.Std.Auto.Std.
    Number of
    phenotypic
    features
    511151481422574103465111261243331
    Molecular
    diagnosis
    Early infantile
    epileptic
    encephalopathy
    7
    Glycogen
    storage
    disease V
    Dopa-
    responsive
    dystonia
    NoneNoneNoneNonePermanent
    neonatal
    diabetes
    mellitus
    NoneNoneNoneNoneX-linked
    agamma-
    globulinemia 1
    Benign familial
    neonatal
    seizures 1
    Gene and
    causative
    variant(s)
    KCNQ2
    c.727C > G
    PYGM
    c.2262delA
    c.1726C > T
    TH
    c.785C > G
    c.541C > T
    n.a.n.a.n.a.n.a.INS c.26C > Gn.a.n.a.n.a.n.a.BTK c.974 + 2
    T > C
    KCNQ2
    c.1051C > G
    Sample/Library
    Prep (hours)
    3:202:552:242:222:1023:542:1222:052:1315:422:3118:303:3010:104:3012:103:0523:50
    NovaSeq
    loading
    (hours)
    0:200:170:160:201:38*0:200:290:220:300:530:152:300:450:351:001:000:200:53
    2 × 101 nt
    sequencing
    (hours)
    15:3615:3115:3415:2715:2624:1315:2524:0815:2122:4415:1733:3615:1721:0715:1922:4615:5821:00
    1° & 2° analysis
    (hours)
    1:031:020:590:591:073:051:001:571:012:301:022:301:022:301:092:251:242:24
    3° analysis
    processing
    (hours)
    0:060:050:070:050:060:150:080:140:060:150:050:1510:280:160:060:160:060:16
    Total (hours)20:2519:5619:2019:1420:42*56:0319:2948:4619:1142:0419:1057:2131:0234:3822:0438:3720:5348:23

    *Included time to thaw a second set of NovaSeq reagents.

    †Included 10:20 hours of downtime due to data center relocation.

    Supplementary Materials

    • stm.sciencemag.org/cgi/content/full/11/489/eaat6177/DC1

      Materials and Methods

      Fig. S1. Venn diagram showing overlap of observed and expected patient phenotypic features in 95 children diagnosed with 97 genetic diseases.

      Fig. S2. Precision, recall, and F1 score of phenotypic features identified manually and by CNLP and OMIM.

      Fig. S3. Flow diagram of the software components of the autonomous system for provisional diagnosis of genetic diseases by rWGS.

      Table S1. Comparison of the analytic performance of standard and new library preparation and genome sequencing in retrospective samples.

      Table S2. Comparison of the analytic performance of standard and rapid library preparation and genome sequencing methods in seven matched prospective samples.

      Table S3. Characteristics of 16 children with genetic diseases used to train CNLP.

      Table S4. Precision and recall of phenotypic features extracted by CNLP from EHRs in 10 children with genetic diseases.

      Table S5. Precision and recall of 26 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 201.

      Table S6. Precision and recall of 96 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 205.

      Table S7. Precision and recall of 95 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 213.

      Table S8. Precision and recall of 158 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 233.

      Table S9. Precision of 85 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 243.

      Table S10. Precision and recall of 90 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 6094.

      Table S11. Precision and recall of 96 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 6098.

      Table S12. Precision and recall of 83 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 6108.

      Table S13. Precision and recall of 44 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 7003.

      Table S14. Precision and recall of 71 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 7004.

      Table S15. The test cohort diagnosed manually by rWGS or rWES and interpreted retrospectively with an autonomous system.

      Table S16. Variant characteristics in rWGS or rWES of the 101 children with 105 genetic diseases.

      Table S17. Number of nucleotide variants shortlisted by MOON and rank of the causal variant in MOON in 84 children with 86 genetic diseases.

      Table S18. Number of SVs shortlisted by MOON and rank of the causal variant in MOON in 11 children with genetic diseases.

      Table S19. Summary statistics of provisional diagnoses reported for clinical rWGS.

      Data file S1. Mapping of HPO terms to SNOMED CT expressions.

      Data file S2. Phenotypic features of 101 children with genetic diseases that were manually extracted by experts from the EHR.

      Data file S3. Phenotypic features of 101 children with genetic diseases that were automatically extracted from the EHR by CNLP.

    • The PDF file includes:

      • Materials and Methods
      • Fig. S1. Venn diagram showing overlap of observed and expected patient phenotypic features in 95 children diagnosed with 97 genetic diseases.
      • Fig. S2. Precision, recall, and F1 score of phenotypic features identified manually and by CNLP and OMIM.
      • Fig. S3. Flow diagram of the software components of the autonomous system for provisional diagnosis of genetic diseases by rWGS.
      • Table S1. Comparison of the analytic performance of standard and new library preparation and genome sequencing in retrospective samples.
      • Table S2. Comparison of the analytic performance of standard and rapid library preparation and genome sequencing methods in seven matched prospective samples.
      • Table S3. Characteristics of 16 children with genetic diseases used to train CNLP.
      • Table S4. Precision and recall of phenotypic features extracted by CNLP from EHRs in 10 children with genetic diseases.
      • Legends for tables S5 to S17
      • Table S18. Number of SVs shortlisted by MOON and rank of the causal variant in MOON in 11 children with genetic diseases.
      • Table S19. Summary statistics of provisional diagnoses reported for clinical rWGS.
      • Legends for data files S1 to S3

      [Download PDF]

      Other Supplementary Material for this manuscript includes the following:

      • Table S5 (Microsoft Excel format). Precision and recall of 26 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 201.
      • Table S6 (Microsoft Excel format). Precision and recall of 96 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 205.
      • Table S7 (Microsoft Excel format). Precision and recall of 95 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 213.
      • Table S8 (Microsoft Excel format). Precision and recall of 158 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 233.
      • Table S9 (Microsoft Excel format). Precision of 85 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 243.
      • Table S10 (Microsoft Excel format). Precision and recall of 90 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 6094.
      • Table S11 (Microsoft Excel format). Precision and recall of 96 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 6098.
      • Table S12 (Microsoft Excel format). Precision and recall of 83 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 6108.
      • Table S13 (Microsoft Excel format). Precision and recall of 44 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 7003.
      • Table S14 (Microsoft Excel format). Precision and recall of 71 phenotypic features extracted and proportion of OMIM clinical features detected by CNLP from the EHR of patient 7004.
      • Table S15 (Microsoft Excel format). The test cohort diagnosed manually by rWGS or rWES and interpreted retrospectively with an autonomous system.
      • Table S16 (Microsoft Excel format). Variant characteristics in rWGS or rWES of the 101 children with 105 genetic diseases.
      • Table S17 (Microsoft Excel format). Number of nucleotide variants shortlisted by MOON and rank of the causal variant in MOON in 84 children with 86 genetic diseases.
      • Data file S1 (Microsoft Excel format). Mapping of HPO terms to SNOMED CT expressions.
      • Data file S2 (Microsoft Excel format). Phenotypic features of 101 children with genetic diseases that were manually extracted by experts from the EHR.
      • Data file S3 (Microsoft Excel format). Phenotypic features of 101 children with genetic diseases that were automatically extracted from the EHR by CNLP.

    Navigate This Article