Research ArticleGenomics

# Exome Sequencing Can Improve Diagnosis and Alter Patient Management

See allHide authors and affiliations

Science Translational Medicine  13 Jun 2012:
Vol. 4, Issue 138, pp. 138ra78
DOI: 10.1126/scitranslmed.3003544

## Abstract

The translation of “next-generation” sequencing directly to the clinic is still being assessed but has the potential for genetic diseases to reduce costs, advance accuracy, and point to unsuspected yet treatable conditions. To study its capability in the clinic, we performed whole-exome sequencing in 118 probands with a diagnosis of a pediatric-onset neurodevelopmental disease in which most known causes had been excluded. Twenty-two genes not previously identified as disease-causing were identified in this study (19% of cohort), further establishing exome sequencing as a useful tool for gene discovery. New genes identified included EXOC8 in Joubert syndrome and GFM2 in a patient with microcephaly, simplified gyral pattern, and insulin-dependent diabetes. Exome sequencing uncovered 10 probands (8% of cohort) with mutations in genes known to cause a disease different from the initial diagnosis. Upon further medical evaluation, these mutations were found to account for each proband’s disease, leading to a change in diagnosis, some of which led to changes in patient management. Our data provide proof of principle that genomic strategies are useful in clarifying diagnosis in a proportion of patients with neurodevelopmental disorders.

## Introduction

Next-generation sequencing (NGS), in which the whole genome, or a portion thereof, is sequenced, has proven extraordinarily useful for identifying new causes of genetic disease, especially for Mendelian disorders. However, the application of NGS directly in the clinic is not straightforward because of the difficulties in determining which of the thousands of variants of unknown significance (1, 2) are relevant to the individual patient’s presenting signs and symptoms. Still, there is great anticipation that NGS, especially whole-exome sequencing, in which the 1% of the genome that codes for proteins is sequenced (3), will improve diagnostic approaches in genetic disease.

Neurodevelopmental disorders affecting 4 to 6% of the general population, most notably children, include intellectual disability, epilepsy, autism, structural brain diseases, and neuromuscular disorders. The U.S. Centers for Disease Control estimates that the annual cost of neurodevelopmental disorders accounts for 5 to 10% of total health care expenditure in the United States owing to the lifelong care required for these patients (4). The inaccessibility of neural tissue makes it difficult to arrive at a specific diagnosis for patients with neurodevelopmental disorders, so clinicians are left with categorical diagnoses or long differential diagnoses lists. Low-yield and expensive radiographic, electrophysiological, biochemical, and biopsy evaluations are the only prospect of narrowing these lists, often costing in excess of $10,000 per patient (5). Neurodevelopmental disorders exhibit both clinical and locus heterogeneity; therefore, genetic investigations are often limited to candidate sequencing of a single gene, or a small panel of genes, that requires the clinician to have a clear sense of the likely genetic cause before testing. Although chromosomal and copy number variations account for 10 to 20% of these cases (6, 7), the remaining cases have relatively little chance of achieving a genetic diagnosis. The failure to make a specific diagnosis for neurodevelopmental disorders is a major clinical problem because it limits prognostic information, anticipatory counseling, prevention strategies, quality of life, and initiation of potentially beneficial therapies (8). For these reasons, and the finding that many neurodevelopmental disorders have a genetic basis, the neurodevelopmental disorders clinic represents a fruitful area to explore the use of whole-exome sequencing. For this project, we identified 118 probands and their families from regions of the world with high rates of consanguinity, which enhances the power to identify recessive genetic mutations using homozygosity mapping (9). About one-sixth of the world’s population resides in these areas, making this an important population to study. In such populations displaying recessive disease, heterozygous alleles can usually be excluded as causative (table S1), greatly reducing the number of variants to be considered, and overcoming one of the potential drawbacks of whole-exome sequencing (5). It is estimated that 80% of the variants causing Mendelian disease are located within the exome, making whole-exome sequencing an attractive method to interrogate variants of high effect (10, 11). Furthermore, about 15% of suspected Mendelian disease has a recessive mode of inheritance, and genomic carrier burden for such disease is estimated at 2.8 mutations per genome in outbred populations, making this class of diseases an important part of the neurodevelopmental disorder spectrum (12). Here, we present data on the application of whole-exome sequencing to a large clinical cohort. Our data show that not only is whole-exome sequencing a useful tool for identifying disease-causing genes, but it is also able to correct or modify the diagnosis in ~10% of the families studied (n = 118), thereby providing proof of principle that whole-exome sequencing can be a useful tool for diagnosis in the clinic. ## Results ### Patient recruitment and diagnostic sequencing We analyzed a total of 188 families by collecting pedigrees, phenotype information, and blood samples on each genetically informative subject. Initial medical diagnoses were generated by the collective medical team (that is, treating physician, geneticist, and medical specialists) at case conferences consistent with current medical practice (Table 1) and were termed “initial diagnosis” for the purposes of this study. In some instances, the presenting features were too nonspecific to suggest a unique diagnosis, and in such cases, a categorical diagnosis was assigned. All families contained two or more affected individuals born to consanguineous parents. We used a standard protocol to exclude known disease-causing genes either by direct sequencing of all coding exons and splice sites or by excluding known loci with linkage exclusion mapping. Of the 188 probands (Fig. 1), 40 had mutations in one of the genes associated with the initial diagnosis and the mutation segregated with the phenotype in the family according to a recessive model. Such mutations were reported to the referring physician as part of this research protocol, and the families were not further studied. For the remaining families, mutations in known genes were not identified, and these families moved on to the next phase of analysis. In hindsight, whole-exome sequencing analysis of the 40 probands with mutations in known genes might have been more efficient and cost-effective than single gene sequencing methods because most subjects required evaluation at three or more genetic loci (Table 1). Table 1 Summary of 10 probands and their families in which whole-exome sequencing corrected diagnosis. View this table: ### Linkage analysis The remaining 148 probands and their families were subjected to genome-wide parametric linkage analysis using a panel of highly informative single-nucleotide polymorphism (SNP) markers. In 30 families, a single linkage peak was identified, and such families were not considered further because we viewed strategies other than whole-exome sequencing to be a more direct method of mutant gene identification. For these families, in no instance did the identified peak overlap with a genetic locus known to cause the initial diagnosis, suggesting that many of these peaks should reveal previously unidentified causes of disease. In the remaining 118 families, we uncovered between two and eight peaks consistent with linkage, although about 30% of these families were not analyzed with linkage because they came to the study relatively late. These families were instead analyzed using homozygosity mapping from exome data (9). ### Exome sequencing and variant discovery From the 118 families without single linkage peaks, one proband per family was evaluated using whole-exome sequencing, producing an average coverage at >10× read depth for 96% of the exome, which is within the expected coverage and depth for whole-exome sequencing studies (13) and is sufficient to assess most recessive disease variants. On average, a total of 26,393 ± 4971 (SD) variants were identified per proband for evaluation. Tabulation of the <10% of the genome that failed adequate recovery from whole-exome sequencing (<10× depth) was generated in case a causative variant could not be identified among those recorded. Variants were then filtered and prioritized according to the presumed recessive disease model to identify variants of high effect size (Fig. 2). On the basis of the HapMap project, the average haplotype block size from an offspring resulting from a first-cousin marriage is >10 centimorgan (cM) (14), so we focused on such blocks of homozygosity identified from either parametric [LOD (logarithm of the odds ratio for linkage) scores] or nonparametric (homozygosity mapping) linkage. The remaining variants were then prioritized according to type of mutation (deletion/insertion > nonsense > missense), amino acid conservation, predicted damage to the protein, and relevance of the candidate gene to the given disease. The final variant list contained a mean of 9 (range, 4 to 21) new, coding, homozygous variants in linkage or homozygous intervals per proband (tables S1 and S2). Variants on the final filtered list were validated by Sanger sequencing, verified as homozygous in affecteds, and tested for segregation in the family to be consistent with the pedigree structure. ### Disease gene identification In 22 of the 118 probands who were analyzed by whole-exome sequencing, we identified a single variant in a gene not previously implicated in disease, which fell within a region of homozygosity, and suggested a previously unidentified disease gene as the cause of the disorder. Two of these variants in which we have validated segregation are listed in Table 2. Specifically, we identified a mutation in GFM2 in a family with microcephaly, simplified gyral pattern, and insulin-dependent diabetes and a mutation in EXOC8 in a family with Joubert syndrome [Mendelian Inheritance in Man (MIM) number 213300]. Table 2 Whole-exome sequencing is a useful technique for identifying new disease-causing genes. Summary of two families analyzed in which whole-exome sequencing identified a causative gene not previously associated with disease. View this table: GFM2 (also called EFG2) encodes the mitochondrial elongation factor G2 and is part of the mitochondrial translation complex essential for maintaining energy metabolism. The identified c.T2032A variant in family 650 changes p.D576E, but this variant also occurs in a conserved predicted splice site at the acceptor for exon 17 and is predicted to destroy the splice acceptor function based on NetGene2 and BDGP prediction algorithms (15, 16). The presentation is overlapping with Wolcott-Rallison syndrome (MIM 226980) (17), characterized by early-onset insulin-dependent diabetes and occasional microcephaly. Mutations in EIF2AK3 and IER3IP1, encoding a translational initiation factor kinase and an endoplasmic reticulum stress response factor, respectively, have been linked to Wolcott-Rallison syndrome (18, 19). The p.D576E variant is the single variant found in a homozygous interval that segregates in the family, is not present in 200 ethnically matched controls, is predicted to damage the protein, and occurs in an evolutionarily conserved residue (Fig. 3). This mutation in a mitochondrial elongation factor is consistent with the model of Wolcott-Rallison syndrome as a defect in energy and cellular stress homeostasis, leading to altered neurogenesis and apoptosis. These findings suggest that GFM2 is a rational candidate for the disease and further support the use of whole-exome sequencing in identifying previously unidentified disease-causing genes for Mendelian disorders. EXOC8 encodes the exocyst 84-kD subunit, one of the critical members of the eight-subunit complex required for targeting secretory vesicles to the plasma membrane during exocytosis (20). The p.E265G variant found in family 982 occurs in the B6 loop of the highly conserved pleckstrin homology (PH) domain, which is involved in binding phosphatidylinositol lipids for vesicular transport. This is the single, segregating variant in the family and is not present in 200 ethnically matched controls. It is predicted to be damaging according to POLYPHEN-2 (15, 16) and occurs in a fully conserved residue. Joubert syndrome is one of the “ciliopathy” diseases, and EXOC8 is part of the ciliary proteome (21). Further, the exocyst complex has been implicated in ciliary function (21). For these reasons, EXOC8 is a rational candidate for this disorder (Fig. 4). In the remaining 86 probands, we found 2 to 10 variants of unknown significance per proband, some of which are good disease-causing candidates. Studies are ongoing in the lab to improve variant annotation and search for probands with similar phenotypes displaying variants in the same gene in an effort to demonstrate causality, similar to published work (22). ### Corrected patient diagnoses In 10 of the 118 probands (Table 3), it was apparent that one of the variants occurred within a gene already listed in Online MIM (OMIM) to cause a neurodevelopmental disease phenotype that at least partially overlapped with the phenotype of the proband, suggesting that it might represent the causative mutation (figs. S1 to S10). In each of these 10 patients, however, the genetic diagnosis suggested from whole-exome sequencing differed from the initial diagnosis, leading us to question the veracity of the initial diagnosis. It was initially surprising to identify mutations in known disease genes, because for each initial diagnosis, we had excluded the genes most frequently mutated. For instance, in a family diagnosed with microcephaly, we excluded the genes for primary microcephaly (MCPH1, CDK5RAP2, MCPH4, ASPM, CENPJ, and STIL); in a family displaying ataxia with vitamin E deficiency, we excluded the causative gene (TTPA); and in a family with intellectual disability, we excluded the most commonly mutated gene for the recessive form of the disease (VSP13B) (Table 1). Table 3 Initial diagnosis compared to genetic diagnosis after whole-exome sequencing in 10 probands. Summary of 10 families analyzed in which whole-exome sequencing corrected diagnosis. In each family, an identified mutation in a known disease-causing gene led to a modification of the diagnosis. Only G726E (family 1436) is a previously reported disease mutation. For all mutations leading to a premature stop codon (families 928, 992, 890, 1409, and 951), other stop codons have been reported with the respective disease. For the missense mutations not previously reported (families 1004, 995, and 1002), and the splice mutation (family 702), each was located in an amino acid/base pair that is fully conserved across evolution (Supplementary Materials), located in a protein domain essential for protein function or splicing, is predicted to be damaging, and is not found in 200 ethnically matched controls. All mutations segregated normally with the phenotype in these families. These data, in addition to further scrutiny of the patient’s clinical profile, provide evidence that these mutations are the cause of the disorders seen in each family. View this table: To understand this paradox, we returned to the patient charts to review the presentation and clinical course. In each case, we found that the genetic variant was sufficient to explain the full clinical presentation, suggesting that whole-exome sequencing was able to either modify or correct an initial diagnosis for each of these 10 cases. ### Clinical presentations Family 890: Mutation in VLDLR. This family from Trabzon, Turkey, presented two affecteds at birth with microcephaly, nystagmus, congenital ataxia, mild spasticity, and arachnodactyly. Brain magnetic resonance imaging (MRI) analysis demonstrated severe hypoplasia of the midbrain, consistent with a diagnosis of pontocerebellar hypoplasia (MIM 607596), published as such in 2002 (23). The family was negative for mutations in the three known genes for pontocerebellar hypoplasia—TSEN2, TSEN34, and TSEN54—encoding transfer RNA (tRNA) splicing endonucleases (24), and linkage analysis demonstrated three potential linkage peaks not associated with any known pontocerebellar hypoplasia genes. Whole-exome sequencing identified a homozygous p.G1246fsX1305 alteration, which segregated in the family, leading to a protein frameshift in the VLDLR gene, encoding the very low-density lipoprotein receptor (fig. S1). Reevaluation of the brain MRI was completely consistent with VLDLR-associated congenital cerebellar ataxia with intellectual disability syndrome (MIM 224050), demonstrating the classical very small, smooth cerebellum (25). The team concluded that the initial diagnosis was incorrect because the clinical phenotype in this family was different from the spectrum previously described for VLDLR-associated disease. Family 951: Mutation in MAN2B1. This family from Islamabad, Pakistan, presented four affected children with intellectual disability. After a normal pregnancy, labor, and delivery except for low birth weight, there was intellectual disability noted by 2 years of age, as well as mild dysmorphic features including prominent forehead, wide-set eyes, and defects in hearing and speech. Routine metabolic screening and mass spectrometry were noncontributory. The affecteds received an initial diagnosis of recessive intellectual disability and were negative for alterations in the VPSB13B gene, tested because of concordant obesity (26). SNP-based linkage analysis pointed to two potential linkage peaks, neither containing genes for autosomal recessive intellectual disability. Whole-exome sequencing demonstrated a homozygous p.W695* truncating mutation in the MAN2B1 gene that segregated fully in the family (fig. S2). The MAN2B1 gene is mutated in α-mannosidosis (MIM 248500) (27), a metabolic lysosomal storage condition caused by an inability to cleave α-linked mannose residues from the nonreducing end of N-linked glycoproteins. Reevaluation of the phenotype in light of this finding confirmed the typical facial appearance, enlarged liver, and vacuolated lymphocytes typical of type I α-mannosidosis (28). The anticipatory guidance and direction of therapy has been changed to reflect this genetic diagnosis (29). The team concluded that the initial diagnosis did not take into account this disease because of the nonspecific presenting features. Family 1002: Mutation in SPG11. Family 1002 from Marrakech, Morocco, presented three affected members with progressively unsteady gait from the age of 5 years, interpreted as ataxia. There was areflexia, positive Babinski sign, and loss of proprioception with intact cognition, and a normal brain computed tomography (CT) scan, leading to the initial diagnosis of a progressive ataxia or spasticity. Initial workup included reduced serum levels of ApoA1, high-density lipoprotein (HDL), and vitamin E, consistent with a diagnosis of ataxia with vitamin E deficiency (MIM 277460). The reduced serum levels were within the range of other patients we have evaluated with this condition, although they lacked the common Moroccan 744delA mutation in the TTPA gene (30). However, patients showed nominal improvement in function upon administration of daily exogenous vitamin E, supporting the diagnosis. Full sequence of the TTPA gene was negative for variation, and SNP-based linkage analysis suggested two potential peaks, neither of which contained the TTPA gene or known modulators of vitamin E metabolism. Whole-exome sequencing analysis identified one splice and two missense variants, two of which were fully conserved across species and one predicted to be damaging. Only a homozygous c.T5088G variant leading to a p.A1696G amino acid transversion in the SPG11 gene segregated according to the predicted mode of inheritance in the seven children in the family, providing compelling evidence that this mutation may cause this neurodevelopmental disorder (fig. S3). The SPG11 gene is a recently reported cause of hereditary spastic paraplegia with thin corpus callosum (MIM 604360) (31). The p.A1696G changes a nonpolar neutral amino acid to a polar negative amino acid and is predicted to be damaging to protein function according to POLYPHEN-2 software (15, 16). The p.A1696 residue is perfectly conserved across evolution and occurs within the leucine-rich repeat 3 domain, supporting its pathogenicity. This variant was not detected in chromosomes from 200 control Moroccan individuals. Subsequent reevaluation of the family led to reinterpretation of the ataxia as spasticity, and brain MRI analysis in two affecteds demonstrated the characteristic thin corpus callosum, consistent with a diagnosis of SPG11-associated disease. Vitamin E therapy has subsequently been halted without clinical consequence. In this situation, the team concluded that the original initial diagnosis was incorrect due to an initial misinterpretation of the clinical signs and false-positive chemistry studies. Family 1004: Mutation in GJC2. This family from Cairo, Egypt, presented two affecteds with microcephaly and intellectual disability. The initial diagnosis of microcephaly was assigned on the basis of a head circumference of 48 cm at age 8 years (−2.5 SD) in an older male sibling and 45 cm at age 3 years (−2.5 SD) in a younger female sibling. Brain MRI showed thin corpus callosum, mild ventriculomegaly, and cerebellar hypoplasia. The family tested negative for mutations in the known primary microcephaly genes MCPH1, CDK5RAP2, MCPH4, ASPM, CENPJ, and STIL. As the children aged, they displayed signs of nystagmus, hyperreflexia, and spasticity, atypical for primary microcephaly, and the three linkage peaks identified from SNP-based analysis did not suggest any other microcephaly loci. Whole-exome sequencing demonstrated a homozygous c.C94T alteration in the GJC2 gene that segregated fully in the family and is known to cause hypomyelinating leukodystrophy II (MIM 608804) (32). This mutation leads to a p.R35C amino acid transversion in the connexin domain. The p.R35 residue is perfectly conserved across evolution, is predicted to be damaging, and was not found in chromosomes from 200 Egyptian control individuals (fig. S4). Subsequent reevaluation of the family focusing on this variant led us to conclude that the spasticity and nystagmus were progressively worsening, along with the presence of mild peripheral axonal neuropathy. MRI reinterpretation showed a hypomyelinating leukodystrophy consistent with GJC2-associated disease. The team concluded that the initial diagnosis was too broadly categorized due to nonspecific presenting features, which precluded a more accurate diagnosis. ## Discussion The main finding of this work is that whole-exome sequencing is beneficial over individual candidate gene sequencing in identifying mutations in genes not previously suspected in a given patient. This finding provides proof of principle that whole-exome sequencing has the potential to change clinical practice for genetic disease. Specifically, this work demonstrates the use of whole-exome sequencing in the clinic when applied to a group of patients with likely genetic disease for which the cause remained elusive before study. In our study, we found that in 10 cases of 118 probands undergoing whole-exome sequencing, there was a revision of the diagnosis and, in some cases, a change in management. Furthermore, in each of these 10 cases, genetic counseling, prenatal diagnostic options, and carrier testing were altered after diagnosis. We also identified likely causative mutations in two other families with neurodevelopmental disorders, which have the potential to lead to new therapies. Although the ability of NGS to provide an accurate genetic diagnosis has been established for single cases like 3,4-dihydroxyphenylalanine (dopa)–responsive dystonia and Charcot-Marie-Tooth neuropathy (33, 34), this report addresses the benefits of NGS in a large clinical cohort. In our cohort, we first excluded known genetic causes of disease by sequencing likely mutated genes on the basis of the initial diagnosis. This enriched for patients with new genetic causes of disease and with an incorrect or partially correct diagnosis. From this group, we used whole-exome sequencing to further stratify patients into those with a likely new disease gene (22 of 118, or 19%), those with no obvious single disease gene candidate but rather numerous candidates (86 of 118, or 73%), and those with a mutation in a disease gene known to cause a disorder different from the initial diagnosis (10 of 118, or 8%). These findings should be compared to other diagnostic modalities, such as copy number variant (CNV) or de novo mutation identification in neurodevelopmental disorders, where success rates fall between 10 and 60% in selected populations (3537). Although it is difficult to compare success rates due to differences in cohort structures, whole-exome sequencing in probands with recessive disease is likely to emerge as an attractive alternative approach to candidate gene sequencing. Our data show that in a substantial portion of patients in the neurodevelopmental disorders clinic, the initial diagnosis might not be accurate or might be too broadly classified. There are several potential reasons why an initial diagnosis might be incorrect or partially correct in the neurodevelopmental disorders spectrum. In our study, we attributed these differences to the following, and it is our experience that these limitations exist in the clinical setting in general: (i) Patient phenotypes differed partially or substantively from the spectrum previously described for a given gene, (ii) medical information or history was incomplete, and (iii) nonspecific clinical features were found in the patients. The field of genetic medicine has literally tens of thousands of unique syndromes. Even an experienced professional might not entertain all possible diagnoses for a given presentation due to the vast number of syndromes to consider. Medical diagnostic software that helps to maintain a broad and systematic differential could help with this issue (38) and would make a powerful partner to help prioritize and filter data. In each case presented here, the medical team returned to the clinical information to determine why the initial diagnosis differed from the genetic diagnosis, and it was found that most differences were due to limitations in the clinical practice of medicine. Whole-exome sequencing was able to overcome many of these limitations. This study demonstrates the clinical use of whole-exome sequencing and points out potential benefits in correcting patient diagnosis. The current cost for whole-exome sequencing is ~$2000 to \$4000 per patient (39) and is expected to drop substantially in the coming years. With similar costs for candidate gene sequencing, whole-exome sequencing should be considered an attractive alternative in families with a presumed genetic cause of disease. Whereas whole-genome sequencing is another technology that is sure to change the face of medicine in the future, whole-exome sequencing has captured the attention of the clinical genetics community because most genetic variants of large effect reside in the exome, because intronic mutations are difficult to interpret and to model, and because genome sequencing is still more expensive than whole-exome sequencing (10, 34). The data presented here suggest that whole-exome sequencing should be considered in a diagnostic context in the appropriate clinical settings.

Whereas whole-exome sequencing was used with some success in this study, it does suffer from limitations—even in the field of Mendelian genetics. Whole-exome sequencing in clinical applications lacks some sensitivity due to its inability to interrogate intronic sequence, the absence of recovery of some exons, and sequencing errors. Even more important is the difficulty in separating causative variants from the vast number of variants of unknown significance identified per patient (10, 13, 15). Furthermore, the recent finding that more than 25% of putative disease-causing variants in available databases are erroneous (12) makes interpretation all that much more difficult. These limitations were partially overcome by restricting analysis to consanguineous families with recessively inherited disease, and allowing exclusion of most variants of unknown significance using criteria specific to this model of disease. In addition, these families allowed for testing of segregation of each variant, thereby providing an additional level of certainty about the causation of each mutation. As human mutation databases become sufficiently populated and carefully curated, the ability to interpret whole-exome sequencing data will greatly improve. The introduction of whole-exome sequencing into routine clinical practice will require careful assessments of these issues. As for the future, the limitations of whole-exome sequencing seem tractable and there are solutions that should allow whole-exome sequencing to ultimately be used routinely in a clinical setting.

## Materials and Methods

### Study participants

The probands for this study were ascertained from the Middle East, North Africa, and Central Asia and were selected based on the criteria of (i) healthy parents with documented consanguinity, (ii) more than one affected child to enrich for recessive disease, and (iii) an initial diagnosis of a neurodevelopmental disorder of unknown genetic etiology. We excluded patients with clear single gene defects in which clinical features are absolutely characteristic, such as for Tay-Sachs disease, Niemann-Pick type C, and spinal muscular atrophy. This study was approved by the Institutional Review Board at the University of California, San Diego, and collaborating institutions; all study participants signed informed consent documents; and the study was performed in accordance with Health Insurance Portability and Accountability Act privacy rules.

### Phenotypic assessments

One or more of the authors who are board-certified in pediatrics, neurology, and/or genetics evaluated each patient. Standard evaluation included full general and neurological examination, height, weight, head circumference measurements, intelligence quotient (IQ), brain MRI or CT, and electroencephalogram (EEG) when clinically indicated. Diagnostic criteria were based on those standards in the field, and initial diagnoses were determined by consensus at clinical team meetings where differential diagnoses, genetic counseling, and care plans were discussed.

Linkage analysis was performed by genotyping all available and consenting members of the family in the affected and parental generations of the pedigree. DNA was extracted from peripheral blood leukocytes with salt extraction, genotyped with the Illumina Linkage IVb mapping panel (40), and analyzed with easyLINKAGE-Plus (41) software to generate multipoint LOD scores.

### Exome sequencing and variant analysis

For each sample, whole-exome sequencing was performed with the Agilent SureSelect Human All Exome 50 Mb Kit, which captures up to 50 Mb of the human exome and includes all exons annotated in the consensus CDS database, as well as 10 bases flanking each targeted exon and small noncoding RNAs. This kit provides >95% coverage at 1×, and >88% at 10× coverage, and paired-end sequencing of 75–base pair sequencing length was done with Illumina GAIIx or HiSeq2000 instruments (42). Depth of sequencing was 30 ± 16 (SD) per exome.

Whole-exome sequencing generated large data sets that required extensive analysis to identify the important genetic variants. The process of isolating potential disease-causing variants involved several major steps: (i) align and ensure quality of DNA sequences for each exome; (ii) identify polymorphisms in the patient’s DNA sequence compared to a reference sequence with tested SNP and insertions/deletions (INDEL) calling software [that is, Genome Analysis Tool Kit (GATK) and in-house generated tools]; (iii) verify the quality, repeatability, and comparability to published results of SNP and INDEL data; (iv) filter out variants that are outside runs of homozygosity, outside of coding/splice regions, and are found in homozygous form in the healthy population; (v) prioritize potential disease-causing variants by type of mutation, conservation of amino acid residue(s) across species, predicted damage to the protein, and relevance to the neurodevelopmental disorder; (vi) validate potential disease-causing variants by assuring Mendelian segregation in the family, screening healthy, ethnically matched controls, and identifying mutations in the same gene in other families with the disorder.

Genetic variants were delineated with the GATK software for both SNPs and INDELs (43). Briefly, Illumina “qseq.txt” files from each exome were converted to FASTQ format with Illumina Pipeline Software and then aligned to a reference sequence with Burrows-Wheeler Aligner (BWA) software (SourceForge). Duplicate sequencing reads, which can be produced by polymerase chain reaction amplification, were then removed, and sequence quality scores were recalibrated to correct for sequencing errors and artifacts. Alignments that contain INDELs can lead to mismatches that resemble SNPs; therefore, alignment of sequences with Maq was also necessary to identify and isolate INDEL-containing reads. Next, a Bayesian-based SNP caller and INDEL Genotyper (GATK) were used to filter potential disease-causing variants from the aligned sequences. Each aligned exome was then assessed for sufficient quality with the following parameters: number of SNPs called in each lane (average ~26,000), accuracy, depth of coverage, and error rate per read position. Only exomes of high quality were analyzed further.

Exomes were then filtered to highlight false-positive variants. Likely false positives were flagged with the following criteria: low SNP confidence, frequency of the reference base in the population is overwhelmingly high, low depth of coverage for the SNP, and presence of the SNP in homopolymer runs longer than three bases. Flagged variants were included in the subsequent “variant filtering and prioritization pipeline” but were pursued with caution in the “validation pipeline.” Both pipelines are described below.

Exome call sets underwent the following evaluations: compared number of SNPs to the quantities usually found per cleaned exome data set, determined overlap with the dbSNP database, determined the transition/transversion ratio as a measure of the false-positive rate found in the data set, compared the data to other exomes sequenced in the lab, and compared random variants found in the exome to the Human Genome Browser. These data were used as quality measures to determine the overall integrity of the data set compared to published studies. The end result of this pipeline was a list of quality potential disease-causing variants that were further filtered and prioritized.

The “sequencing analysis pipeline” identified numerous potential disease-causing variants, and on average, an exome from one individual from a first-cousin marriage contains about 26,000 SNPs and 1000 INDELs (from our data sets). Most of the variants identified were known and/or nondeleterious polymorphisms; however, a small number of them represented frameshift, missense, and nonsense mutations in a potential disease-causing gene. Therefore, variants were further filtered with our “potential disease-causing variants filtering and prioritization pipeline.” First, SNPs and INDELs present in stretches of homozygosity, as determined by either linkage analysis or publicly available homozygosity mapping software (Homozygosity Mapper), were isolated (44). Homozygosity mapping in consanguineous families with recessive disease is a proven method for identifying disease-causing mutations given that the DNA sequence flanking the mutation will be preferentially homozygous by descent in children from consanguineous marriages. Next, variants present in coding regions that lead to nonsynonymous amino acid changes, or found in splice sites, were isolated. Only those not found in the homozygous state in dbSNP genotype studies were pursued. The remaining variants were cross compared to the Genome Variation Server (http://gvs.gs.washington.edu/GVS/) and SIFT (http://sift.jcvi.org/) databases to determine single-nucleotide evolutionarily constraint and conservation scores (GERP, PhastCons), protein damage prediction determinations (Polyphen), and relationship of variants to OMIM classifications (45). These measures were used to prioritize variants by high conservation across species, predicted damage to the protein (or causing nonsense-mediated decay), and are not associated with any known, nonrelated diseases. Finally, variants were reviewed for their expression profiles in the brain and their relevance to the neurodevelopmental disorders of interest (Fig. 2). The end result of this pipeline was a list of key potential disease-causing variants that underwent follow-up validation in patients and family members.

Once key potential disease-causing variants were identified, it was then necessary to validate each candidate to determine whether they were the disease-causing mutation in our “potential disease-causing variants validation pipeline.” Specifically, we used direct sequencing to confirm that the polymorphism segregates within the family in a manner that is consistent with recessive inheritance, and we verified that the mutation was not present in a large cohort of healthy, ethnically matched control individuals. Thus, each variant was consistent with variant interpretation category 2 (unreported) and is of the type that is expected to cause the disorder according to the American College of Medical Genetics and Genomics guidelines (1, 2).

## Supplementary Materials

www.sciencetranslationalmedicine.org/cgi/content/full/4/138/138ra78/DC1

Fig. S1. Genetic data for family 890.

Fig. S2. Genetic data for family 951.

Fig. S3. Genetic data for family 1002.

Fig. S4. Genetic data for family 1004.

Fig. S5. Genetic data for family 702.

Fig. S6. Genetic data for family 928.

Fig. S7. Genetic data for family 992.

Fig. S8. Genetic data for family 995.

Fig. S9. Genetic data for family 1409.

Fig. S10. Genetic data for family 1436.

Table S1. Estimated variants to be considered as causative from whole-exome sequencing are greatly reduced in recessive disease with documented consanguinity.

Table S2. Number of variants identified in each family at each step of the variant filtering and prioritization pipeline.

## References and Notes

1. Acknowledgments: We thank the families for their participation. Funding: Supported by Howard Hughes Medical Institute, NIH [National Institute of Neurological Disorders and Stroke grants R01NS041537, R01NS048453, and R01NS052455; National Human Genome Research Institute grants P01HD070494 (to J.G.G.) and U54HG003067 (to S.B.G. and C.R.); National Institute on Alcohol Abuse and Alcoholism/Center for Inherited Disease Research grant N01-HG-65403] for SNP genotyping, and NSF (grants III-081905 and CCF-1115206 to V.B.). Author contributions: T.J.D.-S., J.L.S., S.B., and J.G.G. recruited patients and designed and analyzed the experiments. N.U., J.S., A.E.S., J.O., V.B., A.G.F., G.N., and N.A. generated the bioinformatic pipeline for data analysis. M.S.Z., G.H.A.-S., L.A.M., L.S., S.A.-H., N.M., T.B.-O., N.A.A.-S., F.M.S., F.C., and M.A. identified and recruited families for study and ascertained clinical information. K.J.H. and A.C. assembled clinical data in tabular format. K.V.G., C.S., C.R., and S.B.G. generated sequencing results and provided preliminary analysis. T.J.D.-S. and J.G.G. wrote the manuscript. Competing interests: The authors declare that they have no competing interests. Data and materials availability: Data have been deposited into dbGap (phs000288).
View Abstract