Research ArticleGenomics

The Predictive Capacity of Personal Genome Sequencing

See allHide authors and affiliations

Science Translational Medicine  09 May 2012:
Vol. 4, Issue 133, pp. 133ra58
DOI: 10.1126/scitranslmed.3003380


New DNA sequencing methods will soon make it possible to identify all germline variants in any individual at a reasonable cost. However, the ability of whole-genome sequencing to predict predisposition to common diseases in the general population is unknown. To estimate this predictive capacity, we use the concept of a “genometype.” A specific genometype represents the genomes in the population conferring a specific level of genetic risk for a specified disease. Using this concept, we estimated the maximum capacity of whole-genome sequencing to identify individuals at clinically significant risk for 24 different diseases. Our estimates were derived from the analysis of large numbers of monozygotic twin pairs; twins of a pair share the same genometype and therefore identical genetic risk factors. Our analyses indicate that (i) for 23 of the 24 diseases, most of the individuals will receive negative test results; (ii) these negative test results will, in general, not be very informative, because the risk of developing 19 of the 24 diseases in those who test negative will still be, at minimum, 50 to 80% of that in the general population; and (iii) on the positive side, in the best-case scenario, more than 90% of tested individuals might be alerted to a clinically significant predisposition to at least one disease. These results have important implications for the valuation of genetic testing by industry, health insurance companies, public policy-makers, and consumers.


As a result of continuing advances in high-throughput sequencing technologies (14), whole-genome sequencing will soon become an affordable approach to identify all sequence variants in an individual human. Recent evidence suggests that each human genome has more than 4.5 million sequence variants, some common, some infrequent (5). To date, several thousand genomic variants have been associated with human diseases, either as rare variants in Mendelian disorders or as common single-nucleotide polymorphisms in genome-wide association studies (GWAS) (6, 7). Whole-genome or whole-exome sequencing has recently been used to identify new disease predisposing variants in various familial disorders, such as familial pancreatic cancer (8) and Miller syndrome (9). However, the potential utility of genome-wide sequencing for personalized medicine in the general population is unclear. Suppose, for example, that sequencing becomes sufficiently inexpensive that all individuals, at birth, could have their genomes sequenced at negligible cost. What fraction of the population would benefit from such sequencing? “Benefit” in this context is defined as receiving information indicating that the risk of disease is increased or decreased to a degree that would alter an individual’s life-style or medical management.

On the surface, it might seem impossible to answer this question at present, because there are millions of genetic variants in every individual and the contribution of nearly all of these variants to any disease is unknown. However, there is one group of individuals in which this question can be immediately addressed: monozygotic twin pairs. If one twin of the pair has a disease, then the probability of the other twin developing that disease is dependent on the genome whenever that disease has some genetic component. We show below that when this logic is applied to a large numbers of twins, estimates of the maximal benefit of genome-wide sequencing in the general (non-twin) population can be made.


Conceptual basis

The key to our analysis is the concept of a “genometype.” We do not know the genomic sequences of the twin pairs analyzed in the studies described herein, but we do know that each twin pair shares a nearly identical genome (10) and that a genome confers a particular genetic risk to every disease. For each disease, we group genomes that confer identical genetic risks into genometypes. For example, genometypes could be grouped into 20 bins, with genometypes in bin 1 conferring zero genetic risk, genometypes in bin 2 conferring 3% genetic risk, genometypes in bin 3 conferring 10% genetic risk, etc. We can then estimate what distributions of genometypes in the population best reflect the observed monozygotic twin concordancy and discordancy for any given disease.

In twin studies on diseases, heritability (defined in Table 1) is generally based on the difference in the incidence of a disease in monozygotic versus dizygotic twins (11, 12). Heritability reflects the average genetic contribution to disease in a twin population. We are interested in the distribution of genetic risks rather than the average. For example, a 30% average risk could reflect a small fraction of twin pairs with genometypes conferring high genetic risk or a larger fraction of twin pairs with genometypes conferring a moderate genetic risk. Among all the distributions of genometypes that are compatible with the twin epidemiologic data, we wished to find the distributions that maximized the potential clinical utility of identifying those genometypes by genomic sequencing.

Table 1

Definition of terms.

View this table:

Whole-genome sequencing-based tests, like any genetic test, can be informative in two ways: Negative and positive tests would indicate a substantially lower or higher risk, respectively, than that of the general population. The challenge is to define “substantially” in clinically meaningful and quantitative terms. An example might help put this challenge into perspective. Suppose a woman receives a whole-genome test result indicating that she has a 90% lifetime risk (the total risk over her entire life) of developing breast cancer. She may decide to have a prophylactic double mastectomy to prevent this outcome. Similarly, if the test indicated an 80% or even a 50% lifetime risk of developing breast cancer, she may consider mastectomy. On the other hand, if the test indicated only a 14% risk of developing breast cancer, then mastectomies would be considered by very few women, given that most women today do not opt for prophylactic mastectomies even though the lifetime risk of developing breast cancer in the general population is 12%.

This example illustrates that the risk threshold required for clinical utility represents a balance between the risk reduction afforded by an intervention and its negative consequences. A precedent exists for defining this threshold, in that the decision to implement genetic tests is often based on a positive predictive value (PPV) of at least 10%, implying that more than 1 in 10 patients with a positive test result are expected to develop disease (13). Although the choice of this threshold will depend on the specific intervention and should ideally be left to the individual, we use this 10% threshold for our population-level analyses of 20 of the 24 diseases analyzed (table S1). In the other four diseases (chronic fatigue syndrome, gastroesophageal reflux disorder, coronary heart disease–related death, and general dystocia), which occur at relatively high frequency in the population, this 10% threshold is inadequate to distinguish individuals with a substantially increased genetic risk from the rest of the population. For these four diseases (table S1), a more appropriate threshold corresponds to one conferring a genetic risk that is at least as great as that of the nongenetic component. Individuals with genometypes conferring this degree of genetic risk would therefore have a total risk at least twice as large as those without any genetic predisposing factors. This 2× threshold in relative risk is similar to those widely used as clinical benchmarks for common diseases (table S2) (1418).

For whole-genome testing in healthy individuals, we thereby defined a threshold at which a positive test result would be clinically meaningful as follows. If the nongenetic risk was <5%, then the threshold was set at 10%. If the nongenetic risk was >5%, then the threshold was set at 2× the nongenetic contribution. Although we have used these particular thresholds in most of the examples described below, we also describe how these results varied when other thresholds were considered.

Twin data

We collated monozygotic twin pair data from the Swedish Twin Registry, Danish Twin Registry, Finnish Twin Cohort, Norwegian National Birth Registry, and the National Academy of Science–National Research World War II Veteran Twins Registry (1931) (Table 2). From these registries, we selected data representing 24 diseases of diverse etiologies including autoimmune diseases, cancer, cardiovascular diseases, genitourinary diseases, neurological diseases, and obesity-associated diseases. Three of these conditions (coronary heart disease, cancer, and stroke) represent the leading causes of mortality in the United States, accounted for 54.2% of total deaths in 2007, and are therefore of major public health importance (32). The thresholds for a clinically meaningful test result, as defined above, were calculated from disease prevalence and nongenetic risks in the populations from which the twins were drawn (1931) (Materials and Methods, Table 2, and table S2).

Table 2

Population-based twin studies used for analysis. Disease prevalence in cohort [cohort risk (CR)] was determined as described in Materials and Methods. MZ, monozygotic.

View this table:

Mathematical model

We then developed computational methods to evaluate possible frequency (f) and genetic risk (r) combinations for a population containing 20 genometypes. Genometype frequency is defined as the proportion of twin pairs in the population that have a given genometype (Table 1). Genometype genetic risk is defined, for each disease, as the absolute increment in risk that an individual with that genometype will face compared to someone with no genetic risk at all (Table 1). For any combination of genometypes, each with a certain frequency and genetic risk, we obtain an expected distribution of disease-affected individuals among a monozygotic twin cohort. Many different combinations of genometype frequencies and genetic risks match the observed distributions in monozygotic twins; we are interested in those combinations (distributions) that maximize clinical utility, as noted above and further explained below. The mathematical framework for our study and associated statistical and technical issues are detailed in Materials and Methods.

Clinical implications

These analyses allowed us to address various measures of potential clinical utility. First, for each disease, what is the maximum and minimum fraction of patients with the disease that would receive a positive test, that is, a result indicating that they have a substantially increased or decreased risk, respectively, of that disease? The answers to this question are graphically shown in Fig. 1 for each of the 24 diseases (for three diseases, we present different answers for males and females, resulting in a total of 27 disease categories). As can be seen from Fig. 1, the maximum fraction of patients that would receive a positive test varies widely from disease to disease. Most of the patients (>50%) who would ultimately develop 13 of the 27 disease categories would not test positive, even in the best-case scenario. On the other hand, there were four disease categories—thyroid autoimmunity, type I diabetes, Alzheimer’s disease, and coronary heart disease–related deaths in males—for which genetic tests might identify more than 75% of the patients who ultimately develop the disease. Genometype risk and frequency distributions for all diseases are shown in table S3 and graphically for representative diseases in fig. S1.

Fig. 1

The fraction of cases (that is, patients with disease) who would test positive by whole-genome sequencing. For each disease, the maximum and minimum fraction of cases that would test positive using the thresholds defined in table S1 are plotted.

We could also determine the maximum and minimum fraction of individuals in the population (rather than the fraction of patients with disease) who would receive positive test results for each disease. As shown in Fig. 2, this fraction is generally small, as expected, because the incidence of most diseases is relatively low. Are these negative tests, which would be received by the great majority of individuals for most diseases, informative? Negative tests could be informative to individual patients if they indicated a considerably lower total risk than would be assumed in the absence of testing. As can be seen from Fig. 3, though, negative tests are generally not very informative in the case of whole-genome sequencing, because such genetic tests are limited by the nongenetic component of risk. For 22 of the 27 disease categories studied, a negative test would not indicate a risk that is less than half that in the general population, even in the best-case scenario. This level of risk reduction is probably not sufficient to warrant changes of behavior, life-style, or preventative medical practices for these individuals (33, 34). On the other hand, there was one disease category (Alzheimer’s disease, Fig. 3) in which a negative test result might indicate as little as a ~12% relative risk of disease compared to the entire twin cohort, at least in the best-case scenario. Knowledge of such a reduced risk might be comforting and relieve anxiety, particularly to those with a family history of Alzheimer’s disease.

Fig. 2

Percentage of individuals in the general population who would test positive by whole-genome sequencing. For each disease, the maximum and minimum fraction of individuals in the population that would test positive using the thresholds defined in table S1 are plotted.

Fig. 3

Relative risk of disease in individuals testing negative by whole-genome sequencing. A relative risk of 100% represents the same risk as the general population, that is, the cohort risk. Relative risks were calculated using the genometype frequencies and genometype genetic risks that maximized or minimized sensitivity for disease detection.

What is the maximum fraction of individuals that could receive at least one positive test result, that is, a report indicating that he or she is at risk for at least 1 of the 24 diseases assessed? From the data depicted in Fig. 2, we estimate that >95% of men and >90% of women could receive at least one positive test result if the risk alleles were actually distributed in the way that produced maximal sensitivity in our model. We assumed that the risk alleles for these 24 diseases were independent in these estimates; if they were not independent, then these figures represent overestimates. On the other hand, these frequencies may represent underestimates because there are a number of additional diseases with hereditary components that have not yet been studied in monozygotic twins or included in our analyses. At the very least, if we consider only distinct disease categories whose pathogenesis is unlikely to be shared, our analyses suggest that, in the best-case scenario, most of the tested individuals might be alerted to a clinically meaningful risk by whole-genome sequencing.

It was of interest to determine how the results described above varied with the threshold chosen for the analysis. For example, it might be argued that a threshold of 10% was too low for true clinical utility. Our analyses show that the maximum fraction of affected cases testing positive, as well as the maximum fraction of the total population that tests positive, is not changed much when the thresholds are changed to 20% (tables S4 and S5). With very high thresholds, however, both these measures of sensitivity decrease significantly (tables S4 and S5). Moreover, the maximum predictive value of a negative test drops precipitously at higher thresholds (table S6).


The general public does not appear to be aware that, despite their very similar height and appearance, monozygotic twins in general do not always develop or die from the same maladies (35, 36). This basic observation, that monozygotic twins of a pair are not always afflicted by the same maladies, combined with extensive epidemiologic studies of twins and statistical modeling, allows us to estimate upper and lower bounds of the predictive value of whole-genome sequencing.

On the negative side, our results show that most of the tested individuals would receive negative tests for most diseases (Fig. 2). Moreover, the predictive value of these negative tests would generally be small, because the total risk for acquiring the disease in an individual testing negative would be similar to that of the general population (Fig. 3). On the positive side, our results show that, at least in the best-case scenario, most of the patients might be alerted to a clinically meaningful risk for at least one disease through whole-genome sequencing.

These conclusions should be compared to other models as well as current knowledge about risk allele loci from GWAS (57, 3739, and references therein). In general, GWAS have shown that many loci can predispose to disease and that each risk allele confers a relatively small effect (38, 39). For example, a recent analysis of large cohorts of individuals with colorectal cancer showed that only ~1.3% of phenotypic variance could be accounted for by the 10 loci discovered through GWAS (40). However, it could be argued that the relatively low level of utility that might be inferred from such studies is misleading. In particular, it is possible that a more complete knowledge of disease-associated variants and their epistatic relationships would be able to reliably predict who will and who will not develop disease in the general population. Modeling allows estimation of the maximum possible information that could be derived from such tests.

Several of our conclusions are based on the genometype frequency and risk distributions that would maximize the clinical utility of genetic testing, that is, are best-case scenarios. The actual frequency and risk distributions of genometypes in the population are not likely to be distributed in this way. Indeed, other distributions are also consistent with the monozygotic twin data on which our maxima are determined and all other distributions yield less clinical utility than those of the maxima. Moreover, in the real world, it is unlikely that the biomedical correlates of every genetic variant and the epistatic relationships among these variants will ever be completely known, or that the analytic validity of genetic testing will be perfect—as we assume in our ideal scenario.

Thus, our conclusions purposely overestimate the value of whole-genome sequencing that will be achieved—they represent an absolute upper bound that cannot be improved by improvements in technology or genetic knowledge. As a practical example of this principle, we estimate that a negative whole-genome sequencing–based test could indicate a nearly twofold decrease in risk for prostate cancer in men and a similar twofold decrease for urinary incontinence in women. But this twofold decrease would only apply in a world in which the risk alleles are distributed in a fashion that maximizes the sensitivity of whole-genome testing (Fig. 3). In the real world, the risk alleles are not likely to be distributed in this ideal fashion, and omniscience about every variant is not likely to be realized. Thus, the risk of these diseases in patients who test negative will likely be even more similar to that of the general population. For diseases with a lower heritable component, such as most forms of cancer, whole genome–based genetic tests will be even less informative. Thus, our results suggest that genetic testing, at its best, will not be the dominant determinant of patient care and will not be a substitute for preventative medicine strategies incorporating routine checkups and risk management based on the history, physical status, and life-style of the patient.

It is important to point out that our study focused on testing relatively common diseases in the general population and did not address the use of whole-genome sequencing to identify the genetic basis of rare monogenic diseases. In such unusual cases, it has already been shown that whole-genome sequencing can prove highly informative [for example, (8, 9)].

As with any model-based study, our conclusions have a number of caveats. Our analyses are based on data from twin studies and the assumptions made therein (11). Specifically, we do not model gene-environment interactions and rely on the prevalence of disease in the twin cohorts; this prevalence, as well as the operative nongenetic contributions, may differ from that in the general population. Although twins are likely to be representative of the general population, the estimates provided by our model could be improved through analyses of larger twin cohorts as these become available, as well as through a more complete phenotypic evaluation of twins of varying ethnicities. Another caveat is that our conclusions about potential utility are based on thresholds that represent a complex balance of personal choices, demographic influences, disease characteristics, and the clinical intervention(s) available. We have used a minimum 10% total risk and a minimum relative risk of 2 as the threshold in our analyses. Other thresholds may be more appropriate and meaningful for given situations, although the data in tables S4 to S6 show that our major conclusions are not altered much by the choice of threshold.

In sum, no result, including ours, can or should be used to conclude that whole-genome sequencing will be either useful or useless in an absolute sense. This utility will depend on the results of testing, the individual tested, and the perspectives of individuals and societies. What we hoped to accomplish with this study is to put the debate about the value of such sequencing in a mathematical and clinically relevant framework so that the potential merits and limitations of whole-genome sequencing, for any disease, can be quantitatively assessed. Recognition of these merits and limits can be useful to consumers, researchers, and industry, because they can minimize unrealistic expectations and foster fruitful investigations.

Materials and Methods

Twin studies used for genometype analyses

We used data from twin studies arising from population-based twin registries to investigate the distribution of disease risk within the population (1931). The registries in our study included the Swedish Twin Registry, Danish Twin Registry, Finnish Twin Cohort, Norwegian National Birth Registry, and the National Academy of Science–National Research Council World War II Veteran Twins Registry. Traits were chosen that represented diverse etiologies or were conditions of significant public health importance. We evaluated diseases in the following categories: autoimmune (type 1 diabetes and thyroid autoantibodies), neoplastic (breast, colorectal, and prostate cancer), cerebrovascular (coronary heart disease– and stroke-related death), genitourinary (general dystocia, pelvic organ prolapse, and urinary incontinence), unknown etiology (irritable bowel syndrome and chronic fatigue), neurological (Parkinson disease, Alzheimer’s disease, and dementia), and obesity-associated (type 2 diabetes and gallstone disease).

To be included in our analyses, the following data had to be available for each twin study: (i) nt—total number of monozygotic twin pairs where the disease status of each twin was known, (ii) nc—number of disease-concordant monozygotic twin pairs, (iii) nd—number of disease-discordant monozygotic twin pairs, (iv) nh—number of healthy-concordant monozygotic pairs, and (v) heritability (HER)—calculated as the proportion of the polygenic liability variation associated with genetic factors.

Using the data from population-based twin studies, we define cohort risk (CR)—the fraction of people in the cohort that had the disease—as follows:CR=(2nc+nd)2nt(1)

Model of the predictive capacity of personal genome sequencing

We define the following generative model that characterizes the joint distribution of an individual having a prespecified disease and a particular genometype. Each individual is characterized by (i) a binary (Bernoulli) random variable, Z, specifying whether he or she has the disease and (ii) a categorical random variable, G, indicating the genometype of the individual. This means that of the assumed extant genometypes, each individual can have only one of them. The joint distribution of both the disease and the genometype for an individual is given by P(Z, G). This joint distribution decomposes into a product of the likelihood of getting the disease given the genometype, P(Z|G), and the prior probability of having the genometype, P(G):P(Z,G)=P(Z|G)P(G)(2)

Thus, to proceed, we specify both the likelihood function, P(Z|G), and the prior, P(G). As mentioned above, G is a categorical random variable taking values g1, g2, …, gd, each of which with some probability. Therefore, we haveP(G=gi)=fi(3)

for all i = 1, 2, …, d. In words, a person can have one of the d assumed extant genometypes, and the probability of having genometype i is given by fi.

The probability of having the disease given a genometype is qi = e + ri. Thus, qi is the sum of a nongenetic risk, e, that is assumed to be constant for the whole population, and genetic risk, ri (note that 0 ≤ qi ≤ 1). Nongenetic risk (e) is the proportion of people in the population that would get the disease if all had the most favorable genometype possible. Nongenetic risk includes all factors that are not inherited, including environmental exposures (for example, diet and carcinogens), epigenetic alterations, and stochastic influences. We estimated it as follows: e = CR(1 − HER) (see below). This model assumes that all risks are either nongenetic or genetic, that is, no interactions. We require that the unknown parameters, ri, must be between 0 and 1 − e, for all i. Therefore, for a given genometype, the likelihood term for genometype i is given byP(Z=z|G=gi)={e+ri,if z=11eri,if z=0(4)

Thus, the joint distribution of disease and genometype can be written as follows:P(Z=z,G=gi)=fi(e+ri)z(1eri)1z,z{0,1},gi{g1,,gd}(5)

If the available data included the genometype and disease status of each individual, then inferring estimates of the parameters, r = (r1, …, rd), and f = (f1, …, fd), would be relatively straightforward. However, the available data include only the disease status of monozygotic twins. When considering monozygotic twins, these represent observations of disease status in two individuals with identical genometypes. Therefore, we can describe a joint distribution for monozygotic twins having a disease or not. Let Zj = Z(Xj) be the Bernoulli random variable indicating whether a particular individual has disease, and let Zk = Z(Xk) be the Bernoulli random variable for the co-twin. Similarly, let Gj = G(Xj) and Gk = G(Xk) be categorical random variables indicating whether twin j or k of a pair has some particular genometype. The distribution of disease within monozygotic twins can be divided into three distinct groups, namely, disease-concordant, disease-discordant, and healthy-concordant pairs.

The probability of disease-concordant monozygotic twins is given byP(Zj=Zk=1|Gj=Gk)=iP(Zj=Zk=1|Gj=Gk=gi)P(Gj=Gk=gi)(6a)=iP(Zj=1|Gj=gi)P(Zk=1|Gk=gi)P(Gj=Gk=gi)(6b)=i(e+ri)2fi(6c)

Similarly, the probability of healthy-concordant monozygotic twin pairs is given byP(Zj=Zk=0|Gj=Gk)=iP(Zj=Zk=0|Gj=Gk=gi)P(Gj=Gk=gi)(7a)=i(1eri)2fi(7b)

The probability of monozygotic twin pairs discordant for disease is given byP(ZjZk|Gj=Gk)=2i(e+ri)(1eri)fi(8)


For each disease, let nc, nh, and nd correspond to the number of concordant diseased, healthy, and discordant twin pairs, respectively. Assuming that there are d genometypes, the expected number of twin pairs of each of the three types is simply the total number of twin pairs times the probability of being each kind of twin pair:E[nc]=nti[d](e+ri)2fi(9)E[nc]=nti[d](e+ri)2fi(10)E[nd]=nti[d]2(e+ri)(1eri)fi(11)

Because we are interested in the limits of utility of genetic testing, we search for a parameter set that maximizes or minimizes the fraction of patients who will receive a positive test result, given certain constraints. Formally, we define the positive fraction (PF) as the proportion of cases that have a genometype sufficient to change clinical action. In our notation: PF(t,e;f,r)=i[d]|ri>tfi[(e+ri)2+(e+ri)(1eri)]i[d]fi[(e+ri)2+(e+ri)(1eri)](12)

where t is the genetic risk required for a person to be at the threshold required for clinical utility and d is the maximum number of genometypes under consideration. The thresholds for each disease are provided in table S1, and for each disease, t is defined as this threshold minus e.

We therefore seek to solve the following optimization problem, for each disease:maximizef,rPF(t,e;f,r)(13)subject tofi0,ifi=1,ri(0,1e),x{c,h,d}(n^xE[nx])20.25(14)

where Eq. 14 enforces that none of the residual errors can be larger than 0.5. The estimated number of twin pairs, nx, is the estimated number of twin pairs of each type obtained by plugging the estimated parameters into Eqs. 9 to 11. This is therefore a quadratically constrained optimization problem. We use the following algorithm to obtain a local optimum.

For d′ = 2, that is, starting with d′ = 2 genometypes, we implement a grid search over the parameter space and select the parameters that maximize the likelihood over a constrained search space. Let θ = (f, r) and Θ be the set of all θs under consideration, as defined by the feasible region specified in Eq. 14. We then discretize this space into 9 bins for each element of f and 100 bins for each element of r and denote P(Z|G) by Pθ(Z|G) to emphasize the dependence of the joint distribution on the parameter. Thus, we aim to solve the following optimization problem:θ^(2)=arg maxθΘ i,jPθ(Zj,Zk|Gj=Gk)(15)where θ^(d)=(f^(d),r^(d)) is the parameter estimate assuming only d′ genometypes. For each d′ = 3, …, 20, we seek to solve the above optimization problem. To initialize, we pad the previous solution with zeros, yielding f^(0)(d+1)=(f^(d),0) and similarly for r^(0)(d+1). Then, we use MATLAB’s fmincon to find a local maximum of PF given the constraints. If no improvement in PF is obtained for d′ + 1 genometypes using the default “padded” initialization, we try randomly initializing. We stop trying random initializations if any of the following criteria are met: (i) if we find an improvement in PF with the constraints satisfied, (ii) if we reach 100% PF, or (iii) if we reach 15 random initializations. If criterion (i) is met, we denote the parameters achieving the improvement θ^(d+1) and then increment d′ and continue. If criterion (ii) is met, we stop incrementing d′, because we have achieved the maximum possible PF, so adding additional genometypes cannot possibly maximize it further. If criterion (iii) is met, we let θ^(d+1)=θ^(0)(d+1); that is, we let our final estimate for d′ + 1 simply be our estimate for d′ padded with a zero. We then increment d′.

We repeat the above approach for each disease. The parameters that we determined using this approach to maximize PF were then used to estimate the percentage of the population testing positive for a given disease, as well as the relative risk of disease for those individuals testing negative, as defined below. We apply this approach separately for each disease, thus assuming independence. To find the minimum PFs compatible with the twin data, we used a similar procedure.

Relative risk of disease if testing negative

We determined the relative risk of disease of individuals whose whole-genome sequencing tests were negative after maximizing or minimizing the sensitivity (PF) of the test. Disease risk in the population testing negative (DRneg) is the ratio of the number of disease cases testing negative to the number of individuals in the population testing negative:DRneg=(2nc+nd)(1PF)2nti[d]ri<tfi(16)

To determine the relative risk of disease if testing negative (RRneg), we calculated the ratio of disease risk of individuals testing negative to the disease risk in the twin cohort (CR):RRneg=DRnegCR(17)

Calculation of relative risks

We defined relative risk in table S1 as the minimum total risk of individuals with genometypes carrying a given genetic risk compared to the total risk of individuals with genometypes carrying a genetic risk of 0% (that is, determined solely by nongenetic factors). The minimum total risk was determined using the standard 10% risk threshold described in the text as well as others (tables S4 to S6). In all cases,RR=PPV+[CR(1HER)]CR(1HER)(18)

Other parameters and models

Equation 14 enforces that none of the residual errors can be larger than 0.5, such that upon rounding we obtain a perfect fit. Changing this parameter from 0.5 to 0.01 did not alter the PFs depicted in Fig. 1 for any disease.

Instead of maximizing PFs, we also determined the distributions of genometype risks (ri) and frequencies (fi) that would minimize the relative risk of disease of individuals whose whole-genome sequencing tests were negative. This independent optimization yielded results nearly identical to those reported in Figs. 1 to 3.

As noted above, we estimated the nongenetic risk as e = CR(1 − HER). This risk is somewhat higher than that derived from the standard liability threshold (LT) model. However, it has recently been shown that the LT model underestimates the nongenetic contribution to disease because it does not take into account synergistic interactions among genes (41). The model described herein does not make any assumptions about the nature of the interactions between genes, such as additivity. However, the LT model can also be used to approximate the maximum capacity of whole-genome sequencing to detect individuals at predefined risks under certain simplifying assumptions about the distribution of risk alleles in the population. The PF predictions from the LT model using 10% thresholds are provided in table S4 and can be compared to the results of the current model with 10% thresholds (table S4).

Finally, our model can be used to calculate the potential clinical utility of whole-genome sequencing under any assumption about the proportion of nongenetic contributions to disease risk, or estimates thereof. Representative values for each disease, with nongenetic contributions ranging from 10 to 90%, are provided in table S7.

Supplementary Materials

Fig. S1. Graphical representation of genometype frequency and risk distributions for (A) leukemia, (B) Alzheimer’s disease, and (C) pancreatic cancer.

Table S1. Examples of known risk factors for common human diseases.

Table S2. Thresholds and other parameters used to analyze each disease.

Table S3. The risks and frequencies of each of the 20 genometypes providing maximum sensitivity (PF) for detection of each disease.

Table S4. Percentage of cases (that is, individuals with disease) testing positive with whole-genome sequencing at varying risk thresholds or with the liability threshold (LT) model.

Table S5. Percentage of population testing positive with whole-genome sequencing at varying risk thresholds.

Table S6. Relative risk of disease if testing negative with whole-genome sequencing at varying risk thresholds.

Table S7. Percentage of cases testing positive with whole-genome sequencing at varying estimates of nongenetic contributions.

References and Notes

  1. Acknowledgments: We thank N. Wray and D. Geman for critical comments regarding the manuscript, and K. Kinzler for technical assistance. Funding: The project was supported by NIH grant CA121113, the Virginia & D. K. Ludwig Fund for Cancer Research, and American Association for Cancer Research Stand Up To Cancer–Dream Team Translational Cancer Research Grant. Author contributions: N.J.R., J.T.V., G.P., K.W.K., B.V., and V.E.V. designed the study; N.J.R., J.T.V., and V.E.V. generated and analyzed the data; N.J.R., J.T.V., B.V., and V.E.V. wrote the manuscript. Competing interests: B.V., K.W.K., and V.E.V. are co-founders of Inostics and Personal Genome Diagnostics and are members of their Scientific Advisory Boards. K.W.K., B.V., and V.E.V. own Inostics and Personal Genome Diagnostics stock, which is subject to certain restrictions under University policy. The terms of these arrangements are managed by the Johns Hopkins University in accordance with its conflict-of-interest policies. G.P. is on the scientific advisory board of Counsyl.
View Abstract

Stay Connected to Science Translational Medicine

Navigate This Article