Using genetics to prioritize diagnoses for rheumatology outpatients with inflammatory arthritis

See allHide authors and affiliations

Science Translational Medicine  27 May 2020:
Vol. 12, Issue 545, eaay1548
DOI: 10.1126/scitranslmed.aay1548

A genetic tool for triaging diagnoses

Multiple slowly progressing diseases initially present with inflammatory arthritis, and it can be difficult to clinically differentiate these conditions. Knevel et al. show that genetic data could be used to triage inflammatory arthritis–causing diagnoses at a patient’s first visit, improving the likelihood of a correct initial diagnosis and potentially expediting appropriate treatment. Their genetic diagnostic tool, here optimized for rheumatic disease diagnosis, could, in principle, be calibrated for other phenotypically similar diseases with different underlying genetics.


It is challenging to quickly diagnose slowly progressing diseases. To prioritize multiple related diagnoses, we developed G-PROB (Genetic Probability tool) to calculate the probability of different diseases for a patient using genetic risk scores. We tested G-PROB for inflammatory arthritis–causing diseases (rheumatoid arthritis, systemic lupus erythematosus, spondyloarthropathy, psoriatic arthritis, and gout). After validating on simulated data, we tested G-PROB in three cohorts: 1211 patients identified by International Classification of Diseases (ICD) codes within the eMERGE database, 245 patients identified through ICD codes and medical record review within the Partners Biobank, and 243 patients first presenting with unexplained inflammatory arthritis and with final diagnoses by record review within the Partners Biobank. Calibration of G-probabilities with disease status was high, with regression coefficients from 0.90 to 1.08 (1.00 is ideal). G-probabilities discriminated true diagnoses across the three cohorts with pooled areas under the curve (95% CI) of 0.69 (0.67 to 0.71), 0.81 (0.76 to 0.84), and 0.84 (0.81 to 0.86), respectively. For all patients, at least one disease could be ruled out, and in 45% of patients, a likely diagnosis was identified with a 64% positive predictive value. In 35% of cases, the clinician’s initial diagnosis was incorrect. Initial clinical diagnosis explained 39% of the variance in final disease, which improved to 51% (P < 0.0001) after adding G-probabilities. Converting genotype information before a clinical visit into an interpretable probability value for five different inflammatory arthritides could potentially be used to improve the diagnostic efficiency of rheumatic diseases in clinical practice.


The prevalence of patients with whole-genome genotyping data is readily increasing (13) as genome-wide genetic data are collected for biobanking efforts, routine care, and direct-to-consumer genotyping. Genotype data provide a patient-specific, time-independent risk profile that could be used to prioritize different diagnoses. In the case of complex rheumatic diseases, genetic data may not be particularly informative without patient signs or symptoms, as these diseases tend to be rare (411). However, genetic data available at an initial doctor visit could be used in ongoing clinical care in real time (12, 13).

Many patients in rheumatology outpatient clinics present with synovitis or joint swelling as the first symptom of inflammatory arthritis. Although such patients are often misdiagnosed at their first visit, about 80% of patients with inflammatory arthritis are eventually diagnosed with rheumatoid arthritis (RA) (14, 15), systemic lupus erythematosus (SLE) (16), spondyloarthropathy (SpA) (1719), psoriatic arthritis (PsA) (20), or gout (21). If the correct diagnosis for patients with inflammatory arthritis could be obtained more quickly, then therapies could be started sooner, thereby lessening the chance of disability and permanent damage (2226) and avoiding use of inappropriate immunomodulatory therapies.

Many risk loci have been identified for rheumatic diseases (2734), and genetic risk scores have been studied for both prediction of rheumatic disease progression (57) and for susceptibility (811). For instance, a previous study built a genetic model for gout susceptibility (28). Most other risk scores have had modest predictive value in determining case versus control status. Given the low prevalence of rheumatic diseases, most tests perform poorly on a population level since the pretest disease probability is low (35). In the outpatient setting, however, symptom-based selection substantially increases the pretest probability for disease, resulting in an increased posttest probability that may render probabilistic predictions more effective in the clinical setting. This is particularly the case for inflammatory arthritis, which is not present in healthy individuals. To our knowledge, the use of genetics to discriminate between multiple rheumatic diseases has not been investigated in a practical setting. Here, we explored whether genetic data can facilitate disease differentiation in patients with similar early disease stage symptoms of inflammatory arthritis at their first visit to an outpatient clinic.


Summary of methods

G-PROB (Genetic Probability tool) uses genetic information combined in a genetic risk score from multiple diseases to calculate a given patients’ conditional probabilities for each of multiple diseases, assuming that one of the diseases is present (Fig. 1). We call these probabilities G-probabilities. In this proof-of-principle study, we calculated the probabilities of RA, SLE, PsA, SpA, and gout for each patient using bias-adjusted odds ratios (ORs) from both single-nucleotide polymorphisms (SNPs) and human leukocyte antigen (HLA) variants of uncorrelated risk SNPs, as reported for European samples and sex-dependent population risk (2734). We note that we reduced the effect of each SNP using an arbitrary shrinkage factor to correct for the common overestimation of published effect sizes (36). We tested G-PROB in multiple settings with different patient selection criteria (figs. S1 to S6 and tables S1 to S4).

Fig. 1 Schematic depiction of G-PROB and study design.

(A) For patients with symptoms (inflammatory arthritis, suggesting possible synovitis at clinical exam in this example), G-PROB assigns a genetic information–based probability that the patient has each of the possible diseases. The magnitude and difference between disease probabilities can guide clinicians to subsequent tests. dsDNA, double-stranded DNA; RF, rheumatoid factor; ANA, antinuclear antibody; X-SI, X-ray of sacroiliac joints; MRI-SI, MRI of sacroiliac joints. (B) The genetic probability calculation consists of two steps: First, the population-based probability is calculated using a weighted genetic risk score. Next, it obtains within-patient probabilities normalizing the population-based probabilities. The end product is a probability that each patient has for each disease. (C) The multiple testing and validation phases of this study are summarized by this flow chart.

Simulated data

We first tested whether simulated genetic information on its own offered the potential to discriminate between different diagnoses, assuming an individual has one of the inflammatory arthritides being considered. We simulated a population of 1 million individuals and assigned 49,151 cases with one of our five synovitis-causing diseases (fig. S1). We stochastically assigned individuals as having a disease or not, taking into account the differences in their underlying genetic risk. We assigned case status after assigning an individual population-based disease risk for each of the five diseases as a random binomial trial for each of the diseases. As we assumed a disease prevalence of 1% for each disease in the simulated population, each disease was equally represented in our case set. This resulted in a simulated population of cases solely defined by genetics and a random factor.

G-PROB is a multiclass classifier, calculating probabilities for several diseases. For each disease, we tested whether the probabilities matched the simulated disease status. If the G-probabilities perform well, then the patient’s true disease status should correspond to a high G-probability for that specific disease, whereas the G-probabilities for other diseases should be low. We will further describe probabilities that correspond to a patient’s real disease as disease matching and probabilities that correspond to one of the other diseases as non–disease matching.

Within the simulated cases, we observed that the distribution of the genetic probabilities matching the patients’ simulated diseases was clearly higher than the genetic probabilities for the other diseases (Fig. 2A and fig. S7). Next, we tested whether the magnitude of the probabilities calibrated well with the real disease. For this, we estimated the regression line between the probabilities and disease match (yes/no) using linear regression with the intercept constrained to zero, where the resulting β (regression coefficient) gave the exact calibration. The β was almost optimal in the simulated data (β = 1.01; Fig. 2B). Whether G-PROB is useful to prioritize patients depends partly on how often the probabilities are highly informative low or high values. In the simulated data, 32% of all probabilities were ≤0.05, making their corresponding diseases very unlikely. This cutoff corresponded to a negative predictive value (NPV) of 0.98 (Table 1). In contrast, 13% of the probabilities were ≥0.5, corresponding to a positive predictive value (PPV) of 0.70. Each patient had five disease probabilities assigned, and we were frequently able to rule in or rule out one of the five diseases. We observed that 90% of the patients had at least one of those probabilities ≤0.05 and 64% of patients had a diagnosis with at least one diagnosis probability ≥0.5.

Fig. 2 Discriminative ability of G-PROB and the concordance of predictions with observed disease occurrence.

The performance of G-PROB is measured by comparing the magnitude of its estimated probabilities with known disease status. All analyses consider at least five categories: RA, SLE, PsA, SpA, and gout. In setting III, “other” is added as a sixth category. (A) The distribution of probabilities estimated by G-PROB for the correct diagnosis (green) and incorrect diagnoses (orange). See fig. S10 for analyses on the individual disease level. (B) The concordance of G-probabilities with patients’ real disease status. Ideally, the higher the inferred G-probability is for a disease, the more likely that it is the actual diagnosis. Here, we estimated the regression line between the disease G-probabilities and disease match (yes/no) using linear regression, constraining the intercept to zero. A β (regression coefficient) of one indicates exact calibration. The solid line is the regression line. In the case of perfect test performance, the solid line would lie exactly on the identity line, represented by the dashed diagonal line. For visualization, we placed G-probabilities (x axis) into five equally sized bins and plotted the proportion of instances where predicted disease is the actual disease (y axis). (C) Receiver operating curve (ROC) describing the balance between sensitivity and the specificity and thereby the overall discriminative ability of G-probabilities. The area under the curve (AUC) summarizes the AUC of the pooled data for all diseases. See fig. S7 for the depiction on the individual disease level.

Table 1 Performance of G-probabilities in ruling out and pointing toward disease diagnoses.

(A) The NPV of G-probabilities at 0.05 and 0.2. We consider a G-probability <0.05 for a specific disease to indicate that disease is highly unlikely for that patient. (B) The PPV of G-probabilities at 0.2 and 0.5. A G-probability higher than 0.5 for a specific disease suggests that disease is more likely than not to be the diagnosis. As each patient has multiple G-probabilities (one for each disease), the table provides the percentage of probabilities below (0.05, 0.2) or above (0.2, 0.5) each cutoff, as well as the percentage of patients who have at least one disease with a probability below (0.05, 0.2) or above (0.2, 0.5) each cutoff. More extensive test characteristics are depicted in fig. S11. NPV, negative predictive value; PPV, positive predictive value; % of probs, the percentage of all probabilities; % of pts, the percentage of patients who had probabilities for at least one disease below or above the given cutoff.

View this table:

We next wished to determine whether G-PROB could correctly classify rheumatic disease status in the simulated data. To test this, we depicted receiver operating characteristic (ROC) curves and summarized the total performance with the area under the curve (AUC) for G-probabilities with disease match. The overall discriminatory capacity of the G-probabilities was highly accurate, with AUC of 0.86 [95% confidence interval (CI), 0.86 to 0.86] (Fig. 2C and table S5). The precision-recall curve is provided in the Supplementary Materials (fig. S8).

Setting I: Assigning patient diagnoses based on billing codes

The simulation showed that genetic information was substantially different for the five diseases and could have the potential to prioritize diagnoses. However, the simulations may differ from real patient data because of differences in reported effect sizes and unappreciated environmental factors. Therefore, we tested how well genetic information separated real patients using individual-level data from three biobank cohorts (settings) in which genetic data were linked to electronic medical records (EMRs).

Setting I comprises patients from eMERGE (Electronic Medical Records and Genomics), a network that has amassed clinical data and genome-wide genotyping data from 12 health care networks from throughout the United States (37). This consortium includes medical centers with biobank genetic information linked with EMR data, such as diagnosis [International Classification of Diseases and Related Health Problems (ICD) 9 and ICD10] billing codes. Among 83,717 subjects in eMERGE, we included 72,624 individuals who were self-described as white and excluded individuals enrolled in Partners Biobank (used in settings II and III), resulting in a total of 52,623 individuals in setting I. After testing the performance of the number of ICD codes for case identification with EMR-reviewed data as the gold standard (fig. S2 and table S1), we chose a cutoff of ≥20 ICD9/ICD10 codes for each disease to define the eventual true diagnoses for each patient. Applying this cutoff of ≥20 codes, we identified 1211 patients with one of five diseases (fig. S3 and table S2).

The G-probabilities in setting I were well calibrated with disease status, although with a slightly lower β (0.90) (Fig. 2B), reflecting the uncertainty of billing code–based diagnoses. We observed that 13% of the G-probabilities were ≤0.05, corresponding to an NPV of 0.94. Using this cutoff, we could identify and classify one or more of the five diseases as highly unlikely in 619 (51%) patients, two or more diseases in 135 (11%) patients, and three or more diseases in 10 (<1%) patients. For 24% of patients in this dataset, there was a single disease with ≥0.5 probability, corresponding to 5% of the total assigned probabilities and a PPV of 0.39. G-PROB showed a modest AUC discriminating between diseases [AUC, 0.69 (95% CI, 0.67 to 0.71); Fig. 2C].

Setting II: Assigning patient diagnoses based on medical records

Misclassification of gold standard diagnoses due to the use of ICD codes may have reduced the accuracy of results for setting I. Therefore, in setting II, we used a more precise method of patient selection: a complete manual review of patients’ medical records including data from the first to most recent visit. We used data from the Boston-based Partners HealthCare Biobank comprising >80,000 subjects (22). We obtained data on 12,604 self-described white patients with available genotype data. The medical charts of these patients were reviewed by a rheumatologist applying the clinical criteria of the American College of Rheumatology (ACR) or the European League Against Rheumatism (1421) to define the final diagnosis (fig. S4 and table S3). We identified 245 patients with one of the five diseases of interest.

We observed greater correspondence between G-probabilities and disease match (β = 1.08) (Fig. 2B) in setting II than setting I. This likely reflects the more accurate gold standard patient diagnoses based on complete medical charts in setting II than the ICD-based selection in setting I. Here, 11% of all probabilities were ≤0.05, corresponding to an NPV of 0.96. At this cutoff, it was possible to deprioritize at least one disease for 119 (49%) patients, two or more diseases for 13 (5%) patients, and three or more diseases for 3 (1%) patients. Furthermore, 25% of the patients had a single disease with a G-probability ≥0.5 (Table 1), which represented 5% of all calculated probabilities. A G-probability ≥0.5 corresponded to a PPV of 0.66. We observed that the accuracy was closer to the accuracy in the simulation data with AUC of 0.81 (95% CI, 0.76 to 0.84) (Fig. 2C).

Sensitivity analyses

To ensure that one strongly genetically determined disease did not skew the results, we conducted several sensitivity analyses. When we calculated the AUC for each individual disease, G-PROB showed similar performance across classes (fig. S9). We ran G-PROB five times, excluding each of the disease, and observed similar AUC values (fig. S10). As our shrinkage factor of 0.5 was arbitrarily chosen, we tested the calibration of the G-probabilities with disease outcome, the log-likelihood, and entropy score (38) using different shrinkage factors for the ORs in the genetic risk score and found that the results did not substantially differ (fig. S11).

We grouped peripheral SpA (17) with axial SpA (18, 19) but separated PsA (20) as a different disease phenotype, as there is extensive data from genome-wide association studies on axial SpA and PsA (1720). We tested whether the inclusion of peripheral SpA cases had influenced the result by rerunning analyses without those cases. The results were similar [β = 1.09 (95% CI, 1.01 to 1.18); AUC, 0.81 (95% CI, 0.78 to 0.84)].

Setting III: Selecting patients presenting with inflammatory arthritis at their first visit

As G-PROB was developed to discriminate patients presenting with similar symptoms, we tested this exact hypothesis in setting III by manually reviewing the records of all patients who received ICD codes from a rheumatology clinic. In contrast to setting II, we restricted this analysis to patients with documented but unexplained inflammatory arthritis at an initial encounter with a rheumatologist. From the 1808 Partners Biobank patients who visited a rheumatology outpatient clinic, 282 had inflammatory arthritis without a previous diagnosis. We note that settings II and III are not completely independent since they were obtained from the same patient resource and they share 107 individual patients. We excluded seven cases in which the patient did not fulfill any classification criteria, but the rheumatologist eventually diagnosed the patient with one of the five diseases; we could not determine for these patients whether the medical record lacked information or the patient was misclassified by the rheumatologist. Of the remaining patients, 79.4% were diagnosed with one of the diseases of interest (fig. S5 and table S4), a similar proportion as found in other studies (39, 40). We classified the remaining 20.6% as “other diseases” and included this as a sixth category in G-PROB, accounting for the fact that patients with inflammatory arthritis can have conditions in addition to the five most common diseases. In addition, for the G-PROB calculation, we used the diseases’ prevalence within the outpatient clinic. After these two adjustments, made possible by the quasi-prospective design, the G-probabilities reflected the most realistic disease risks. We excluded patients with RA without documented anti–cyclic citrullinated peptide (CCP) antibody status, resulting in 243 patients for analysis.

Again, G-probabilities corresponded well with real disease status (β = 0.99). Here, 39% of the G-probabilities were ≤0.05. At this threshold, we could deprioritize at least one disease in 100% of patients, at least two diseases in 203 patients (84%), at least three diseases in 98 patients (40%), and four diseases in 27 patients (11%), on the basis of genetic data alone. Furthermore, 45% of the patients had a single disease with G-probability ≥0.5, representing 8% of the calculated probabilities (PPV, 0.64). The ability of G-PROB to discriminate between different diseases was similarly high as in previous settings [AUC, 0.84 (95% CI, 0.81 to 0.86)] (Fig. 2, B and C). Possibly, the difference in prevalence for disease influenced the results in setting III. Subanalyses assigning uniform prevalence across diseases showed improved performance for the less prevalent diseases (fig. S7).

G-PROB’s performance compared to clinical knowledge

The pseudo-prospective patient identification enabled us to compare G-PROB’s performance with the diagnosis of the rheumatologist at a patient’s first visit. Compared with the final diagnosis after complete follow-up, we observed that 35% of the patients were misclassified by their rheumatologist at their first visit (Fig. 3A). The misclassification compared to eventual diagnosis was 35% for RA, 33% for SLE, 50% for SpA, 42% for PsA, 11% for gout, and 37% for other diseases. Of these initial misdiagnoses, 43% had a genetic probability ≤0.5, 29% had a genetic probability ≤0.2, and 9% had a genetic probability of ≤0.05. The difference of G-probabilities between the eventual diagnosis and the incorrect initial diagnosis favored the probability of the eventual diagnosis (Fig. 3B): For 65% of patients, the G-probability of the correct disease was higher than the G-probability of the initial diagnosis of the rheumatologist.

Fig. 3 Diagnostic value of adding G-probabilities to clinical information at first visit.

(A) For setting III, we collected the highest-ranked rheumatologist diagnosis at a patient’s first visit and matched this with the final diagnosis. Because rheumatologic diseases can be hard to classify at first visit, the final diagnosis can differ from the first. (B) The density of G-probabilities of both real (correct) diagnoses (green) and the incorrect initial (clinical) diagnosis from (A) (orange; left) and the difference between the two (right). (C) McFadden’s R2 for three multinomial logistic regression models: G-probabilities only, rheumatologist diagnosis at first visit only, and both first-visit diagnosis and G-probabilities. RA, rheumatoid arthritis; SLE, systemic lupus erythematosus; SpA, spondyloarthropathy; PsA, psoriatic arthritis; Dx, diagnosis.

In 53% of individuals, the disease with the highest G-probability corresponded to the final diagnosis. In 77% of individuals, the correct diagnosis was one of the two with the highest G-probabilities; a total of 87% of individuals had the correct diagnosis in the top three. The McFadden’s R2 (41) of the G-probabilities alone was 17 and 39% for the diagnosis at first visit. Adding G-probabilities to the clinical information significantly improved the model with an increase of 12 percentage points yielding an R2 of 51% (P < 0.00001) (Fig. 3C). We infer that the availability of G-PROB’s information would have improved the rheumatologist’s differential diagnosis.

Next, we explored whether serologic data could improve clinical insight when added to our model. Serologic data improved a model with clinical and genetic data by 22%. As several studies have described a strong association between genetics and serology (4244), we were surprised to find that serologic testing improved the model’s accuracy in addition to the genetic information. Assuming a setting where serology is available at the first visit, G-PROB still significantly improved R2 by 12% (P < 0.0001; table S6).


The number of patients with DNA genotyped before their first visit to a clinic is rapidly expanding. We investigated whether genetic information can help clinicians initially prioritize likely diagnoses and deprioritize unlikely ones among patients presenting with inflammatory arthritis. We observed that in a rheumatology setting, genetic information adds value to the clinical information obtained at the initial encounter, even when serologic data are available. Together, preexisting genetic data could be considered part of a patient’s medical history given its potential to improve precision medicine in the modern outpatient clinic.

Our results demonstrate that genetic data can provide probabilistic information to discriminate between multiple diseases presenting with similar clinical signs and symptoms. We investigated inflammatory arthritis, a hallmark of many rheumatic diseases associated 80% of the time with the diseases that we focused on in our study: RA, SLE, SpA, PsA, and gout (1421). Because of the evolution of symptoms over time, disease classification by a clinician at a first visit is challenging, evidenced by the 35% misclassification of patients by the rheumatologist at first visit. Our data strongly suggest that using increasingly available genetic information would reduce this misclassification rate.

In settings I and II, we selected patients and assigned diagnoses using criteria based on billing codes and medical record review. We were more stringent in setting III, where we selected patients who presented with inflammatory arthritis for the first time at a rheumatology clinic; this setting was particularly illuminating since physician diagnoses changed over time. If a patient had inflammatory arthritis at their first visit, we followed their disease course until the final diagnosis. This thorough selection ensured that the analysis represented the study hypothesis that genetic data can facilitate disease differentiation in patients with early symptoms. Although this setting had relatively low patient numbers and an extra level of complexity due to the addition of the category other diseases, G-PROB was able to differentiate disease categories in patients presenting with inflammatory arthritis similar to setting II.

Although clinicians have several tests to prioritize their differential diagnosis, complex diseases such as rheumatic disease take notoriously long to diagnose; a third of patients with arthritis are initially classified as undifferentiated even when all test results are available (45), and 48% of SLE patients have to wait >6 months to receive their diagnosis (46). We expect that in the future, an increasing number of patients will have genetic data available before their visits, making it possible for a clinician to request a G-PROB calculation while reviewing a patients’ medical history before their visit. When more information is known about a patient’s disease (for instance, at later visits), the value of G-PROB will inevitably decrease. We have not tested how much it decreases over time. Since we found that G-PROB substantially contributes to disease classification even when serologic data are present, we conclude that G-PROB can complement the arsenal of tests available to future clinicians.

We developed G-PROB to discriminate between multiple related diagnoses by estimating probabilities of different diseases based on patients’ genetic data. The advantage of providing probabilities instead of a binary diagnostic classification is that it synergizes with the probabilistic reasoning of clinicians and can easily be incorporated into a differential diagnosis. In instances where the genetic probability for a disease is sufficiently high or low, a clinician can prioritize diagnostic testing accordingly.

The importance of differentiating diseases in patients with similar presenting symptoms is relevant not only in rheumatology but also in many other clinical settings in which patients present with similar symptoms, for example, endocrinology (late-onset type 1 diabetes versus type 2 diabetes), pulmonology (chronic obstructive pulmonary disease versus asthma), and cardiology (subsets of heart failure). Although genetic risk factors are known for many of these diseases, they have yet to be translated into clinical practice. The G-PROB method can serve as a template to develop symptom-tailored, disease-differentiating tests.

We acknowledge several limitations of our study. In settings I and II, patients with less severe disease may have been excluded. It is likely that patients with less severe disease have less genetic risk and that including such patients would decrease the value of G-PROB. This limitation was overcome in setting III. Nonetheless, to study the value of G-PROB in addition to clinical information, a prospective study is required to compare diagnostic efficiency when G-probabilities are available versus not. Second, nongenetic information available at a first visit such as family history and age could further improve the performance of predictive tests such as G-PROB. Future studies could combine both genetic and nongenetic factors, aiming for a more precise prediction model. We restricted our study to genetic factors as we designed a proof-of-principle study of the value of genetics to differentiate diseases in patients with inflammatory arthritis. Third, we note that genetic data from genome-wide association studies of rheumatic diseases were primarily available from self-reported white individuals. Lack of diversity of genetic studies could potentially cripple the clinical applicability of G-PROB and other genetic risk score strategies (43). Last, we note that data may be lacking on key disease subtypes. We note that there is a rapid explosion of biobank data, where a wide range of phenotypic information for patients with different backgrounds is being obtained alongside genetic data. These studies, such as the U.K. Biobank (47), may offer more clear data on the differences and similarities in the genetics of different inflammatory arthritic conditions, such as axial and peripheral SpA.


Study design

We developed a tool, G-PROB, that translates a patient’s genetic profile into a risk score ranging from 0 to 1 (called a G-probability) for each possible disease of interest. We tested this tool in the situation where a patient presents with inflammatory arthritis and there are five likely disease diagnoses: RA, PsA, SpA, SLE, and gout. Thus, the genetic profile of a patient is translated into five probabilities, one for each disease that together add up to one. We tested how well these probabilities corresponded to a patient’s real disease. For example, for a patient with RA, ideal performance by G-PROB would result in the patient’s G-probability for RA being high and the G-probabilities of the other diseases being much lower.

We tested G-PROB in four settings (n = 50,743): simulation, two retrospective data collections, and one pseudo-prospective data collection. In the two retrospective settings, patients were selected using information at any time of their disease, whereas in the latter, patients had to have unexplained inflammatory arthritis at their first visit. This is described in more detail below.

Simulated data. To validate G-PROB in an optimal setting, we simulated a population of 1 million samples using the population-based allele frequencies of uncorrelated risk SNPs as reported for European samples (48). First, we calculated the number of samples with different genotype using the reported minor allele frequency (MAF) of Europeans from Ensembl (49), ensuring an allele distribution as expected under Hardy-Weinberg equilibriumNo.of samples homozygotes minor=MAF·MAF·1 millionNo.of samples heterozygotes=2MAF·(1MAF)·1 millionNo.of samples homozygotes major=(1MAF)·(1MAF)·1 million

We then randomly assigned an SNP genotype to each sample by sampling without replacement and repeated this for all non-HLA SNPs. For the HLA region, we randomly sampled complete profiles [from a reference panel (50)] to ensure that the strong linkage structure of the HLA region would be present in our simulated data.

Next, we calculated the population-based disease probability for each simulated patient (Eq. 3) for each of the five rheumatic diseases. Case status was assigned when a patient’s disease probability was higher than a randomly sampled number between 0 and 1 from a binominal distribution. Using this approach, we created a set of 49,151 patients having one of the five diseases (fig. S1). In this set of 49,151 patients, we calculated the within-patient risk for each disease (Eqs. 4 and 5).

Setting I: ICD-based patient selection. The eMERGE network (37) is a consortium of medical centers with EMR linked with genetic information and biobank data such as billing diagnosis codes (ICD). It includes 83,717 patients from 12 medical centers and is a National Institutes of Health–approved collaboration. We obtained consent from the network to use the imputed genotypes, self-reported ethnicity, and ICD9 and ICD10 data for this study. We selected patients with one of our five rheumatic diseases using relevant ICD codes.

Previous studies have demonstrated that one ICD code is highly unreliable to identify real cases (51, 52). To find the optimal number of ICD codes for patient selection, we explored the ICD code performance in previously reviewed medical records for RA in the Partners Biobank (53). Here, one rheumatologist screened the medical records and assigned case status both by the ACR 2010 criteria (14) and her expert opinion. We observed adequate PPV for when patients had >20 ICD9 or ICD10 codes (fig. S2 and table S2) and thus applied this cutoff to the eMERGE data (fig. S3). Table S3 summarizes the patient characteristics of setting I.

Setting II: Patient selection through manual review of medical records. Partners HealthCare Biobank comprises >80,000 subjects from Boston-based hospital centers (Brigham and Women’s Hospital, Massachusetts General Hospital, Faulkner Hospital, Newton-Wellesley Hospital, McLean Hospital, North Shore Medical Center, and Spaulding Rehabilitation Network) recruited from about 1.5 million patients (53). Written consent was obtained from each patient before their data were included in the Biobank. We obtained approval from the Partners Institutional Review Board to use the genotypes of these patients and access their clinical records.

We preselected records with clinical features that are typical for rheumatologic diseases (fig. S4). The selected records were completely reviewed manually (fig. S6), and patients were identified by applying the disease classification criteria (1421). Table S4 summarizes the patient characteristics of setting II.

Setting III: Pseudo-prospective patient collection. Setting III data of patients were collected from the same biobank as in setting II but with a different selection process. We identified patients in a pseudo-prospective manner by manually reviewing the records of all patients who received ICD codes from a rheumatology clinic on the presence of inflammatory arthritis at their first visit (n = 1808). Next, we performed a complete manual chart review on the patients with inflammatory arthritis (fig. S6). From the patients in setting II, 107 had unexplained untreated inflammatory arthritis at first visit and were therefore included in setting III. Patients that did not fulfill any of the criteria of our five diseases of interest were classified as other diseases. We also extracted the highest-ranked diagnosis made by the rheumatologist at the first visit. See fig. S5 and table S5 for detailed patient selection criteria.

G-PROB calculation

Genome-wide association studies have used logistic regression to identify disease susceptibility factors in patients by comparing them with healthy controls. The resulting ORs refer to a relative increase in population-based odds of getting that disease. Our genetic probability model, G-PROB, uses these logORs to create a weighted genetic risk score (54). G-PROB consists of two steps:

The genetic risk score of individual i for disease k is defined asβkjxij(1)where xij is the number of risk alleles of SNP j present for individual i and βkj is the logOR for SNP j obtained from previous genome-wide association studies (2734) for disease k.

To correct for possible overestimation of the effect sizes due to publication bias (36), we shrank the logORs of each genetic variant by multiplying them by 0.5. We refer to this as the shrinkage factor. The genetic risk score can be used to calculate for each subject a population level disease probability (Pki) of each patient i for each disease k, using the logistic regression formulaPki=11+exp[αk+i=1nβkjxij](2)

The genetic risk score of each disease is combined with an intercept that ensures that the mean probability is equal to the predefined disease prevalence. αk is an unknown constant, estimated by assuming that the mean predicted pretest probability is equal to the population prevalence of disease k, by minimizing the following estimation equation(k=1n11+exp([αk+i=1nβkjxij])nPopPrevk)2(3)where PopPrevk is the assumed prevalence for disease k in the general population.

The next step is to calculate the conditional probability of each of the diseases of interest k, given that the patient has one of the five diseases [Pr(Yk = 1| Σ(Yk = 1)]. These within-patient genetic probabilities (cPki) of each patient (i) for each disease (k) were obtained through normalization of the population riskcPki=Pkik=1nPki(4)

G-PROB gives each patient a probability (G-probabilities) for each of the five diseases of our interest. The final product of G-PROB per patient is as followsGProbi=[cP1,icPk,i](5)

To summarize, G-PROB assigns a disease probability for each disease of interest based on a weighted genetic risk scores for the diseases, taking into account predefined, sex-specific disease prevalences, assuming that a patient has one of the possible diseases. G-PROB can be used to discriminate between any set of diseases, provided that there is sufficient knowledge of the genetic risk.

Application of G-PROB to rheumatological diseases.
Selecting the prevalences

We considered the five most common inflammatory arthritis-causing diseases: RA, SLE, SpA, PsA, and gout. For the purely retrospective analyses (simulated data, settings I and II), we assumed a population prevalence of 1% for each disease. Sex is genetically determined. We incorporated sex by adjusting the prevalence of disease to the sex-specific prevalences. For example, the risk for RA is three times higher in women than in men. Assuming an overall prevalence of RA of 1% and a 1:1 ratio of women and men, the RA prevalence is 1.5% in women and 0.5% in men. Thus, Eq. 3 was minimized for men and women separately. We used F:M risk ratios of 3:1, 9:1, 0.3:1, 1:1, and 0.3:1 for RA, SLE, SpA, PsA, and gout, respectively (55).

Derivation of genetic risk scores for rheumatological diseases

We took genetic risk variants that obtained genome-wide significance (P < 5 × 10−8) for one or more of the five diseases from ImmunoBase (48) or (when not available) from the most recent genome-wide study on people of European decent (2734). For all diseases except for gout, we then calculated disease-specific ORs such that SNPs contributing to the susceptibility of several rheumatic conditions had different ORs for each disease. If an SNP was not associated with the other diseases, the OR for those diseases was 1.0. If there were two SNPs associated to the same disease in linkage with each other (r2 > 0.8), then we selected the SNP with the highest OR.

We calculated genetic risk scores for gout in a different manner. One group developed a prediction model for gout translating the genetic β’s for uric acid levels into risk groups and assigned an OR to each group (28). We used this risk group categorization (β values per uric acid increase) and the corresponding OR to calculate the genetic risk score for gout.

For HLA, we searched recent large studies that provide dependent estimates for the HLA variants for each disease. This ensured that the ORs used in our study were corrected for the strong linkage disequilibrium in this region. Because the HLA risk differs between CCP+ and CCP RA, we created a separate genetic risk score for CCP+ and CCP RA using their specific HLA ORs in the datasets where CCP status was available (simulated data, settings II and III). We combined CCP+ and CCP patients into one population probability for RA for each patient. In settings where no CCP status was known, we used the HLA variants for CCP+ RA. The final G-PROB model consisted of 208 SNPs outside the HLA region and 42 HLA variants such as SNPs, haplotypes, and alleles. The number of variants for each disease ranged from zero HLA variants (gout) to 21 variants (CCP+ RA) and 18 non-HLA variants (PsA) until 93 variants (CCP+ RA). Data file S1 summarizes the included risk variant information.

Optimizing G-PROB in pseudo-prospective setting III

In our pseudo-prospective analysis (setting III), we added other diseases to the five diseases as a sixth category. The ORs for the genetic risk scores of the other diseases group were all 1.0, as thus far, no data on the genetic risk profile of these patients are available. We also obtained the real outpatient clinic disease prevalences, which we incorporated into the G-PROB calculation (data file S2).

Statistical analysis

All analyses were performed in R version ≥3.2 (56). Our primary test was the overall performance of G-PROB combining all diseases. G-PROB gives each patient a probability for each disease of interest (so-called G-probability), assuming that only one of these diseases is a patient’s real disease. We test this for rheumatic disease, but basically, every set of diseases can be implemented in G-PROB given that they are mutually exclusive. For analyses on G-PROB’s performance, we grouped all probabilities of all patients into one vector and created a corresponding vector, indicating whether a particular probability corresponded with a patient’s real disease. For example, the binary disease match vector contained 20% instances where the probability of one disease matches a patient’s real disease match and 80% nonmatch. If G-probabilities correctly differentiated diseases, the real disease matching probabilities would be higher than the nonmatching probabilities.

We tested the performance of G-probabilities as follows. First, we examined how well G-probabilities were calibrated with disease outcome. We performed a linear regression model without intercept using probabilities as independent variable and disease match as dependent variable. The regression coefficient (β) describes how well the model was calibrated (ideally, β is one) (57). Next, we explored the ability of G-probabilities to correctly classify disease status, using the AUC-ROC of multiclass classifications (58). The ROC depicts the true-positive rate [sensitivity (y axis)] against the false-positive rate [1 − specificity (x axis)] for the probabilities as test variable and disease match as gold standard. Higher AUCs indicate better classification: For a random predictor, AUC = 0.5, and for a perfect predictor, AUC = 1. Given our two-vector data summary, our AUCs are so-called microAUCs, as described in the R package multiROC (59). We also calculated a macroAUC that are averages of the AUCs of each disease group. We obtained 95% CI by bootstrapping (100 resamplings) (table S5). We statistically compared the AUCs of different G-PROBs (for instance, in the sensitivity analyses where we constructed G-PROB with different diseases) using DeLong’s method within R package pROC (58).

To test whether our choice for a shrinkage factor influenced our results, we reran G-PROB with 10 different shrinkage factors between zero and one. For each shrinkage factor, we tested the calibration, the log-likelihood (a measure of how well a model fits the data, where the higher the log-likelihood, the better the model fits) and the entropy scores (a measure for the disorganization of values, where the lower the entropy, the more organized the values are)Entropy Score=iPilog(Pi)(6)

To test G-probabilities performance in setting III, we assigned a final diagnosis to patients presenting to a rheumatologist with synovial inflammation using complete chart review using data across all available visits. We also noted the rheumatologist’s initial diagnosis at first visit; as the rheumatologist had access to clinical information (physical exam, history, laboratory tests, referral records, etc.) when they made their initial diagnosis, we considered this initial diagnosis to be a proxy for all clinical information.

We applied multinomial logistic regression with the six disease categories as the dependent variables and the diseases probabilities and the clinical information (initial clinical diagnoses) as independent variables. We calculated the McFadden’s pseudo-R2 (the logistic regression equivalent of the explained variance of linear regression analyses) (41) to compare how different models—(I) the G-PROB, (II) the clinical information, and (III) the combination of the clinical information with the G-PROB—predictR2=1{L(MIntercept)L(MFull)}2N(7)

We assessed whether the addition of the G-probabilities to the clinical information significantly improved the model using a likelihood ratio test.

Genotyping and imputation. Setting I (eMERGE) samples were genotyped on 78 Illumina and Affymetrix array batches with different genome coverage (34). We imputed missing genotypes using the Michigan Imputation Server with the minimac3 imputation algorithm (51) and the HRCv1.1 reference panel (60). Before imputation, data were curated using PLINK version 1.9b3 (61), applying the following settings: maximum per SNP missing of <0.02, maximum per person missing of <0.02, MAF of >0.01, and Hardy-Weinberg disequilibrium P value of 0.00001.

For settings II and III, at time of the initiation of this study, 15,047 patients of the Partners Biobank were genotyped, 4930 were genotyped on the Illumina Multi-Ethnic Genotyping Array (MEGA) chip, and the remaining were genotyped on Expanded MEGA (MEGAEX) chip (62). To obtain all relevant SNPs for the genetic risk score, we imputed the missing genotypes on the Michigan imputation server with the minimac3 imputation algorithm and 1000 genome phase 3 version 5 as reference panel. Before imputation, data were curated using PLINK version 1.9b3 (61), applying the following settings: maximum per SNP missing of <0.02, maximum per-person missing of <0.02, MAF of >0.01, and Hardy-Weinberg disequilibrium P value (exact) of 0.00001. In all settings, we required each genetic variant to have an imputation quality ≧0.8. In case a variant of interest was not present in the postimputed postcurated data, we searched for a proxy that was in close linkage with the original variant (r2 > 0.8). The included SNPs are listed in data file S1.

For all settings, we imputed HLA regions using SNP2HLA (54) using the T1DCG version 1.0.3 reference panel (50). This tool imputes amino acid polymorphisms and SNPs in HLA within the major histocompatibility complex region on chromosome 6.


Fig. S1. Flowchart of the simulation study.

Fig. S2. Test characteristics of different ICD9 cutoffs for identification of RA cases using reviewed medical record data as the gold standard.

Fig. S3. Flowchart of patient selection in setting I.

Fig. S4. Flowchart of patient selection in setting II.

Fig. S5. Flowchart of patient selection in setting III.

Fig. S6. Flowchart of the medical record review procedure.

Fig. S7. Density plots of G-probabilities per disease.

Fig. S8. Precision recall curves.

Fig. S9. Sensitivity analysis of the performance of G-PROB per disease.

Fig. S10. Sensitivity analysis of the influence of individual diseases on G-PROB’s performance.

Fig. S11 Sensitivity analysis comparing different shrinkage factors.

Fig. S12. Test characteristics for the probabilities at different cutoffs.

Table S1. ICD9 and ICD10 codes used to identify patients in setting I (eMERGE).

Table S2. Patient characteristics in setting I.

Table S3. Patient characteristics in setting II.

Table S4. Patient characteristics in setting III.

Table S5. Area under the receiver operating curve per disease.

Table S6. McFadden’s R2 from multinomial logistic regression testing how much of the variance in the final disease diagnosis was explained by clinical, genetic, or serologic information.

Data file S1. ORs of curated risk variants for RA, RAneg, SLE, PsA, SpA, and gout.

Data file S2. Disease prevalence used in G-PROB per setting.


Funding: R.K. is supported by ReumaNederland 15-3-301 and Niels Stensen Fellowship. K.P.L. is supported by the Harold and DuVal Bowen Fund. K.S. is supported by the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS) F31 AR070582. This work is also supported by the NIH K24AR066109, R01 AR057327, and R01 AR049880 (to K.H.C.); P30 AR072577 (to K.P.L.); U01 HG008685 and 1OT2OD026553 (to E.W.K.); and U01 HG009379 and R01AR063759 (to S.R.). Author contributions: R.K. and S.R. jointly developed the project, designed the study, and wrote the initial manuscript. R.K. collected, curated, and analyzed the data. S.R. supervised the analysis. R.K., K.H.C., and K.P.L. reviewed patient charts. K.P.L., E.W.K., S.l.C., T.W.J.H., and K.P.L. critically appraised the study design. J.C. conducted HLA imputation. K.S. identified SNP proxies. C.C.T. and S.l.C. critically appraised the statistical strategy. All authors were involved in writing, reviewing, and critically appraising the manuscript. Competing interests: S.R. has recently served as a consultant for Merck, Pfizer, and AbbVie and is currently serving as a consultant for Gilead and Biogen Idec. S.l.C. has consulted to Danone and DSMB in the past. Data and materials availability: All data associated with this study are present in the paper or the Supplementary Materials. Genetic variants, G-probabilities, and final diagnoses are accessible through dbGAP under accessions phs000944.v1.p1 and phs001584.v1.p1.

Stay Connected to Science Translational Medicine

Navigate This Article