Research ArticlePRECISION MEDICINE

# Identification of type 2 diabetes subgroups through topological analysis of patient similarity

See allHide authors and affiliations

Science Translational Medicine  28 Oct 2015:
Vol. 7, Issue 311, pp. 311ra174
DOI: 10.1126/scitranslmed.aaa9364

## Networks work for diabetes

Big problems require big solutions, and for complex diseases such as cancer or diabetes, the big solution is big data. One long-term goal of U.S. President Barack Obama’s Precision Medicine Initiative is to assemble medical and genetic data from at least one million volunteers. But how might researchers use all those data? Li et al. provide one answer by using patient electronic medical records (EMRs) and genotype data from Mount Sinai Medical Center in New York to characterize new subtypes of type 2 diabetes (T2D).

The group first clustered EMR data to identify T2D patients within the larger group. Topological analysis of the T2D group identified three new T2D subtypes on the basis of distinct patterns of clinical characteristics and disease comorbidities. Genetic association analysis identified more than 300 single nucleotide polymorphisms (SNPs) specific to each subtype. The authors found that classical T2D features such as obesity, high blood sugar, kidney disease, and eye disease, were limited to subtype 1, whereas other comorbidities such as cancer and neurological diseases were specific to subtypes 2 and 3, respectively. These distinctions might call for tailored treatment regimens rather than a one-size-fits-all approach for T2D. Although a larger sample size is needed to determine causal relationships, this study demonstrates the potential of precision medicine.

## Abstract

Type 2 diabetes (T2D) is a heterogeneous complex disease affecting more than 29 million Americans alone with a rising prevalence trending toward steady increases in the coming decades. Thus, there is a pressing clinical need to improve early prevention and clinical management of T2D and its complications. Clinicians have understood that patients who carry the T2D diagnosis have a variety of phenotypes and susceptibilities to diabetes-related complications. We used a precision medicine approach to characterize the complexity of T2D patient populations based on high-dimensional electronic medical records (EMRs) and genotype data from 11,210 individuals. We successfully identified three distinct subgroups of T2D from topology-based patient-patient networks. Subtype 1 was characterized by T2D complications diabetic nephropathy and diabetic retinopathy; subtype 2 was enriched for cancer malignancy and cardiovascular diseases; and subtype 3 was associated most strongly with cardiovascular diseases, neurological diseases, allergies, and HIV infections. We performed a genetic association analysis of the emergent T2D subtypes to identify subtype-specific genetic markers and identified 1279, 1227, and 1338 single-nucleotide polymorphisms (SNPs) that mapped to 425, 322, and 437 unique genes specific to subtypes 1, 2, and 3, respectively. By assessing the human disease–SNP association for each subtype, the enriched phenotypes and biological functions at the gene level for each subtype matched with the disease comorbidities and clinical differences that we identified through EMRs. Our approach demonstrates the utility of applying the precision medicine paradigm in T2D and the promise of extending the approach to the study of other complex, multifactorial diseases.

## INTRODUCTION

Type 2 diabetes (T2D) is a complex, multifactorial disease that has emerged as an increasing prevalent worldwide health concern associated with high economic and physiological burdens. An estimated 29.1 million Americans (9.3% of the population) were estimated to have some form of diabetes in 2012—up 13% from 2010—with T2D representing up to 95% of all diagnosed cases (1, 2). Risk factors for T2D include obesity, family history of diabetes, physical inactivity, ethnicity, and advanced age (1, 2). Diabetes and its complications now rank among the leading causes of death in the United States (2). In fact, diabetes is the leading cause of nontraumatic foot amputation, adult blindness, and need for kidney dialysis, and multiplies risk for myocardial infarction, peripheral artery disease, and cerebrovascular disease (36). The total estimated direct medical cost attributable to diabetes in the United States in 2012 was $176 billion, with an estimated$76 billion attributable to hospital inpatient care alone. There is a great need to improve understanding of T2D and its complex factors to facilitate prevention, early detection, and improvements in clinical management.

A more precise characterization of T2D patient populations can enhance our understanding of T2D pathophysiology (7, 8). Current clinical definitions classify diabetes into three major subtypes: type 1 diabetes (T1D), T2D, and maturity-onset diabetes of the young. Other subtypes based on phenotype bridge the gap between T1D and T2D, for example, latent autoimmune diabetes in adults (LADA) (7) and ketosis-prone T2D. The current categories indicate that the traditional definition of diabetes, especially T2D, might comprise additional subtypes with distinct clinical characteristics. A recent analysis of the longitudinal Whitehall II cohort study demonstrated improved assessment of cardiovascular risks when subgrouping T2D patients according to glucose concentration criteria (9). Genetic association studies reveal that the genetic architecture of T2D is profoundly complex (1012). Identified T2D-associated risk variants exhibit allelic heterogeneity and directional differentiation among populations (13, 14). The apparent clinical and genetic complexity and heterogeneity of T2D patient populations suggest that there are opportunities to refine the current, predominantly symptom-based, definition of T2D into additional subtypes (7).

Because etiological and pathophysiological differences exist among T2D patients, we hypothesize that a data-driven analysis of a clinical population could identify new T2D subtypes and factors. Here, we develop a data-driven, topology-based approach to (i) map the complexity of patient populations using clinical data from electronic medical records (EMRs) and (ii) identify new, emergent T2D patient subgroups with subtype-specific clinical and genetic characteristics. We apply this approach to a data set comprising matched EMRs and genotype data from more than 11,000 individuals. Topological analysis of these data revealed three distinct T2D subtypes that exhibited distinct patterns of clinical characteristics and disease comorbidities. Further, we identified genetic markers associated with each T2D subtype and performed gene- and pathway-level analysis of subtype genetic associations. Biological and phenotypic features enriched in the genetic analysis corroborated clinical disparities observed among subgroups. Our findings suggest that data-driven, topological analysis of patient cohorts has utility in precision medicine efforts to refine our understanding of T2D toward improving patient care.

## RESULTS

### T2D-specific patient network

We developed and applied an unsupervised, topology-based approach that uses EMR-derived clinical data to infer a patient-patient similarity network as the computational model to represent a complex patient population. In the resulting patient-patient network, patients (nodes) are connected to one another by edges if they exhibit clinical similarity across many clinical dimensions (for example, laboratory tests). Patients who exhibited very high degrees of similarity were grouped into single nodes (see Materials and Methods). We identified two distinct clusters in the resulting patient-patient network (Fig. 1A) that contained 3889 and 7321 unique patients (the left and right clusters, respectively). The left cluster (n = 3889) was significantly enriched [least absolute shrinkage and selection operator (LASSO), P < 0.05] for endocrine and metabolic diseases, immunity disorders, infectious disease, mental illness, diseases of the circulatory and genitourinary systems, and symptoms/signs/ill-defined conditions and factors that influence health status. The right cluster (n = 7321) was significantly enriched for complications of pregnancy, respiratory diseases, and unclassified E code (external causes of injury) (15). Next, we identified T2D patients in the network to evaluate the heterogeneity of T2D patient groups across the patient-patient topology. We used a previously validated EMRs and genomics (eMERGE) network electronic phenotyping algorithm (16, 17) to define the T2D phenotype (n = 2551) and evaluated the network for topological enrichment of T2D patients. The red areas in Fig. 1A indicate that T2D patients are enriched in that particular location in the network, where the color scheme reflects the P value from hypergeometric enrichment analysis of topological enrichment (see Materials and Methods). We observed multiple distinct clusters or subnetworks of T2D patient enrichment.

We then rebuilt the patient-patient network, using the same topology analysis pipeline, with only the 2551 T2D patients identified with the T2D electronic phenotyping algorithm. The filtering step resulted in 73 clinical features that were used for topological inference of the patient-patient similarity network (table S1). From the resulting patient-patient network, we identified three completely segregated clusters with 762 (subtype 1), 617 (subtype 2), and 1096 (subtype 3) patients, respectively (Fig. 1B). We evaluated the network for enrichment of gender and did not observe any elevated enrichment of male or female patients in any of the clusters, suggesting that gender is not an organizing factor in the topology.

To assess the reproducibility of the T2D subtypes identified from the patient-patient network, we examined the performance on random samplings of training and test sets. First, we randomly split the 2551 T2D patients into two groups, with two-thirds as a training set and one-third as a test set. We then rebuilt the patient-patient network using the same 73 clinical features, distance metrics, and filter functions from the topology analysis pipeline. These steps were repeated 10 times. Last, we calculated the average of the precision [positive predictive value (PPV)] and recall (sensitivity) for the 10 tests, for training and test sets individually. The average precisions were 100, 91, and 98%, and the average recalls were 99, 96, and 94% for subtype 1, subtype 2, and subtype 3, respectively, in the training sets. In the test sets, the average precisions were 100, 90, and 97%, and the average recalls were 99, 96, and 93% for subtype 1, subtype 2, and subtype 3, respectively. The overall accuracy was 96% for both the training sets and test sets.

### Significant characteristics and clinical features specific to T2D subtypes

We identified 33 clinical variables significantly specific to subtype 1 (n = 761) compared to both of the two other subtypes individually or combined. Three of these variables overlapped with clinical variables that were also specific to subtype 3, resulting in 29 variables unique to subtype 1. In addition, we identified 3 and 11 clinical variables significantly specific to subtype 2 (n = 617) and subtype 3 (n = 1096), respectively, with one shared variable. The only variable the three subtypes had in common was insulin administration (Table 1, A to C).

Table 1.

Clinical variables specific to subtypes. S-1, subtype 1; S-2, subtype 2; S-3, subtype 3; BMI, body mass index.

View this table:

Patients in subtype 1 were the youngest (59.76 ± 0.45 years) and were notable for features classically associated with T2D, such as the highest BMI (33.07 ± 0.29 kg/m2) and highest serum glucose concentrations at point-of-care testing (POCT) (193.69 ± 11.45 mM). Patients in subtype 1 had the lowest complete blood count, including the lowest white blood cell counts (5.32 ± 0.57 × 109/liter), neutrophil counts (2.50 ± 0.58 × 109/liter), eosinophil counts (0.09 ± 0.02 × 109/liter), and mean platelet volumes (9.97 ± 0.37 fl). In addition, patients in subtype 1 had a considerably lower platelet count, with more than 50% of patients below the reference range (98.36 ± 17.86 × 109/liter). Adding to this curious hematological finding was a prolonged prothrombin time at POCT (29.18 ± 3.64 s), which corresponded to an elevated international normalized ratio (INR) (2.57 ± 0.34). Patients in subtype 1 also displayed the highest serum albumin (4.27 ± 0.02 g/dl) and lowest creatinine (1.0 ± 0.02 mg/dl) levels. Although these patients had better kidney function compared to those in the other two subtypes, estimated glomerular filtration rate (GFR) was below the reference range (72.26 ± 1.47 ml/min/1.73 m2; range, 17.3 to 149.7). In addition, patients in subtype 1 had the highest total blood CO2 (26.6 ± 0.13 mmHg) and fewer respirations per minute (16.65 ± 0.16), and lower prescription rates for calcium channel blockers (CCB; 19.55%), angiotensin II receptor blockers and angiotensin-converting enzyme inhibitors (ARB/ACEI, 48.16%) (commonly prescribed for hypertension), dipeptidyl peptidase 4 inhibitor (DPP4, 1.05%), and metformin (MET, 6.43%) (the last two are both prescribed for T2D).

Patients in subtype 2 had the lowest weight (85.17 ± 1.14 kg) compared with those in the other subtypes. Patients in subtype 3 had the highest systolic blood pressure (135.7 ± 0.7 mmHg), serum chloride levels (102.03 ± 0.11 mEq/liter), and troponin I levels (0.36 ± 0.09 μg/liter) and were more often prescribed ARB/ACEI (62.96%) for the treatment of hypertension and statins (56%) for cholesterol reduction. A full list of variables that were significantly specific to each subtype is provided in Table 1 (A to C).

### Disease comorbidity associated withT2D subtypes

We applied the disease Clinical Classifications Software (CCS; see Materials and Methods) (18) on more than 7000 ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) diagnosis codes in our cohort to aggregate the large number of ICD-9-CM codes into a manageable number of either 281 single-level disease categories or 18 level 1 (broader) categories in the multilevel disease categories. By adjusting patient age, gender, and self-reported race, we found that the patients in subtype 1 (n = 762) were more likely to associate with the following ICD-9-CM codes: diseases in the “other upper respiratory infections” [relative risk (RR), mean, 1.68; range, 1.34 to 2.11]; immunization and screening for infectious disease (RR, 1.65; range, 1.32 to 2.06); diabetes mellitus with complications (RR, 1.50; range, 1.22 to 1.84); other skin disorders (RR, 1.41; range, 1.13 to 1.76); and blindness and vision defects (RR, 1.32; range, 1.04 to 1.67), than were the other two subtypes (Table 2A). Patients in subtype 2 (n = 617) were more likely to associate with diseases of cancer of bronchus: lung (RR, 3.76; range, 1.14 to 12.39); malignant neoplasm without specification of site (RR, 3.46; range, 1.23 to 9.70); tuberculosis (RR, 2.93; range, 1.30 to 6.64); coronary atherosclerosis and other heart disease (RR, 1.28; range, 1.01 to 1.61); and other circulatory disease (RR, 1.27; range, 1.02 to 1.58), than were the other two subtypes (Table 2B). Patients in subtype 3 (n = 1096) were more often diagnosed with HIV infection (RR, 1.92; range, 1.30 to 2.85) and were associated with E codes (that is, external causes of injury care) (RR, 1.84; range, 1.41 to 2.39); aortic and peripheral arterial embolism or thrombosis (RR, 1.79; range, 1.18 to 2.71); hypertension with complications and secondary hypertension (RR, 1.66; range, 1.29 to 2.15); coronary atherosclerosis and other heart disease (RR, 1.41; range, 1.15 to 1.72); allergic reactions (RR, 1.42; range, 1.19 to 1.70); deficiency and other anemia (RR, 1.39; range, 1.14 to 1.68); and screening and history of mental health and substance abuse code (RR, 1.30; range, 1.07 to 1.58) (Table 2C).

Table 2.

Significant associated disease categories. MHSA, mental health and substance abuse; LCI, lower confidence interval; UCI, upper confidence interval.

View this table:

### Significant disease–genetic variant enrichments specific to T2D subtypes

We next evaluated the genetic variants significantly associated with each of the three subtypes. Observed genetic associations and gene-level [that is, single-nucleotide polymorphisms (SNPs) mapped to gene-level annotations] enrichments by hypergeometric analysis are considered independent of the clinical phenotype–based network topology, because patient genetic data were not used in the determination of the patient-patient network topology. We identified 1279, 1227, and 1338 genetic variants specific to subtypes 1, 2, and 3, respectively, using a hypergeometric enrichment approach (see Materials and Methods) (significant SNPs are shown in table S3, A to C). After mapping the variants to gene regions, we identified 425, 322, and 437 unique genes specific to subtypes 1, 2, and 3, respectively. We used a comprehensive human disease–SNP association database (VarDi) (19) to assess the agreement between genetic-disease associations and disease comorbidities associated with each subtype. We analyzed the enrichment of phenotypes including both diagnosis (for example, diabetic nephropathy) and laboratory measurements (for example, creatinine levels) associated with the genetic variants at the gene level.

We observed 27 gene-phenotype associations enriched (hypergeometric analysis, P ≤ 0.05) among the genetic variants unique to subtype 1 (Table 3A and Fig. 2). Many of the enriched gene-level phenotype annotations have known associations with T2D, such as increased serum retinol levels (20), increased B cell counts (21), increased albumin-to-creatinine ratios (22), increased diabetes mellitus, increased serum alanine transaminase levels (23), increased diabetic nephropathy (22, 24), increased leptin receptor (a single-transmembrane domain receptor) (25), increased serum levels of mannose-binding lectin (26), increased forced expiratory volume (27), and increased serum vitamin D concentrations (28). A complete list of subtype 1–specific enriched phenotypes is displayed in Table 3A.

Table 3. Significant phenotypes.
View this table:

We observed 25 gene-phenotype associations significantly enriched among the genetic variants unique to subtype 2. The four enriched gene-level phenotype annotations for subtype 2 were related to either cancer or treatment of cancer including bleomycin sensitivity, epirubicin-induced adverse drug reactions, stem cell transplantation, and follicular lymphoma. In addition, we identified two cardiovascular phenotypes, left ventricular internal diastolic dimensions and atrial fibrillation. The enriched gene-level phenotypes matched with patient comorbidities associated with subtype 2 (Table 3B and Fig. 2), suggesting a possible link between observed disease comorbidities and underlying subtype genetics.

We observed 28 gene-phenotype associations significantly enriched among the genetic variants unique to subtype 3 (Table 3C and Fig. 2). Ten phenotypes were related to mental and neurological diseases, including spinocerebellar ataxia type 1, intraventricular septal thickness, anxiety disorders, cognitive decline, dementia, impaired play skills, intelligence, depression, θ power of electroencephalogram, and HIV-associated neurocognitive disorders. Three were related to the cardiovascular system, including heart rate interval (RR), peripartum cardiomyopathy, and atrial fibrillation. Increased serum vitamin D concentrations (28) were recently implicated as a risk factor for T2D and also were enriched in subtype 1. Furthermore, two phenotypes, allergy and response to statins, were enriched for genetic variants that matched with the identified clinical variables and phenotype comorbidities specific to subtype 3, including cardiovascular disease and mental illness. Disease comorbidities and clinical variables associated with subtype 3 matched particularly well with the gene-level phenotype enrichments. A complete list of enriched phenotypes for subtype 3 is shown in Table 3C.

The network of genetic variants in gene-level and associated phenotypes for the three T2D subtypes is shown in Fig. 2 (produced with Cytoscape 3.2.0) (29).

### Significant pathway and toxicity functions specific to T2D subtypes

We assessed the toxicity functions and signaling pathways for gene-level enrichments unique to each subtype (425, 322, and 437 gene-level enrichments specific to subtypes 1, 2, and 3, respectively) using Qiagen’s Ingenuity Pathway Analysis (IPA) program. Canonical pathways include metabolic and cell signaling pathways that have been curated from the literature by IPA. We identified five, two, and six canonical pathways to subtypes 1, 2, and 3, respectively (P < 0.01), by Fisher’s exact test right-tailed for enrichment.

Pathways that were enriched in subtype 1 were fatty acid β-oxidation III, which is increased in diabetic liver disease (30), acetate conversion to acetyl-CoA, which is involved in the metabolism of carbon sugars (3133), and cAMP (adenosine 3′,5′-monophosphate)–mediated signaling, which normalizes glucose-stimulated insulin secretion in uncoupling protein 2–overexpressing pancreatic β cells (34). Two pathways were associated with disease comorbidities for subtype 1, including netrin signaling, which acts in a protective role during diabetic nephropathy (35), and GABA (γ-aminobutyric acid) receptor signaling, which can often be detected early in the course of diabetic retinopathy (36, 37).

Pathways enriched in subtype 2 include those involved in pattern recognition receptors in the recognition of bacteria and viruses, which might explain why patients in subtype 2 had an increased prevalence of tuberculosis. We also found an enrichment for thrombopoietin signaling, which activates a number of secondary messengers that promote cell survival, proliferation, and differentiation (38). Increased thrombopoietin levels might contribute to the development and progression of coronary artery disease (39, 40).

Pathways enriched in subtype 3 include α-adrenergic signaling, which is implicated in diverse physiological functions, in particular those of the cardiovascular and central nervous systems (41, 42); synaptic long-term depression (43); CREB (cAMP response element–binding protein) signaling in neurons, which has a well-documented role in neuronal plasticity and long-term memory formation in the brain (44) as well as therapeutic potential for patients who have Alzheimer’s disease (45); glutamate receptor signaling, which has been implicated in brain pathologies in neurological diseases (46); hepatic fibrosis and hepatic stellate cell activation; and sperm motility. The complete list of pathways and their related genes for all subtypes are shown in Table 4.

Table 4.

Canonical pathways at gene level for each T2D subtype. ns, not significant.

View this table:

Enriched toxicity functions included hepatotoxicity, nephrotoxicity, cardiovascular toxicity, and clinical pathology endpoints. We identified nine, three, and three toxicity functions enriched in subtypes 1, 2, and 3, respectively (P < 0.01). In subtype 1, four of the nine functions are related to renal dysfunction, including glomerular injury, renal hypertrophy, renal proliferation, and renal degeneration, suggesting that diabetic nephropathy exists in the subtype 1 cohort (47, 48). The remaining five functions are related to liver dysfunction, which match the two liver enzymes, alanine transaminase levels and aspartyl phenylalanine levels, identified by VarDi (19). Surprisingly, subtypes 2 and 3 were both associated with cardiac arteriopathy, even though they were associated with different sets of genes. Most toxicity functions that are related to cardiovascular disorders and liver fibrosis match the findings that both cohorts have high risk for cardiovascular diseases, as deduced on the basis of disease comorbidities from the EMRs and genetic variant associations by VarDi (19). The complete list of enriched toxicity functions for all subtypes and their related genes are listed in Table 5.

Table 5. Toxicity functions at the gene level for each T2D subtype.
View this table:

Together, these results suggest that the current clinical definition of T2D subsumes more nuanced subtypes whose definition and recognition might inform important clinical distinctions. Furthermore, the genetic findings suggest that these differences between T2D subtypes are potentially rooted in biological differences that relate to the observed clinical differences, and these biological differences might suggest new opportunities for biomarker discovery or improving our understanding of disease mechanisms.

## DISCUSSION

Previous efforts to analyze or mine large clinical populations with associated genome-wide genotyping information have largely focused on replicating known clinical genotype-phenotype correlations, or discovering new correlations from more narrowly defined clinical phenotypes that can be extracted from EMRs (49, 50). Previous efforts to develop and apply phenome-wide association study (PheWAS) approaches represent a new approach in which data from EMRs are integrated and used for systematic discovery of new clinical genotype-phenotype correlations (51). However, the goal of PheWAS is to discover new pleiotropic genotype-phenotype associations—that is, to identify many clinical phenotypes linked to a single genetic locus. The goal of our study was to develop a precision medicine approach to characterize the complexity of T2D patient populations through data-driven, topological analysis of patient-patient similarity across clinical phenotype traits. Our approach is distinct from previous efforts in that we developed and applied a patient-centric clinical phenotype similarity network and then used the topology of the resulting patient-patient similarity network to define patient subgroups, which were subsequently used as the basis of clinical and genotype risk factor associations.

We hypothesized that topological analysis of patient populations in high-dimensional clinical phenotype space may identify meaningful subpopulations of T2D patients. We focused our analysis on T2D patients, who are of high clinical importance and the most prevalent disease group in the population. We identified 2551 T2D patients in our outpatient cohort as determined by the eMERGE T2D electronic phenotyping algorithm (16, 17). Using our data-driven, topology-based approach, we identified three distinct subtypes of T2D. Subtype 1 comprises ~30% (n = 761) of the overall T2D cases and was enriched for diabetic nephropathy and diabetic retinopathy, both microvascular complications. Subtype 2 comprises ~24% (n = 617) of all T2D cases and was enriched for cancer malignancy and cardiovascular diseases. Subtype 3 comprises ~43% (n = 1096) of all T2D cases and associated most strongly with cardiovascular diseases, neurological diseases, allergies, and HIV infections. Macrovascular complications are generally best averted by stringent control of blood pressure and low-density lipoprotein. We identified 1279, 1227, and 1338 SNPs, which mapped to 425, 322, and 437 genes, specific to subtypes 1, 2, and 3, respectively. The enriched phenotypes and biological functions defined at the gene level for each subtype matched with the disease comorbidities and clinical differences that we identified through EMR-based topology data analysis (TDA). This observed agreement is likely meaningful mechanistically because the genetic data were not used to inform patient subgroup topology.

The patient-patient network representation was constructed using cosine distance metric with two filter functions to assess the similarity of the clinical variables from EMRs. The clinical data set comprises more than 500 clinical variables represented in the EMRs, including patient demographics, laboratory tests, and medication orders.

The observed differences in comorbidity and genetic associations between T2D subtypes might serve as useful features for informing the clinical characterization of T2D patients. We found several notable associations between disease diagnosis categories and T2D subtypes. We used CCS developed by the U.S. Agency for Healthcare Research and Quality (AHRQ) (18) to narrow down more than 7000 ICD-9-CM diagnosis codes in our cohort to higher-order single-level disease categories (n = 281) that include exclusively mental health and substance abuse (CCS-MHSA) general categories, which were more useful for presenting data at a descriptive statistical categorical level than using individual ICD-9-CM codes. Patients in subtype 1 associated most with prototypical microvascular diabetic complications, namely, diabetic nephropathy and diabetic retinopathy, which was supported by both clinical data and genotype data independently. In support of a genetic etiology for subtype 1 phenotype manifestation, the ACE gene, which encodes angiotensin I converting enzyme and was specifically associated with this cohort (Table 3A and Fig. 2), has been implicated in diabetic nephropathy (52, 53) and also in platelet aggregation (53). Accordingly, this association could reasonably suggest a mechanism to explain the lower platelet counts observed in subtype 1 patients (54). In addition, we extracted hemoglobin A1c (HbA1c) levels from our EMRs and found that patients in subtype 1 had the highest HbA1c levels compared with other two groups (7.68 ± 1.75, 7.45 ± 1.87, and 7.47 ± 1.78 in subtypes 1, 2, and 3, respectively, P < 0.05), which confirmed that subtype 1 was most likely enriched with microvascular diabetic complications best prevented by glycemic control (55).

Patients in subtype 2 were more likely to associate with cancer of the bronchus and lung (RR, 3.76; range, 1.14 to 12.39) and malignant neoplasm without specification of site (RR, 3.46; range, 1.23 to 9.7). Epidemiological studies have demonstrated an association between T2D and cancer (56). To try to unravel a putatively causal ordering for this disease link, we compared the first diagnosis dates for both diseases in our cohort to determine whether one more often predated the other. We identified 40% patients who were diagnosed with T2D before any instance of cancer and 60% of patients who were diagnosed with a cancer before T2D. This pattern indicates that T2D can be either the risk factor for or consequence of many forms of cancer (56, 57). Patients in subtype 3 were most likely to be associated with cardiovascular diseases and mental illness according to clinical data and genotype data independently. These patients were more often prescribed the top psychiatric medications to treat anxiety and depression (58), with 3.4% (P = 0.01) and 8.3% (P = 0.02), compared with other two subtypes from χ2 tests, respectively, as well as insulin treatment (45%, P < 0.0001). The 61 patients diagnosed with HIV infection could have a poorer response to therapy for diabetes because antiretroviral agents and chronic inflammation could adversely affect glycemic control (59). To address any potential bias from HIV infection or treatment, we removed these HIV patients from the cohort and reanalyzed the data using the LASSO algorithm (60). Except for allergies, disease comorbidities remained the same, dismissing the possibility of HIV infection bias and exhibiting the robustness of our methodology. Furthermore, the FHIT gene, which encodes the fragile histidine triad protein and was specifically associated with the subtype 3 cohort, has been associated with allergy and neurological disorders, including anxiety and depression (Table 3C and Fig. 2) (6163), indicating that FHIT could be a driver for these conditions and could explain why patients who had allergies also had an increased rate of suicide (6467). Although patients in subtypes 2 and 3 had significantly lower BMIs than those in subtype 1 (P < 0.0001, Table 1B), both were enriched for cardiovascular morbidity, whereas patients in subtype 1 were not. A recent study showed that weight loss does not reduce the rate of cardiovascular events in obese adults with T2D (68, 69). These data suggest that the cardiovascular morbidity seen in patients in subtypes 2 and 3 might be independent from obesity and potentially driven by genetic variants. Another interesting finding along these lines is our observation that hypertensive macrovascular variants were associated with subtypes 2 and 3, whereas hyperglycemic microvascular variants were associated with subtype 1.

Our study has several potential limitations. We identified 2551 T2D patients on the basis of an eMERGE algorithm (16, 17) from an 11,210 genotyped outpatient cohort. The sample size is relatively modest for identifying risk variants from a genome-wide association study (GWAS) point of view. Given that we investigated 38 million variants, it was a great challenge to control for false discovery rate. In our study, however, we derived our genetic data from more than ~10,000 published GWAS at the P < 1 × 10−6 significance level. The stringency of this inclusion criterion adds a measure of control to the procedure because subtype enrichments were identified using these disease-associated variants.

Another limitation is the lack of a deep consideration for the temporal aspects of disease trajectories. In analyzing the EMRs in Mount Sinai Medical Center (MSMC), we cannot always be clear when and where the first diagnosis of disease took place. Specifically, we cannot determine whether the patient had been diagnosed beforehand in other hospitals and, if so, how long the patient had the diagnosed disease before his or her first observed ICD-9-CM diagnosis. One possible solution is to explore the integration of insurance claims data. We will explore an extension of our analytical framework that incorporates temporal analysis in future studies.

In addition, T2D inclusion and exclusion criteria were precisely refined by the eMERGE algorithm (16, 17), and the other disease categories developed by AHRQ were all based on the current ICD-9-CM diagnosis code. Furthermore, CCS developed by AHRQ (18) only assigns one disease classification of a disease. As of now, only 20 phenotypes have been validated by eMERGE (70) using iteratively refined phenotype algorithms incorporating both structured and unstructured data to achieve high PPVs to identify true cases and controls from EMRs.

Our approach combines imputed variant information from the whole genome with high-dimensional EMRs, which facilitates pinpointing the differences between clinical and genetic factors specific to each subtype. This provides a tractable framework that enables initial steps toward the T2D redefinition informed by genetic markers. Our genetic analysis used the imputed variants from the 1000 Genome Projects, not limiting the variants in the genotyping arrays. This strategy offers better coverage on the intergenic and noncoding regions when investigating the associations between variants and phenotypes. The Encyclopedia of DNA Elements (ENCODE) project has shown that ~95% of known variants within sequenced genomes and 88% of those variants from GWAS fall outside of coding regions (71), and a functional SNP most strongly supported by experimental evidence is an SNP in the linkage disequilibrium region (72). The technique of imputation uses information of haplotypes from a more comprehensive whole-genome sequencing study (the 1000 Genome Projects) to infer variants that were not profiled by the original technology (73). With the information on variants from the whole genome, we were able to identify more variants associated with subtypes as well as to achieve better mapping of the identified variants to published GWAS.

Our study offers several important conclusions for translational research. First, our approach demonstrates the utility and promise of applying the precision medicine paradigm in T2D, and can be extended toward the study of other complex, multifactorial diseases. Next, our study demonstrates the utility of using higher-dimensional clinical data to first define the complex topology of a clinical phenotype before genetic marker discovery. This stands in contrast with previous precision medicine efforts that begin with molecular stratification and rely on established clinical phenotype definitions. Furthermore, the subtype-specific genetic factors identified by this study can be further explored through additional population genetic and experimental work to evaluate their utility for identifying subtype-specific biomarkers or to improve understanding of T2D disease mechanisms. Last, incorporation of the temporal dimension in future development of our topology-based approach might provide additional insight into the complexity of T2D patient populations along the natural history of disease and inform disease prevention efforts.

## MATERIALS AND METHODS

### Study design

The aim of our study was to develop a precision medicine approach to better understand and to characterize the complexity of T2D patient populations through data-driven, topological analysis of patient-patient similarity across clinical phenotype traits. We performed topological analysis for the data set, which comprises EMRs and genotype data from 11,210 individuals from MSMC’s large outpatient population. T2D and non-T2D control phenotypes were defined by the eMERGE phenotyping algorithm (16, 17). We assessed the disease comorbidities and human disease–SNP association for each subtype in T2D, as well as the enriched phenotypes and biological functions at gene level for each subtype.

### Patient population

We recruited and analyzed 11,210 unique patients who are consented participants in the Mount Sinai BioMe Biobank Program, an ongoing, EMR-linked bio- and data repository. The data set comprises adult patients recruited nonselectively from MSMC’s large outpatient population. Participants are predominantly recruited from local diverse communities in New York with 46% Hispanic, 32% African American, 20% European white, and 2% others as self-reported. The data were composed of 6857 (61%) females and 4350 (39%) males, and the average age is 55.5 years for overall, female, and male populations (fig. S1). The overall characteristics of 11,210 Biobank patients are shown in table S2. The individuals represented in the clinical data set are drawn from diverse racial, ethnic, and socioeconomic backgrounds. The EMR data are deidentified, and this study was governed by institutional review board approval and informed consent.

### Genotype data processing and identification of genetic variants and genes

A total of 11,210 unique patients were genotyped for genome-wide Illumina OmniExpress and Illumina Human Exome BeadChip arrays. We used a default GenCall score cutoff of 0.15 in GenomeStudio (v2011.1) as recommended by Illumina. Quality control was performed by zCall (74) for SNP quality. SNPs were removed if they had (i) a call rate of <95%, (ii) no minor alleles, (iii) Hardy-Weinberg equilibrium within population (P < 5 × 10−5), and (iv) removed A/T and G/C SNPs and any SNPs that deviate from 1 kg (<40% versus >60% and vice versa). After quality control for call quality and population equilibrium, the genotype data were phased by ShapeIt v2 r644 (75), yielding 850,067 SNPs, and then imputed by IMPUTE2.3 (73) using the 1000 Genomes Project (76) version 3 and integrated variant set (August 2012) as the reference panel, resulting in 38,068,758 variants. A complete list of the number of variants, in coding regions, and genes in both original genotype and the imputed data using genome build GRCh37/hg19 is shown in table S4. The rationale for using the 1000 Genomes Project as reference panel for imputation is that it contains the largest sample size of most diverse ethnicity background. Given the diversity in the Mount Sinai Biobank patients, using the 1000 Genomes Project allows us to identify the closest individuals for each patient and impute for genotypes that were not profiled in the original array. We mapped the imputed variants to gene regions by SnpEff v2 r644 (77) and AILUN [(78); http://ailun.stanford.edu] using human genome assembly (GRCh37/hg19) reference genome (UCSC Genome Browser, http://genome.ucsc.edu). The imputed variants data covering variants originally profiled by the genotyping arrays as well as variants observed in the 1000 Genomes Projects were then used for association analysis.

### Clinical phenotype data

We generated a pseudo cross-sectional data set from our deidentified patient records using the following phenotypic logic scheme. Using the initial enrollment date into the BioMe program (D1) as an anchor, we populated all (first) laboratory values, vitals, and specified medications ±30 days from D1. We collected the last laboratory/vital/medication date (D2) where the upper bound of the D2 date was constrained to D1 +30 days, and the lower bound constrained to D2 = D1. In most cases, D2 = D1. We then populated all ICD-9-CM codes for patients, where ICD-9-CM date ≤ D2 date. We then populated all medication orders for patient, where medication orders date ≤ D2 date. The data set also includes self-reported demographic data collected at D1.

T2D and non-T2D control phenotypes were defined by an electronic phenotyping algorithm that was developed by the eMERGE network (16, 17) based on ICD-9-CM diagnosis codes, laboratory tests (LONIC), prescribed medications (RxNorm), physician notes (natural language processing), and family history. Interim results were vetted by subject matter experts (SMEs) to verify that the queries were capturing the specified data appropriately. Adjustments to the queries were implemented iteratively as per the feedback received. Once the SMEs were satisfied with the algorithm components, the separate queries were packaged into a single job flow and executed against the base population datamart, resulting in the identification of cases and controls. We randomly selected samples of 100 cases and 100 controls for manual chart review by clinical experts from the endocrinology division at Mount Sinai Hospital and performance statistics generated. The algorithm achieved a PPV of 96% for cases and 100% for controls.

The processed data were then assembled into a data matrix of n patients by P clinical variables. The data set used for analysis represented 11,210 individual patients, 505 clinical variables (480 of which were clinical laboratory measures), and 7097 unique ICD-9-CM codes (1 to 218 per patient). On average, there were 64 clinical variables collected per patient (range, 25 to 212). To avoid overfitting, we selected the clinical variables with at least 50% of patients who had the values, resulting in 73 variables to perform the analysis (table S1).

### Disease classification

Each individual patient had at least one ICD-9-CM code diagnosis at the time his or her DNA sample was collected. CCS is a tool that was developed at AHRQ for clustering patient diagnoses and procedures into a manageable number of clinically meaningful categories (18). The single level of CCS is used to classify all diagnoses and procedures into unique groups based on the patient’s ICD-9-CM codes. The multilevel characterization of CCS is used to group single-level CCS categories into broader body systems or condition categories (for example, “Diseases of the Circulatory System,” “Mental Disorders”). The multilevel system has four levels of groupings for diagnoses, and we use the highest, most broad level to examine and assess general groupings for the disease category (18). In our study, we used 281 mutually exclusive single-level and 18 multilevel categories (broadest level) from CCS to map the disease categories based on their ICD-9-CM codes.

### TDA pipeline

We developed a novel TDA-based approach to perform unsupervised clustering of patients using various clinical features to produce a patient-patient network organized according to the high-dimensional clinical phenotype similarity among patients. We use Ayasdi 3.0 (79, 80) (http://ayasdi.com, Ayasdi Inc.) to perform the TDA analysis. We used TDA pipeline for overall patients, random samplings of training and test data sets. A cosine distance metric was used to assess the similarity of the data points based on clinical variables (Eq. 1). Two filter functions, L-infinity centrality and principal metric singular value decomposition (SVD1), were used to generate the patient-patient network based on clinical variables. L-infinity centrality is defined for each data point y to be the maximum distance from y to any other data point in the data set. It produces a more detailed and succinct description of the data set than a typical scatter plots display (80). Large values of this function correspond to points that are far from the center of the data set. SVD1 also was used in the data matrix to obtain subspaces within the column space, and dimensionality reduction is accomplished by projection on these subspaces (80). This is done with standard linear algebraic techniques when possible, and when the number of points is too large, numerical optimization techniques are used.(1)where D1 and D2 represent two individual data points.

### Statistical analysis

We used Ayasdi 3.0 (79, 80) (http://ayasdi.com, Ayasdi Inc.) to perform TDA for generating the patient-patient network. We used Qiagen’s IPA program version 24390178 (IPA, Qiagen, http://qiagen.com/ingenuity) to assess the toxicity functions and pathways for significant genes associated with each subtype. For imputed SNPs, we performed hypergeometric analysis to identify the significant SNPs associated with each subtypes based on their allele frequency and then examined the disease enrichment associated with the genes mapped from SNPs. The goal of performing hypergeometric tests is to identify genes that are highly associated with each subtype, which would lead to distinct phenotypes associated with each subtype. Such analysis is by nature different from traditional GWAS, where the goal is to identify disease-causing variants. Therefore, the hypergeometric test P values were used as an association measure instead of the evaluation of significance for individual SNPs. Similar analysis can also be seen in gene set–based gene expression analysis such as gene set enrichment analysis (81). We used our curated VarDi (19) to assess the significance of the genotype-phenotype enrichment. VarDi (19) is composed of 24,435 variants mapped to 3694 unique genes in 904 distinct phenotypes with a significant level (P < 1 × 10−6) from over ~13,000 GWAS, and we used P < 1 × 10−6 to identify variants from VarDi (19). LASSO provides stability and robustness statistics, which are used to inform consistency and sparsity. LASSO seeks a model that not only fits well but also is “simple” to avoid large variation, which occurs in estimating complex models (60). We used the LASSO algorithm with corrected Akaike information criterion statistic (AICC) (Eq. 2) (82) for feature selection and logistic regression for RR estimate of disease comorbidities based on CCS disease classification. We used analysis of variance (ANOVA), two-tailed t test, or χ2 tests to compare multiple or two-class continuous or categorical clinical variables. Data were presented as means ± SE. Statistical analyses and random samplings were carried out using SAS 9.3.2 (SAS Institute) and R 2.15.1 (83). We used Cytoscape 3.2.0 (29) to visualize the networks for the significant genotype-phenotype association identified from VarDi (19) specific to each of the T2D subtypes.(2)where k is the number of parameters in the model, and n is the sample size.

## SUPPLEMENTARY MATERIALS

www.sciencetranslationalmedicine.org/cgi/content/full/7/311/311ra174/DC1

Fig. S1. Age distributions for overall, female, and male populations.

Table S1. Clinical features.

Table S2. Patient characteristics across entire Biobank cohort.

Table S3. Significant SNPs specific for each T2D subtype.

 S4. Genes and variants.

## REFERENCES AND NOTES

1. Acknowledgments: We thank D. Ruderfer for insights on Biobank data; M. Menon for helpful comments and suggestions; and the IT group in Icahn School of Medicine at Mount Sinai for Hadoop computing and database support. Funding: This study was supported by funding from the NIH National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) (R01DK098242) and National Cancer Institute (NCI) (U54CA189201). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIDDK, NCI, or NIH. Author contributions: Conceived and designed the study: L.L. and J.T.D. Performed the TDA and statistical analysis: L.L. Analyzed the EMRs: L.L. Analyzed the genotyping data: L.L. and W.-Y.C. Contributed VarDi analysis tools: R.C. Contributed Biobank genotyping data: J.T.D., O.G., and E.P.B. Contributed clinical interpretation: L.L. and R.T. Wrote and edited the paper: L.L., W.-Y.C., J.T.D., B.S.G., O.G., and R.T. Competing interests: The authors declare that they have no competing interests.
View Abstract