Research ArticleGenomics

Phenome-wide scanning identifies multiple diseases and disease severity phenotypes associated with HLA variants

See allHide authors and affiliations

Science Translational Medicine  10 May 2017:
Vol. 9, Issue 389, eaai8708
DOI: 10.1126/scitranslmed.aai8708

Hints on health and disease from HLA

Each of us expresses a mix of different human leukocyte antigens (HLAs), which present self- and foreign peptides to T cells. Because slightly different peptides are presented by each HLA type, HLA expression can influence an individual’s susceptibility to disease. Karnes et al. scrutinized electronic health record information from tens of thousands of people in two distinct cohorts to compare their phenotypes to the HLA alleles they express. This study confirmed previously identified HLA associations and also identified new ones; most associations were related to autoimmune diseases. The researchers have made the catalog freely available so that other groups can mine the data for future discoveries about how HLAs drive different phenotypes.


Although many phenotypes have been associated with variants in human leukocyte antigen (HLA) genes, the full phenotypic impact of HLA variants across all diseases is unknown. We imputed HLA genomic variation from two populations of 28,839 and 8431 European ancestry individuals and tested association of HLA variation with 1368 phenotypes. A total of 104 four-digit and 92 two-digit HLA allele phenotype associations were significant in both discovery and replication cohorts, the strongest being HLA-DQB1*03:02 and type 1 diabetes. Four previously unidentified associations were identified across the spectrum of disease with two- and four-digit HLA alleles and 10 with nonsynonymous variants. Some conditions associated with multiple HLA variants and stronger associations with more severe disease manifestations were identified. A comprehensive, publicly available catalog of clinical phenotypes associated with HLA variation is provided. Examining HLA variant disease associations in this large data set allows comprehensive definition of disease associations to drive further mechanistic insights.


The major histocompatibility complex (MHC) and human leukocyte antigen (HLA) genes have been extensively studied because of their key role in immune responses (1). Despite containing only a small fraction of all human genes, the MHC contains the greatest number of significant associations from genome-wide association studies (GWAS) relative to any other region in the genome (2). In GWAS alone, single-nucleotide polymorphisms (SNPs) in the MHC have been associated with more than 100 different phenotypes according to the National Human Genome Research Institute’s GWAS catalog (3). Four-digit HLA alleles, which characterize MHC proteins at the amino acid level, and key HLA amino acid residues are often found to have critical roles in immune-mediated disease. This HLA variation can be now imputed with high accuracy from SNP-level data (46). Previous investigations including GWAS have used a phenotype-to-genotype approach, which involves identifying individuals with a specific phenotype first and then determining genetic associations with that phenotype. The broader question of the influence of HLA variation in these genes across multiple human disease phenotypes has not been comprehensively investigated (7).

Here, we report the results of a large phenome-wide association study (PheWAS) that systematically examined relationships between variation in the MHC and a broad set of human disease phenotypes. PheWAS using electronic health record (EHR) systems have replicated hundreds of previous GWAS associations (including many associations with MHC SNPs), identified new associations, and have been used in other applications, such as an evaluation of the phenotypic legacy of Neandertals in modern humans (713). We conducted a PheWAS on European ancestry individuals genotyped on the Illumina HumanExome BeadChip and tested the association of imputed HLA variants with 1368 EHR phenotypes. Populations were derived from two institutions with DNA biobanks coupled to deidentified EHRs: the Vanderbilt University Medical Center’s BioVU (n = 28,839) and the Marshfield Clinic’s Personalized Medicine Research Project (PMRP; n = 8431). The results are freely available as a web catalog of HLA risk loci and their associated phenotypes at


Primary PheWAS analysis

A total of 99 four-digit HLA alleles, 77 two-digit alleles, and 988 nonsynonymous variants were imputed using SNP2HLA (6) and included in the analysis (Table 1). In a subset of 3152 BioVU individuals of European ancestry who also underwent direct HLA sequencing, the overall concordance rate and call rate for SNP2HLA-imputed four-digit alleles compared to sequencing were 96.93 and 99.98%, respectively. Using a false discovery rate (FDR) adjustment of q < 0.10 in the discovery cohort (BioVU) and P < 0.05 in the replication cohort (PMRP), a total of 104 significant associations with the same direction of effect were identified between four-digit HLA alleles and the 1368 phenotypes tested (Fig. 1 and fig. S1). A total of 284 four-digit allele associations were significant in BioVU before replication in PMRP. The 104 significant associations were observed with 29 distinct phenotypes and 29 unique alleles. After meta-analysis of associations in BioVU and PMRP, there were 12 four-digit allelic associations with a meta-analysis P ≤ 1.9 × 10−25 (other associations are presented at; Table 2). The strongest was observed between type 1 diabetes (T1D) and HLA-DQB1*03:02 [odds ratio (OR), 3.62; meta-analysis P = 1.04 × 10−54]. In the analysis of two-digit alleles, 92 significant associations were observed with 24 distinct phenotypes and 25 unique alleles. A total of 205 two-digit allele associations were significant in BioVU before replication in PMRP. There were nine meta-analysis associations with P < 1.9 × 10−25 (other associations are presented at; Table 2), and the strongest was between HLA-B*27 and ankylosing spondylitis (AS; OR, 32.48; meta-analysis P = 3.06 × 10−45). In the analysis of nonsynonymous variants, a total of 1759 phenotype nonsynonymous variant associations were observed with 53 distinct phenotypes and 467 unique amino acid substitutions. A total of 3862 nonsynonymous variant associations were significant in BioVU before replication in PMRP. The strongest association observed was between AS and an asparagine substitution at position 97 of HLA-B (OR, 33.25; meta-analysis P = 1.17 × 10−45). Quantile-quantile (QQ) plots for association analyses are presented in figs. S2 to S4. Results acquired using the additive genetic model were similar to results acquired using the dominant model.

Table 1. Number of alleles or variants observed by HLA gene in BioVU.

NS, nonsynonymous variant.

View this table:
Fig. 1. Association of all imputed four-digit HLA alleles with PheWAS phenotypes by HLA locus.

HLA alleles were imputed from the HumanExome BeadChip using SNP2HLA. Strength of the association is plotted along the y axis as −log(P), and phenotypes are represented on the x axis categorized by the PheWAS code group. Bronch., acute bronchitis and bronchiolitis; CeD, celiac disease; Cholang., cholangitis; CircDis, circulatory disease; EarDis, degenerative and vascular disorders of the ear; Derm, dermatomyositis; DM, diabetes mellitus; DR, diabetic retinopathy; CTDis, diffuse diseases of connective tissue; CNDis, disorders of other cranial nerves; GD, Graves’ disease; HypoTh, hypothyroidism; Ileost., ileostomy status; IBD, inflammatory bowel disease and other gastroenteritis and colitis; JuvRA, juvenile rheumatoid arthritis; LOC, lack of coordination; Tooth loss, loss of teeth or edentulism; Lupus, localized and systemic lupus; MS, multiple sclerosis; CNSDis, other demyelinating diseases of central nervous system; Endocrine, other endocrine disorders; Spondyl., other inflammatory spondylopathies; PR, polymyalgia rheumatica; Wound inf., posttraumatic wound infection; PBC, primary biliary cirrhosis; Psoriasis*, psoriasis and related disorders; PsoriasisV, psoriasis vulgaris; RA, rheumatoid arthritis; RA*, rheumatoid arthritis and other inflammatory polyarthropathies; 2° neopl., secondary malignant neoplasm of digestive systems; SS, Sicca syndrome; SLE, systemic lupus erythematosus; T1DK, T1D with ketoacidosis; T1DN, T1D with neurological manifestations; T1DO, T1D with ophthalmic manifestations; T1DR, T1D with renal manifestations; Thyrotox., thyrotoxicosis with or without goiter; Uveitis, noninfectious or not otherwise specified (Uveitis).

Table 2. Most significant associations in two- and four-digit HLA allele PheWAS analysis.

PMRP, Personalized Medicine Research Project.

View this table:

Table 3 compares major HLA genetic associations for common autoimmune/inflammatory diseases based on previous literature and the association results from the PheWAS analysis in BioVU. The major association for each disease was highly significant with effect sizes similar to those previously published, suggesting the validity of the approach and results. Further evaluation of PheWAS results compared to previous associations is presented in the Supplementary Materials (see Materials and Methods).

Table 3. Major HLA genetic determinant for specific autoimmune/inflammatory diseases in the BioVU PheWAS analysis.

Major HLA genetic determinants for specific autoimmune/inflammatory diseases and their association results from the PheWAS analysis. ORs, 95% confidence intervals (CIs), and P values were generated using logistic regression adjusted for age, gender, and two PCs. MZ, monozygotic.

View this table:

The results also provide supportive evidence for associations previously reported but not yet widely accepted. For instance, one of the strongest associations observed with MS was for a nonsynonymous variant at position 71 in HLA-DRB1 (OR, 2.66; meta-analysis P = 3.54 × 10−16), consistent with previous reports (14, 15). Similarly, cervical cancer has been linked with class II HLA alleles and class II HLA haplotypes (16, 17), and we observed a significant association between “cervical cancer and dysplasia” and nonsynonymous variation at position 37 of HLA-DRB1 (OR, 1.43; meta-analysis P = 7.22 × 10−7).

Previously unidentified associations

This PheWAS also identified associations between immune-mediated diseases and nonsynonymous variants where only a four- or two-digit allele association was previously reported. For instance, several HLA alleles have been associated with increased risk for Sicca syndrome (SS) in European populations, including HLA-DQB1*02:01 (18, 19), but the association of nonsynonymous variants with SS is largely unknown (20). This PheWAS identified HLA-DQB1*02:01 as the strongest HLA risk factor in our population and also implicated nonsynonymous variation at position 77 in HLA-DRB1 with SS susceptibility. Similarly, uveitis is known to be associated with HLA-B*27 (21), and the PheWAS now implicates nonsynonymous variation at position 114 of HLA-B with uveitis susceptibility. Notably, this nonsynonymous variant association with uveitis was observed to have a higher OR and stronger association than the four-digit allele. These results may clarify the driving association for a given phenotype is a two- or four-digit allele, or a nonsynonymous variant, based on the strength of association, and the work supports further attempts to identify critical and more shared peptide residues among autoimmune diseases.

The PheWAS also identified associations with diseases for which evidence for predisposing HLA variation is minimal or has not been previously reported. These were primarily with nonsynonymous variation (Table 4) and were observed across organ systems (neurologic, gastrointestinal, genitourinary, endocrine, infectious, oncologic, neurologic, cardiologic, and respiratory conditions). The large numbers of subjects studied also allowed evaluation of “subphenotypes” such as those indicating specific T1D complications or manifestations. In this analysis, a stronger OR was consistently observed for the T1D subphenotype than for the T1D phenotype (Table 5). For instance, the OR for the association between HLA-DQB1*03:02 and T1D (OR, 4.42) was reduced compared to that between HLA-DQB1*03:02 and T1D with ketoacidosis (OR, 7.09), T1D with neurological manifestations (OR, 6.26), T1D with ophthalmic manifestations (OR, 7.33), and T1D with renal manifestations (OR, 6.66). Similar significant associations were observed for phenotypes indicating more severe manifestations of disease, including nonsynonymous variant associations with psoriatic arthropathy compared to psoriasis (Table 5).

Table 4. Previously unidentified associations with HLA variation.

aa, amino acid.

View this table:
Table 5. Associations with T1D and psoriasis and their disease complications.

manif., manifestations.

View this table:

Pleiotropic associations

For four- and two-digit HLA alleles, we found consistent evidence of association with multiple phenotypes. Figure 2 represents a heatmap of four-digit HLA allele associations. To determine cross-phenotype associations potentially driven by pleiotropic effects, we implemented the trait-based association test (TATES) (22). TATES was implemented for all phenotypes that had more than 40 cases and crossed the FDR significance threshold for any four-digit HLA allele (MAF > 1%) in BioVU. We found 60 HLA alleles with significant (P < 0.05) evidence for cross-phenotypic effects with the strongest evidence for HLA-B*27:05 (TATES P = 7.76 × 10−68), which was associated with AS and uveitis (table S1). Among four-digit alleles associated with multiple phenotypes, HLA-DQA1*01:02 and HLA-DQB1*02:01 were significantly associated with eight phenotypes, the largest number of associations with any allele in our data set. The most significant results of the TATES analysis and the HLA alleles with multiple phenotypic associations are presented in table S1. Shared HLA signatures have been previously observed for T1D (2325), and T1D was associated with most of the alleles that had the strongest evidence of cross-phenotypic associations. The HLA-DQB1*02:01 and HLA-DRB1*03:01 alleles shared associations with T1D, SS, systemic lupus erythematosus, and celiac disease. Shared associations were also common among the nonsynonymous variants. For instance, substitutions at position 26 of HLA-DRB1 were significantly associated with an increased risk of T1D, RA, celiac disease, SS, and diffuse disease of connective tissue.

Fig. 2. Heatmap of four-digit HLA allelic associations with P < 1 × 10−5.

Red indicates an increased OR, and blue indicates a decreased OR. Phenotypes and HLA alleles are clustered by similarity based on association results. Carats indicate groupings that were present in more than 90% of 1000 bootstrap simulations, calculated from sampling with replacement from the actual association results.

To further refine our models of pleiotropy, we conducted conditional analyses of phenotypes in a pairwise fashion. The purpose of the conditional analysis was to determine whether the association of multiple phenotypes with the same allele was due to co-occurrence of phenotypes rather than independent association of both phenotypes with the allele. This analysis showed that HLA associations with T1D, RA, MS, SS, and celiac disease were independent. Notably, the HLA associations with RA and polymyalgia rheumatica appeared to be independent in conditional analyses, as did HLA associations with AS and uveitis in addition to T1D and diabetic retinopathy. However, conditional analyses suggested that the optic neuritis, “other demyelinating diseases of central nervous system,” and “lack of coordination” phenotypes are more strongly associated with MS than with HLA variation; that is, these associations were driven by an association with MS.

To further understand the relationships between HLA loci, alleles, and phenotypes, we conducted a network analysis using nodes to represent PheWAS codes, HLA genes, and HLA alleles with links representing associations that were significant at P < 1 × 10−5. HLA genes and their alleles are further organized into a hierarchical structure by linking alleles to HLA genes. A total of 135 PheWAS analysis associations between four-digit HLA alleles and PheWAS codes were observed in the BioVU data set. The majority of the most significant associations (at P ≤ 1 × 10−15) are among T1D with HLA-DQB1*02:01, HLA-DRB1*03:01, and HLA-DRB1*03:01; T1D/RA with HLA-DQA1*03:01 and HLA-DRB1*01:02; and T1D/MS with HLA-DQB1*06:02 and HLA-DRB1*04:01 but with the opposite (protective) effect on MS. Figure 3 presents an undirected graph with nodes representing PheWAS codes, HLA genes, and four-digit HLA alleles. Nodes with more significant associations and that share more connections are topologically clustered together. The layout is intuitive to visualize the complex PheWAS associations where clusters of PheWAS codes stay closer due to shared genetic associations and vice versa. The genes HLA-DRB1, HLA-DQB1, and HLA-DQA1 cluster together strongly with autoimmune diseases, with T1D and T1D subphenotypes at the focus. An interactive version of Fig. 3 is available at

Fig. 3. Network diagram of 171 PheWAS analysis associations between four-digit HLA alleles and PheWAS codes at P < 1 × 10−5.

Large circular nodes indicate HLA genes, small circular nodes represent four-digit HLA alleles, and colored squares represent PheWAS phenotypes. Lines indicate an association, and distance between nodes indicates the strength of the association by P value. Dashed lines represent a protective effect (OR, <1), and solid lines indicate a risk effect (OR, >1). The shade of the lines indicated three different levels of statistical significances based on P values, with the darkest level for the most significant associations at P < 1 × 10−15, the intermediate level for the significances at P value between 1 × 10−8 and 1 × 10−15, and the light level for those at P < 1 × 10−5. An interactive version is available at EndoDis, other endocrine disorders; HypoTh*, hypothyroidism NOS; Ileostomy, ileostomy status; T1Dins, insulin pump user; CervixDis, noninflammatory disorders of cervix; UC, ulcerative colitis.


We conducted a systematic test of phenomic associations with HLA variation. Significant associations with large effect sizes were replicated in autoimmune conditions well represented in the population, such as T1D, AS, MS, and RA, providing strong evidence of the validity of the PheWAS approach and HLA imputation. We find evidence for several previously undescribed associations with MS and cervical cancer. Previously unidentified associations were also observed between HLA variation and multiple phenotypes, including gastrointestinal hemorrhage, nodular lymphoma, cancer of the brain and nervous system, cervical cancer and dysplasia, kidney transplant, partial epilepsy, and atherosclerosis of the extremities. We found consistent evidence for pleiotropy between HLA variation and common autoimmune diseases, especially for class II HLA alleles associated with T1D and other autoimmune diseases. Our analysis confirms the ability of PheWAS to identify multiple associations from well-powered studies. In each of the common diseases studied, the most significant known HLA contributor to disease was identified as a risk factor in this study, suggesting the validity of the results. Our results also provide a publicly available, comprehensive set of HLA risk loci for validation of previous findings and hypothesis generation for future research.

The strongest observed associations in our study were with T1D, in keeping with the fact that T1D is a common, dependably diagnosed disease and 40 to 50% of T1D heritability has been attributed to HLA variability (26). Genetic influences on T1D risk include a hierarchy of predisposing and protective HLA haplotypes, with the greatest risk conferred by the DR3-DQ2 and DR4-DQ7/8 haplotypes, especially in patients heterozygous for both haplotypes (27, 28). These haplotypes are well represented among our observed associations. The strongest association observed overall was for a nonsynonymous variant at position 57 of HLA-DQB1, a locus that was identified early on as a susceptibility locus for T1D (29, 30). Our data support this observation and the notion that class II HLA allelic associations for T1D are a reflection of underlying amino acid substitutions involved directly in autoimmune etiology. Here, the nonsynonymous 57 HLA-DQB1 variant was a stronger association than any other HLA variation for T1D, suggesting that the nonsynonymous 57 HLA-DQB1 variant is the driver of the association, which is in line with previous observations (29). Our observations were consistent with previous disease models for a number of other autoimmune diseases, such as RA, AS, MS, celiac disease, psoriasis, primary biliary cirrhosis, and SS.

This is one of the first systematic studies of clinical phenotypes with HLA variation in a single population. Previous studies investigating phenotype associations with HLA variation have identified similar associations through the recruitment of specific case/control cohorts. The availability of extensive phenotype data enables an in-depth investigation of pleiotropy. This extensive phenotype data also allow investigation of subphenotypes, such as T1D with various manifestations of the disease. A larger OR was consistently observed for the T1D subphenotype than for the T1D phenotype. This might indicate a stronger HLA association with T1D disease severity or potentially a phenotype with less misclassification. Finally, we consider the paucity of observed associations for nonautoimmune phenotypes to be interesting and informative. The relatively small number of associations with infectious diseases, psychiatric disorders, or oncologic disorders could be due to the lack of a significant biological effect of HLA variation on these disorders or a lack of power for these relatively uncommon phenotypes within our data set in the presence of multiple comparisons.

One limitation of the work is that the observed strength of association depends on the accuracy of the International Classification of Diseases, Ninth Revision (ICD9), code diagnosis, accuracy of imputation, and frequency of both diagnosis and allele in our population. Any misclassification error should underestimate the magnitude of effect size estimates, thus leading to a bias toward a null result. PheWAS has been previously shown to accurately replicate diverse phenotype-genotype associations (8, 12), and data presented here replicate many previous reports of HLA associations, as noted above. Similarly, our results depend on the accuracy of prediction of HLA alleles and nonsynonymous variants based on SNP data from the HumanExome BeadChip. Previous literature suggests that HLA alleles can be imputed accurately (46, 31), and alleles were imputed with high accuracy compared to sequence data in a subset of our population. Rare alleles were not included here, and the study does have limited power to detect significant associations for phenotypes with limited sample sizes in our population. Although the total number of individuals within our study is large, the number of individuals with specific phenotypes was relatively small, which limits our ability to detect associations with modest effect sizes. Additional diagnostic information that would add specificity, such as anti–cyclic citrullinated peptides antibodies in RA, was not used in this study. Finally, we encountered challenges in the selection of appropriate methods for multiple-comparison adjustment that minimized false-positive results while accounting for strong linkage disequilibrium in the HLA region. Such challenges in defining statistical significance with extensive correlations in the genotypes and phenotypes have likely resulted in some false-positive associations.

We provide large-scale replication of phenotypic associations with variation in the HLA region. Our observations are valuable because they add evidence to known and uncertain HLA disease associations and may help uncover HLA variation that drives a given immune-mediated phenotype and identify key residues that provide mechanistic insights. The associated type of HLA variation (that is, four-digit allele versus nonsynonymous variant) provides clues to the molecular and structural basis of disease. For some diseases, variation in HLA may be necessary but not sufficient, and in some cases, HLA alleles may be protective with effects that outweigh risk alleles. A two-digit versus four-digit association may point to the serotype, the HLA protein that is expressed from the respective HLA gene and is detected by the respective antibody, more broadly associated with general phenotypes of diseases and may relate to shared peptide binding specificities. Locations of nonsynonymous variants in peptide binding domains or killer cell immunoglobulin-like receptor interaction sites may also inform disease mechanisms. Our results are freely available and provide a comprehensive set of HLA risk loci for additional research, allowing an in-depth evaluation of the underlying genetic architectures for a range of diseases.


Study design

The objective of our study was to determine the associations between variation in the HLA region and a broad set of phenotypes. We used a PheWAS approach, which tests the association of genomic variation with multiple phenotypes (8). Multiple phenotypes for each individual were derived from deidentified EHR data. HLA variation was imputed using SNP-level data from GWAS platforms using two large biobanks that link DNA extracted from discarded blood samples to deidentified EHRs. The discovery analysis was performed using BioVU, the Vanderbilt DNA databank, and the replication analysis was performed in the Marshfield Clinic’s PMRP. All individuals with sufficient EHR and genotype data (described below) from the replication and discovery cohorts were included in a nonrandomized and nonblinded fashion.

Patient populations

The discovery cohort included a population from BioVU, the Vanderbilt DNA databank that links DNA extracted from discarded blood samples to deidentified EHRs (32). BioVU accrues DNA samples extracted from blood drawn for routine clinical testing. A full description of this resource and its associated ethics, privacy, and other protections has been previously published (32). At the time of this study, BioVU contained about 210,000 patients with DNA, and 28,839 self-identified white/non-Hispanic patients with available Illumina HumanCoreExome BeadChip genotyping were included. The mean age of the individuals was 55.4 years, and 53% were female. BioVU operated on an opt-out basis until January 2015 and on an opt-in basis since (32). The phenotypic data in BioVU are all deidentified, and the study was designated “nonhuman subjects” research by the Vanderbilt Institutional Review Board.

The replication cohort consisted of self-identified white/non-Hispanic Marshfield Clinic patients recruited into the PMRP, as described previously (33). PMRP represents a homogeneous population, with 77% of participants claiming German ancestry. In this PheWAS, 8431 patients were included for independent validation of BioVU signals. PMRP discovery set participants were all more than 40 years of age, with more than 30 years of EHR data on average, and have been genotyped by the Illumina HumanExome BeadChip.

Genotyping and HLA imputation

All samples were genotyped using the Illumina HumanExome BeadChip, which contains 2061 HLA tag SNPs. Before imputation, SNP data were cleaned using the quality control pipeline developed by the eMERGE Genomics Working Group (34, 35). Classical two- and four-digit HLA alleles and HLA nonsynonymous variants were imputed from SNP data on the HumanExome BeadChip using SNP2HLA (6) with the Type 1 Diabetes Genetics Consortium reference panel, a marker window size of 1000, and a posterior probability (gprob) threshold of 0.5. We restricted our analysis to individuals of at least 90% European descent, as defined using 3242 ancestry informative markers (AIMs) from the HumanExome BeadChip input into STRUCTURE using HapMap reference populations (36). Identity-by-descent analysis was performed, and related individuals (parent-child and full sibling) were removed from the data set. PCs analysis was performed on AIMs using the EIGENSTRAT method implemented in PLINK (37, 38). To minimize confounding by population stratification, association analyses were adjusted for PCs 1 and 2. We imputed individual dosages for classical four-digit alleles, two-digit alleles, and polymorphic nonsynonymous variants at HLA-A, HLA-B, HLA-C, HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DQB1, and HLA-DRB1 from 28,839 individuals of genetically determined European ancestry. All HLA alleles and nonsynonymous variants with MAFs greater than 0.01 were included in the analysis.

To confirm accuracy of HLA imputation for four-digit alleles, high-resolution, four-digit HLA sequencing was performed in 3152 subjects from the Vanderbilt Electronic Systems for Pharmacogenomic Assessment cohort (39) for HLA-A, HLA-B, HLA-C, HLA-DR, HLA-DP, and HLA-DQ at the Institute for Immunology and Infectious Diseases (IIID) at the Murdoch University in Perth, Australia. The IIID is accredited by the American Society for Histocompatibility and Immunogenetics and the National Association of Testing Authorities and is an established center for sequencing the HLA loci and participating in large, multicenter trials that use HLA sequence data (40, 41). All samples that were successfully typed were included in the study population. Concordance rates and call rates for four-digit HLA alleles were then calculated for SNP2HLA compared to direct HLA sequencing results. A detailed description of the HLA sequencing methods is available in the Supplementary Materials.

PheWAS analysis

Phenotypes were defined by patient EHR data, as previously described (8, 11, 13). The ICD9 coding system describes diseases, signs and symptoms, injuries, poisonings, procedures, and screening codes (42). We translated the ICD9 data into the custom “case phenotypes” (phecodes) and their algorithm-defined control groups using version 1.1 of the phecode mappings (available from (8). Phecodes aggregate codes so that their granularity is appropriate for research purposes, including large-scale genomic studies (11). We used a minimum code count threshold of two distinct days to define a case for a phenotype. We tested only phenotypes with at least 40 cases, including a total of 1368 phenotypes, and used a dominant genetic model adjusted for age, gender, and PCs 1 and 2.

In the primary-discovery PheWAS analysis, a total of 99 four-digit HLA alleles, 77 two-digit HLA alleles, and 988 nonsynonymous variants were tested across 1368 phenotypes that had at least 40 cases. Phenotypes reflected diagnoses extracted from the EHR and included a broad range of diseases grouped into the following categories: infectious, neoplastic, endocrine and metabolic, hematopoietic, psychiatric, neurologic, cardiovascular, pulmonary, digestive, genitourinary, dermatologic, musculoskeletal, symptoms and signs, and injuries.

In the discovery cohort, we used the FDR method to correct for multiple comparisons in association tests by variant type (two- or four-digit HLA allele, nonsynonymous variant) (43). We declared statistical significance at a q value cutoff of 0.10 in the discovery cohort (BioVU) combined with P < 0.05 in the replication cohort (PMRP). All predicted HLA alleles and nonsynonymous variants were tested for the association with phenotypes derived from the EHR in a dominant model adjusted for age, gender, and PCs 1 and 2. Although the dominant genetic model was used for the primary analysis, an additive model was also performed for comparison. We conducted association tests using logistic regression to predict case/control status based on the HLA allele, age, sex, and a categorical ascertainment variable based on the experiment under which the genotyping was originally performed. Fixed-effects meta-analysis was used to determine the overall effects of HLA variation in both discovery and replication cohorts. The analysis was conducted using PLINK and Perl scripts, and graphics were generated using the R PheWAS package (11, 44).

Pleiotropy assessment and network visualization of PheWAS associations

To assess the pleiotropic or cross-phenotypic associations for four-digit HLA alleles in the discovery cohort, we first calculated the number of phenotypes significantly associated with each allele. We then performed conditional analysis when multiple significant associations with the same allele were present, adjusting for the second phenotype in a logistic regression to predict the first phenotype. The purpose of the conditional analysis was to determine whether the association of multiple phenotypes with the same allele was due to the co-occurrence of phenotypes rather than independent association of both phenotypes with the allele. For instance, the association of an allele with both multiple sclerosis and the lack of coordination phenotype might suggest that co-occurrence of lack of coordination and MS is driving this association. To adjust for this effect, we performed regressions adjusted for the second phenotype whenever an allele was significantly associated with multiple phenotypes. If an allelic association became nonsignificant after this adjustment, we concluded that the allelic association was due to co-occurrence with another phenotype rather than a true association of the allele.

TATES (22) was conducted using R ( for all phenotypes that had more than 40 cases and crossed the FDR significance threshold for any four-digit HLA allele (MAF > 1%) in BioVU. TATES combines the P values obtained in single-marker associations to determine a global P value while correcting for observed correlation structure among phenotypes (22, 45). The TATES procedure affords improved power to detect a combined effect for a group of correlated phenotypes. Thereby, we examined the significance of the TATES result to indicate a pleiotropic effect. All pleiotropy analyses were performed on a data set pruned of redundant phenotypes, including phecode subphenotypes that indicated specific T1D complications such as T1D with ketoacidosis, neurological manifestations, ophthalmic manifestations, or renal manifestations. To minimize co-occurrence of related phenotypes in the TATES analysis, we removed redundant phenotypes related to complications (for example, “T1D with renal manifestations”) and broad versus narrow phenotype classifications (for example, “hypotension” was pruned, whereas “orthostatic hypotension” was preserved). We also eliminated sex-specific phenotypes such as “dysplasia of female genital organs.”

To visualize the complex many-to-many relationships between phenotypes and HLA genes from the PheWAS analysis, we generated both a heatmap and a network map for four-digit allelic associations. The heatmap was generated using the ORs of associations between four-digit HLA alleles and phenotypes. Heatmaps were generated with the heatmap.2 function in the R package gplots, and colors were chosen from the R package RColorBrewer. All phenotypes and HLA alleles with at least one association with P < 1 × 10−5 were included in the heatmap. Dendrograms of phenotypes and HLA alleles clustering were generated with heatmap.2 and used Euclidean distance. To determine which clusters represented actual structure in the data set, a bootstrap analysis was performed. For both the HLA allele and phenotype clustering, 1000 simulations were performed, sampling with replacement from the actual data set for each simulation. Clusters that were present in more than 90% of the simulations were identified and indicated in the heatmap.

For the network map, we used R package igraph 1.0.0 and constructed a graph representation that maps associations between phenotypes and HLA genes/alleles (46). In this network map, we use nodes to represent either PheWAS codes or HLA genes/alleles, and links between HLA alleles and PheWAS codes represent the significant associations between four-digit HLA alleles and PheWAS codes that are considered significant at α = 1 × 10−5. In addition, HLA genes and their alleles are further organized into a hierarchical structure by linking all alleles to its HLA genes. A force-directed layout is then applied to the graph to stimulate the network graph as a physical system, so links act as springs that attract connected nodes closer together and are weighted by P values; thus, more significantly associated nodes have stronger attraction forces to stay closer. As a result, nodes with more significant associations and share more connections are topologically clustered together. The layout is intuitive to visualize the complex PheWAS associations where clusters of PheWAS codes stay closer due to shared genetic associations and vice versa.


Materials and Methods

Fig. S1. PheWAS plot of all four-digit HLA allele associations.

Fig. S2. QQ plot of associations from four-digit HLA allele PheWAS analysis.

Fig. S3. QQ plot of associations from two-digit HLA allele PheWAS analysis.

Fig. S4. QQ plot of associations from nonsynonymous HLA variant PheWAS analysis.

Table S1. Most significant results from TATES.

References (4765)


Funding: The data set used in the analyses described was obtained from the Vanderbilt University Medical Center’s BioVU, which is supported by institutional funding, and by the Vanderbilt Clinical and Translational Science Award (CTSA) grant ULTR000445 from the National Center for Advancing Translational Sciences (NCATS)/NIH. HumanExome BeadChip genotyping was supported institutionally. The PheWAS method and website deployment are supported by R01 LM010685. BioVU is supported by institutional funding and by the Vanderbilt CTSA grant UL1 TR000445 from the NCATS/NIH. Marshfield investigators and the PMRP were supported by the generous donors of the Marshfield Clinic, NIH NCATS grant number UL1TR000427, National Human Genome Research Institute grant number 1U01HG006389, National Institute of General Medical Science grant number 1R01GM114128, and National Library of Medicine grant number 1K22LM011938. This work was supported by the NIH, including P50 GM115305, U19 HL65962, P30 AI110527, 1R01AI103348, and 5T32AI007474. J.H.K. receives support from the American Heart Association (16SDG29090005 and 15POST22660017), the Vanderbilt University Medical Center Clinical Pharmacology Training grant (T32 GM07569), and the American College of Clinical Pharmacy Research Institute (Futures grant). S.R. is supported by 5U01GM092691-04 and 1R01AR062886-01. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. Author contributions: J.H.K., L.B., S.R., D.M.R., E.J.P., and J.C.D. designed the study and interpreted the results. L.B., C.M.S., D.M.R., and J.C.D. coordinated and managed the genotype and phenotype data ascertainment in the BioVU discovery cohort. Z.Y., J.G.M., M.H.B., and S.J.H. coordinated and managed the genotype and phenotype data ascertainment in the PRMP replication cohort. S.G., S.M., and E.J.P. performed the HLA typing and analysis of HLA data. L.B., C.M.S., Y.X., A.M.G., J.D.M., and S.Z. performed the statistical and bioinformatic analyses. J.H.K. drafted the manuscript. All authors contributed to the final version of the manuscript. Competing interests: The authors declare that they have no competing interests. Data and materials availability: The results are freely available as a web catalog of HLA risk loci and their associated phenotypes at These data will also be available in the Database of Genotypes and Phenotypes (dbGaP).
View Abstract

Stay Connected to Science Translational Medicine

Navigate This Article