Research ArticleDrug Development

# The druggable genome and support for target identification and validation in drug development

See allHide authors and affiliations

Science Translational Medicine  29 Mar 2017:
Vol. 9, Issue 383, eaag1166
DOI: 10.1126/scitranslmed.aag1166

## An organized way to drug the genome

Many drugs that are already approved for specific diseases have known protein targets, which may be relevant for other disease types as well. In addition, a systematic way of identifying druggable genes in various diseases should help streamline the process of developing new drugs for these targets, even if no specific drugs are available for them yet. Finan et al. designed a computational approach to do this, combining data from numerous existing genome-wide association studies to identify druggable proteins, connect them with known drugs where available, and facilitate the design of new targeted therapeutics.

## Abstract

Target identification (determining the correct drug targets for a disease) and target validation (demonstrating an effect of target perturbation on disease biomarkers and disease end points) are important steps in drug development. Clinically relevant associations of variants in genes encoding drug targets model the effect of modifying the same targets pharmacologically. To delineate drug development (including repurposing) opportunities arising from this paradigm, we connected complex disease- and biomarker-associated loci from genome-wide association studies to an updated set of genes encoding druggable human proteins, to agents with bioactivity against these targets, and, where there were licensed drugs, to clinical indications. We used this set of genes to inform the design of a new genotyping array, which will enable association studies of druggable genes for drug target selection and validation in human disease.

## INTRODUCTION

Only 4% of drug development programs yield licensed drugs (1, 2), largely because of two unresolved systemic flaws: (i) Preclinical experiments in cells, tissues, and animal models and early-phase clinical testing to support drug target identification and validation are poorly predictive of eventual therapeutic efficacy and (ii) definitive evidence of the validity of a new drug target for a disease is not obtained until late-phase development [in phase 2 or 3 randomized controlled trials (RCTs)]. Reasons for poor reliability of preclinical studies include suboptimal experimental design with infrequent use of randomization and blinding (3), species differences, inaccuracy of animal models of human disease (4, 5), and overinterpretation of nominally significant experimental results (68). Human observational studies can mislead for reasons of confounding and reverse causation. Evidence of target validity from phase 1 clinical studies can also be inadequate (because phase 1 studies primarily investigate pharmacokinetics and tolerability, are typically small in size, are of short duration and measure a narrow range of surrogate outcomes, and are often of uncertain relevance to perturbation of the target of interest) (9). Because the target hypothesis advanced by preclinical and early-phase clinical studies is all too frequently false, expensive late-stage failure in RCTs from lack of efficacy is a common problem affecting many therapeutic areas (10), posing a threat to the economic sustainability of the current model of drug development.

Genetic studies in human populations can imitate the design of an RCT without requiring a drug intervention (1113). This is because genotype is determined by a random allocation at conception according to Mendel’s second law (Mendelian randomization) (12, 14). Single-nucleotide polymorphisms (SNPs) acting in cis (variants in or near a gene that associate with the activity or expression of the encoded protein) can therefore be used as a tool to deduce the effect of pharmacological action on the same protein in an RCT. Numerous proof-of-concept examples have now been reported (11, 13, 1519), including the marked correlation between 80 circulating metabolites’ association with a SNP in the HMGCR gene that encodes the target for statin drugs and the effect of statin treatment on the same set of metabolites (20). SNPs acting in cis are a general feature of the human genome (21), and population and patient data sets with stored DNA and genotypes linked to biological phenotypes and disease outcome measures are now widely available for this type of study.

By extension, disease-associated SNPs identified by genome-wide association studies (GWAS) could be explicitly interpreted as an underused source of randomized human evidence to aid drug target identification and validation. For illustration, loci for type 2 diabetes identified by GWAS include genes encoding targets for the glitazone and sulphonylurea drug classes already used to treat diabetes (22, 23). Apparently, sporadic observations such as this suggest that numerous, currently unexploited disease-specific drug targets should exist among the thousands of other loci identified by GWAS and similar high-quality genetic association studies. Recent studies of advanced or completed drug development programs (mostly based on established approaches to target identification) have also indicated that those with incidental genomic support had a higher rate of developmental success (2427).

Fulfilling the potential of GWAS (and studies using disease-focused genotyping arrays) for drug development requires mapping disease- or biomarker-associated SNPs to genes encoding druggable proteins and to their cognate drugs and drug-like compounds. The set of proteins with potential to be modulated by a drug-like small molecule has been predicted on the basis of sequence and structural similarity to the targets of existing drugs, the set of encoding genes being referred to as the druggable genome. Hopkins and Groom (28) identified 130 protein families and domains found in targets of drug-like small molecules known at the time and more than 3000 potentially druggable proteins containing these domains. A similar estimate was made by Russ and Lampel (29), using a later human genome build. Kumar et al. (30) used these protein families (plus other families of particular relevance to cancer) to manually curate lists of druggable proteins for inclusion in the dGene data set. More recently, the Drug Gene Interaction database (DGIdb) has been developed (31), which integrates data from each of the previous efforts together with a recently compiled list of drug candidates and targets in clinical development (32) as well as information from the PharmGKB (33), Therapeutic Target Database (34), DrugBank (35) databases, and others.

However, earlier estimates of the druggable genome predated contemporary genome builds and gene annotations and also did not explicitly include the targets of biotherapeutics, which formed more than a quarter of the 45 new drugs approved by the U.S. Food and Drug Administration’s (FDA’s) Center for Drug Evaluation and Research in 2015 (36), reflecting their increasing importance in pharmaceutical development. We therefore updated the set of genes comprising the druggable genome. We then linked GWAS findings curated by the National Human Genome Research Institute and European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI) GWAS catalog (37) to this updated gene set, as well as to encoded proteins and associated drugs or drug-like compounds curated in the ChEMBL (38) and First Databank (FDB) (39) databases. We used the connection to explore the potential for genetic associations with complex diseases and traits for informing drug target identification and validation, as well as to repurpose drugs from one indication for another. In addition, to better support future genetic studies for disease-specific drug target identification and validation, we assembled the marker content of a new genotyping array designed for high-density coverage of the druggable genome and compared this focused array with genotyping arrays previously used in GWAS.

## RESULTS

### Redefining the druggable genome

We estimated 4479 (22%) of the 20,300 protein-coding genes annotated in Ensembl version 73 to be drugged or druggable. This adds 2282 genes to previous estimates made by Hopkins and Groom, Russ and Lampel, or Kumar et al., by inclusion of targets of first-in-class drugs licensed since 2005; the targets of drugs currently in late-phase clinical development; information on the growing number of preclinical phase small molecules with protein binding measurements reported in the ChEMBL database; as well as genes encoding secreted or plasma membrane proteins that form potential targets of monoclonal antibodies and other biotherapeutics. A set of 432 genes that was included in all other proposed druggable gene sets but not the DrugDev set consists mainly of olfactory receptors and phosphatases; both protein families have major limitations for future exploitation as drug targets (Fig. 1) (40, 41). We stratified the druggable gene set into three tiers corresponding to position in the drug development pipeline. Tier 1 (1427 genes) included efficacy targets of approved small molecules and biotherapeutic drugs as well as clinical-phase drug candidates. Tier 2 was composed of 682 genes encoding targets with known bioactive drug-like small-molecule binding partners as well as those with ≥50% identity (over ≥75% of the sequence) with approved drug targets. Tier 3 contained 2370 genes encoding secreted or extracellular proteins, proteins with more distant similarity to approved drug targets, and members of key druggable gene families not already included in tier 1 or 2 [G protein (heterotrimeric guanine nucleotide–binding protein)–coupled receptors (GPCRs), nuclear hormone receptors, ion channels, kinases, and phosphodiesterases]. A full list of genes is provided in table S1. An overview of the 15 most frequently occurring protein domain types for each tier can be found in table S2, based on the Pfam-A database of protein families.

### Connecting loci identified by GWAS to the druggable genome

We retrieved 21,406 associations from 2155 GWAS, of which 9178 surpassed the significance threshold of P ≤ 5 × 10−8. The retrieved associations spanned 315 Medical Subject Heading (MeSH) disease terms, which can be stratified into 24 MeSH root disease areas and 3 MeSH psychiatry and psychology areas (Table 1). Variants associated with common diseases and biomarkers had median minor allele frequency (MAF) of 0.29 [interquartile range (IQR), 0.21] based on a subset of 7387 records with risk allele frequency data, reflecting the preponderance of common variants on widely used genotyping arrays. The median odds ratio (OR) for studies of disease end points was 1.24 (IQR, 0.31) (based on the 3367 results with effect size data). We examined sequence ontology consequence types (42) of disease- and biomarker-associated variants and found most to be noncoding, mainly intronic, presumably altering or marking variants that alter mRNA expression or availability, or marking variants that alter structure or activity of encoded proteins (fig. S1).

Table 1. Count of GWAS published per disease area.
View this table:

Of the 9178 GWAS significant associations (P ≤ 5 × 10−8), 8879 mapped to 5084 unique intervals defined as containing all SNPs in linkage disequilibrium (LD) (with an r2 ≥ 0.5), with the SNP exhibiting the most significant association, applying an upper physical bound of 1 million base pairs (Mbp) on either side of this variant. The remaining 299 associations were either not in LD with any other variants or not present in the 1000 Genomes reference panel (phase 3 version). Such associations were assigned a nominal interval of 2.5 kilo–base pair (kbp) on either side of the variant. The frequency distribution of genes and druggable genes in such LD intervals was right skewed (Fig. 2), and there was a correlation between LD interval size and the number of resident genes (fig. S2).

Of the 5084 unique LD intervals, 1533 (30.2%) contained a single gene. Of these, 532 contained a gene from the druggable set: 233 from tier 1, 76 from tier 2, and 223 from tier 3. Of the remaining genomic intervals, 17.3% (880) mapped to intervals containing two genes, 10.1% (511) contained three genes, 6.7% (343) contained four genes, and 25.2% (1281) contained five or more genes. In addition, 536 (10.5%) regions had no gene in the LD interval. For the 1624 LD intervals containing two or more genes, of which at least one was druggable, the median distance of the closest druggable gene to the reported GWAS variant was 4.98 kbp (IQR, 37.7 kbp), where the distance was set to 0 bp for GWAS variants lying within a gene, and a druggable gene was among the two most proximal genes in 67.1% of these LD intervals (1089) (Fig. 3). We identified a total of 3052 genes in the druggable set that were not represented in any of the LD intervals corresponding to a GWAS association: 62.7, 69.2, and 71.6% of tiers 1, 2, and 3 genes, respectively.

We found that 1291 GWAS associations defined 1072 LD intervals containing 532 druggable genes from tier 1, which includes the targets of licensed drugs. Four hundred seventy-nine of the intervals contained a single drug target, and 593 contained two or more targets. For the set of LD intervals containing genes encoding the targets of licensed drugs, two clinically qualified curators blinded to the identity of the genes independently evaluated the correspondence between the disease association from the GWAS and the treatment indication(s) for drug(s) acting on the target(s) encoded by a druggable gene in the interval (Table 2). Our curators identified 56 unique associations (30 unique drug targets), where the treatment indication and genetic association were precisely concordant, and 13 associations (9 targets), where the indication and association came from the same disease area (for example, a GWAS in one form of epilepsy identifying a drug target for a different form of epilepsy). Ninety-seven associations (mapping to 37 licensed drug targets) corresponded to biomarkers known to be altered by treatment with the corresponding drug [for example, an LD interval containing the gene encoding the interleukin-6 receptor was identified in a GWAS of C-reactive protein, a biomarker altered by the action of the interleukin-6 receptor blocker tocilizumab (43)]. A further 76 associations (27 licensed drug targets) were identified through a genetic association with a mechanism-based adverse effect, such as in a GWAS of heart rate, where the SNP rs3143709 defined an LD interval containing the gene ACHE (acetylcholinesterase), encoding the target of cholinesterase inhibitors used in the treatment of myasthenia gravis, which have the side effect of lowering heart rate (44). A further 32 genetic associations (corresponding to eight targets) were with a quantitative trait that could be either a marker of therapeutic efficacy or a mechanism-based side effect, as in the case of QT interval in the context of antiarrhythmic drug therapy. In all, GWAS “rediscovered” 74 licensed drug targets through disease indications, mechanism of action, or mechanism-based adverse effects (the numbers for the categories above are nonadditive because some targets overlap categories). Illustrative examples of the curation are shown in table S3.

Table 2. Number of unique GWAS associations mapping to drug targets for licensed drugs curated according to the correspondence between the GWAS association and drug indication.
View this table:

Manual curation identified 1523 discordant pairings of drug indications and disease associations, corresponding to 144 drug targets that were interpreted as plausible repurposing opportunities (Fig. 4). After manual curation, uncertainty remained for 108 associations (52 targets) as to whether discordance represented a repurposing opportunity or an unrecognized mechanism-based side effect. The remaining targets of licensed drugs mapped to LD intervals corresponding to GWAS traits unlikely to be of therapeutic interest (for example, hair color) or to a genetic association with a new biomarker of uncertain biological function (such as a metabolite measured by a new metabolomics platform). Curators disagreed on the coding for GWAS associations corresponding to four licensed targets. For LD intervals corresponding to GWAS rediscoveries, the interval length was smaller and contained fewer genes, and the druggable gene was closer to the lead SNP than for those LD intervals where the indication and genetic association were discordant (table S4).

### Translational opportunities unveiled by the data linkage

Figure 5 and figs. S3 and S4 illustrate the result of mapping disease associations in the GWAS catalog to the full set of druggable genes, the encoded proteins, and compounds exhibiting binding affinity to these targets, regardless of development phase. For example, 84 studies in the GWAS catalog reported findings pertaining to cardiovascular system diseases (39 disease subcategories), reporting 388 GWAS associations, mapping to 228 unique LD intervals containing 670 genes, of which 135 were in the druggable set. Of these, 29 genes were either the solitary occupant or one of only a pair of genes in the LD interval. We linked all 135 druggable genes identified in the cardiovascular category to 19,844 compounds with measured activities in ChEMBL, 512 of which had a United States Adopted Name (USAN) International Nonproprietary Name (INN) or were in late-phase development, and 168 of which were previously licensed drugs. On the basis of comparisons between GWAS phenotype terms and treatment indications in the cardiovascular category, 8 drug target indications and genetic associations were concordant (target “rediscovery”) and 19 were discordant. Figure 6 illustrates the results of a similar mapping exercise for seven specific diseases (type 2 diabetes, hypertension, inflammatory bowel disease, asthma, coronary heart disease, schizophrenia, and Alzheimer’s disease). The proportion of druggable genes in LD intervals defined by GWAS SNPs for digestive system diseases [0.20; 95% confidence interval (CI), 0.12 to 0.27], neoplasms (0.15; 95% CI, 0.10 to 0.20), nervous system diseases (0.17; 95% CI, 0.10 to 0.24), cardiovascular diseases (0.20; 95% CI, 0.12 to 0.29), respiratory diseases (0.19; 95% CI, 0.08 to 0.31), skin and connective tissue diseases (0.17; 95% CI, 0.10 to 0.24), immune system diseases (0.19; 95% CI, 0.12 to 0.26), and mental health (0.16; 95% CI, 0.08 to 0.24) was similar to the proportion of druggable genes in the genome overall (4479/20,300 = 0.22).

### Coverage of the druggable genome by Illumina DrugDev and other widely used genotyping arrays

Capture of variation in druggable genes by the widely used genotyping arrays is illustrated in Fig. 7, with reference to the 1000 Genomes European superpopulation ancestry panels (45). Disease-focused genotyping arrays and whole-genome arrays with fewer than 600,000 SNPs used for many of the discoveries curated in the GWAS catalog provided less comprehensive capture of variation in the druggable genome than the more recently developed arrays with several million SNPs (such as the Illumina Human Omni 2.5 Exome 8 and Illumina Omni 5). However, because no array to date has been designed specifically to ensure capture of variation in genes encoding druggable targets, we designed the content for an array (the Illumina DrugDev array) using the Illumina Infinium platform, which combines genome-wide tag SNP content of the Illumina Human Core array with 182,375 bespoke markers in 4479 druggable genes. The median number of variants captured per kilo–base pair of the druggable genome was very similar to that of the Illumina Human Omni 2.5 Exome 8 and Illumina Omni 5 (Fig. 7 and figs. S5 and S6) with an average of around 2.5 SNPs per kilo–base pair of the druggable genome, at an average of nearly 50 variants per gene array wide, with even denser coverage of tier 1 and 2 genes.

With the exception of Illumina Omni products, all available genotyping arrays captured druggable genome variation most efficiently among populations of European descent and most poorly among populations of African descent (Fig. 7 and figs. S5 and S6). Outside of the European populations, the high-density Illumina Omni arrays gave superior coverage (for both directly genotyped variants and tagged variants) to all other genotyping arrays. The Affymetrix UK Biobank array displayed similar coverage to the Illumina DrugDev array in European populations but less complete coverage in non-European populations. A heat map summarizing the coverage for each druggable gene, stratified by tier and 1000 Genomes population groups, is shown in Fig. 8. Results for directly typed and tagged variants in 1000 Genomes subpopulations are shown in figs. S7 and S8, respectively.

## DISCUSSION

By first reestimating the boundaries of the druggable genome and then mapping biomarker and disease-associated loci from GWAS to genes encoding druggable targets, we have demonstrated the extent to which GWAS have already rediscovered target-disease indications or mechanism-based adverse effects of licensed drugs. These findings indicate the potential of genetic association studies to systematically and accurately identify disease-specific drug targets across the spectrum of human diseases, addressing one of the key productivity-limiting steps in drug development.

For example, we found substantial potential for repurposing of drugs with licensed indications from one disease area to another (Fig. 4), in keeping with previous analyses from the GWAS catalog that indicated that 17% of genes exhibit associations with more than one phenotype (46). We also mapped genetic associations to drug target and compound annotations in ChEMBL to evaluate the potential for progressing or repositioning compounds at earlier developmental stages (Fig. 5).

Despite the many therapeutic opportunities already arising from the mapping of existing genetic association findings to drug targets and compounds, there are strong reasons to suspect that the potential of this approach has yet to be maximized. Our analysis identified target-disease indication pairings (defined as a gene encoding a druggable target mapping to an LD interval containing a lead SNP from a GWAS) for 1427 of the 4479 druggable genes and 240 of the 652 genes encoding targets of licensed drugs. We might not have discovered associations for all genes in our druggable set because targets of drugs in development may truly play no role in any disease. However, alternative explanations are that only a fraction of diseases have been subjected to GWAS [451 of 3022 conditions (the denominator is based on the number of bottom-level MeSH disease areas)]; that for many of the diseases that have been investigated by GWAS, the sample sizes have been too small to detect all the responsible genes; or that there may have been incomplete coverage of certain druggable genes by the arrays most widely deployed in GWAS.

Genome-wide association analyses continue to be published in new disease areas and in new ethnic groups. Additional genetic discoveries are also being made with other types of arrays, such as the dense, locus-centric SNP arrays following up on GWAS findings that are currently not systematically captured by the GWAS catalog, including Cardiochip (48), CardioMetabochip (49), and Immunochip (50), and by increases in sample size. Exome array analyses are also unveiling rare, disease-associated variants underrepresented in whole-genome arrays. Therefore, we anticipate that the current gap between druggable genes and GWAS findings will be reduced over time, particularly if such studies are extended to electronic health record data sets, which form rich repositories of phenotypic traits and diagnostic codes. In the future, as cost falls and the pipelines for interpreting individual sequence variation are streamlined, whole-genome sequencing may replace genotyping arrays as the major source of information on genetic variation used for drug target identification and validation.

Genetic profiling of a promising target against a range of outcomes can help evaluate the efficacy and safety of a target for the primary indication as well as the identification of additional disease indications to help plan drug development priorities. To stimulate the wider use of genetic association studies in drug development and to ensure that such studies have comprehensive coverage of the druggable genome, we designed the content of a new array that combines focused coverage of the druggable genome with a whole-genome scaffold. This array could be deployed to boost sample size and power in diseases already studied by GWAS to identify additional susceptibility loci and druggable targets. The Illumina list price for the array DrugDev ($56 per sample) is lower than that of the Omni 2.5 Exome array ($177 per sample) and Omni 5 array (\$273 per sample), thus allowing a three- to fivefold increase in sample size under a fixed budget. It could also help stimulate new druggable GWAS prioritized according to unmet therapeutic need. This would automatically result in an abundance of target profiling information encompassing both efficacy and safety outcomes. This will need to be captured systematically and curated consistently to help develop a repository of human drug targets linked to the predicted consequences of their pharmacological modification.

Some limitations of our analysis are noteworthy. The identification of repurposing opportunities in the current data set relied on detecting discordance between a gene-disease association and the corresponding target-disease indication for a licensed drug and excluding instances where this was likely to be due to a mechanism-based adverse effect. However, the lack of standardized vocabulary in licensing agency approval documents and the scientific literature currently hampers this effort. We therefore used a combination of experimental factor ontology (EFO) and MeSH terms to harmonize nomenclature. Two qualified physicians then compared the annotations using a prespecified classification system developed in a pilot study involving one-fifth of the data set. Greater efforts to harmonize terms from the different ontologies (EFO, MeSH terms, the Disease Ontology, and the Human Phenotype Ontology) (5153) and from vocabularies for drug indications from the Anatomical Therapeutic Chemical (ATC) classification, electronic BNF, and eMC+ terms would help generate standardized terminology to improve the efficiency of similar efforts in the future.

In general, antagonist or inhibitor drugs are easier to develop than agonists or activators. However, it was not straightforward to establish a single workflow that would allow recommendation of agonist or antagonist development in the light of a GWAS finding. This is because alleles reported in GWAS sometimes associate with increased, and sometimes with reduced, disease risk. Moreover, alleles reported for their association with a disease biomarker could have an opposite (yet unreported) association with disease outcome if the biomarker and disease risk are inversely correlated. We recommend that this issue should be considered on a case-by-case basis whenever a specific drug development program is predicated on a genetic association at a locus encoding a druggable target.

Where several genes occupy the same LD interval as a GWAS SNP, it may be difficult to determine which is responsible for the disease or biomarker association. We took a pragmatic approach to this problem by classifying LD intervals containing druggable genes according to the total number of genes in the interval and the number and proximity of any druggable gene to the associated SNP. About 529 unique LD intervals containing a variant with a significant association from a GWAS (P ≤ 5 × 10−8) contained a single druggable gene. Such genes are strong positional candidates for the association. For the remainder, the LD interval included 2 to 146 genes (median, 4 genes; excluding the 536 regions containing 0 genes; Fig. 3), but a druggable gene was first or next most proximal gene to the association signal in 36.1% of these cases. The rediscovery of 183 target indication– or mechanism-based adverse pairings for licensed drugs using this information supports the validity of this approach. Previous Mendelian randomization studies also provide reassurance that associations of SNPs in proximity to genes encoding druggable targets recapitulate the effects of drugs modifying the encoded proteins pharmacologically (13, 18, 43). Nevertheless, we recognize that some misclassification is possible, for example, a causal signal arising from a gene encoding a nondruggable protein that occupies the same LD interval as a gene encoding a druggable target (confounding by LD). Integrating information from feature annotation databases such as ENCODE (54), National Institutes of Health Roadmap (55), and the Single Amino Acid Polymorphism database (56) could help reduce misclassification. Localization of causal genes could also be aided by evidence on the effect of genetic variants on RNA transcription or on the activity or concentration of proteins and metabolites, combining new proteomics and metabolomics technologies that are scalable to large population studies (57, 58) with statistical approaches to assess whether association signals from the same region are consistent with the same causal variant (59). Note, also, that even where GWAS identify a gene outside the druggable set, the findings also have the potential to inform drug development indirectly, by highlighting pathways and processes involved in disease pathogenesis that may contain other druggable targets.

The Mendelian randomization paradigm that underpins this strategy validates targets (within a defined disease context) and not compounds, although comparing the profile of effects of a genetic variant with those of a drug or developmental compound can help distinguish on- from off-target effects (13, 18). For this reason, RCTs will not be superseded by the approach we describe, because any new molecule developed for a target of interest could have off-target actions that cannot be modeled genetically. In addition, the effect of altering the expression or function of a target may only be seen beyond some lower threshold so that a weak genetic effect may not adequately model the effect of modifying the target pharmacologically (26). Genetic evidence of a causal mechanism also does not guarantee its reversibility through pharmacological modification. For example, immune system–related genetic variants associate with the risk of developing type 1 diabetes, but useful therapies arising from this knowledge may be difficult to realize, because by the time the disease is diagnosed, immune-mediated damage to the pancreatic β cells may be too advanced (26). Despite these theoretical limitations, evidence is emerging that Mendelian randomization studies have wide-ranging potential to improve the efficiency of drug development and reduce the risk of expensive late-stage failure.

In summary, we have shown an approach to focus and catalyze the use of genomic information to support drug target validation, accurately match targets to disease indications, and identify rational repurposing opportunities for licensed drugs. The approach aligns well with proposals to “reengineer” translational science (60). It could help address the efficiency and innovation problem and serve as a basis for reinvigorating drug development.

## MATERIALS AND METHODS

### Study design

Work in this paper extended the concept of Mendelian randomization studies for drug development from individual targets to the whole genome. The study (i) defined a set of genes that encode actual (or potential) drug targets and are likely to be responsible for genetic associations with complex diseases from earlier GWAS, (ii) allowed us to design a genotyping array with enriched SNP coverage of these genes, (iii) and linked the proteins encoded by this gene set to licensed drugs or to compounds with bioactivities against these targets. A variety of bioinformatics resources and other in silico tools were used to achieve these aims. The integrity of the analysis was evaluated through a comparison of the consistency between licensed drug indications and GWAS associations through manual curation and blinded clinical expert review. This analysis showed that GWAS have already rediscovered around 70 or so of about 600 targets of licensed drugs through associations with disease indications, disease-related biomarkers, or mechanism-based adverse effects. The data set was then used to draw inferences about the potential for drug repositioning and the more systematic application of genomics for drug target–disease indication mapping in the future.

### Assembly of a druggable gene set

The reference set of genes used to redefine the druggable genome was composed of gene annotations from Ensembl version 73 with a biotype of “protein coding.” To this, we added T cell receptor and immunoglobulin genes, polymorphic pseudogenes, and a number of additional genes that were annotated in Ensembl version 73 as nonprotein coding but which were nevertheless believed to encode important proteins (for example, SRD5A2 and CYP4F8). Data were extracted via Biomart (grch37.ensembl.org/biomart/martview). The content was assembled in three tiers.

Tier 1.

This tier incorporated the targets of approved drugs and drugs in clinical development. Proteins that are targets of approved small-molecule and biotherapeutic drugs were identified using manually curated efficacy target information from release 17 of the ChEMBL database (61). An efficacy target was defined as the target for the intended drug indication as opposed to any other potential targets for which the drug shows high-affinity binding. Where binding site information was available in ChEMBL, a non–drug-binding subunit of a protein complex was assigned to tier 3, whereas the drug-binding subunit was included in tier 1. Drugs in clinical development were identified from a number of sources: investor pipeline information from a number of large pharmaceutical companies [including Pfizer, Roche, GlaxoSmithKline, Novartis (oncology only), AstraZeneca, Sanofi, Lilly, Merck, Bayer, and Johnson & Johnson—accessed June to August 2013], monoclonal antibody candidates and USAN applications from the ChEMBL database (release 17), and drugs in active clinical trials from clinicaltrials.gov (62). Targets for these drug candidates were assigned from company pipeline information and scientific literature, where available. Where no reported target information could be found, a potential target was assigned through analysis of bioactivity data in ChEMBL, with the target having the highest dose-response measurement ≤100 nM for the compound being assigned. All other human targets having a median inhibitory concentration (IC50)/median effective concentration (EC50)/median growth inhibitory concentration (GI50)/median effective or inhibitory concentration (XC50, AC50)/dissociation constant (Kd)/inhibition constant (Ki)/potency ≤ 100 nM for an approved drug or USAN compound were also included in tier 1. Genes involved in ADME (absorption, distribution, metabolism, and excretion)/drug disposition (phase 1 and 2 metabolic enzymes, transporters, and modifiers) were identified from the PharmaADME.org extended set (63).

Tier 2.

This tier incorporated proteins closely related to drug targets or with associated drug-like compounds. Proteins closely related to targets of approved drugs were identified through a BLAST search (blastp) of Ensembl peptide sequences against the set of approved drug efficacy targets identified from ChEMBL previously (38). Any genes where one or more Ensembl peptide sequences shared ≥50% identity (over ≥75% of the sequence) with an approved drug target were included. Putative targets with drug-like (Lipinski rule-of-five compliant) compounds having an IC50/EC50/GI50/XC50/AC50/Kd/Ki/potency ≤1 μM were identified from ChEMBL and were also included in tier 2.

Tier 3.

This tier incorporated extracellular proteins and members of key drug target families. Proteins distantly related to drug targets were identified through a BLAST search against the set of approved drug targets (as above), with any proteins sharing ≥25% identity over ≥75% of the sequence and with E value of ≤0.001 being included in the set. Members of the five major “druggable” protein families (GPCRs, kinases, ion channels, nuclear hormone receptors, and phosphodiesterases) were extracted from Kinase SARfari (64), GPCR SARfari (65), and IUPHAR/BPS Guide to PHARMACOLOGY (66) and included in the tier 3. Extracellular proteins were identified using annotation in UniProt (67) and Gene Ontology (GO) (68). Because the potential size of the secreted/extracellular portion of the proteome (potential targets for monoclonal antibodies) is large, and the number of markers available for inclusion on the array was limited, this data set was restricted to those proteins for which higher confidence annotations of extracellular localization were available (not solely prediction of a signal peptide). Proteins annotated in UniProt as having a “secreted” subcellular location, those containing a signal peptide, or those annotated as “extracellular” (where these annotations were supported by the following evidence types: experimental, probable, and by similarity) were included in tier 3. Proteins annotated in GO with cellular component terms GO:0005576 (extracellular region), GO:0005615 (extracellular space), GO:0005578 (proteinaceous extracellular matrix), GO:0031233 (intrinsic to external side of plasma membrane), GO:0031232 (extrinsic to external side of plasma membrane), GO:0071575 (integral to external side of plasma membrane), GO:0031362 (anchored to external side of plasma membrane), GO:0009897 (external side of plasma membrane, and GO:0044214 [fully spanning plasma membrane and supported by strong evidence (EXP, IDA, TAS)] were also included in the tier. Finally, proteins known to be cluster of differentiation antigens (CD antigens) according to UniProt were also added to tier 3. Because the final set of genes included in tier 3 was large (2370 genes), this tier was further subdivided to prioritize those genes that were in proximity (±50 kbp) to a GWAS SNP and had an extracellular location (tier 3A). The remainder of the genes was assigned to tier 3B.

To evaluate the Pfam-A domain content for druggable genes, gene identifiers were converted to UniProt accession keys using the UniProt web services (67). Only UniProt accessions matching the regular expression pattern “[OPQ][0-9][A-Z0-9]{3}[0-9]” were retained for further analysis. Pfam-A domains were extracted using the Xfam API (69). For genes mapping to multiple UniProt accessions, we retained domain annotations for the UniProt accession mapping to the highest number of unique Pfam-A domains.

### Comparison of druggable gene sets

For comparison with genes covered on the Illumina DrugDev array, sets of druggable genes defined by Hopkins and Groom in 2002 (28), Russ and Lampel in 2005 (29), and Kumar et al. (30) were obtained from DGIdb (31). Gene names were converted to Ensembl gene identifiers using the Ensembl REST API (70). The overlap between the three sets was determined and visualized using the Python module matplotlib_venn.

### Compilation of GWAS results

The GWAS catalog was downloaded from www.ebi.ac.uk/gwas/api/search/downloads/alternative on 21 July 2015. Several quality control and further postprocessing steps were then taken. The identifiers of associated variants were validated against Ensembl (version 79, build 37) using the Perl API. This step returned the latest identifier and the human genome build 37 chromosome coordinates; 707 associated variants could not be validated and were excluded. The GWAS catalog provides numerical effect estimates but does not specify the type of effect, such as OR or β coefficient. We attempted to resolve the effect type by using data in other fields (such as the presence of case or control in the discovery population fields) to classify the effect type as OR, β, or unknown. The discovery population field was also processed using a set of regular expressions to determine the sample size and populations used. The populations were then mapped to an appropriate 1000 Genomes superpopulation. Where no population name could be identified, EUR was used as a default because most of the studies in the GWAS catalog were performed on Europeans. The PubMed identifier field was used to search PubMed using the Biopython API. MeSH terms for the publications were mapped to the association to provide structured phenotype descriptions. However, these study-level descriptions may not apply to every association reported by the study; therefore, the MeSH terms were manually curated for each association. These supplemented the EFO terms that are already present in the GWAS catalog. Finally, the associations were filtered for those that are ≤5 × 10−8, so all data used in this study exceeded genome-wide significance.

### Assignment of LD intervals

The complete 1000 Genomes phase 3 data (release 5) were downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502. BCFtools (v1.2 using HTSlib 1.2.1) was used to subset the vcf files into sub- and superpopulation files (71). For each population group, Plink v1.90b3d (72) was used to perform pairwise LD (r2) calculations between all variants in the processed GWAS catalog and biallelic 1000 Genomes variants within a 1-Mbp flank on either side of the GWAS variant having a MAF of ≥0.005. To reduce file size, only r2 values of ≥0.2 were output. The boundaries of the LD region surrounding each GWAS SNP were defined by the positions of the variants furthest upstream and downstream of this SNP with an r2 value of ≥0.5. Associated variants that were not present in the 1000 Genomes panel that were not in LD with any other variants were given a nominal flank of 2.5 kbp on either side of the association.

### Linking GWAS and drug target data

Gene annotations were extracted from Ensembl version 79. After filtering out pseudogenes, 38,352 genes remained. The set of genes was further reduced to those that overlapped an LD region surrounding an association. Within each associated LD region, the absolute base pair distance of the closest point of a gene from the associated variant was calculated. Variants located within a gene were given a distance of 0 bp. Genes were given a distance rank value according to their base pair distance. In the event of a distance rank tie, the gene with the oldest annotation date was assigned the lower rank.

Drug targets in ChEMBL 20 are annotated with UniProt accessions. The accessions were converted to Ensembl gene identifiers using the UniProt ID mapper (www.uniprot.org/uploadlists/). Drug target Ensembl gene IDs were then intersected with the IDs of genes within LD regions to give a set of drug targets in the proximity of associated variants.

### Evaluation of consistency between licensed drug indications and GWAS disease/biomarker traits

We evaluated the concordance between drug indication and disease association for those LD intervals defined by a GWAS SNP containing one or more genes encoding the target or targets of licensed drugs (fig. S9). Two experienced clinicians used a prespecified classification system developed in a pilot study of one-fifth of the total data set. Each physician was blinded to the identity of the gene encoding the druggable target within each LD interval. The outputs from the two physician-curators were then compared, any coding errors were corrected, and inconsistencies between curators were resolved by consensus where agreement could be reached. Category 0 referred to a situation where coding could not be completed because of missing data, 1 referred to a precise drug indication–target gene–disease association match, 2 referred to a drug indication–target gene–disease area association match, and 3 referred to a drug indication–target gene–mechanism of action association match. Categories 1 to 3 were defined as “concordant.” Category 4 referred to a drug mechanism–based adverse effect–target gene–disease association match; 5 referred to a drug indication–target gene–disease association mismatch with prior biological plausibility, and 6 without prior biological plausibility; 7 referred to a trait unlikely to be of therapeutic interest (such as hair color); and 8 referred to a genetic association with a new biomarker of uncertain biological function (such as a metabolite measured by a metabolomics platform). For certain drug targets/genes, a 34 code was used to indicate that the genetic association finding could reflect both a mechanism of action and mechanism-based adverse effect rediscovery. For example, the modification of certain electrocardiographic parameters by variants in the targets of certain antiarrhythmic drugs could reflect both their mechanism of action and the mechanism by which such drugs produce their adverse effects. A 54 code was used when there was uncertainty about the direction of effect. A 9 code was assigned to the four cases where consensus could not be reached between the two curators. Categories 4, 5, 54, and 6 were referred to as discordant. Categories 1 to 4 and 34 were referred to collectively as “GWAS rediscoveries” of known drug effects.

### Design of the Illumina DrugDev array and comparative analysis of coverage of variation in the druggable genome

Selection of custom SNP content.

The design was based on three tiers, corresponding to the level of evidence for druggability of the encoded proteins, with the highest priority given to genes in tiers 1 and 2. Tag SNPs were selected from the 1000 Genomes European ancestry populations (CEU/GBR/FIN/TSI). Associations (tagging) between SNPs were identified on the basis of LD (r2 > 0.8). SNPs already covered or tagged by the Human Core base content were not duplicated. Only SNPs with a MAF of ≥1.5% were considered for inclusion. The tagging threshold was defined as the number of variants a SNP tags (including itself) and was varied according to the tier. For tiers 1 and 2, a tagging threshold of 1 was applied, meaning that all SNPs were considered for inclusion, even if they only tag themselves. For tier 3A, we used a tagging threshold of 3, and for tier 3B, we used a tagging threshold of 4. SNPs were selected only if they were positioned within ±2.5 kbp of the druggable genes selected in the three tiers (defined as a region of 2.5 kbp upstream of the Ensembl gene start position to 2.5 kbp downstream of the Ensembl gene end position). SNPs from the Illumina Exome array were also included in the custom content, where these were found within genes in tiers 1, 2, and 3A. Again, any redundancy with the Human Core and selected tag SNP content was eliminated. A collection of mitochondrial tag SNPs from the Broad Institute, designed to capture common variation within the mitochondrial genome, was also included in the custom content (www.broadinstitute.org/mpg/tagger/mito.html). This set is composed of 64 SNPs, but only 56 of these loci were designable and included in the array. Finally, the remaining space was filled with lead SNPs for any disease or trait association from the GWAS catalog, prioritizing SNPs located within 50 kbp of a druggable gene or within the boundaries of any protein-coding gene.

For tier 1 genes, 99,102 custom markers were selected, including tag SNPs and Human Exome content. A further 17,944 of the Human Core markers also fell within tier 1 gene regions, giving 117,046 markers in total. Tier 2 included 40,943 custom markers, and an additional 6270 markers from the Human Core fell within tier 2 gene regions, resulting in a total of 47,213 markers. Genes in tier 3 were represented by 38,858 custom markers. A further 21,626 Human Core markers fell within tier 3 gene regions, yielding 60,484 markers in total. In addition to coverage of genes encoding druggable targets, 6400 SNPs associated with complex diseases or traits identified from the GWAS catalog and from selected gene-centric studies were also incorporated in the array content. Of these SNPs, 2996 were already covered in the Human Core or previously included in the custom content, leaving 3410 variants to be added (of which 1395 were within tier 1 to 3 gene regions). Finally, 53 mitochondrial genome tag SNPs were also included, along with 9 mitochondrial genome exome SNPs. Considering all content, 226,138 markers were located in, or within ±2.5 kbp of, genes in the selected drugged, druggable, and ADME sets. For the array as a whole, 78,175 markers were exonic, 286,577 were intronic, 27,393 were located in 5′ untranslated region, and 41,171 were located in 3′ untranslated region.

We used variants in the 1000 Genomes phase 3 reference panel populations to compare coverage of the druggable genome by the new array and other commonly used genotyping arrays (see previous section). For this analysis, the variants in each array were first mapped to the 1000 Genomes phase 3 reference panel, and coverage was then compared using two metrics: variant density (per kilo–base pair of the druggable gene) and the proportion of the variants in the druggable genome that were captured. We defined complete coverage of druggable genome as capture of all the biallelic variants in a 1000 Genomes phase 3 reference panel population with a MAF of ≥0.005 (representing low-frequency to common variants). Because of differences in variant content reported in successive genome builds, not all the content of the genotyping arrays could be mapped back to the 1000 Genomes phase 3 reference set. However, the proportion of variants captured by each array that could be mapped to the 1000 Genomes reference panel was very similar (fig. S10).

Evaluating genotyping array coverage of the DrugDev array.

The build 37 genotyping array content for the Illumina arrays was downloaded from W. Rayner's array strand website (www.well.ox.ac.uk/~wrayner/strand). Where multiple versions of an array exist, the latest version number was downloaded. The Affymetrix array annotations were downloaded as SQLite databases from the Affymetrix website. 1000 Genomes data were processed as described in the method for creating LD regions. Variants present on the genotyping arrays were mapped to 1000 Genomes phase 3 using the following sequence: variants with rs identifiers were searched against the 1000 Genomes sites file, and if no match was obtained, then synonyms of the rs identifier (obtained from Ensembl version 79 build 37) were searched. Variants not mapping by rs identifier were then mapped by chromosome, position, and alleles (flipping the strand of the alleles where appropriate). Allele frequencies and variant tagging for each subpopulation group were calculated using Plink [v1.90b3d (73)]. Tagging was restricted to biallelic low-frequency and common variants (MAF of ≥0.005) within 1 Mbp of the source SNP. Baseline 1000 Genomes coverage of the druggable genome in the different subpopulations was ascertained using BEDTools (v2.22.1) to intersect 1000 Genomes variants with a MAF of ≥0.005 against the druggable gene list (including 2.5 kbp upstream/downstream). Proportional coverage of the druggable genome by the different genotyping arrays was then ascertained by intersecting the baseline coverage with the 1000 Genomes mapped array content.

Indications and adverse effects of licensed therapies. Drug indication data were obtained from several sources. The primary source was the FDB database (www.fdbhealth.co.uk/). This is a commercial database used by University College London Hospitals (UCLH), and a one off single release was provided for research purposes by FDB Europe Ltd. Because FDB is used clinically, this was regarded as the “gold standard” indication set used for the manual categorization of concordant/discordant drug/GWAS links (see above). FDB drug indications are tagged with Universal Medical Language System (UMLS) concept identifiers (CUIs) and could be mapped into MeSH and other ontologies within the UMLS meta-thesaurus (51, 74). Drug indication data were obtained from ChEMBL 21 by manual curation and mapping of data from FDA drug labels (https://dailymed.nlm.nih.gov/dailymed/), World Health Organization ATC classification (www.whocc.no/atc_ddd_index/), and ClinicalTrials.gov (https://clinicaltrials.gov). This was used to supplement the FDB data and fill in indication data for drugs that were not present in the FDB release.

Side effect data were obtained from the Side Effect Resource (SIDER) database (75). The drug identifiers used in SIDER were mapped back to ChEMBL identifiers using a mapping file provided by SIDER. The side effects are provided as MedRA terms and UMLS CUIs and were mapped to MeSH terms using the UMLS.

### Statistical analysis

The proportion of druggable genes in LD intervals specified by GWAS associations in each MeSH disease or MeSH psychiatry category was calculated by dividing the number of druggable genes by the number of all genes with 95% CIs calculated assuming a binomial distribution, on the assumption that each study was independent.

## SUPPLEMENTARY MATERIALS

www.sciencetranslationalmedicine.org/cgi/content/full/9/383/eaag1166/DC1

Fig. S1. The distribution of sequence ontology functional consequences of GWAS significant variant associations (P ≤ 5 × 10−8).

Fig. S2. Number of protein-coding genes in LD region.

Fig. S3. Translational potential of GWAS disease associations across all MeSH root disease areas and mental disorders.

Fig. S4. Translational potential of GWAS biomarker associations with relevance to disease across all MeSH root disease areas and mental disorders.

Fig. S5. Density of variant coverage across the druggable genome in commercial genotyping arrays (directly typed variants).

Fig. S6. Density of variant coverage across the druggable genome in commercial genotyping arrays (directly typed and tagged variants).

Fig. S7. Directly typed variant coverage of druggable genes on commercial genotyping arrays expressed as a proportion of 1000 Genomes phase 3 variants (directly typed variants).

Fig. S8. Directly typed variant coverage of druggable genes on commercial genotyping arrays expressed as a proportion of 1000 Genomes phase 3 variants (directly typed and tagged variants).

Fig. S9. Criteria for manual evaluation of the concordance/discordance of GWAS phenotypes, drug indications, and side effects.

Fig. S10. Proportion of content in various commercial arrays mapped to the 1000 Genomes phase 3 set of variants.

Table S1. The druggable genome.