Research ArticleDrug Discovery

Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data

See allHide authors and affiliations

Science Translational Medicine  17 Aug 2011:
Vol. 3, Issue 96, pp. 96ra77
DOI: 10.1126/scitranslmed.3001318

This article has a correction. Please see:


The application of established drug compounds to new therapeutic indications, known as drug repositioning, offers several advantages over traditional drug development, including reduced development costs and shorter paths to approval. Recent approaches to drug repositioning use high-throughput experimental approaches to assess a compound’s potential therapeutic qualities. Here, we present a systematic computational approach to predict novel therapeutic indications on the basis of comprehensive testing of molecular signatures in drug-disease pairs. We integrated gene expression measurements from 100 diseases and gene expression measurements on 164 drug compounds, yielding predicted therapeutic potentials for these drugs. We recovered many known drug and disease relationships using computationally derived therapeutic potentials and also predict many new indications for these 164 drugs. We experimentally validated a prediction for the antiulcer drug cimetidine as a candidate therapeutic in the treatment of lung adenocarcinoma, and demonstrate its efficacy both in vitro and in vivo using mouse xenograft models. This computational method provides a systematic approach for repositioning established drugs to treat a wide range of human diseases.


The identification of new disease indications for approved drugs, or drug repositioning, offers several advantages over traditional drug development (1). The traditional paradigm of drug discovery is generally regarded as protracted and costly, with studies showing that it takes about 15 years and more than $1 billion to develop and bring a new drug to market (2). A substantial portion of drug development costs are incurred during early development and toxicity testing, with more than 90% of drugs failing to move beyond these early stages (3). The repositioning of drugs already approved for human use mitigates the costs and risks associated with early stages of drug development and offers shorter routes to approval for therapeutic indications. Successful examples of drug repositioning include the indication of sildenafil for erectile dysfunction and pulmonary hypertension, thalidomide for severe erythema nodosum leprosum, and retinoic acid for acute promyelocytic leukemia (4).

The prevailing approach to drug repositioning is based on established tools and techniques developed for screening libraries of lead compounds against biological targets of interest in early-stage drug discovery. In the case of drug repositioning, libraries of approved drugs are screened across a broad set of putative biological targets with high-throughput screening (HTS) technologies that incorporate multiplex biological assays (1, 4). Computational approaches to the discovery of novel biological targets for approved drugs have been developed to overcome the high costs and other logistical limitations associated with experimental HTS approaches, allowing for greatly expanded searches for novel repositioning opportunities (510). Although these approaches provide comprehensive and systematic means to discover new biological targets for existing drugs, they do not address the challenge of moving from in silico or in vitro binding of a drug to a biological target to realized therapeutic use in affected patients.

Gene expression microarrays (11, 12) enable the measurement of genome-wide expression, and they are regularly and broadly applied in clinical studies of human diseases. Comparative gene expression analysis of primary affected, peripheral and secondary organs, tissues, and biofluids is used to study the molecular pathophysiology of a disease condition or to identify expression patterns that serve as prognostic or diagnostic indicators. By examining the sets of genes that are up- and down-regulated in a disease state compared to a normal state, it is possible to create a gene expression profile, or signature, of a disease (1317). When derived from clinical samples, these signatures represent the accurate and consistent gene expression state imparted by immunological, metabolic, and other complex factors that make up the broad physiological manifestation of a disease (18, 19).

Microarrays are also used to discover gene expression patterns that signify the perturbation of biological systems by drug compounds (2024). A collection of genome-wide transcriptional expression data from cultured human cells treated with a broad range of Food and Drug Administration (FDA)–approved bioactive small molecules (25) has been used to study the molecular basis of drug effects and particular diseases of interest like colon cancer (26) or prostate cancer (27).

Here, we performed a large-scale integration of expression signatures of human diseases from the public data with previously described drug-effect signatures (25). Using this approach, we built a compendium of disease-drug relationship predictions based on matching genome-wide signatures of disease pathophysiology and drug effect. Instead of examining a single drug-disease pair or even looking at reactions of a large set of drugs on a single disease, we focused on discovering connections between drugs and diseases across all the available gene measurements. Here, we present evidence that this strategy can confirm already known therapeutic uses for drugs and uncover new uses for FDA-approved drugs.


We identified and combined data from publicly available microarray data sets from National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) representing 100 diseases with gene expression data from several human cancer cell lines treated with 164 drugs or small molecules to predict previously undescribed therapeutic drug-disease relationships. We then defined a disease signature for each of the 100 diseases as a set of genes that are significantly up- and down-regulated for that disease compared to normal values using significance analysis of microarrays (SAM) (19, 28). Next, we statistically compared each of the disease signatures to each of the reference drug expression signatures from the Connectivity Map (25) to infer possible therapeutic indications. A schematic overview of our methodology is shown in Fig. 1. We calculate a similarity score for every pairing of a drug and disease reflecting the similarity of the drug and disease signatures, ranging from +1, indicating a perfect correlation of signatures, to −1, indicating exactly opposite signatures. Our hypothesis is that if a disease state is signified by a specific set of genome-wide expression changes, and if exposure to a particular drug causes the reverse set of changes in a model cell line (that is, negative similarity scores indicating an opposing relationship), then that drug has the potential to have a therapeutic effect on that disease.

Fig. 1

Analytic workflow. (A) Two gene expression collections are used: a set of disease-associated gene expression data with corresponding controls and a set of gene expression data from tissue treated with drugs and small molecules with corresponding controls. SAM is used to obtain a signature of significantly up- and down-regulated genes for each disease. Rank normalization and the preprocessing procedure previously described (25) are used to create a reference database of drug gene expression. (B) A modification to the Connectivity Map method (25) is used to query the disease signature against the drug reference expression set to assign a drug-disease score to each drug-disease pair based on profile similarity. These scores are interpreted, resulting in a list of candidate therapeutics for each disease of interest.

To evaluate the significance of our predictions, we used a permutation approach under which random drug signatures were generated and the analysis was repeated 100 times for each disease. We computed the false discovery rate (FDR) for individual drug-disease similarity score values, or q value, by calculating the expected number of false positives, given the actual distribution of similarity scores and the distribution of scores after randomization for each disease signature.

Of more than 16,000 possible drug-disease pairings, 2664 were statistically significant (q < 0.05), more than half of which suggest a therapeutic (negative or opposite) relationship. Overall, our method provided significant candidate therapeutic drug-disease relationships for 53 of the 100 diseases at an FDR of 5%. The other 47 diseases could not be significantly associated with any of the 164 drugs we tried and were excluded from further consideration. Each of the 164 drugs was significantly associated with at least 1 of the 53 remaining diseases.

Many cancers, including gastric cancer, melanoma, and transitional cell carcinoma, showed the highest number of significant matches to therapies (Table 1). The drug predicted to be efficacious for the largest set of diseases was vorinostat, a histone deacetylase (HDAC) inhibitor with significant therapeutic association with 21 of the 100 diseases in our data set. Other HDAC inhibitors, such as trichostatin A and Helminthosporium carbonum (HC) toxin, as well as gefitinib, an epidermal growth factor receptor (EGFR) inhibitor used in cancer treatment, had more than 20 significant predicted therapeutic indications (Table 1).

Table 1

Drugs and diseases with the most indications.

View this table:

Global properties of drug-disease connections

By examining the relationships between the 164 drugs and 53 diseases based on predicted therapeutic scores, we identified major clusters of predicted therapeutic relationships between drugs and diseases (Fig. 2). We then looked separately at the clustering of drugs on the basis of their similarity scores across diseases (Fig. 3A). Drugs with similar mechanisms of action clustered together. For example, the HDAC inhibitors vorinostat, HC toxin, and trichostatin A formed a cluster, as did the phosphatidylinositol 3-kinase (PI3K) inhibitors LY-294002 and wortmannin, and the salicylate anti-inflammatory drugs sulfasalazine, mesalazine, and acetylsalicylic acid (Table 2). We also found clusters of drugs that are known to affect different aspects of the same biological process. For example, geldanamycin, raloxifene, monorden, and sodium phenylbutyrate are all known to perturb the activity of the heat shock protein 90 (HSP90) chaperone complex, either through direct inhibition or through perturbation of upstream factors (29), and these agents also clustered together (Table 2).

Fig. 2

Heat map of drug-disease scores. Most of the heat map is white, indicating that most of the drug and gene expression profiles are not significantly concordant. Yellow indicates a negative (therapeutic) drug-disease score, meaning the expression profiles of the two are opposing, and that the drug might be a potential treatment option of the disease. Blue indicates a positive drug-disease score, meaning the expression profiles of the two are similar, and therefore, the drug may exacerbate the disease. This figure can be enlarged in the full-text version of the paper at or in the Supplementary Material, which contains a copy of this figure.

Fig. 3

Hierarchical clustering of drugs and diseases by predicted therapeutic scores. (A) Drugs are clustered on the basis of their prediction scores. Several groups are highlighted in color, indicating clusters of drugs with known shared mechanisms of action. (B) Diseases are labeled along with Unified Medical Language System Concept Unique Identifiers, and are clustered on the basis of their prediction scores. Groups highlighted with color are known to share characteristic pathophysiology. This figure can be enlarged in the full-text version of the paper at or in the Supplementary Material, which contains a copy of this figure.

Table 2

Significant drug clusters. Drugs were clustered in an unsupervised manner by their predicted therapeutic score across 100 diseases (see Fig. 3A).

View this table:

Diseases similarly clustered on the basis of their computed therapeutic scores across the panel of drugs (Table 3 and Fig. 3B). We found a large cluster of cancers, including lung, stomach, and other cancers, pointing to potential commonality of predicted therapeutic response between these conditions (Table 3 and Fig. 3B, green). The inflammatory bowel diseases (IBDs) ulcerative colitis (UC) and Crohn’s disease (CD) appeared together as part of a larger cluster of diseases for which corticosteroids and other immunosuppressive drugs are broadly indicated (Table 3 and Fig. 3B, yellow). IBD was clustered near infection by Yersinia enterocolitica, which can present clinical symptoms similar to those of IBD (30). The reported clusters were determined to be significant by bootstrap analysis (31).

Table 3

Significant disease clusters. Diseases were clustered in an unsupervised manner according to their predicted therapeutic response across 164 drug compounds (see Fig. 3B).

View this table:

Prediction of known and previously unassociated drugs for individual diseases

In addition to examining overarching relationships between groups of drugs and diseases, we also determined individual therapeutic predictions based on the hypothesis that if a drug has a gene expression signature that is opposite to a disease signature, that drug could potentially be used as a treatment for that disease. One of the strongest therapeutic predictions for CD and UC was the corticosteroid prednisolone, a well-known treatment for these conditions (32), with a score of −0.216. We found that topiramate, an anticonvulsant drug currently used to treat epilepsy, had a stronger therapeutic score for CD than the established therapeutic prednisolone, with a score of −0.220, as shown by the comparison of the CD signature to the gene expression profile of topiramate. Genes that were up-regulated in disease were down-regulated by the drug and vice versa, reflecting the significant score. Topiramate was also one of the strongest predicted therapies for UC, based on our analysis. We established the efficacy of topiramate in ameliorating an IBD phenotype in an independent study using a trinitrobenzenesulfonic acid (TNBS)–treated rat model of colitis and reported in an accompanying paper (33).

Prediction and validation of cimetidine as a previously undescribed indication for lung adenocarcinoma

To perform an initial experimental evaluation of our approach, we chose to evaluate one of the therapeutic predictions for lung adenocarcinoma (LA), because lung cancer contributes the greatest burden of cancer mortality and incidence in Europe and the United States (34). Although our methodology predicted multiple new therapeutic relationships for LA, we chose to test cimetidine because it is an off-patent and inexpensive drug available over the counter in the United States and has a favorable side effect profile (35). Our prediction score of −0.088 for cimetidine was moderate, but still more significant than the score of −0.075 for gefitinib, a well-known therapy for LA.

To evaluate the efficacy of cimetidine against LA in vitro, we performed an MTT [3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyl tetrazolium bromide] colorimetric assay to assess growth and proliferation of LA cells (A549) after exposure to cimetidine, and assessed apoptosis through detection of DNA fragmentation by TUNEL (terminal deoxynucleotidyl transferase–mediated deoxyuridine triphosphate nick end labeling) assay. LA cells exposed to increasing concentrations of cimetidine exhibited a dose-dependent reduction in growth and proliferation when compared to cells treated with phosphate-buffered saline (PBS) vehicle (Fig. 4A). LA cells also exhibited extensive apoptosis 1 day after direct exposure to 2000 μM cimetidine (Fig. 4B, right; green, TUNEL-positive), whereas apoptosis was not detectable by TUNEL fluorescence microscopy in LA cells from the same line after exposure to the PBS vehicle solution.

Fig. 4

Experimental validation of cimetidine for lung adenocarcinoma (LA). (A) MTT calorimetric assay showing dose-dependent inhibition of LA cell growth after exposure to cimetidine in vitro. (B) Evaluation of apoptosis by TUNEL assay. LA cells treated with 2000 μM cimetidine exhibit a significant increase in TUNEL-positive (green) nuclei compared to vehicle (PBS)–treated control. The scale bar is 75 μM. (C) Results from a tumor xenograft experiment testing the efficacy of H2 agonist cimetidine in inhibiting the growth of A549 LA cell line tumors in SCID mice. Three treatment groups (25, 50, or 100 mg/kg per injection) and one control group (PBS) were used. Another group was treated with doxorubicin as a positive control. (D) Representative images of tumors treated with high dose of cimetidine (left) and control (right).

To further evaluate the efficacy of cimetidine against LA in vivo, we tested three doses of cimetidine in mice implanted with a human A549 LA cell line and followed the growth of these tumors for 12 days. Tumors in mice treated with vehicle (Fig. 4C, red line) grew to 3.25 times their original volume, whereas tumors in mice treated with our positive control dosing with the chemotherapeutic doxorubicin grew to twice the original volume (Fig. 4C, orange line). Cimetidine caused a statistically significant reduction in growth compared to control (100 mg/kg per day group versus PBS control on day 12; P < 0.001, t test), with the highest dose of cimetidine approaching the efficacy of the positive control therapy doxorubicin. Cimetidine exhibited its antineoplastic properties in a dose-dependent manner, with tumors in mice treated with 50 mg/kg per day (Fig. 4C, purple line) growing to 2.8 times the original size, and those treated with 100 mg/kg per day (Fig. 4C, light blue line) growing only 2.3 times as large [P < 0.05, Tukey’s honest significant difference (HSD)]. Figure 4D shows representative examples of tumors treated with cimetidine (100 mg/kg per day) (left) in comparison to control (right).

To test the specificity of our predictions, we performed a similar experiment in mice using the ACHN renal cell carcinoma cell line as the origin of the implants (fig. S1). We chose ACHN as a control for this experiment because it did not have any significant association with cimetidine according to our analysis (score 0, P = 1.0). Concordant with our computational prediction, we found that cimetidine had no effect on renal cell carcinoma xenograft tumors in a mouse model (fig. S1).


Here, we sought to discover previously undescribed therapeutic relationships between drug compounds and diseases through the large-scale integration of public gene expression data for drugs and disease. Our hypothesis was that a comprehensive and systematic comparison of drug and disease gene expression signatures could be used to build a global space of drug-disease relationships, which could be explored to identify previously unknown therapeutic relationships between drugs and disease conditions. The global clusterings of drugs and diseases by predicted therapeutic scores revealed known therapeutic clusters of drugs and diseases and also provided pathophysiological context to support interpretation of novel predicted therapeutic relationships.

Drugs that clustered on the basis of their gene signature similarity across diseases exhibited shared mechanisms of action (for example, HDAC inhibition) or shared physiological processes (for example, immunomodulation). In some cases, the axis of commonality is not apparent within drug clusters, and these situations may inform our understanding of a drug’s unknown mechanism of action or even suggest new biology. For example, we found a cluster containing the salicylate drugs sulfasalazine, mesalazine (a metabolite of sulfasalazine), and acetylsalicylic acid grouped with the calcium channel blocker verapamil and its R-enantiomer, dexverapamil. The two distinct drug types represented in this cluster are not known to act via the same mechanism and are not generally considered to be interchangeable or complementary therapeutic options for a common disease indication. Although the underlying mechanism is unclear, both mesalazine and sulfasalazine are used as maintenance drugs in the treatment of the inflammatory bowel disorders CD and UC. A possible therapeutic role for verapamil in IBD is indicated by several studies that show verapamil to reduce inflammation in UC through leukotriene inhibition (36) and promote mucosal healing in experimental colitis (37). Further study into the genes and potential additional drug targets driving this clustering could reveal new information about IBD biology and new therapeutic directions.

The diseases that clustered on the basis of their gene signature profiles showed within-cluster consistency of established therapeutic conventions and likely shared pathophysiology. The major cluster containing CD and lung transplant rejection also harbored a number of diseases for which corticosteroids and other immunosuppressants are regularly prescribed. The large cluster of cancers shares many of the same antineoplastic and immunomodulatory therapeutics. Less evident groupings, such as that between polycystic ovary syndrome (PCOS) and glioblastoma, may present opportunities for drug repositioning within a cluster. However, in some cases, such groupings may have alternative explanations. For example, the peculiar clustering of cardiomyopathies with cancers could be a reflection of the fact that many chemotherapeutics induce cardiomyopathies as a side effect (38).

In addition to examining broader relationships among groups of drugs and diseases, we also discovered individual therapeutic predictions for 53 of the 100 diseases in our data set. It is not surprising that a large proportion of the diseases we studied did not have any significant therapeutic predictions, because the set of drugs we examined was limited, and the baseline assumption was that most drugs will not promiscuously exhibit therapeutic potential for many diseases.

Many of our therapeutic predictions recapitulate established therapeutic knowledge. For instance, we predict that HDAC inhibitors (trichostatin A, valproic acid, vorinostat, and HC toxin) have therapeutic effects in treatment of different types of brain tumors (astrocytoma, glioblastoma, and oligodendroglioma) as well as other cancers (esophagus, lung, and colon). HDAC inhibitors are anticancer agents that act by inhibiting cancer cell proliferation and inducing apoptosis (3941). Trichostatin A was recently shown to restore the loss of NECL1, a tumor suppressor in glioma (42). Valproic acid is a drug used to treat epilepsy and bipolar disorder but has been more recently reported to be effective in the treatment of various cancers including glioma (43), myeloma (44), and melanoma (45), often in combination with other therapies. Vorinostat is currently used for the treatment of several lymphomas but has recently undergone successful phase II trials for treatment of recurrent glioblastoma multiforme (GBM), which showed that vorinostat is well tolerated in patients with recurrent GBM and has modest single-agent activity (46). Recent studies have also shown that HC toxin effectively suppresses the malignant phenotype of neuroblastoma cells (47, 48). Our prediction that celastrol, an antioxidant and anti-inflammatory drug, would be effective against melanoma is also supported by previous research, which shows that the compound resembles the well-characterized ATF2 (activating transcription factor 2) peptide and may therefore offer new approaches for treatment (49).

After computationally screening our large set of compounds, we predicted and validated cimetidine as a possible therapeutic agent for LA. Cimetidine is a histamine H2 receptor (H2R) antagonist that acts as an inhibitor of acid production in the stomach. It is used and prescribed most often as an over-the-counter drug for heartburn and peptic ulcers. Cimetidine has previously demonstrated efficacy in other adenocarcinoma models (5053). A molecular mechanism or other molecular basis for the observed efficacy of cimetidine against LA has not been previously reported or otherwise described. Initial reports of cimetidine’s antitumor activity proposed that cimetidine inhibits tumor growth via immunomodulation (5457). However, as demonstrated by the in vitro and in vivo experimental results of this study, cimetidine inhibits tumor growth and cellular proliferation in immune-deficient hosts and by direct exposure to LA cells. More recent studies have demonstrated that cimetidine can inhibit vascular development in tumors, suggesting that cimetidine might work to inhibit tumor-associated angiogenesis (52, 58). Additional studies into the antineoplastic properties of cimetidine have revealed putative roles in the inhibition of tumor cell adhesion and induction of apoptosis (59, 60). It is also possible that the efficacy of cimetidine for LA is mediated by some yet unknown interaction between cimetidine and a target outside of its canonical H2R target. The characteristics of the gene signatures can be used to learn more about the potential mechanism of action of the drug.

Despite computational and experimental validation of our approach, there are several caveats. One potential drawback to the approach described here is the need to have gene expression profile measurements on the candidate drugs being evaluated. Such data are publicly available for more than 1000 compounds in the second release of the Connectivity Map; however, there are numerous compounds that are not part of the database. In addition, it is not clear whether drug performance in a breast cancer cell line, used extensively to measure transcriptional response of drugs in Connectivity Map, is relevant to all types of diseases. However, the number of publicly available data reporting the effects of drugs on gene expression across disease tissues continues to grow (61). Furthermore, disease-related microarray data could be combined with other types of knowledge on drugs (62, 63) for computational prediction of drug side effects.

Therapeutic efficacy is always more complex than just a simple matching of expression profiles. For example, compounds have to reach the desired or appropriate tissue to have an effect. However, the tissue-agnostic methodology used in this analysis (the disease and drug gene expression was not measured on the same tissues) might be suitable to find both direct and distant effects of drugs. Although our findings for cimetidine will need further preclinical testing and demonstration in larger randomized prospective clinical trials, our results validate the concept of computational analysis of public gene expression databases as a potentially useful approach to drug discovery that may uncover additional uses for approved drugs.

Materials and Methods

We combined data from publicly available microarray data sets representing 100 diseases and gene expression data from human cell lines treated with 164 drugs or small molecules to predict therapeutic drug-disease interactions.

Drug and disease gene expression data

Disease expression data were obtained from the NCBI GEO (64, 65), with methods previously described (6668). We used 176 gene expression microarray data sets, each with samples reflecting disease tissue and control nondisease tissue (table S1). These data sets covered 100 diseases, totaling 3113 microarrays (table S1). A data set with measurements for LA and adjacent normal tissues measured from human patients (GSE2514) was among the 176 data sets used for the study.

Rank normalization was applied to the disease expression data to carry out robust cross-platform analysis (69, 70). Because experiments were run on different platforms, we standardize gene identifiers from chip-specific probe identifiers to NCBI GeneID identifiers using AILUN (71) to be able to reason across multiple experiments. In cases where multiple microarray probes mapped to the same NCBI GeneID, we averaged across individual probe expression values. On each data set, we performed SAM (28), comparing control samples to disease samples to generate a list of up- and down-regulated genes for each disease state. We used an FDR threshold of 0.05 for q values. The list of significantly up- and down-regulated genes for each disease-control comparison constituted a disease signature (Fig. 1).

Drug-exposure gene expression microarray data measured on several types of cancer cell lines for 164 distinct small molecules were obtained from Lamb et al. (25). Most of the experiments were carried out on the breast cancer epithelial cell line MCF7, but a subset of the compounds was also profiled in the prostate cancer epithelial cell line PC3 and the nonepithelial lines HL60 (leukemia) and SKMEL5 (melanoma). The original experiments were carried out on two platforms: GEO Platform (GPL) 96 Affymetrix GeneChip Human Genome U133 Array Set HG-U133A and GPL3921 Affymetrix GeneChip HT-HG_U133A Arrays. To be able to reason between drug and disease expression data, we standardized gene identifiers from microarray-specific probe identifiers to NCBI GeneID identifiers, averaging across individual probe expression values (Fig. 1). We followed the preprocessing and normalization steps described in the supplementary material in Lamb et al. (25).

Computational pipeline

We used the drug-exposure expression data as a reference database and queried this database with each individual disease signature by applying a nonparametric, rank-based, pattern-matching strategy based on the methodology originally introduced by Lamb et al. (25) to generate a ranked list of potential treatments for each of the diseases of interest (Fig. 1). Given a disease signature, we evaluated its similarity to each of the reference expression drug profiles in the drug expression set by computing an enrichment score for the up- and down-regulated disease genes. If the up-regulated disease genes appear near the top (up-regulated) of the rank-ordered drug gene expression list and the down-regulated disease genes fall near the bottom (down-regulated) of the rank-ordered drug gene expression list, we can conclude that the drug and the disease expression profiles are similar, and thus the drug might cause a change in tissue expression similar to having the disease. More interestingly, if the up-regulated disease genes fall near the bottom of the rank-ordered drug gene expression list and the down-regulated disease genes are near the top of the rank-ordered drug gene expression, then the given drug and disease have complimentary expression profiles and the drug might be a possible treatment option for the disease of interest.

For each disease, we computed an enrichment score (es) separately for the set of up- or down-regulated genes in the signature: esup and esdown. We constructed a reference drug expression set by taking the difference between the gene expression values of the samples treated with a compound of interest and untreated samples. Furthermore, we carried out rank normalization on the resulting gene expression differences. Let r be the total number of genes in the reference drug expression set and let s be the number of up- or down-regulated of each gene in the disease signature. We first constructed a vector V of the position (1… r) for each of the genes in the disease signature on the basis of the values from the reference drug expression set. Those were sorted in ascending order such that V(j) is the position of disease gene j, where j = 1, 2, … s. Then, for each set of up- and down-regulated disease genes, we computed aup, adown, bup, and bdown defined as follows.

If aup/down > bup/down, we set esup/down to aup/down, and if bup/down > aup/down, we set esup/down to −bup/down. The drug-disease score dds is set to zero, where esup and esdown have the same algebraic sign. Otherwise, we set dds = esup − esdown. For more details, see the supplementary material in Lamb et al. (25).

In the original method proposed by Lamb et al., the final drug-disease score was scaled from +1 to −1. In using this scaling, every disease signature was able to match some drug profile at a maximal level. Because the scores for each disease were assigned a unique scaling factor, the strength or confidence of one drug-disease association could not be compared with another drug-disease association. We thus chose not to carry out this final scaling step, instead using the resulting raw values as drug-disease scores. To present the results in a more readable fashion, we chose to average the scores by drug dose and cell line, although we realize that this might dampen some of the signal. Finally, we ranked the list of drug and small-molecule instances for each disease from most correlated to most anticorrelated, according to the computed similarity drug-disease score.

As a control, for every disease signature in our data set, we chose a random signature of the same size and recomputed the drug-disease scores for each drug. We repeated this experiment 100 times. The variance of the original scores is statistically significantly greater than the scores generated from random signatures (P < 2.2 × 10−16, Levene’s test), meaning that there were more significant therapeutic indications in our resulting data set than expected by chance.

To obtain a measure of significance on the drug-disease scores, we started by computing a P value for each drug-disease score by comparing the distribution of actual scores to those obtained on randomized data for each disease signature. We further used the q value (72) statistical package to compute tail area–based FDR for each drug-disease score in our result given the previously computed P values. On the basis of this analysis, we were able to establish the cutoff for significance for drug-disease scores for each disease signature, yielding an FDR of 5%.

To identify disease and drug classes, we applied hierarchical cluster analysis to the data using the computed Pearson correlation coefficients as a distance metric between disease or drug pairs. Initially, each disease or drug was assigned to its own cluster. The algorithm proceeded iteratively, at each stage joining the two most similar clusters, until there was just a single cluster left. We used the Pvclust R package (31) to compute a bootstrap analysis of the clusters. The bootstrap probability of a cluster corresponds to the frequency with which the cluster appears in bootstrap samples of the data. Approximately unbiased (AU) probability values were computed with bootstrap samples of various sizes and indicate how strongly the cluster is supported by data (AU > 95%).

Evaluation of cimetidine as a chemotherapeutic agent for non–small cell lung carcinoma

To evaluate our prediction, we performed experimental validation using tumor xenograft models of non–small cell lung carcinoma. Our experimental design was based on two human cancer cell lines that were implanted into severe combined immunodeficient (SCID) mice. Thirty mice were implanted with the A549 human adenocarcinoma cell line. Five million cells per microliter of PBS were injected into upper flanks of the animals. Thirty mice were divided into five groups of six mice each. Three groups received varying doses of cimetidine ranging from 25 to 100 mg/kg daily via intraperitoneal injection. One group was injected with doxorubicin (2 mg/kg) intraperitoneally biweekly as a positive control, and the final group received only PBS/vehicle as a control. The experiment was carried out for 12 days after the tumor implants reached a volume of 100 mm3. Tumor volumes were measured by caliper with the following equation: mm3 = 0.52 × [width (cm)]2 × height (cm).

To evaluate the specificity of our approach, we performed a parallel experiment using the ACHN renal carcinoma cell line (5 million cells per microliter of PBS injection into upper flanks), because we predicted that cimetidine would not be efficacious against this cancer type. We used the same experimental design and protocol for the mice implanted with the A549 cell line as for the animals implanted with ACHN. Differences in tumor volume between groups were evaluated on the final day of dosing with Tukey’s HSD test and also Student’s t test between the highest dose of cimetidine and the PBS control. Mouse xenograft experiments were performed by the Transgenic Mouse Research Center core facility at the Stanford Comprehensive Cancer Center. Cell lines were verified to be free of murine pathogens by Research Animal Diagnostic and Investigative Laboratory. All chemicals and reagents were purchased from Sigma-Aldrich Corp.

Colorimetric MTT assay for cell survival and proliferation

MTT assay reagents from Millipore were used for cell survival and proliferation. The assay was carried out according to the manufacturer’s instructions. A549 cells were seeded at 5000 cells per well in a 96-well plate 15 to 18 hours before the start of the treatment. Cimetidine solution (250 mM) was prepared by adding 0.63 g of cimetidine in 8 ml of PBS and then 10 N HCl (200 μl) and 10 N NaOH (150 μl). Then, the pH was adjusted to the physiological level and total volume was adjusted to 10 ml. Cells were treated three times with cimetidine (250, 500, 1000, 1500, and 2000 μM) with 250 mM stock solution at 11 a.m., 6 p.m., and 11 a.m. the next day followed by MTT assay at 4 p.m. After the treatments, 0.01 ml of MTT (Millipore CT01-5, 50 mg/ml in PBS) solution was added to each well, and the cells were incubated for 4 hours at 37°C for the cleavage of MTT to occur. Then, color development solution (isopropanol with 0.04 N HCl, 0.1 ml each) was added and mixed thoroughly. Within an hour, absorbance was measured at 570 nm and at a reference wavelength at 630 nm. For the calculation, absorbances measured at 570 nm were subtracted from those measured at 630 nm.

TUNEL assay

TUNEL assay was carried out according to the manufacturer’s instructions (Roche). A549 cells were seeded at 20,000 cells per well in Nunc four-well Lab-Tek chamber slides in Dulbecco’s modified Eagle’s medium (DMEM) with 10% fetal bovine serum (FBS). Cells were seeded 15 to 18 hours before the start of the treatment. Cells were treated two times with cimetidine (500 to 3000 μM with 250 mM stock solution) at 11 a.m. and 6 p.m. and were stained the next day. Medium was removed, cells were washed in cold PBS, and the cells were air-dried. Then, freshly made 4% paraformaldehyde solution was added and cells were fixed for 15 min in room temperature. Cells were rinsed with PBS and were incubated with permeabilization solution (0.1% Triton X-100) for 2 min on ice. Next, terminal deoxynucleotidyl transferase and label solution were mixed and 50 μl each of the mixture was added to each well. Cells were incubated for 1 hour in a humidified chamber at 37°C. For negative controls, no label solution was added. Slides were covered with coverslips to ensure proper staining. Cells were then washed with cold PBS three times and dried. Finally, Hoechst solution for nuclei staining (Sigma) was diluted 1:10,000 in PBS and was added to the cells for 1 to 2 min. Cells were rinsed with PBS three and two times in water. Cells were applied with aqueous mounting medium and coverslipped. Immunofluorescence was measured with a Leica confocal microscope at the Cell Sciences Imaging Center at Stanford Cancer Center.

Supplementary Material

Fig. S1. Effect of the H2 agonist cimetidine on the growth of ACHN renal carcinoma cell line tumors in SCID mice.

Fig. S2. Heat map of drug-disease scores.

Fig. S3A. Hierarchical clustering of drugs by predicted therapeutic scores.

Fig. S3B. Hierarchical clustering of diseases by predicted therapeutic scores.


  • * These authors contributed equally to this work.

References and Notes

  1. Acknowledgments: We thank A. Skrenchuk and B. Oskotsky (Stanford University) for computer support and E. Davydov, C. Do, S. Gross, and M. Schaub (Stanford University) for constructive discussion. We thank Q. Zheng of the Transgenic Mouse Research Center at the Stanford Comprehensive Cancer Center for help in conducting the mouse xenograft experiments. Funding: This work was supported by Lucile Packard Foundation for Children’s Health, the Hewlett Packard Foundation, National Institute of General Medical Sciences (R01 GM079719), National Cancer Institute (R01 CA138256), National Library of Medicine (T15 LM007033), Howard Hughes Medical Institute, and Pharmaceutical Research and Manufacturers of America Foundation. Author contributions: M.S., J.T.D., A.P.C., and A.J.B. designed the study and carried out the analysis. J.K., A.S.-C., and J.S. carried out validation experiments. A.A.M. and A.P.C. assisted in the analysis. M.S., J.T.D., and A.J.B. wrote the paper. Competing interests: A.J.B. is on the scientific advisory board of NuMedii Inc. and Personalis and is a paid consultant for Lilly, Johnson and Johnson, Genstruct, Tercica, Ansh Labs, and Prevendia. J.T.D. is a consultant for NuMedii Inc. The other authors declare that they have no competing interests.
  • Citation: M. Sirota, J. T. Dudley, J. Kim, A. P. Chiang, A. A. Morgan, A. Sweet-Cordero, J. Sage, A. J. Butte, Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data. Sci. Transl. Med. 3, 96ra77 (2011).

View Abstract

Navigate This Article