Why Most Gene Expression Signatures of Tumors Have Not Been Useful in the Clinic

See allHide authors and affiliations

Science Translational Medicine  13 Jan 2010:
Vol. 2, Issue 14, pp. 14ps2
DOI: 10.1126/scitranslmed.3000313


Omics technologies are expected to enhance our understanding of a variety of diseases and to open the door to patient-specific personalized medicine. Despite the extensive literature on the use of gene expression arrays to predict prognosis in cancer patients, poor progress has been made in the translation of gene expression signatures for use in the clinics. Breast cancer provides a ripe arena for an analysis of why such signatures have failed to fulfill their promise.

The first reference in PubMed containing the words “gene” and “microarray” was published in 1995 (1). About 10 years elapsed between this first report of “differential expression measurements of 45 Arabidopsis genes” and the marketing of pangenomic cDNA microarrays that allow analysis of more than 40,000 genes simultaneously. As of December 2009, more than 28,000 peer-reviewed articles containing these two words can be found in PubMed. The cost of microarray technology has declined, and genomics is considered to be the most mature of all the “omic” technologies (2). These and all future related technologies strive to enhance our knowledge in a variety of arenas and to open the door to patient-specific personalized medicine. Before dreaming of new “arrays of opportunities” (3), however, one should critically consider how useful gene expression signatures have been in the clinics. For such an analysis, I use breast cancer as an example.


This recurring statement is present in most clinical papers about DNA microarray analyses. One of the main arguments in the quest for genomic signatures is that “patients with the same clinicopathological parameters can have markedly different clinical courses” (4). This observation has evolved into the expectation that differences between tumors at the gene level should explain everything. Gene expression profiles currently are used to classify tumors according to two strategies: (i) unsupervised analyses that classify tumors according to gene expression and (ii) supervised analyses that classify tumors according to clinical characteristics.

Unsupervised classifications are based on clustering algorithms. In the case of breast cancer, this strategy was used by Perou et al. (5), who identified four main categories: luminal, basal-like, normal-like, and HER2-enriched tumors. These four categories were initially defined according to the expression of a so-called intrinsic gene set that comprised 496 genes (5). In subsequent publications, the number of genes rose to 1300 (6), then dropped to 37 (7), and, finally, the same team proposed recently that these four categories be defined according to the expression of 50 genes (8). The names of the four categories are used by clinicians to classify tumors. However, instead of using gene expression profiles, clinicians use clinical surrogates based on hormone receptor expression, HER2 status, and immunohistochemical staining of tumor cells with specific antibodies.

There are two reasons not to use gene expression to classify tumors. First, the resources required to acquire these measures (people, analytical devices, money) are not available; and second, if they were, none of the algorithms used to define tumor categories using gene expression data have been published and thus none are publicly available. Because the four tumor classification categories are currently used by physicians, one may argue that the gene expression–clustering algorithms have been useful in the clinics. However, this classification has been successful largely because an approximate coincidence exists between the computer-generated gene clusters in the four tumor types and tumor classification based on preexisting clinical knowledge, such as HER2-enriched tumors, which are mostly tumors with a HER2-positive status, or basal-like tumors (also named triple-negative tumors), which have negative hormone receptor and HER2 statuses. If this overlap with the clinical classification had not existed, the gene cluster names would not have been used. And given the overlap, why choose a complicated, time-consuming, and expensive method that requires the acquisition and analysis of complex gene expression data instead of a simple one based on clinical parameters that are routinely available. The four tumor categories differ in terms of their prognosis and response to treatments; these observations confirm what we knew decades ago—that the hormone receptor and HER2 statuses of tumors are important prognostic and predictive factors in breast cancer.

Supervised classifications answer specific questions. For instance, what is the difference between patients who relapse and those who do not? This question was first addressed in the pioneering work on node-negative breast cancers performed by Van’t Veer et al. (9). The method used was remarkably simple. Gene expression was assessed in tumor tissue from 34 patients who relapsed during the 5 years after initial treatment and from 44 patients who did not. The expression level of each of the genes in patients who relapsed was compared to that of the same gene in patients who did not. Genes were ranked according to the P value of the test that compared their expression levels in the two groups. Researchers selected the 70 genes with the smallest P values—and thus the strongest differences between the two groups. The 70-gene signature corresponded to the mean expression values of these 70 genes in the group of patients who did not relapse. They then used this signature to define a rule by which to classify patients: Individuals for whom the correlation between the good 70-gene signature and tumor expression of the same 70 genes was >0.4 were classified as “good prognosis” patients, and the other patients were classified as “poor prognosis” cases.

This simplistic analytical strategy, developed in 2002, is still the cream of the crop in genomic analyses, and more complicated strategies have not outperformed it. In 2008, Haibe-Kains et al. (10) showed in a validation setting that complex models are not better prognosis predictors than simpler ones. What we have learned from all of these supervised classifications is that proliferation ability of tumor cells is a common denominator of many existing prognostic gene signatures (10, 11). Again, this realization confirms what we knew decades ago, that the ability of tumors to proliferate is an important prognostic and predictive factor in breast cancer.


A crucial step in the translation of gene signatures is validation in a clinical setting (12, 13). Ideally, validation of an experimental gene signature should be performed in an independent patient population, by an independent research team. The original validation of Van’t Veer’s signature was published in the same article that defined the signature (9) and was performed in 19 patients from the same patient population but who were not included in experiments that yielded the data that defined the signature. Seventeen of the 19 patients (89%) had an accurate prognosis prediction. However, this excellent result was not matched in subsequent validations. Van de Vijver et al. (14) studied a consecutive series of 295 breast cancer patients, including both node-positive and node-negative individuals. A major criticism of this validation was that it included 61 node-negative patients who had participated in the original study by Van’t Veer et al. (15). This lack of independence between the two populations led to an overestimation of the performance of the signature that became perceptible in subsequent evaluations. When the signature was evaluated in 180 patients who were not involved in the original study and for whom one knew whether they had had a metastasis within the first 5 years of follow-up, the sensitivity (that is, the proportion of patients classified as high-risk with the 70-gene signature among the patients who relapsed) was equal to 93% (95% confidence interval: 81–99%); and the specificity (that is, the proportion of patients correctly classified as low-risk with the 70-gene signature among the patients who did not relapse) was only 53% (44–61%). Similar results were obtained in 307 node-negative breast cancer patients, with a sensitivity of 90% (78–95%) and a specificity of 42% (36–48%) (16).

These successive validations illustrate the impact of inadequate validations on overestimation of the performance of the signature. The 70-gene signature is probably the only one that has been so extensively validated, and this is also remarkable. The gene microarray literature is polluted with many signatures that have inadequate validation or even no validation at all (Fig. 1). This is not acceptable for many reasons: (i) the findings will not be reproduced, (ii) the studies are referenced by many scientists and so the research problems addressed in the studies are believed to be solved, (iii) inadequate information is never removed from the databases, and (iv) thus, the total number of signatures is artificially increased and the proportion of potentially useful ones decreased. The time of clinicians who are expected to use genomic signatures is too precious to be wasted on judging the relevance and quality of gene microarray signatures—this validation must be done by scientists, so that robust signatures can be delivered to the clinics. The number of published papers with inadequate validation casts doubt over the complete body of gene microarray literature, and these papers should be expunged from bibliographic databases. I suggest that an adaptable dictionary of published gene expression signatures be created that contains a critical analysis of every declared possible use of each signature, as well as comments on their statistical and clinical validity. A key bottleneck to such a project is the ability to guarantee that those in charge of the critical analyses have no real or perceived conflicts of interest.

Fig. 1. Too much in, nothing out.

The gene microarray literature is polluted with many gene expression signatures that have inadequate validation or no validation at all.



Two clinical trials have been launched to test the clinical usefulness of two prognostic signatures that currently are being used by thousands of physicians in many countries all over the world to identify patients with a low risk of relapse. These patients are expected to derive no survival benefit from chemotherapy, while being exposed to the serious side effects. In Europe, the MINDACT (Microarray In Node-negative and 1 to 3 positive lymph node Disease may Avoid ChemoTherapy) trial is a randomized study designed to compare the ability of MammaPrint (Agendia, Amsterdam, the Netherlands)—the commercial version of the 70-gene signature—to identify women with a low risk of relapse with that of a clinico-pathological classification procedure (17). The TAILORx trial [Trial Assigning Individualized Options for Treatment (Rx)] was launched in the United States in 2006. In TAILORx, chemotherapy is assigned or randomized according to the recurrence score (RS) estimated with the Oncotype DX test (Genomic Health, Redwood, CA): Patients with an RS less than or equal to 10 do not receive chemotherapy, patients with an RS above 25 receive chemotherapy systematically, and patients with an intermediate RS between 11 and 25 are randomized between chemotherapy and no chemotherapy groups (18).

The 10-year results of these two trials will not be available before the year 2020. Let us assume that MammaPrint ends up being better than the clinico-pathological classification. Studies comparing the 70-gene signature results and the 21-gene prognostic score—which forms the basis of Oncotype DX—have shown that the overall concordance is 82% (19). Among the patients with an intermediate RS score, who constitute the very population in which the decision to give chemotherapy is in question, about 50% will have a good prognosis and 50% a poor prognosis according to MammaPrint (20, 21). These patients with an intermediate RS score represent 37% of the Oncotype DX target population (22). If TAILORx shows that patients with an intermediate RS score have to be treated with chemotherapy, then if these patients are treated according to Mammaprint, half will not receive chemotherapy. Conversely, if TAILORx shows that these patients should not be treated with chemotherapy, then if these patients are treated according to Mammaprint, half will receive chemotherapy. This means that for 37% of patients, the treatment will be defined by which genomic test is used.


Gene microarrays have brought little progress to the clinical management of cancer since Shena et al.’s 1995 publication (1). Van’t Veer et al. (9) gave us a proof-of-concept when they showed that the gene microarray information could be used to predict the prognosis. Unfortunately, these predictions of prognosis are not very accurate and have not improved since 2002. This state of affairs is extremely disappointing given the potential of the technology. We still do not know how to read the messages within the genome. New technologies are on the horizon and competition between these and gene microarrays might be the end of the latter (23). However, these new technologies will generate increasingly large databases that will be more and more difficult to analyze. The field urgently needs a breakthrough in the way we analyze such data, or we will end up with a collection of data sufficient to explain everything but unable to predict anything.


  • Citation: S. Koscielny, Why most gene expression signatures of tumors have not been useful in the clinic. Sci. Transl. Med. 2, 14ps2 (2010).


  1. I thank E. Benhamou, J.-M. Guinebretière, C. Hill, and V. Ribrag for their critical comments and suggestions, and L. Saint-Ange for editing.

Stay Connected to Science Translational Medicine

Navigate This Article