Editors' ChoiceGenomics

The future is unsupervised

See allHide authors and affiliations

Science Translational Medicine  06 Jul 2016:
Vol. 8, Issue 346, pp. 346ec108
DOI: 10.1126/scitranslmed.aag3101

Gene expression data reflect biological processes, as well as technical artifacts. An analysis guided by known biology that only considers features corresponding to an understood process may help avoid artifacts but could limit the novelty of the findings. So how do we discover biologic features without limiting ourselves to what we already know?

Celik et al. developed INSPIRE: an unsupervised algorithm that finds the hidden factors, termed “latent variables,” shared across many studies of a disease. Because the disease is consistent but technical artifacts would vary, cross-dataset features can be expected to be disease-associated factors and not study-specific artifacts. The authors’ evaluations show that an unsupervised INSPIRE analysis reveals pathways and transcription factors.

Applying the method to nine expression datasets of ovarian cancer biopsies from different platforms identified 90 factors. These factors form a low-dimensional representation of ovarian cancer: a set of variables sufficient to explain the observed gene expression patterns. A subset of INSPIRE factors captured stromal patterns in the biopsies, and deeper analysis revealed a potential regulator of tumor-associated stroma, HOPX. The factor containing HOPX was associated with poor resectability in existing datasets. The authors found that HOPX protein expression in stroma characterizes tumors where the stroma is intertwined throughout the tumor. Combined with recent functional studies in mice, this suggests a potential mechanism by which HOPX contributes to differences in resectability.

This demonstrates the power of cross-dataset unsupervised feature extraction. Researchers planning to apply INSPIRE should note that each gene’s expression is constrained to a single latent variable. This could facilitate cross-dataset analysis of a single disease, but algorithms may need to be more flexible for cross-disease analysis.

The growing availability of data will present new opportunities for cross-dataset analysis. Unsupervised methods, which can lead us out of our biological comfort zone, may reveal new questions that we have not thought to ask. Because these methods are not designed to look for a single pattern, more data may be required than is typical for a supervised approach to find a certain pattern—but in our data-rich world, we should expect more unsupervised success stories.

S. Celik et al., Extracting a low-dimensional description of multiple gene expression datasets reveals a potential driver for tumor-associated stroma in ovarian cancer. Genome Med. 10.1186/s13073-016-0319-7 (2016). [Full Text]

Stay Connected to Science Translational Medicine

Navigate This Article