How to know what we don’t

See allHide authors and affiliations

Science Translational Medicine  09 Nov 2016:
Vol. 8, Issue 364, pp. 364ec179
DOI: 10.1126/scitranslmed.aal0067

Biological systems are interconnected in complex ways. One protein’s abundance can influence many others’ changes in transcription, posttranslational modifications, or other mechanisms. Moreover, cells’ responses to one environment may parallel their responses to others, with subtle differences. This is an assumption on which precision medicine rests: Some shared indicator will reveal sets of patients who will respond favorably to a certain therapy, even if that therapy is ineffective for others with nominally the same disease.

Unfortunately, we cannot measure everything about an individual all the time. Even if cost is no object, we cannot biopsy every tissue and cell type in the body every day, every hour, and every minute. At some point, there will be no tissue left to measure, and we are going to have to make guesses. So how do we do this better?

Zhou et al. developed a way to estimate the unknown while tackling a common challenge in experimental design. Researchers have collected data in the past, but technology has advanced, and now new platforms measure the same markers and more. The authors processed approximately 100,000 gene expression assays and trained a computer to predict the expression of genes measured only by new platforms from the expression of genes measured on old platforms.

Analyzing a dataset derived from pediatric renal cancer biopsies using only the markers detected by old technologies did not perfectly group cancer types. Once the computer guessed the unmeasured genes, the cancer types separated perfectly. This improved clarity is promising: Researchers can now reanalyze data that are not genome-wide—potentially including historical clinical trials or research data—in a genome-wide manner.

This work focused on the protein-coding transcriptome, but it is possible that the approach could be similarly effective in other contexts as well—for example, in dealing with noncoding RNA. Perhaps an analogous approach could be used to estimate the abundance of lncRNAs or other classes of RNAs in historical data sets that predated their discovery. Such computational and statistical approaches can stitch together multiple data sets, allowing us to infer what we could not or did not measure before.

W. Zhou et al., Imputing gene expression to maximize platform compatibility. Bioinformatics 10.1093/bioinformatics/btw664 (2016). [Abstract]

Stay Connected to Science Translational Medicine

Navigate This Article