Editors' ChoiceGenomics

Nothing but a hound dog

See allHide authors and affiliations

Science Translational Medicine  25 May 2016:
Vol. 8, Issue 340, pp. 340ec83
DOI: 10.1126/scitranslmed.aaf9196

Deep learning algorithms have beaten world champions in the game of Go and have even tackled challenges as critical as finding cats in online videos. In recent work, Kelley et al. applied these methods to genome sequences, because predicting a region’s function simply from a sequence of bases has remained a challenge.

The authors trained a convolutional neural network (CNN) to recognize accessible DNA across 164 cell types. CNNs are a particularly useful deep-learning approach because they make the simplifying assumption that neighboring features—in this case, bases in a genome sequence—will usually be more informative than distant ones. In many types of data, this assumption is met. The trained CNN can be used for in silico saturation mutagenesis. Researchers can input either a reference or variant allele and predict accessibility. This analysis predicts variants that change accessibility and suggests a mechanism by which specific variants affect certain phenotypes.

This research team was not the first to train a CNN for genomic features, but they are the first to provide a software package designed to sniff out accessible regions in specific cell types. Good software is not enough: Training a basset model for a new cell type could still take weeks or months. To get around this bottleneck, the authors trained a generic basset model across many cell types and then further trained the computational hound on the scent of a specific cell type—a process that takes minutes or hours instead of weeks. This model’s accuracy was similar to the model trained over the complete collection of cell types.

This advance means that any researcher can train a cell type−specific hound quickly. For scientists who study variant-phenotype associations, the basset can sniff out variants that are likely to affect genome accessibility and provide a short cut to to linking disease-associated variants to causal mechanisms.

D. R. Kelley et al., Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 10.1101/gr.200535.115 (2016). [Full Text]

Navigate This Article