Deep Learning for Phylogenetic Inference

The research team will use deep neural networks to infer molecular phylogenies and extract phylogenetically useful patterns from amino acid or nucleotide sequences, which will help understand evolutionary mechanisms and build evolutionary models for a variety of analyses.

Molecular sequence data are input into layers of neural network structures to produce a score for each possible tree topology describing how well the tree fits the data.

The research team will use deep neural networks to infer molecular phylogenies and extract phylogenetically useful patterns from amino acid or nucleotide sequences, which will help understand evolutionary mechanisms and build evolutionary models for a variety of analyses.

A phylogeny is a tree structure depicting the evolutionary relationships among taxonomic units such as species, populations, or individuals. Phylogenies provide fundamental knowledge about evolutionary histories, which form the basis of many hypotheses, models, and theories on evolutionary processes and mechanisms. Furthermore, phylogenies provide the framework to organize and interpret data in areas well beyond evolutionary biology, such as developmental biology, cell lineage reconstruction, epidemiology, cancer biology, wildlife conservation, forensics, and linguistics. The pervasive use of phylogenies in biomedical research is evidenced by >180,000 papers in the biomedical literature database PUBMED with an abstract containing the word “phylogeny”.

Given the importance of phylogenies, it is not surprising that biologists have long dreamed to reconstruct the Tree of Life—the phylogeny containing all living species on Earth. Unfortunately, with rare exceptions, phylogenies cannot be directly observed or measured and need to be inferred from various comparative data. But, inferring phylogenies is extremely challenging, because there are N = (2m-5)!/[2m-3(m-3)!] possible different unrooted tree topologies for m taxa.

Modern biology generally employs alignments of DNA or protein sequences from extant species as the primary data for inferring phylogenies, because sequences result from evolution (i.e., decent with modification) and can provide tens of millions of characters for inferring phylogenies. Multiple tree inference methods have been developed in the last 50 years, roughly divided into two categories: distance-based and character-based. Deep neutral networks have outstanding modeling capacities but have never been attempted for phylogenetic inference. The team hypothesizes that deep neural networks can be used to evaluate tree topologies and to extract informative evolutionary patterns without the need to specify explicitly the mechanistic substitution model. Specifically, given molecular sequences from taxa of interest as input, we believe that deliberately designed deep neural networks can extract phylogenetically useful substitution patterns from amino acid or
nucleotide sequences and produce instructive output to guide predictions on phylogenetic relationships among taxa.

U-M Researchers

Yuanfang Guan