Using computational modeling to construct the evolutionary tree of life

By Wendy Sutton, Office of the Vice President of Research

Stephen Smith, associate chair of ecology and evolutionary biology

As a child, Stephen Smith’s hikes with his father through the Appalachians planted the seeds of a lifelong interest in plants and trees. At the same time, as an elementary school student, he was experimenting with an old computer and modem, using it to call his father and type out a sentence or two. What began as after-school fun grew into a love of coding. Two very different interests, one natural and the other manmade, eventually merged into one career.

Today, Stephen Smith, associate chair of ecology and evolutionary biology, uses computational phylogenetics to study the evolutionary history of plants and trees. The primary focus of his lab is to better understand how flowering plants have diversified over evolutionary time, specifically to explain why some lineages evolve quickly while others change very little over millions of years. To achieve this, his team studies how processes like gene duplication, gene tree conflict and the rate of molecular evolution shape patterns across the tree of life.

“One of the overarching questions for the lab is about the evolution of innovation, meaning how large changes in evolution occur,” Smith said. “One of the big unspoken secrets in biology is that we don’t actually understand how the big changes in evolution occur. Is it one big step at a single moment in time, or is it many small steps that accumulate?”

The entire genome of a typical plant is roughly 10,000 to 40,000 genes, and each gene tells a story. For example, when comparing the genetic relationship between humans, chimpanzees and gorillas, the majority of the genome would conclude that humans and chimpanzees are more closely related. However, some of the genes would indicate that chimpanzees are more closely related to gorillas. Approximately 30% of gene trees support alternative relationships, a conflict likely arising from processes such as incomplete lineage sorting or hybridization. Essentially, incomplete lineage sorting involves ancestral genetic variation being passed down unevenly, while hybridization occurs when species interbreed and exchange genes.

Researchers can sequence whole genomes and all RNA in plants and animals. Sequencing the genome involves reading an organism’s full DNA to find its complete genetic blueprint. Sequencing RNA, known as the transcriptome, captures which genes are actively being expressed and at what levels. Because only a fraction of DNA is expressed as RNA and ultimately translated into proteins, transcriptomes offer a more focused view of functional genes. This is especially useful in plants, where genome sizes vary widely, which is why Smith was an early adopter of transcriptome sequencing.

Smith’s work requires an intensive computational approach, given that there are approximately 350,000 species of flowering plants. To understand why some groups evolve more rapidly than others, his team requires a diverse data set, analyzing thousands of different species. The lab mines massive public databases such as NCBI’s GenBank, while also generating its own large datasets. Because RNA cannot be sequenced from dead plants, Smith’s lab grows its own plants and then sequences them.

In 2019, Smith received a $60,000 Catalysis grant from MICDE for his project, “Hierarchical computing for dynamic evolutionary inference of complexity.” The grant allowed his team to develop new tools to handle the growing variety of biological data available to researchers. Rather than treating plant evolution as a single uniform question, the tools allow them to break larger questions into smaller, more manageable pieces. The team can also compare species across many different data types at once, drawing on physical traits, genetic sequences, biochemical activity and gene expression levels. As part of this shift in computational approach, the team pivoted from using a single large machine to many individual machines to analyze data.

“If you have a large, monolithic dataset, you can analyze it all at once because it is essentially telling one story,” Smith said. “But a large, heterogeneous dataset has many small stories to tell. Breaking that analysis into smaller parts isn’t just preferable in terms of computational resources, it also makes more biological sense to analyze separately. The MICDE Catalysis grant specifically supported our development of a hierarchical computing framework that makes this kind of dynamic, multiscale inference possible.”

The impact of Smith’s work goes far beyond plants and trees. The methods developed in his lab are also applicable to the study of infectious diseases. His team applies these methods to analyze their molecular origins and evolution. Smith also collaborates with the University of Michigan Medical School to study the evolution of antibiotic resistance.

Two-panel scientific figure. Panel A shows a “Retrosplenial Model Neuron” receiving inputs from many thalamic head-direction cells with different preferred head directions. Green “depressing HD input” and orange “non-depressing HD input” arrows point to a purple neuron diagram. Panel B shows a line graph of firing-rate/head-speed correlation versus time lag; the green depressing-synapse curve is much higher and peaks at negative time lag, indicating encoding of past head speed, while the orange non-depressing-synapse curve remains low across time lags.

Image courtesy of Stephen Smith: Phylogeny of plants

Smith’s work also helps inform conservation decisions. Understanding plant evolution and the relationships between species is critical to plant and tree management. By constructing large phylogenetic gene trees for plant species, conservationists can recognize which taxa are unique to a particular area. One of his current projects on North American flora aims to determine how resilient various plant species are to changes in land use and shifting climate.

Because few researchers are investigating the questions Smith is asking at the same scale, he had to develop his own computational codes and programs. Essentially, the software and the science had to advance together.

Middle school students participating in the WISE/GISE summer camp are taking photos of leaves. The photos will be used with the machine learning and AI tools Smith’s team created with funding from MICDE.

Through computational modeling, Smith and his team determined that the longer a plant lineage’s lifespan, the slower its rates of molecular evolution tend to be. This may answer, at least in part, why some species evolve slowly and others more rapidly, suggesting that lifespan itself plays a key role in shaping evolutionary rates. Yet the research also unexpectedly demonstrated that the longer a species lives, the more gene conflict occurs.

His research also links rapid morphological change, meaning shifts in the physical appearance of plants, to gene tree conflict. For example, in branches of the phylogeny where major morphological innovations occurred, including the evolution of flowers or seeds, Smith observed both gene tree conflict and genome duplications.

Genome duplications occur when an organism inherits an extra copy of its entire genome. Over time, this duplicated genome can become fixed in the lineage. That extra genetic material serves as fuel for evolutionary innovation. It provides the raw material from which new traits can evolve. Researchers have found that these genome duplications often occur during periods of the greatest biological change in evolutionary history. Tracking those events is now helping Smith’s team identify precisely where to look to understand how major evolutionary leaps occurred.

Looking ahead, Smith’s lab is using AI to help gather data. Will Weaver, a graduate student in the lab, developed tools called LeafMachine and VoucherVision that use machine learning to identify and measure plant parts from museum and field images, extracting millions of measurements. Postdoc Shelly Gaynor is developing an AI package to detect genome duplications from raw sequence data. By integrating genomic data, museum collections, fossils and computational models across hundreds of thousands of species, Smith’s lab is working to understand how major evolutionary innovations arise and reshape the evolutionary tree of life.

“It’s quite liberating that we do our own computational work in addition to our biological work, because we’re free to ask whatever question we would like,” Smith said. “We are not limited by existing tools. Each set of results raises new biological questions, which drive us to develop new software to analyze new data. It’s an iterative approach, and that cycle is what fuels creativity in our lab.”