Tandy Warnow (UIUC) - Theoretical and Empirical Advances in Large-Scale Species Tree Estimation
Theoretical and Empirical Advances in Large-Scale Species Tree Estimation
The estimation of the “Tree of Life” — a phylogeny encompassing all life on earth–is one of the big Scientific Grand Challenges. Maximum likelihood (ML) is a standard approach for phylogeny estimation, but estimating ML trees for large heterogeneous datasets is challenging for two reasons: (1) ML tree estimation is NP-hard (and the best current heuristics can use hundreds of CPU years on relatively small datasets, just to find local optima), and (2) the statistical models used in ML tree estimation methods are much too simple, failing to acknowledge heterogeneity across genomes or across the Tree of Life. These two “big data” issues — dataset size and heterogeneity — impact the accuracy of phylogenetic methods and have consequences for downstream analyses.
In this talk, I will describe a new “divide-and-conquer” approach to phylogeny estimation that addresses both types of heterogeneity. Our protocol operates as follows: (1) we divide the set of species into disjoint subsets, (2) we construct trees on the subsets (using appropriate statistical methods), and (3) we combine the trees together using auxiliary information, such as a matrix of pairwise distances. I will present three such strategies (all published in the last year) that operate in this fashion, and that improve the theoretical and empirical performance of phylogeny estimation methods. One of the main applications of this work is species tree estimation from multi-locus data sets when gene trees can differ from the species tree due to incomplete lineage sorting. This talk is largely based on joint work with my PhD student, Erin Molloy (Illinois).
Tandy Warnow
Tandy Warnow is the Founder Professor of Computer Science at the University of Illinois at Urbana-Champaign, where she is also an affiliate in Mathematics, Statistics, Bioengineering, Electrical and Computer Engineering, Animal Biology, Entomology, and Plant Biology. Tandy received her PhD in Mathematics at UC Berkeley under the direction of Gene Lawler, and did postdoctoral training with Simon Tavaré and Michael Waterman at USC. Her research combines computer science, statistics, and discrete mathematics, focusing on developing improved models and algorithms for reconstructing complex and large-scale evolutionary histories in biology and historical linguistics. She has published more than 160 papers and one textbook, graduated 11 PhD students, and has 5 current PhD students. She has been a visiting faculty member at many universities, including Princeton University, the University of Maryland, Yale University, Ecole Polytechnique Fédérale de Lausanne (EPFL), and Harvard University. Her awards include the NSF Young Investigator Award (1994), the David and Lucile Packard Foundation Award (1996), a Radcliffe Institute Fellowship (2006), and the John Simon Guggenheim Foundation Fellowship (2011). She was elected a Fellow of the Association for Computing Machinery (ACM) in 2015 and of the International Society for Computational Biology (ISCB) in 2017. Her national service includes being the lead NSF program officer for BigData (2012-2013), chairing the BioData Management and Analysis (BDMA) study section at NIH (2010-2012). Tandy was also a member of the Big Data Senior Steering Group of NITRD subcommittee of the National Technology Council (coordinating federal agencies), 2012-2013.