home

about

people

education

publications

news

upcoming events

annual programs

seminars

governance

   

 

 

Current Topics Workshop Titles and Abstracts


Author: James Farris, Moecular Systematics Laboratory, Swedish Museum of Natural History

Streaming Video: Real Media


Speaker: Walter M. Fitch, University of California, Irvine
Authors: HoangMinh HoDac, Robert Wallace and Walter M. Fitch
Title: Inferring migration from phylogenies

Streaming Video: Real Media

I will be presenting material based on the phylogenetic analysis of 181 hemagglutinin sequences from human viruses of the B type. Then, accepting the resulting phylogeny as substantially correct, I construct for those viruses a single character whose character states are the geographic locations from which the viruses were isolated. One can then determine the most parsimonious accounting of the geographic vector on the previously determined tree. A common vector consisting of nucleotides would interpret a change from C to T as a mutation from Cytidine to Thymine. But for my vector it would mean a migration of a virus from California to Texas. One might hope that the resulting data would lend, over several concatenated migration paths a general path by which influenza viruses spread. Initial results suggest multiple available paths which, in turn, will require more sequences to tease out the migration routes from the noise.


Author: Gonzalo Giribet, Museum of Comparative Zoology, Department of Organismic & Evolutionary Biology, Harvard University
Title: What is large? Analyzing sets of unaligned sequence data

Streaming Video: Real Media

While people focusing on "large data sets" generally refers to hundreds or thousands of terminals, the problem of analyzing sets of unaligned data considerably reduces the number of terminals that can be analyzed in an efficient manner. The reason for that is the relationship between the number of possible alignments that exist for a given set of sequences and the number of possible trees for a given number of terminals. Therefore, tree search for unaligned data includes a nested series of NP-hard problems. In this talk I will explore issues about analyzing medium-size data sets when they are not pre-aligned and the strategies we are exploring for such data sets.


Author: Pablo A. Goloboff, Instituto Superior de Entomología, Consejo Nacional de Investigaciones Científicas y Técnicas, Miguel Lillo 205, 4000 San Miguel de Tucumán, Argentina
Title: On divide and conquer strategies for parsimony analysis of large data sets

Streaming Video: Real Media

A recent paper (Roshan et al., 2004) described a "divide-and-conquer" technique for analysis of large data sets, rec-i-dcm3, and stated that it compares very favorably to results using TNT (the fastest parsimony program; Goloboff et al., 2003). The technique is based on creating reduced data sets, finding most parsimonious trees for the reduced data set, and then merging all parts together. The software Roshan et al. developed submits the reduced data sets to TNT; all their software does is creating the reduced data sets (selecting disjoint sets of terminals) and then combining the results produced by TNT. Although not stated in the paper, the programs for rec-i-dcm3 actually do a complete round of TBR swapping to the complete data set (via TNT) every time the results from the reduced data sets are combined (optimality for a reduced data set rarely produces score improvements for the entire data set). The divide-and-conquer method in TNT (sectorial searches; Goloboff, 1999), avoids this by selecting non-disjoint sets of HTU's (which produces exact length calculations). In this talk, I will show that Roshan et al.'s claims that rec-i-dcm3 outperforms the techniques in TNT is poorly substantiated. First, the settings they used for the TNT runs were extremely poor; very simple settings for TNT would have done a much better job. Second, having TNT analyze larger subproblems, with more exhaustive algorithms (earlier versions only used multiple random addition sequences with TBR, or a few rounds of tree-drifting), produces much better results (which is merely a difference in implementation; the basic algorithms are the same). When this is done, the combined techniques used in TNT clearly outperform rec-i-dcm3. On the other side, rec-i-dcm3 depends on a global round of TBR after each cycle of subdivision, so that using as search engine any program other than TNT becomes unfeasible in the case of very large data sets (e.g. for Roshan et al.'s largest data set, the TBR swapper in PAUP* runs about 800 times slower than the one in TNT). The global TBR becomes more and more critical as the data set is divided into smaller subproblems (because then the combination of the results produces a more suboptimal tree), so that there is a clear limit to the number of taxa that can be reasonably analyzed with rec-i-dcm3. Additionally, since creating reduced data sets to improve the results produces invariably a worse tree (which is subsequently improved by global TBR), rec-i-dcm3, is not truly a divide-and-conquer strategy (as publicized), but instead a technique for cyclic perturbations and improvements.

References:

Goloboff, P. 1999. Analyzing large data sets in reasonable times. Cladistics 15:415-428.

Goloboff, P., J. Farris, K. Nixon. 2003. T.N.T.: Tree Analysis Using New Technology. Program and documentation, available at http://www.zmuc.dk/public/phylogeny/tnt.

U. Roshan, B. M. E. Moret, T. L. Williams, T. Warnow, Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees, Proceedings of the IEEE Computational Systems Bioinformatics (CSB04) 2004, Stanford (CA), USA (2004). Software available at http://www.cs.njit.edu/usman/RecIDCM3.html.


Speaker: Daniel Janies, Department of Biomedical Informatics, The Ohio State University
Co-authors: Pablo Goloboff, Instituto Superior de Entomología, Consejo Nacional de Investigaciones Científicas y Técnicas; Diego Pol, Mathematical Biosciences Institute, The Ohio State University
Title: Applications of large-scale phylogenetic analysis for research in emerging infectious disease

Presentation materials: PPT

Streaming Video: Real Media

Here we report on the application of large-scale phylogenetic analysis to track host shifts and measure surveillance quality among influenza A. Emerging infectious diseases and organisms present critical issues for public health and economic welfare and are thus important to monitor.

For example, historically antigenic typing of the hemagglutinin (HA) and neuraminidase (N) genes is the lens through which the World Health Organization tracks influenza A and recommends vaccines. In recent years, highly pathogenic H5N1 subtype strains of avian influenza have emerged in wild and domestic birds and humans and other mammals in Asia. Alarmingly, in the past few months H5N1 has spread through avian populations in Eastern Europe and by some accounts threatens to become pandemic among humans. Related viruses of H5N2 subtype also circulate among birds in the Americas but these so far have only affected birds.

Recently, as demonstrated by the coordinated international response to severe acute respiratory syndrome and avian influenza, emerging infectious diseases are now being addressed via the collection of nucleotide sequence data. However, our ability to derive information from large sequence datasets lags far behind their acquisition. Our work is directed at addressing this gap between data collection and analysis.

To perform a longitudinal study of influenza A, we use the best data currently in the public domain. We have performed phylogenetic and character optimization analysis of 2359 hemagglutinin nucleotide sequences from isolates of influenza A using heuristic searches in parallel TNT. These sequences were derived from viral isolates that range temporally from 1902 to present, occur in human, avian, and several mammal hosts, and represent 16 antigenic subtypes.

Direct human infection by avian strains of influenza A is considered rare. It is hypothesized swine act as an intermediate host. Our phylogenetic analysis of reveals multiple independent events of avian to human transmission without intermediate hosts.

We also use this large comprehensive analysis to assess the quality of surveillance of influenza A. Compared to a null hypothesis of no correspondence among date of isolation of viruses and the temporal pattern implied by the phylogenetic hierarchy of viruses, our ability to reconstruct the spread of H5 viruses is better than expected. However, surveillance quality of H5 viruses in Asia is much better than in the Americas and this has been true since 1975.


Author: Bret Larget, Departments of Botany and Statistics, University of Wisconsin - Madison
Title: Bayesian MCMC Approaches for large gene-order phylogenies

Methods for the analysis of genome arrangements for phylogenetic inference are complicated by the relative size of the space of possible arrangements. Unlike DNA, amino acid, or codon sequences where sites can be modeled as independent and there are 4, 20, or 61 possible states, genome arrangements must be considered as a single complicated character. In the case of animal mitochondrial genomes with 37 genes arranged on a circle, there are over 2*10^{52} arrangements. This massive state space requires extensive modifications to the algorithms for both likelihood and parsimony based analysis.

In this talk I will describe the Bayesian approach to phylogenetic inference from genome arrangements with several example data sets. I will make comparisons between the Bayesian approach and the parsimony approach for genome arrangement data.


Author: Bernard Moret, Department of Computer Science, University of New Mexico
Title: Large-scale phylogenetic reconstruction, the Tree of Life, and CIPRES

Presentation materials: PDF

Streaming Video: Real Media

In this talk, I will review current computational activities aimed at reconstructing the Tree of Life, the evolutionary history of all living organisms (an estimated 10-100 million). Researchers and funding agencies worldwide have put renewed emphasis on the establishment of evolutionary relationships among living species; such relationships are fundamental to research in medicine, drug design, agriculture, ecology, and many other areas.

The CIPRES (Cyber Infrastructure for Phylogenetic Research), which I direct, was founded to develop the informatics tools required to attempt a reconstruction of the Tree of Life. I will sketch the goal and current achievements of CIPRES, comment on future needs, and relate its work to that of other research efforts in phylogeny.

I will then focus on specific challenges that arise in reconstructing trees with 100,000 or more leaves, with particular emphasis on sources of error and on the methodological advances we will need to evaluate the quality of such reconstructions.


Author: Diego Pol, Mathematical Biosciences Institute, The Ohio State University
Co-authors: Pablo Goloboff, CONICET, INSUE, Instituto Miguel Lillo, Tucuman, Argentina; and Daniel Janies, Department of Biomedical Informatics, The Ohio State University
Title: Strategies for Parallelizing of Heuristic Tree Searches Using Parsimony

Parallel heuristic tree searches for phylogenetic analyses are explored for datasets consisting of 2,000 to 3,000 taxa using the maximum parsimony criterion. The efficiency of alternative strategies is assessed for several benchmark datasets. The searches were conducted in parallel in Beowulf clusters (using 10 to 60 processors) using several heuristic algorithms implemented in TNT. The searches were driven with the scripting language of TNT.

The identification of an efficient combination of heuristic algorithms and tuning of the algorithms parameters in the context of the parallel environment is critical to the overall efficiency of the parallel tree search. Stopping rules for the heuristic search are discussed, including the convergence to the best-known score of each dataset during independent trials and the stabilization of the strict consensus of most parsimonious trees.

The problems of large datasets with poor phylogenetic structure is discussed through the analysis of 2359 hemagglutinin genes of influenza type A that required the use of iterative constraints to achieve independent hits to minimum lengths in reasonable times.


Author: Usman Roshan, Computer Science Department, NJIT-CCS
Title: Co-evolution of DCMs and base-methods for phylogeny reconstruction

The Disk-Covering Method (DCM) is a divide-and-conquer booster technique for improving upon a given base method. It computes a decomposition on the input set of species into smaller overlapping subproblems, applies an external base method to compute subtrees, merges them into a supertree, and further refines the supertree with a local search method if necessary. DCMs have co-evolved with different base methods and optimization criteria since they were first introduced.

In this talk I will discuss the general framework of DCMs and discuss in detail the dataset decomposition aspect. I will present the successful application of DCMs for large-scale phylogeny reconstruction using (1) the neighbor joining method (under distance-based reconstruction), (2) GRAPPA (under gene-order data), (3) PAUP* and TNT (under maximum parsimony), (4) RAxML (under maximum likelihood), and (5) Poy (under generalized tree alignment). I will highlight how DCM evolved with the different base-methods and optimization criteria accompanied by various experimental performance studies.


Author: Alexandros Stamatakis, Institute of Computer Science, Foundation for Research and Technology-Hellas
Title: Computing Huge Trees with Maximum Likelihood: An HPC Perspective

Presentation materials: PPT

Streaming Video: Real Media

The inference of phylogenies based on the Maximum Likelihood (ML) criterion has recently been demonstrated to be NP-hard. However, significant algorithmic advances have been achieved over the last years which currently allow for ML-based analyses of 1,000 organisms within less than 24 hours on a single CPU. To date, most state-of-the-art ML programs are limited by their memory consumption to tree sizes of approximately 3,000 to 5,000 taxa. Thus, a new category of performance problems arises which mainly concerns technical issues such as memory consumption/organization, cache efficiency, and optimization of the likelihood functions.

Within this context, the talk will -by example of RAxML (Randomized Axelerated ML)- initially focus on the rarely documented technical implementation details, which are of growing importance. RAxML has inherited and extended the technically extremely efficient implementation of fastDNAml. A novel (unpublished) and very simple technical optimization of RAxML, which yields run-time improvements of approximately factor 10 on huge alignments comprising 25,000 taxa, will also be presented. In addition, the exploitation of fine-grained loop-level parallelism on SMPs (Symmetric MultiProcessing) and GPUs (Graphics Processing Units) will be addressed. Finally, potential algorithmic and technical challenges as well as solutions for future large-scale inferences of 100,000 taxa will briefly be discussed.


Author: Andrés Varón, Division of Invertebrate Zoology, American Museum of Natural History
Title: Minimum Description Length Phylogenetic Analysis

With the growing size of molecular datasets, additional transformation events are required under certain, not clearly defined, conditions. Hybridizations, duplications, inversions, tandem repeats, horizontal gene transfers, can be taken into consideration, but their relevance for a particular dataset is difficult to assess. We propose the Minimum Description Length Principle (MDL) as a more fundamental optimality criterion for phylogenetic inference. We define a framework for the application in its crude and refined versions; for biological research purposes certain machine details of importance are not relevant in the original MDL, and so, we elaborate (and prefer) the crude version to perform full phylogenetic analysis. Finally both MP and ML under static and dynamic homologies are shown to be two particular cases of this more general framework.


Author: Ward Wheeler, Division of Invertebrate Zoology, American Museum of Natural History
Title: Kolmogorov Complexity, Links with Parsimony and Likelihood, and Tests of Methods and Monophyly

Kolmogorov Cmplexity and the Minimum Description Length Principle provide a foundation to compare and understand the relationship between parsimony and likelihood methods. The case of binary characters is presented showing that condition Kolmogorov Complexity (K) of a cladogram given the root, is equal to the parsimony score and that the Tuffley and Steel (1997) likelihood can be derived from this value via the Universal Distribution. Furthermore, the Farris (1973) "maximum evolutionary path" likelihood can be viewed as the composition of edge transformation functions. The role of root complexity is presented in light of tests of monophyly and multiple origins of taxa. An example using metazoan ribosomal data is presented.
 

current topics workshops

scientific 2002-2003
scientific 2003-2004
scientific 2004-2005
scientific 2005-2006
scientific 2006-2007
scientific 2007-2008
scientific 2008-2009
scientific 2009-2010

courses


postdoctoral fellows


long term visitors