|
Current Topics Workshop Titles and Abstracts
Author: James Farris, Moecular Systematics Laboratory, Swedish Museum of Natural History
Streaming Video: Real
Media
Speaker: Walter M. Fitch, University of California,
Irvine
Authors: HoangMinh HoDac, Robert Wallace and Walter M. Fitch
Title: Inferring migration from phylogenies
Streaming Video: Real
Media
I will be presenting material based on the phylogenetic analysis
of 181 hemagglutinin sequences from human viruses of the B type.
Then, accepting the resulting phylogeny as substantially correct,
I construct for those viruses a single character whose character
states are the geographic locations from which the viruses were
isolated. One can then determine the most parsimonious accounting
of the geographic vector on the previously determined tree. A common
vector consisting of nucleotides would interpret a change from C
to T as a mutation from Cytidine to Thymine. But for my vector it
would mean a migration of a virus from California to Texas. One
might hope that the resulting data would lend, over several concatenated
migration paths a general path by which influenza viruses spread.
Initial results suggest multiple available paths which, in turn,
will require more sequences to tease out the migration routes from
the noise.
Author: Gonzalo Giribet, Museum of Comparative
Zoology, Department of Organismic & Evolutionary Biology, Harvard
University
Title: What is large? Analyzing sets of unaligned sequence data
Streaming Video: Real
Media
While people focusing on "large data sets" generally refers to
hundreds or thousands of terminals, the problem of analyzing sets
of unaligned data considerably reduces the number of terminals that
can be analyzed in an efficient manner. The reason for that is the
relationship between the number of possible alignments that exist
for a given set of sequences and the number of possible trees for
a given number of terminals. Therefore, tree search for unaligned
data includes a nested series of NP-hard problems. In this talk
I will explore issues about analyzing medium-size data sets when
they are not pre-aligned and the strategies we are exploring for
such data sets.
Author: Pablo A. Goloboff, Instituto Superior
de Entomología, Consejo Nacional de Investigaciones Científicas
y Técnicas, Miguel Lillo 205, 4000 San Miguel de Tucumán,
Argentina
Title: On divide and conquer strategies for parsimony analysis of
large data sets
Streaming Video: Real
Media
A recent paper (Roshan et al., 2004) described a "divide-and-conquer"
technique for analysis of large data sets, rec-i-dcm3, and stated
that it compares very favorably to results using TNT (the fastest
parsimony program; Goloboff et al., 2003). The technique is based
on creating reduced data sets, finding most parsimonious trees for
the reduced data set, and then merging all parts together. The software
Roshan et al. developed submits the reduced data sets to TNT; all
their software does is creating the reduced data sets (selecting
disjoint sets of terminals) and then combining the results produced
by TNT. Although not stated in the paper, the programs for rec-i-dcm3
actually do a complete round of TBR swapping to the complete data
set (via TNT) every time the results from the reduced data sets
are combined (optimality for a reduced data set rarely produces
score improvements for the entire data set). The divide-and-conquer
method in TNT (sectorial searches; Goloboff, 1999), avoids this
by selecting non-disjoint sets of HTU's (which produces exact length
calculations). In this talk, I will show that Roshan et al.'s claims
that rec-i-dcm3 outperforms the techniques in TNT is poorly substantiated.
First, the settings they used for the TNT runs were extremely poor;
very simple settings for TNT would have done a much better job.
Second, having TNT analyze larger subproblems, with more exhaustive
algorithms (earlier versions only used multiple random addition
sequences with TBR, or a few rounds of tree-drifting), produces
much better results (which is merely a difference in implementation;
the basic algorithms are the same). When this is done, the combined
techniques used in TNT clearly outperform rec-i-dcm3. On the other
side, rec-i-dcm3 depends on a global round of TBR after each cycle
of subdivision, so that using as search engine any program other
than TNT becomes unfeasible in the case of very large data sets
(e.g. for Roshan et al.'s largest data set, the TBR swapper in PAUP*
runs about 800 times slower than the one in TNT). The global TBR
becomes more and more critical as the data set is divided into smaller
subproblems (because then the combination of the results produces
a more suboptimal tree), so that there is a clear limit to the number
of taxa that can be reasonably analyzed with rec-i-dcm3. Additionally,
since creating reduced data sets to improve the results produces
invariably a worse tree (which is subsequently improved by global
TBR), rec-i-dcm3, is not truly a divide-and-conquer strategy (as
publicized), but instead a technique for cyclic perturbations and
improvements.
References:
Goloboff, P. 1999. Analyzing large data sets in reasonable times.
Cladistics 15:415-428.
Goloboff, P., J. Farris, K. Nixon. 2003. T.N.T.: Tree Analysis
Using New Technology. Program and documentation, available at http://www.zmuc.dk/public/phylogeny/tnt.
U. Roshan, B. M. E. Moret, T. L. Williams, T. Warnow, Rec-I-DCM3:
A Fast Algorithmic Technique for Reconstructing Large Phylogenetic
Trees, Proceedings of the IEEE Computational Systems Bioinformatics
(CSB04) 2004, Stanford (CA), USA (2004). Software available at http://www.cs.njit.edu/usman/RecIDCM3.html.
Speaker: Daniel Janies, Department of Biomedical
Informatics, The Ohio State University
Co-authors: Pablo Goloboff, Instituto Superior de Entomología,
Consejo Nacional de Investigaciones Científicas y Técnicas;
Diego Pol, Mathematical Biosciences Institute, The Ohio State University
Title: Applications of large-scale phylogenetic analysis for research
in emerging infectious disease
Presentation materials: PPT
Streaming Video: Real
Media
Here we report on the application of large-scale phylogenetic analysis
to track host shifts and measure surveillance quality among influenza
A. Emerging infectious diseases and organisms present critical issues
for public health and economic welfare and are thus important to
monitor.
For example, historically antigenic typing of the hemagglutinin
(HA) and neuraminidase (N) genes is the lens through which the World
Health Organization tracks influenza A and recommends vaccines.
In recent years, highly pathogenic H5N1 subtype strains of avian
influenza have emerged in wild and domestic birds and humans and
other mammals in Asia. Alarmingly, in the past few months H5N1 has
spread through avian populations in Eastern Europe and by some accounts
threatens to become pandemic among humans. Related viruses of H5N2
subtype also circulate among birds in the Americas but these so
far have only affected birds.
Recently, as demonstrated by the coordinated international response
to severe acute respiratory syndrome and avian influenza, emerging
infectious diseases are now being addressed via the collection of
nucleotide sequence data. However, our ability to derive information
from large sequence datasets lags far behind their acquisition.
Our work is directed at addressing this gap between data collection
and analysis.
To perform a longitudinal study of influenza A, we use the best
data currently in the public domain. We have performed phylogenetic
and character optimization analysis of 2359 hemagglutinin nucleotide
sequences from isolates of influenza A using heuristic searches
in parallel TNT. These sequences were derived from viral isolates
that range temporally from 1902 to present, occur in human, avian,
and several mammal hosts, and represent 16 antigenic subtypes.
Direct human infection by avian strains of influenza A is considered
rare. It is hypothesized swine act as an intermediate host. Our
phylogenetic analysis of reveals multiple independent events of
avian to human transmission without intermediate hosts.
We also use this large comprehensive analysis to assess the quality
of surveillance of influenza A. Compared to a null hypothesis of
no correspondence among date of isolation of viruses and the temporal
pattern implied by the phylogenetic hierarchy of viruses, our ability
to reconstruct the spread of H5 viruses is better than expected.
However, surveillance quality of H5 viruses in Asia is much better
than in the Americas and this has been true since 1975.
Author: Bret Larget, Departments of Botany
and Statistics, University of Wisconsin - Madison
Title: Bayesian MCMC Approaches for large gene-order phylogenies
Methods for the analysis of genome arrangements for phylogenetic
inference are complicated by the relative size of the space of possible
arrangements. Unlike DNA, amino acid, or codon sequences where sites
can be modeled as independent and there are 4, 20, or 61 possible
states, genome arrangements must be considered as a single complicated
character. In the case of animal mitochondrial genomes with 37 genes
arranged on a circle, there are over 2*10^{52} arrangements. This
massive state space requires extensive modifications to the algorithms
for both likelihood and parsimony based analysis.
In this talk I will describe the Bayesian approach to phylogenetic
inference from genome arrangements with several example data sets.
I will make comparisons between the Bayesian approach and the parsimony
approach for genome arrangement data.
Author: Bernard Moret, Department of Computer
Science, University of New Mexico
Title: Large-scale phylogenetic reconstruction, the Tree of Life,
and CIPRES
Presentation materials: PDF
Streaming Video: Real
Media
In this talk, I will review current computational activities aimed
at reconstructing the Tree of Life, the evolutionary history of
all living organisms (an estimated 10-100 million). Researchers
and funding agencies worldwide have put renewed emphasis on the
establishment of evolutionary relationships among living species;
such relationships are fundamental to research in medicine, drug
design, agriculture, ecology, and many other areas.
The CIPRES (Cyber Infrastructure for Phylogenetic Research), which
I direct, was founded to develop the informatics tools required
to attempt a reconstruction of the Tree of Life. I will sketch the
goal and current achievements of CIPRES, comment on future needs,
and relate its work to that of other research efforts in phylogeny.
I will then focus on specific challenges that arise in reconstructing
trees with 100,000 or more leaves, with particular emphasis on sources
of error and on the methodological advances we will need to evaluate
the quality of such reconstructions.
Author: Diego Pol, Mathematical Biosciences
Institute, The Ohio State University
Co-authors: Pablo Goloboff, CONICET, INSUE, Instituto Miguel Lillo,
Tucuman, Argentina; and Daniel Janies, Department of Biomedical Informatics,
The Ohio State University
Title: Strategies for Parallelizing of Heuristic Tree Searches Using
Parsimony
Parallel heuristic tree searches for phylogenetic analyses are
explored for datasets consisting of 2,000 to 3,000 taxa using the
maximum parsimony criterion. The efficiency of alternative strategies
is assessed for several benchmark datasets. The searches were conducted
in parallel in Beowulf clusters (using 10 to 60 processors) using
several heuristic algorithms implemented in TNT. The searches were
driven with the scripting language of TNT.
The identification of an efficient combination of heuristic algorithms
and tuning of the algorithms parameters in the context of the parallel
environment is critical to the overall efficiency of the parallel
tree search. Stopping rules for the heuristic search are discussed,
including the convergence to the best-known score of each dataset
during independent trials and the stabilization of the strict consensus
of most parsimonious trees.
The problems of large datasets with poor phylogenetic structure
is discussed through the analysis of 2359 hemagglutinin genes of
influenza type A that required the use of iterative constraints
to achieve independent hits to minimum lengths in reasonable times.
Author: Usman Roshan, Computer Science Department,
NJIT-CCS
Title: Co-evolution of DCMs and base-methods for phylogeny reconstruction
The Disk-Covering Method (DCM) is a divide-and-conquer booster
technique for improving upon a given base method. It computes a
decomposition on the input set of species into smaller overlapping
subproblems, applies an external base method to compute subtrees,
merges them into a supertree, and further refines the supertree
with a local search method if necessary. DCMs have co-evolved with
different base methods and optimization criteria since they were
first introduced.
In this talk I will discuss the general framework of DCMs and discuss
in detail the dataset decomposition aspect. I will present the successful
application of DCMs for large-scale phylogeny reconstruction using
(1) the neighbor joining method (under distance-based reconstruction),
(2) GRAPPA (under gene-order data), (3) PAUP* and TNT (under maximum
parsimony), (4) RAxML (under maximum likelihood), and (5) Poy (under
generalized tree alignment). I will highlight how DCM evolved with
the different base-methods and optimization criteria accompanied
by various experimental performance studies.
Author: Alexandros Stamatakis, Institute
of Computer Science, Foundation for Research and Technology-Hellas
Title: Computing Huge Trees with Maximum Likelihood: An HPC Perspective
Presentation materials: PPT
Streaming Video: Real
Media
The inference of phylogenies based on the Maximum Likelihood (ML)
criterion has recently been demonstrated to be NP-hard. However,
significant algorithmic advances have been achieved over the last
years which currently allow for ML-based analyses of 1,000 organisms
within less than 24 hours on a single CPU. To date, most state-of-the-art
ML programs are limited by their memory consumption to tree sizes
of approximately 3,000 to 5,000 taxa. Thus, a new category of performance
problems arises which mainly concerns technical issues such as memory
consumption/organization, cache efficiency, and optimization of
the likelihood functions.
Within this context, the talk will -by example of RAxML (Randomized
Axelerated ML)- initially focus on the rarely documented technical
implementation details, which are of growing importance. RAxML has
inherited and extended the technically extremely efficient implementation
of fastDNAml. A novel (unpublished) and very simple technical optimization
of RAxML, which yields run-time improvements of approximately factor
10 on huge alignments comprising 25,000 taxa, will also be presented.
In addition, the exploitation of fine-grained loop-level parallelism
on SMPs (Symmetric MultiProcessing) and GPUs (Graphics Processing
Units) will be addressed. Finally, potential algorithmic and technical
challenges as well as solutions for future large-scale inferences
of 100,000 taxa will briefly be discussed.
Author: Andrés Varón, Division of Invertebrate Zoology,
American Museum of Natural History
Title: Minimum Description Length Phylogenetic Analysis
With the growing size of molecular datasets, additional transformation
events are required under certain, not clearly defined, conditions.
Hybridizations, duplications, inversions, tandem repeats, horizontal
gene transfers, can be taken into consideration, but their relevance
for a particular dataset is difficult to assess. We propose the
Minimum Description Length Principle (MDL) as a more fundamental
optimality criterion for phylogenetic inference. We define a framework
for the application in its crude and refined versions; for biological
research purposes certain machine details of importance are not
relevant in the original MDL, and so, we elaborate (and prefer)
the crude version to perform full phylogenetic analysis. Finally
both MP and ML under static and dynamic homologies are shown to
be two particular cases of this more general framework.
Author: Ward Wheeler, Division of Invertebrate Zoology, American
Museum of Natural History
Title: Kolmogorov Complexity, Links with Parsimony and Likelihood,
and Tests of Methods and Monophyly
Kolmogorov Cmplexity and the Minimum Description Length Principle
provide a foundation to compare and understand the relationship between
parsimony and likelihood methods. The case of binary characters is
presented showing that condition Kolmogorov Complexity (K) of a cladogram
given the root, is equal to the parsimony score and that the Tuffley
and Steel (1997) likelihood can be derived from this value via the
Universal Distribution. Furthermore, the Farris (1973) "maximum evolutionary
path" likelihood can be viewed as the composition of edge transformation
functions. The role of root complexity is presented in light of tests
of monophyly and multiple origins of taxa. An example using metazoan
ribosomal data is presented. |