Phylogenetic gathering in Leipzig in the week from May 23 to May 28, 2004.
The Max Planck Institute for Mathematics in the Sciences (mis) and the Interdisciplinary Center for Bioinformatics (IZBI) in Leipzig will sponsor a (comparatively small and very informal) workshop on "Phylogenetic Combinatorics". Topics will range from the integration of palaeontological and molecular data to tree-space analysis to tank-based clustering methods and the analysis of proteomics data, e.g. micro-array data and fluorescence data.
The idea is that people would start flocking in from Sunday, May 23, on to meet in smaller or larger groups for absolutely informal discussions and joint work on whatever the participants may want to work on, and that an officially announced workshop would start on Thursday, May 27.
From outside Germany, the following scientists have already agreed to join this meeting: Alex Grossmann (Paris), David Epstein (Warwick), Mike Steel (Christchurch, NZ), and Bill Atchley (Raleigh, NC, USA).
The distribution of language family sizes
When the sizes of language families of the world, measured by the number of languages contained in each family, are plotted against their size ranking, it is seen that the distribution approximates a line defined by the formula y = ax-b. It is suggested that this apparent power-law distribution of language family sizes is of relevance when evaluating over-all classifications of the world's languages, for the analysis of taxonomic structures, and for developing hypotheses concerning the prehistory of the world's languages. It seems that three different major generating models will all eventually lead to power-law distributions: preferential attachment (in the case of networks), the Galton-Watson stochastic branching process (or some version thereof), and (some version) of the sand-pile model. General ingredients in a simulation procedure will be discussed and the results of an initial attempt to simulate the distribution of language family sizes using the second of the models mentioned will be shown.
Ranks of amino acid mutabilities as phylogenetic markers
Consider an arbitrary set on square matrices with non-negative entries. One may ask whether they can be represented in a way that mimics the logarithmic-scale display of one-dimensional data. Such a representation can be obtained by the use of elementary properties of matrix-valued logarithms.
The set of matrices may consist of the count matrices obtained from pairwise subalignments of an alignment of N sequences. The method associates a "rate" matrix to any pair of sequences, and makes it possible to consider two artificial continuous "evolutions" leading from one sequence to the other - if these "evolutions" exist.
The negative trace of the "rate" matrix is the LogDet distance between the two sequences.
The information contained in a "rate" matrix is richer than just the trace.The aim of this work is to exploit this additional information. This is can be done e.g. by focusing on the order of values in the entries of the "rate" matrix, i.e. by rank methods.
We have used this method on proteins coded by a set 123 mitochondrial genomes of metazoa.
Report on Rank-based Analyses of B Subtilis Micro Array Dat
When is a protein a Myc?
A sequence signature is a small set of amino acid sites that, when considered simultaneously, provides an accurate identification of a specific set of proteins. Sequence signatures or predictive motifs can be powerful tools in proteomics. I will describe how we have integrated information theory, multivariate statistics and fuzzy logic searching procedures to find small contiguous sets of amino acids that give very accurate probabilistic identifications of protein families. This should be complemented by Hermann Ragg referring to his work on serpins.
Subwords in reverse complement order
Given two natural number n,i, let F(n,i) denote the set of all strictly monotonous maps from {1,2,...,i} into {1,2,...,n}.
Further, given an alphabet A:={a,a',b,b'} and a sequence s = s(1) s(2) ... s(n) of length n over A, let S' denote the sequence s':= s(n)' s(n-1)' ... s(1)' with x':= a' for x=a, x':= a for x=a', x':= b' for x=b, and x':= b for x=b'.
Finally, with n,i, A, and s as above, let Subword(s|i) denote the set Subword(s|i) := {s(f(1)) ... s(f(i)) : f in F(n,i)}.
Then, one has s=t or s=t' for any two sequences s,t of length n with entries from A if and only of the union of the two sets Subword(s|[2n/3]) and Subword(s'|[2n/3]) coincides with the union of the two sets
Subword(t|[2n/3]) and Subword(t'|[2n/3]).
A Distance Quartet Puzzling Algorithm.Phylogenetic analysis increasingly employs sophisticated mathematical tools ranging from stochastic modelling of Markov processes, principal component analysis, or integer programming to various branches of combinatorics, including extremal combinatorics and combinatorial analysis of multivariate relationships, in particular those derived from (dis)similarity data. The work presented here deals with the latter topics, that is, the construction of phylogenetic trees from quartets (resolved trees on four leaves). Most formulations of the problem are NP-hard. Here we consider a new version that has a polynomial time solution. We present applications of this algorithm for idenitfying putative clades and for elucidating spurious phylogenetic relationships. Also, we note that our algorithm can be applied to weighted sets of quartets.
Finally, we will present some output trees and differences using the four variants of the algorithm.
I describe some recent work with Charles Semple on 'phylogenetic clocks' which are a particularly simple type of phylogenetic network possessing some attractive combinatorial properties. I will also describe some recent work with Andreas Dress on the 'path index' of a tree and other combinatorial concepts (such as parsimony) that can be usefully investigated by taking an algebraic perspective.
There is a well-known topology on the space of all trees with given taxa, where each edge has a length. (These are usually called weighted trees.) I will describe this topology and define some metrics on it, including the L^2 metric of Billera, Holmes and Vogtmann. We have not studied these different metrics extensively in a biological context, but it seems as though the L^1 metric is the most biologically informative. To compute the L^1 distance between two trees is computationally trivial. Computing L^2 distances is a difficult problem, and it may turn out to be NP, in one of the many senses of NP. On the other hand, I like the L^2 distance a lot, because it has beautiful mathematical properties (discovered by Billera, Holmes and Vogtmann). I will present a nice algorithm for computing the L^2-distance, much quicker than any naive algorithm, and will sketch the non-trivial proof that the algorithm gives the correct answer.
If I can prepare the results before the meeting, I may present some applications in biology of some of our computer programs.
Median networks, alias median graphs, have been rediscovered many times (in different mathematical languages). It is thus clear that they form a useful tool in many areas and are interesting on their own.
In the first part of the talk (given by S. Klavzar) we will recall some classical results on median graphs, like the fact that they are precisely the retracts of hypercubes. Then we will follow with some more recent results, for instance the so-called Euler-type formulas and the concept of cube polynomials (and its relation to median graphs).
In the second part (by W. Imrich) the focus will be on algorithms for the recognition of median graphs and related classes of graphs. Conceptually simple and easy to implement algorithms, more efficient but still practical ones and algorithms of purely theoretical interest will be treated. The talk ends with a list of open problems.
Tropical polytopes have recently been introduced by Develin and Sturmfels. They form a class of polyhedral complexes which generalize tight spans of finite metric spaces (and this is one reason why they are relevant for phylogenetic analysis). This renewed discrete geometry viewpoint on the subject paves the way to employ established algorithmic methods and to use existing software for standard problems in phylogenetics.
We report on recent implementations within the open source software framework polymake; see www.math.tu-berlin.de/polymake. These will become available with the new version 2.1, scheduled to be available by the time of this talk.
(joint work with and )
I shall talk about preliminary results obtained by the examination of Kendall correlation matrices for expression data of E. Coli and B. subtilis.
Usually, phylogenetic analyses make use of coding sequences. The reasons are historical ones: for a long time, geneticist focused on the evolution of coding sequences. They collected large amounts of orthologous genes with high levels of homology from distantly related species. Accordingly, the knowledge about evolutionary mechanisms and strategies for multiple alignment and phylogenetic reconstructions were also deduced from gene sequences.
With the emergence of non-coding sequences in the databases and adequate alignment tools, one could start to extract phylogenetic signals from orthologous non-coding sequences. These signals are more or less conserved non-coding nucleotides (CNCNs) that occur clustered at so-called phylogenetic footprints. In cases where the gene framework is highly conserved, large amounts of footprints can be extracted from adjacent non-coding sequences.
We used phylogenetic signals within the non-coding sequences of Hox clusters to look at sequence phylogenies and to shed light on the sequence of cluster duplication events during vertebrate evolution. Furthermore, we could use the large amounts of data together with statistical methods to study the loss and retention of conserved non-coding regions.
Hosts and their parasites are prominent model systems for studying coevolutionary processes. A fundamental problem in the theory of comparing host-parasite phylogenies is the reconstruction of past associations between hosts and parasites. Event based methods for solving the reconstruction problem take advantage of knowledge about the likelihood of possible evolutionary events. In this talk, we give a short overview over event-based methods for the reconstruction problem, and present a tool that we have developed for solving this problem when extinction events are possible.
In my brief exposition, I will present (a) some of the reasons why historical-comparative linguists have been reluctant to adopt phylogenetic methods regularly used in other disciplines, and (b) some of the reasons why I think linguistics can benefit from considering such methods and perhaps even contribute to the refinement of such methods. One problem is the pervasiveness of "horizontal transmission", i.e. the effects of language contact, which can make it difficult or even impossible to arrive at a strictly cladistic account of language differentiation. Traditionally, linguists have used painstaking methods that give very accurate answers to certain questions (e.g. relative chronology of phonetic changes), but that leave many questions unanswered, and are (at least possibly) inapplicable beyond a relatively shallow time-depth (say, around 10,000 years). If historical linguistics is to break through this "time barrier", it is essential to consider methods that provide approximations at greater time-depths, even if the results of applying these methods are less solid than those linguists have been accustomed to considering.
In 1992, Bandelt and Dress introduced the split decomposition method that takes as input a distance matrix and produces as output a collection of splits that are not necessarily compatible. This method was the motivation for the SplitsTree program, and different versions have been written by Rainer Wetzel, Daniel Huson and David Bryant. In practice, split decomposition has turned out to be very conservative method and it becomes more and more timid as the size of the input set grows. More recent methods such as Neighbor-net, consensus networks, bootstrap networks or the Z-closure network are much more potent at producing splits and this puts new demands on algorithms for representing them. In this talk we will illustrate this problem and discuss some ideas for addressing it.