Gene Family Histories and Graph Data in Phylogenetics
- Peter Stadler (IZBI, Leipzig University)
Abstract
Genes evolve within their species by means gene duplication, gene loss, and occasional horizontal gene transfer, leading to a family of related genes distributed over a set of species. In each speciation event, all genes are faithfully transmitted into the separating lineages. As long as horizontal transfer is the exception rather than the rule for most gene families, species are well-defined and, like their genes, evolve along trees. The history of a gene family is then defined as the mapping of the gene tree $T$ into the species tree $S$, such that inner vertices that represent speciations in $T$ are mapped to inner vertices of $T$. Such evolutionary scenarios can be used to defined several useful vertex-colored graphs that represent partial information on the gene family history. For example, in the orthology graph two vertices are adjacent if the corresponding genes have a speciation event as their last common ancestor. In the best match graph, a directed edge connects $x$ and $y$ from different species, if $y$ is, in its species, a closest relative of $x$. In the LDT graph, $x$ and $y$ are adjacent, if their last common ancestor is younger than the last common ancestor of the species in which they reside. The interest in these and other graphs stems from the fact that they can inferred more or less directly from sequence similarity data without the need to construct the gene tree $T$ or the species $S$. We discuss to what extent gene family histories are determined by these graphs. Since these graphs have very specific mathematical structures, correcting empirical estimates to conform to these structures provides a powerful way of reducing noise in the data.
Taken together this suggest a graph-based approach to gene family histories that avoids many of the practical issues with classical phylogenetic approaches that require accurate gene and species trees as a first step.