New maximum lielihood methods for estimating empirical models of amino acid substitution
- Simon Whelan (Cambridge)
- Nick Goldman
Empirical models of amino acid substitution involve the estimation of a matrix describing the relative rates of instantaneous change between the different amino acids from a limited set of data under the assumption that the recovered model is applicable to all future sets of observed data. Such models are useful for a variety of purposes, for example phylogenetic inference, detection of homologous sequences from protein databases and protein structure prediction. Current maximum likelihood approaches based on a complete evolutionary model of the observed data provide a relatively accurate estimate of the empirical model but are limited by the amount of sequence data that may be analysed and consequently by the generality of the resulting model. Procedures based on naive counts of the number of amino acid replacements between large numbers of pairs of related sequences suffer from errors inherent in the counting process and discard a large amount of information present in the sequences. We have recently introduced an approximation to the maximum likelihood method that can estimate an accurate model from large amounts of sequence data. This model has been been compared to other widely used models of amino acid substitution for phylogenetic inference and has been found to provide a better description of the evolutionary process in the majority of cases. We have also recently assessed the impact of modelling rate variation when estimating amino acid substitution models.