Papers on Phylogenetic Inference and Molecular Evolution

  • Kimmen Sjolander Bayesian Evolutionary Tree Estimation

    To appear in the proceedings of the "Computing in the Genome Era" conference, Washington, DC, March, 1997

    Abstract

    This paper gives the essentials of the method to infer (or reconstruct) an evolutionary tree for a set of proteins using relative entropy, a distance metric from information theory, in combination with Dirichlet mixture densities over amino acid distributions. Relative entropy and Dirichlet mixture priors used together allow us to identify key structural or functional positions in the molecule from the amino acid sequence alone, and constrain tree topologies to preserve these important positions within subtrees.

    This paper provides experimental results on Bacteriorhodopsin and homologs, comparing this method, Bayesian Evolutionary Tree Estimation (Bete), against Maximum Likelihood and Star Decomposition from the MOLPHY suite, Maximum Parsimony and Neighbor-Joining from the PHYLIP suite, and a quartet method (Puzzle) from Arndt Von Haeseler and Strimmer. These results and others (data not shown) suggest that Bete provides several advantages over existing tree-estimation methods. It is robust with respect to differing evolutionary clocks among taxa, differing mutation rates at sites in the molecule, handles deletions of portions of the molecule among taxa, and produces tree topologies that agree more closely with accepted phylogenies and functional subgroups within the data. Bete is also computationally efficient in the number of taxa (n^2 log(n), where n=the number of taxa), so that large numbers of sequences (in the hundreds) may be used as input to the tree-estimation process.

  • Kimmen Sjolander Phylogenetic inference in protein superfamilies: Analysis of SH2 domains

    To appear in the proceedings of ISMB98, Montreal, Canada, June 1998

    Abstract

    This paper demonstrates the use of Bayesian Evolutionary Tree Estimation on protein superfamilies. Once a tree is inferred, we employ minimum-description-length principles to determine a cut of the tree into subtrees, to identify the subfamilies in the data. This method is demonstrated on SH2-domain containing proteins, resulting in a change in the SwissProt subfamily assignment for Src2_drome, and a suggested evolutionary relationship between Nck_human and Drk_drome, Sem5_caeel, Grb2_human and Grb2_chick. Analysis of conservation patterns in the context of the subfamily decomposition show high conservation at binding pockets, suggesting the applicability of this method as a predictive tool for experimental verification. Results of different tree-reconstruction methods are compared for this data.

    Papers on Dirichlet Mixture Priors

    Kimmen Sjolander, Kevin Karplus, Michael Brown, Richard Hughey, Anders Krogh, I. Saira Mian and David Haussler Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology. CABIOS, 1996

    Abstract

    This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein database into a mixture of Dirichlet densities. These mixtures are designed to be combined with observed amino acid frequencies, to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model, or other statistical model. These estimates give a statistical model greater generalization capacity, such that remotely related family members can be more reliably recognized by the model. Dirichlet mixtures have been shown to outperform substitution matrices and other methods for computing these expected amino acid distributions in database search, resulting in fewer false positives and false negatives for the families tested.

  • M. P. Brown, R. Hughey, A. Krogh, I. S. Mian, K. Sjolander, and D. Haussler. Using Dirichlet mixture priors to derive hidden Markov models for protein families. In L. Hunter, D. Searls, and J. Shavlik, editors, Proc. of First Int. Conf. on Intelligent Systems for Molecular Biology , pages 47--55, Menlo Park, CA, July 1993. AAAI/MIT Press.

    Abstract

    A Bayesian method for estimating the amino acid distributions in the states of a hidden Markov model (HMM) for a protein family or the columns of a multiple alignment of that family is introduced. This method uses Dirichlet mixture densities as priors over amino acid distributions. These mixture densities are determined from examination of previously constructed HMMs or multiple alignments. It is shown that this Bayesian method can improve the quality of HMMs produced from small training sets. Specific experiments on the EF-hand motif are reported, for which these priors are shown to produce HMMs with higher likelihood on unseen data, and fewer false positives and false negatives in a database search task.

    Papers on Hidden Markov Models

  • A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology , 235:1501--1531, February 1994.

    Abstract

    Hidden Markov Models (HMMs) are applied to the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated on the globin family, the protein kinase catalytic domain, and the EF-hand calcium binding motif. In each case the parameters of an HMM are estimated from a training set of unaligned sequences. After the HMM is built, it is used to obtain a multiple alignment of all the training sequences. It is also used to search the SWISS-PROT 22 database for other sequences that are members of the given protein family, or contain the given domain. The HMM produces multiple alignments of good quality that agree closely with the alignments produced by programs that incorporate three-dimensional structural information. When employed in discrimination tests (by examining how closely the sequences in a database fit the globin, kinase and EF-hand HMMs), the HMM is able to distinguish members of these families from non-members with a high degree of accuracy. Both the HMM and PROFILESEARCH (a technique used to search for relationships between a protein sequence and multiply aligned sequences) perform better in these tests than PROSITE (a dictionary of sites and patterns in proteins). The HMM appears to have a slight advantage over PROFILESEARCH in terms of lower rates of false negatives and false positives, even though the HMM is trained using only unaligned sequences, whereas PROFILESEARCH requires aligned training sequences. Our results suggest the presence of an EF-hand calcium binding motif in a highly conserved and evolutionarily preserved putative intracellular region of 155 residues in the alpha-1 subunit of L-type calcium channels which play an important role in excitation-contraction coupling. This region has been suggested to contain the functional domains that are typical or essential for all L-type calcium channels regardless of whether they couple to ryanodine receptors, conduct ions or both.

  • Haussler, D., Krogh, A., Mian, I.S., Sjolander, K. ``Protein Modeling using Hidden Markov Models: Analysis of Globins", Proceedings of the Hawaii International Conference on System Sciences, January, 1993. Voted best in the category AI Technologies for Molecular Biology Analysis.

    Papers on Stochastic Context-Free Grammars

  • Sakakibara, Y, Brown, M., Hughey, R., Mian, S., Sjolander, K., Underwood, R., Haussler, D. Stochastic Context-Free Grammars for tRNA Modeling Nucleic Acids Research ,22(23):5112--5120, 1994.

    Abstract

    Stochastic context-free grammars (SCFGs) are applied to the problems of folding, aligning and modeling families of tRNA sequences. SCFGs capture the sequences' common primary and secondary structure and generalize the hidden Markov models (HMMs) used in related work on protein and DNA. Results show that after having been trained on as few as 20 tRNA sequences from only two tRNA subfamilies (mitochondrial and cytoplasmic), the model can discern general tRNA from similar-length RNA sequences of other kinds, can find secondary structure of new tRNA sequences, and can produce multiple alignments of large sets of tRNA sequences. Our results suggest potential improvements in the alignments of the D- and T-domains in some mitochdondrial tRNAs that cannot be fit into the canonical secondary structure.

  • Sakakibara, Y., Brown, M., Hughey, R., Mian, S., Sjolander, K., Underwood, R. and Haussler, D. The application of stochastic context-free grammars to folding, aligning and modeling homologous RNA sequences UCSC Technical Report, UCSC-CRL-94-14, 1993

  • Sakakibara, Y, Brown, M., Hughey, R., Mian, S., Sjolander, K., Underwood, R., Haussler, D. Recent Methods for RNA Modeling Using Stochastic Context-Free Grammars , in Proceedings of the Asilomar Conference on Combinatorial Pattern Matching, 1994.