Date(s) - 28/05/2018
11:00 am - 12:00 pm
Studio Villa Bosch
Beyond software tuning: scaling up comparative coding sequence analysis using approximations and models that adapt their complexity to the data
By Sergei L. Kosakovsky Pond, Department of Biology, Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, USA
Genetic sequence data are being generated at an ever-increasing pace, while many analytical techniques that are commonly used to make biologically meaningful inferences on these data are still “stuck” in the “small data” age. For example, a practical upper bound on the number of sequences that can be analyzed with many popular comparative phylogenetic methods is 1000, especially if codon-substitution models are used. These types of models are an essential tool for deciphering the action of natural selection on genetic sequences, and have been used extensively in biomedical and basic science applications, for example to quantify pathogen evolution: drug resistance, zoonotic adaptation, immune escape.
We show how his number can be raised by several orders of magnitude, enabling in-depth study of gene-sized alignments with 10000 − 100000 sequences, much more extensive model testing, or the implementation of more realistic models with added complexity. This can be accomplished via an adaptation of machine learning techniques originally developed in the context of large-scale data mining (latent Dirichlet allocation models), and for variable selection.
Specifically, we describe a relatively general approximation technique to limit the number of expensive likelihood function evaluations a priori, by discretizing a part of the parameter space to a fixed grid, estimating other parameters using much faster simpler models, and integrating over the grid using MCMC or a variational Bayes approach. We demonstrate how this technique can achieve 100× or greater speedups for detecting sites subject to positive selection, while improving statistical performance. Other analyses where there are only a 2-3 parameters of interest (e.g. detection of directional selection in protein sequences) can be accommodated. When discretization is not appropriate, it is often possible to develop methods that employ variable parametric complexity chosen with an information theoretic criterion. For example, in the Adaptive Branch Site Random Effects model, we quickly select and apply models of different complexity to different branches in the phylogeny, and deliver statistical performance matching or exceeding best-in-class existing approaches, while running an order of magnitude faster.
For registration please contact Benedicta Frech: email@example.com