Detecting introgression from phylogenetic invariant site patterns using machine learning
Patrick F. McKenzie, Deren A. R. EatonAbstract
Premise
Detecting historical introgression among populations or species from genomic data is a common goal in evolutionary genetics. Most current methods fall into two major categories: network inference and admixture inference. Network inference (e.g., SNaQ) is computationally challenging and typically requires first reducing large genomic datasets into a less informative collection of inferred gene trees. In contrast, admixture inference (e.g., ABBA‐BABA tests) can accommodate enormous single‐nucleotide polymorphism (SNP) datasets but is restricted to examining subsets of four to five samples at a time. Here, we demonstrate a new approach to evaluate SNP frequencies among quartet samples under a phylogenetic hypothesis (similar to ABBA‐BABA tests), while examining all quartet information simultaneously (similar to the network inference methods).
Methods and Results
To do this, our method simcat trains a neural network machine learning model on coalescent simulations to discriminate between introgression scenarios based on learned SNP frequency patterns. We demonstrate the accuracy of simcat to classify introgression events from simulations, evaluate its sensitivity to variation in species tree parameters, and demonstrate its application to an empirical dataset of oak trees ( Quercus ser. Virentes ).
Conclusions
Our approach represents a first step towards leveraging machine learning to expand phylogenetic invariants–based methods beyond the scale of quartets to a larger phylogenetic context.