ZILA-SRM: a probabilistic framework with zero-inflated latent models for robust strain reconstruction from metagenomes
Saidi Wang, Mintong Chen, Di JiaoABSTRACT
Resolving bacterial strain diversity from shotgun metagenomic data is fundamental to understanding intra-host evolution, transmission dynamics, and phenotypic heterogeneity. However, current probabilistic approaches face a severe “identifiability limit” when disentangling highly similar genomes. Under high-noise conditions, sequencing errors, coverage overdispersion, and collinearity confound standard expectation-maximization algorithms, resulting in overfitting and spurious “ghost” strains. Here, we introduce zero-inflated latent allocation for strain reconstruction from metagenomes with adaptive sparsity regularization (ZILA-SRM) to overcome this barrier through three innovations. First, we integrate a zero-inflated Poisson mixture model to decouple "structural zeros" (true strain absence) from "sampling zeros" (stochastic dropout), addressing overdispersion in standard Poisson-based tools. Second, we impose a convex adaptive sparsity regularization penalty that leverages biological sparsity priors to shrink noise artifacts dynamically. Third, we implement a graph-theoretic refinement step using maximal clique enumeration to resolve haplotype collinearity. Benchmarking against StrainFinder and MixtureS on 702 synthetic data sets shows that ZILA-SRM achieves a 20% improvement in precision in high-complexity scenarios while maintaining over 80% recall for minor variants at 0.5% abundance. Re-analysis of deep-sequencing data from 195
IMPORTANCE
Understanding microbial communities at the strain level is critical because closely related strains can differ dramatically in traits such as drug resistance, virulence, and ecological interactions. However, resolving individual strains from metagenomic sequencing data remains difficult, especially when strains are highly similar or present at low abundance. As a result, biologically meaningful diversity is often obscured or misinterpreted as noise. In this study, we introduce a new framework that improves the reliability of strain reconstruction from complex metagenomic data. By reducing false-positive strain detection while preserving sensitivity to rare variants, our approach enables more accurate characterization of microbial populations. This improved resolution reveals previously hidden subpopulations in clinical and microbiome datasets, providing clearer insights into microbial evolution, competition, and the emergence of clinically relevant traits such as antibiotic resistance.