HIV-V3Augur: A Novel Machine Learning Model for Predicting HIV-1 Tropism in Sub-Subtype A6 and CRF63_02A6, Predominant Variants in Russia and Countries of the Former Soviet Union
Kirill Elfimov, Ludmila Gotfrid, Alina Nokhova, Mariya Gashnikova, Vasiliy Ekushov, Maksim Halikov, Irina Osipova, Dmitriy Baboshko, Andrey Murzin, Ivan Kondeikin, Arina Kiryakina, Aleksey Totmenin, Aleksandr Agaphonov, Natalya GashnikovaDetermining HIV-1 tropism provides the prognosis of HIV infection and is required before prescribing maraviroc, an entry inhibitor that blocks the interaction between the viral gp120 and the CCR5 coreceptor. However, existing prediction algorithms have been developed primarily for the globally most prevalent subtypes (B, C, and CRF01_AE) and often show reduced performance for other HIV-1 genetic variants. Sub-subtype A6 and circulating recombinant form CRF63_02A6 dominate the HIV-1 epidemic in Russia and other Former Soviet Union (FSU) countries, yet the reliability of tropism prediction for these viruses remains virtually unexplored. We phenotypically determined the tropism of 25 clinical isolates (11 R5, 1 X4, and 7 dual-tropic R5/X4) using U87.CD4.CCR5 and U87.CD4.CXCR4 cell lines and performed a comparative analysis of eight existing genotypic tools (Geno2pheno, WebPSSM, T-CUP 2.0, the Delobel/Garrido rules, and others) or their modifications on a combined dataset that included Los Alamos National Laboratory (LANL) reference sequences (subtypes A, B, C, CRF01_AE, and CRF02_AG) and our laboratory-derived isolates. Most models achieved high accuracy for globally prevalent subtypes (≈95% for B, C, and CRF01_AE) but showed markedly reduced performance for sub-subtype A6 (best accuracy among existing models, 85%) and CRF63_02A6 (best accuracy, 72%), with a poor balance between sensitivity and specificity. To address this problem, we developed HIV-V3Augur, an ensemble stacking model based on the Random Forest and Support Vector Machine (SVM) machine learning algorithms, trained on Pseudo Amino Acid Composition (PseAAC) and Relative Synonymous Codon Usage (RSCU) features with 10-fold stratified cross-validation. HIV-V3Augur achieved an accuracy of 77%, sensitivity of 79%, and specificity of 79% on sub-subtype A6, and on CRF63_02A6 it reached an accuracy of 95%, sensitivity of 87%, and specificity of 100%. Cross-validation demonstrated that HIV-V3Augur represents a balanced genotypic tropism prediction tool for understudied HIV-1 variants circulating in the FSU region. HIV-V3Augur can be used locally through a graphical user interface.