Targeted Genomic Region Masking Supports Accurate Variant Calling While Suppressing Low-Complexity Sequencing Artifacts
Chrysoula Kaligerou, Athina Tsagkalidou, Vasiliki Pogka, Dimitrios Christos Tremoulis, Timokratis KaramitrosBackground: False-positive variant calls generated within low-complexity regions (LCRs) remain a persistent bottleneck in clinical genomics, complicating downstream analysis. This study evaluates a targeted spatial masking strategy designed to suppress deterministic artifacts in short-read sequencing data, while preserving clinically actionable variants residing outside LCRs. We implemented a selective masking protocol prior to variant calling across analytical reference standards (EQA, NA12878) and two independent breast cancer whole-exome sequencing cohorts (n = 25). Methods: Callsets were evaluated for diagnostic sensitivity, precision gains, mutational signatures, VAF behavior, pseudo-multiallelic noise and ClinVar/dbSNP annotation. Results: The protocol removed thousands of sequencing and alignment artifacts while maintaining the retained biological callset, with negligible disease-associated diagnostic variants detected in the excluded artifact fraction. LCR masking preserved physiological Ti/Tv and Ins/Del profiles in retained calls, resolved pseudo-multiallelic noise, and distinguished excluded artifact calls by distorted mutational and VAF signatures. dbSNP profiling showed cohort-dependent behavior: TCGA-BRCA reproduced an intriguing phenomenon, with excluded calls showing higher dbSNP annotation than retained calls, whereas AURORA showed the opposite direction. Conclusions: These findings demonstrate the potential vulnerability of one-dimensional database annotation for variant authentication and highlight targeted spatial filtration as a critical, early pipeline intervention for high-fidelity clinical genomics of non-LCR-associated germline variants using short reads.