DOI: 10.35377/saucis...1832143 ISSN: 2636-8129

Signal Processing and Dimensionality Reduction Algorithms for High-Dimensional Time-Series Data Applied to Biomolecular Simulations

Kevser Kübra Kırboğa, Ecir Uğur Küçüksille
High-dimensional time-series datasets with stochastic noise present fundamental challenges in signal processing and data analytics across diverse computational domains. This study develops and validates a systematic framework that integrates digital signal processing techniques with dimensionality reduction algorithms to extract meaningful trends from noisy, high-dimensional temporal data. We implement and compare five filtering approaches—Savitzky-Golay polynomial regression, Moving Average, Gaussian, Butterworth, and Wavelet denoising—combined with Principal Component Analysis (PCA) on GPU-accelerated MD simulation infrastructure (NVIDIA GPU based on the Blackwell architecture, CUDA 12.9), benchmarking performance on two contrasting biomolecular systems: Frataxin (119 residues, rigid) and Carbonic Anhydrase IX (CAIX) (257 residues, dynamic). A comprehensive five-filter comparison across both systems reveals consistent performance rankings: Wavelet achieves the highest signal-to-noise ratio (SNR) for both Frataxin (23.47 dB) and CAIX (13.80 dB), while Savitzky-Golay provides the optimal balance between noise reduction and low-frequency preservation (>99%). Critically, Savitzky-Golay’s improvement over the Moving Average is substantially greater for dynamic CAIX (16.3%) than for rigid Frataxin (3.1%), demonstrating enhanced performance precisely where distinguishing conformational transitions from thermal noise is most challenging. All filters preserve low-frequency conformational dynamics while reducing high-frequency noise, with the extracted noise components validated by Gaussian distribution analysis (σ ≈ 0.008 nm) across both systems and all filtering methods. PCA-based dimensionality reduction achieves 11.9:1 compression (357 dimensions → 30 principal components) for Frataxin while retaining 80% of conformational variance, with CAIX requiring only 11 components for equivalent variance capture due to dominant collective motions. The complete analysis pipeline processes 50,001 frames in less than 0.2 seconds, representing negligible overhead (<0.001%) relative to simulation time. Cross-system validation with consistent filter rankings confirms methodology generalizability across proteins spanning diverse dynamical regimes. All analysis pipelines are implemented in Python 3.13 with open-source libraries (NumPy, SciPy, Matplotlib) to ensure reproducibility and extensibility.

More from our Archive