DOI: 10.1142/s1469026826500264 ISSN: 1469-0268

Feedback-Aware ViT–LSTM for Deepfake Video Detection with Patch-Level Temporal Attention and Explainability

G C Akshatha, K Aditya Shastry, KVVLV Prasad, Deekshith G L

The rapid advancement of generative models has led to the widespread creation of deepfakes highly realistic but manipulated media posing serious risks to digital trust, security, and societal misinformation. Among various deepfake techniques, face-swapping methods are particularly concerning due to their visual realism and potential for misuse. This paper proposes a novel deepfake video detection framework called Feedback-Aware ViT–LSTM that integrates Vision Transformers (ViTs) with Long Short-Term Memory (LSTM) networks through a dynamic feedback mechanism. The architecture is designed to exploit the spatial attention capabilities of ViTs and the temporal modeling strength of LSTM to detect subtle manipulation artifacts in face-swapped videos. Each video is sampled using a hybrid uniform–random strategy, and facial regions are detected using a YOLOv8 model. These faces are processed by ViTs to extract spatial embeddings, which are then passed to an LSTM to model temporal dependencies across frames. A key innovation is the feedback mechanism: temporal insights from the LSTM are used to dynamically guide the ViT’s attention toward regions exhibiting temporal inconsistency, thereby reinforcing sensitivity to manipulation cues. To enhance interpretability and transparency, the model incorporates Explainable AI techniques that highlight the facial patches influencing the classification decision. Experimental evaluation on benchmark deepfake datasets demonstrates that the Feedback-Aware ViT–LSTM achieves superior performance, confirming the effectiveness of the proposed Feedback-Aware ViT–LSTM in robust and interpretable deepfake detection.

More from our Archive