YOLOv8-Based ASL Recognition in Real-Time with Ensemble CNN
Classification

doi:10.2174/0126662558432653251203151915

DOI: 10.2174/0126662558432653251203151915 ISSN: 2666-2558

YOLOv8-Based ASL Recognition in Real-Time with Ensemble CNN Classification

Rupali H. Shende, Nuzhat F. Shaikh

Introduction:

Sign language is vital for communication among deaf and hard-ofhearing individuals, yet barriers persist due to limited public understanding. This study addresses four key challenges—environmental noise, gesture overlap, data imbalance, and limited device capacity—while aiming to deliver a fast, accurate, and explainable ASL interpreter for mobile platforms.

Methods:

The approach involves building a multilingual ASL dataset with preprocessing and augmentation, followed by hand detection using YOLOv8. Cropped hand regions are classified through a custom CNN that captures spatial–temporal features. An ensemble of lightweight learners mitigates class imbalance, improving recognition of rare signs without increasing latency.

Results:

The system’s performance is evaluated via accuracy, precision, recall, and edge-frame latency, while SHAP and Grad-CAM heat maps transparently illustrate which pixels influence the final decision. Benchmarks reveal 98.2% accuracy in uncontrolled settings—far above YOLOv5, SSD, and single-CNN references—while frame rates remain above 25 fps on budget- grade NVIDIA Jetson Nano and mainstream Android devices. These results demonstrate that pairing YOLOv8 for object detection, a CNN for classification, and an ensemble refinement step creates a scalable and interpretable system that closes communication gaps and advances inclusive human-computer interaction.

Discussion:

The framework effectively reduces noise interference, manages overlapping gestures, and improves recognition of rare signs through ensemble learning. Its lightweight design ensures smooth performance on mobile and edge devices, while SHAP and Grad-CAM enhance transparency. Overall, the system balances accuracy, efficiency, and interpretability, making it suitable for real-world ASL communication.

Conclusion:

The proposed framework addresses environmental noise, gesture overlap, data imbalance, and hardware constraints while maintaining high accuracy and interpretability. The combination of YOLOv8, CNN, and ensemble refinement ensures robustness and efficiency for real-world applications.