Real-Time Gesture Recognition via Deep Spatiotemporal Modeling for Human–Computer Interaction

Main Article Content

Leopold Gravenhorst

Abstract

Accurate and real-time gesture recognition is critical for next-generation human–computer interaction systems, particularly in immersive and touchless environments. This study presents a spatiotemporal deep learning framework combining a lightweight convolutional backbone with a temporal transformer encoder to capture dynamic motion patterns in video sequences. The model is trained on a curated dataset consisting of 25,000 gesture samples across 30 predefined gesture classes, collected under varying lighting conditions and user backgrounds. The proposed approach achieves an overall recognition accuracy of 96.2%, surpassing conventional CNN-LSTM baselines (91.5%) and 3D CNN models (93.1%). In real-time testing scenarios, the system maintains a latency of <40 ms, ensuring smooth user interaction. Cross-user generalization experiments show a performance drop of only 2.4%, demonstrating strong robustness to individual variability. Additionally, the model exhibits high resilience to noise and occlusion, maintaining 92.8% accuracy under partial hand visibility. These findings indicate that transformer-based temporal modeling significantly enhances gesture recognition performance for real-world HCI applications.

Article Details

How to Cite
Gravenhorst, L. (2026). Real-Time Gesture Recognition via Deep Spatiotemporal Modeling for Human–Computer Interaction. Journal of Computer Science and Software Applications, 6(4). Retrieved from https://mfacademia.org/index.php/jcssa/article/view/270
Section
Articles