Transformer-Based Visual Recognition for Human Action Understanding: A Comprehensive Survey

Main Article Content

Thayer Winslow

Abstract

Human action recognition has become a cornerstone in computer vision, supporting a wide range of applications such as intelligent surveillance, human–computer interaction, autonomous robotics, and healthcare monitoring. Traditional convolutional neural networks have demonstrated strong spatial feature extraction capabilities but are fundamentally limited in capturing long-range temporal dependencies and complex motion relationships across frames. The introduction of transformer architectures has reshaped visual recognition by introducing global self-attention mechanisms that model spatial and temporal dependencies in a unified and data-driven manner. By treating video frames as sequences of spatiotemporal tokens, transformer-based frameworks have surpassed conventional convolutional approaches in both flexibility and accuracy. These models enable more effective temporal reasoning, multimodal fusion with audio and pose information, and zero-shot generalization through large-scale pretraining. This survey provides a comprehensive overview of transformer-based visual recognition methods for human action understanding, emphasizing the evolution of key architectures, the integration of self-supervised and multimodal learning, and the use of large-scale benchmark datasets. Finally, it highlights current challenges related to efficiency, data dependency, and interpretability, and discusses how future developments may lead to general-purpose, multimodal, and explainable systems for real-world action understanding.

Article Details

How to Cite
Winslow, T. (2025). Transformer-Based Visual Recognition for Human Action Understanding: A Comprehensive Survey. Journal of Computer Science and Software Applications, 5(11). Retrieved from https://mfacademia.org/index.php/jcssa/article/view/247
Section
Articles