Transformer-Based Visual Recognition for Human Action Understanding: A Comprehensive Survey
Main Article Content
Abstract
Human action recognition has become a cornerstone in computer vision, supporting a wide range of applications such as intelligent surveillance, human–computer interaction, autonomous robotics, and healthcare monitoring. Traditional convolutional neural networks have demonstrated strong spatial feature extraction capabilities but are fundamentally limited in capturing long-range temporal dependencies and complex motion relationships across frames. The introduction of transformer architectures has reshaped visual recognition by introducing global self-attention mechanisms that model spatial and temporal dependencies in a unified and data-driven manner. By treating video frames as sequences of spatiotemporal tokens, transformer-based frameworks have surpassed conventional convolutional approaches in both flexibility and accuracy. These models enable more effective temporal reasoning, multimodal fusion with audio and pose information, and zero-shot generalization through large-scale pretraining. This survey provides a comprehensive overview of transformer-based visual recognition methods for human action understanding, emphasizing the evolution of key architectures, the integration of self-supervised and multimodal learning, and the use of large-scale benchmark datasets. Finally, it highlights current challenges related to efficiency, data dependency, and interpretability, and discusses how future developments may lead to general-purpose, multimodal, and explainable systems for real-world action understanding.
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.
Mind forge Academia also operates under the Creative Commons Licence CC-BY 4.0. This allows for copy and redistribute the material in any medium or format for any purpose, even commercially. The premise is that you must provide appropriate citation information.