Transformer-Based Visual Recognition for Human Action Understanding: A Comprehensive Survey

Thayer Winslow

pdf

Published: Nov 1, 2025

Thayer Winslow

Abstract

Human action recognition has become a cornerstone in computer vision, supporting a wide range of applications such as intelligent surveillance, human–computer interaction, autonomous robotics, and healthcare monitoring. Traditional convolutional neural networks have demonstrated strong spatial feature extraction capabilities but are fundamentally limited in capturing long-range temporal dependencies and complex motion relationships across frames. The introduction of transformer architectures has reshaped visual recognition by introducing global self-attention mechanisms that model spatial and temporal dependencies in a unified and data-driven manner. By treating video frames as sequences of spatiotemporal tokens, transformer-based frameworks have surpassed conventional convolutional approaches in both flexibility and accuracy. These models enable more effective temporal reasoning, multimodal fusion with audio and pose information, and zero-shot generalization through large-scale pretraining. This survey provides a comprehensive overview of transformer-based visual recognition methods for human action understanding, emphasizing the evolution of key architectures, the integration of self-supervised and multimodal learning, and the use of large-scale benchmark datasets. Finally, it highlights current challenges related to efficiency, data dependency, and interpretability, and discusses how future developments may lead to general-purpose, multimodal, and explainable systems for real-world action understanding.

How to Cite

Winslow, T. (2025). Transformer-Based Visual Recognition for Human Action Understanding: A Comprehensive Survey. Journal of Computer Science and Software Applications, 5(11). Retrieved from https://mfacademia.org/index.php/jcssa/article/view/247

Issue

Vol. 5 No. 11 (2025)

Section

Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

Mind forge Academia also operates under the Creative Commons Licence CC-BY 4.0. This allows for copy and redistribute the material in any medium or format for any purpose, even commercially. The premise is that you must provide appropriate citation information.

Article Sidebar

Main Article Content

Abstract

Article Details