Image Classification via an Improved Vision Transformer: Enhancing Global and Local Feature Modeling

Main Article Content

Lachlan Andrew

Abstract

This study proposes a method for image classification based on an improved Vision Transformer to address the limitations of traditional convolutional neural networks in global modeling and long-range dependency capture. The input images are first divided into patches and mapped into embeddings to ensure the preservation of local details during serialization. In the encoder, multi-head self-attention and feed-forward networks are introduced to enhance cross-region feature interaction, while residual connections and normalization alleviate gradient vanishing in deep layers to achieve stable and efficient feature learning. At the classification stage, an aggregation function is used to combine patch representations into a global feature, followed by a fully connected layer for final prediction. The CIFAR-100 dataset, covering diverse fine-grained categories, is used to evaluate the adaptability of the model in complex scenarios. Systematic comparisons and sensitivity analyses show that the method outperforms others in AUC, ACC, Precision, and Recall, demonstrating clear advantages in feature modeling and classification performance. This research enriches the theoretical exploration of Vision Transformer optimization and provides a robust and efficient solution for image classification tasks.

Article Details

How to Cite
Andrew, L. (2026). Image Classification via an Improved Vision Transformer: Enhancing Global and Local Feature Modeling. Journal of Computer Science and Software Applications, 6(1). Retrieved from https://mfacademia.org/index.php/jcssa/article/view/257
Section
Articles

References

Li Y, Wu C Y, Fan H, et al. Mvitv2: Improved multiscale vision transformers for classification and detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 4804-4814.

Yang J, Li C, Zhang P, et al. Focal self-attention for local-global interactions in vision transformers[J]. arXiv preprint arXiv:2107.00641, 2021.

Dai Z, Liu H, Le Q V, et al. Coatnet: Marrying convolution and attention for all data sizes[J]. Advances in neural information processing systems, 2021, 34: 3965-3977.

Mehta S, Rastegari M. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer[J]. arXiv preprint arXiv:2110.02178, 2021.

Guo J, Han K, Wu H, et al. Cmt: Convolutional neural networks meet vision transformers[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 12175-12185.

Xu W, Xu Y, Chang T, et al. Co-scale conv-attentional image transformers[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 9981-9990.

Yuan L, Chen Y, Wang T, et al. Tokens-to-token vit: Training vision transformers from scratch on imagenet[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 558-567.

He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

Xie S, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1492-1500.

Liu Z, Mao H, Wu C Y, et al. A convnet for the 2020s[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 11976-11986.