Efficient Compression of Large Language Models with Distillation and Fine-Tuning

Main Article Content

Anda Kai
Lin Zhu
Jiangchuan Gong

Abstract

With the widespread adoption of large language models (LLMs), their extensive parameter scale and high computational cost pose significant challenges for practical deployment. To address this issue, this study proposes a method that integrates Knowledge Distillation and Parameter-Efficient Fine-Tuning (PEFT) to reduce computational overhead while preserving high performance. In the knowledge distillation phase, experiments are conducted using different temperature parameters to analyze their impact on student model learning. The role of various feature distillation levels in model compression is also explored. Experimental results indicate that moderate temperature parameters enhance the distillation effect. Moreover, selecting an appropriate feature layer for distillation improves the generalization ability of the student model. In the fine-tuning phase, the performance of LoRA (Low-Rank Adaptation) is compared with full fine-tuning. Results show that LoRA offers significant advantages in inference speed and computational efficiency, whereas full-parameter fine-tuning achieves superior accuracy and language understanding. Comprehensive experimental findings confirm that a well-designed combination of knowledge distillation and fine-tuning can achieve effective model compression while maintaining performance. Future research can integrate additional compression techniques, such as pruning and quantization, to further enhance model adaptability and computational efficiency. This approach provides a promising solution for deploying large-scale language models in low-resource environments.

Article Details

How to Cite
Kai, A., Zhu, L., & Gong, J. (2023). Efficient Compression of Large Language Models with Distillation and Fine-Tuning. Journal of Computer Science and Software Applications, 3(4), 30–38. https://doi.org/10.5281/zenodo.15165118
Section
Articles