Improving Reasoning Stability in Large Language Models via Iterative Self-Questioning and Semantic Calibration
Main Article Content
Abstract
Large language models (LLMs) have demonstrated strong capabilities in natural language understanding and generation; however, their reasoning stability and consistency remain limited, particularly in multi-step inference tasks. This paper proposes a novel framework that integrates iterative self-questioning with semantic calibration to improve reasoning robustness. The approach introduces a multi-stage reasoning loop in which intermediate outputs are recursively evaluated and refined using a confidence-aware scoring mechanism. Experiments are conducted on multiple benchmark datasets, including GSM8K, StrategyQA, and MultiArith. The proposed method improves reasoning accuracy from 78.4% to 85.9% (+7.5%) on GSM8K compared to standard chain-of-thought prompting. On StrategyQA, accuracy increases from 71.2% to 76.8%, while consistency across repeated runs improves by 12.3%. Furthermore, hallucination rates are reduced by 18.6%, as measured by factual consistency metrics. Ablation studies show that semantic calibration contributes the most significant performance gain (+4.2%), followed by iterative refinement (+3.1%). These results demonstrate that structured reasoning enhancement mechanisms can substantially improve the reliability of LLM outputs in complex reasoning scenarios.
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.
Mind forge Academia also operates under the Creative Commons Licence CC-BY 4.0. This allows for copy and redistribute the material in any medium or format for any purpose, even commercially. The premise is that you must provide appropriate citation information.