DOI: 10.3390/info17070639 ISSN: 2078-2489

Improving Kazakh Abstractive Summarization with LLMs via Domain Adaptation and Self-Correction

Talgat Zhabayev, Ualsher Tukeyev, Dina Amirova, Nazym Rakhmanberdi

Recent advances in large language models (LLMs) have improved abstractive text summarization, yet their benefits remain limited for low-resource and morphologically rich languages such as Kazakh. This study investigates the adaptation of the instruction-tuned Gemma-3-4B-it (Google LLC, Mountain View, CA, USA) model for Kazakh news summarization. We propose a two-stage training pipeline combining domain-adaptive pretraining (DAPT) on an unlabeled Kazakh news corpus with parameter-efficient supervised fine-tuning (SFT) using LoRA on summarization datasets. We also examine an inference-time self-correction step in which the model revises its initial summaries to improve consistency and reduce factual errors. Experiments are conducted on Kazakh news data, including a translated version of XSum and corpora collected from BAQ and TengriNews. We compare the baseline, SFT-only, DAPT-only, and DAPT + SFT configurations, with and without self-correction. Evaluation is performed using ROUGE, BERTScore, chrF++, human assessment, and qualitative linguistic analysis. The results show that DAPT combined with SFT achieves the strongest overall performance, while self-correction further improves automatic scores in the DAPT + SFT setting. However, human evaluation indicates that metric gains do not always correspond to higher factual faithfulness or informativeness. These findings highlight the need for domain adaptation and careful evaluation when applying LLMs to Kazakh summarization.

More from our Archive