DOI: 10.3390/data11070154 ISSN: 2306-5729

Bridging the Gap in Arabic Legal NLP: A Novel Large-Scale Corpus and Benchmark for Domain-Adapted Summarisation-Classification

Omar T. Sayed, Amal E. Aboutabl, Amr S. Ghoneim

Significant progress in legal natural language processing (NLP) has enabled advancements in tasks such as legal judgment prediction, case retrieval, and question answering. However, the development of analogous technologies for Arabic legal texts remains severely constrained by the scarcity of large-scale, publicly available benchmarks for summarisation and classification. This paper addresses this gap by introducing a novel, comprehensive dataset of 9699 Arabic legal cases sourced from the Saudi Board of Grievances. This corpus is unique in pairing full-length court decisions with expertly human-crafted abstractive summaries and multi-class category labels (Administrative, Commercial, and Criminal), establishing a dedicated benchmark for Arabic legal NLP. The dataset was constructed via a robust, reproducible pipeline that ensures high textual fidelity, incorporating specialised optical character recognition (OCR) via Google Document AI and precise structural segmentation into facts, reasons, and summaries. To establish robust baselines, we conduct an extensive empirical evaluation of seven summarisation models—encompassing four extractive algorithms (TextRank, LexRank, Latent Semantic Analysis, and Luhn) and three transformer-based abstractive architectures (AraT5v2, AraBART, and mBART)—each evaluated in both base and fine-tuned configurations. Results across ROUGE, BERTScore, BLEU metrics and human evaluation demonstrate substantial performance gains achieved through domain-specific fine-tuning, with the fine-tuned AraBART model achieving the strongest performance among all evaluated models. Furthermore, we present a novel analysis of the downstream utility of generated summaries by evaluating their performance on legal category classification using five machine learning models. This investigation reveals a strong positive correlation between summarisation quality and classification accuracy, empirically demonstrating that domain-adapted abstractive summarisation not only enhances intrinsic evaluation scores but also significantly boosts extrinsic task performance. By providing this essential dataset and comprehensive benchmarking, our work contributes a much-needed resource to the field, facilitating future research and innovations in Arabic legal text analysis.

More from our Archive