DOI: 10.3390/bdcc10070212 ISSN: 2504-2289

CollectivIA: Two-Pipeline Multilingual Legal RAG for Moroccan Territorial Governance with LLM-Assisted and Regex-Based Chunking

Firiel Zouak, Omar El Beqqali, Jamal Riffi

Retrieval grounding is crucial for high-stakes administrative applications, since large language models remain prone to hallucinations when addressing legal questions. This problem is particularly relevant in Moroccan territorial governance, where official legislative PDFs have highly heterogeneous digital quality, user interactions often occur in Moroccan Darija, and the legal corpus is bilingual Arabic–French. This paper presents CollectivIA, a multilingual Retrieval-Augmented Generation system implemented for Moroccan territorial governance law. The system supports queries in French, Arabic, and Moroccan Darija and indexes 2272 article-level segments from sixteen official legislative documents. We compare two end-to-end retrieval pipelines: an LLM-assisted semantic chunking architecture using Gemini and ChromaDB and a regex-based chunking architecture using FAISS. Based on an expanded multilingual benchmark of 150 legal queries, with 50 queries per language group, the LLM-assisted pipeline achieved higher RAGAS scores than the regex-based pipeline, particularly improving Context Precision from 0.315 to 0.818. The multimodal Vision fallback successfully recovered 456 articles, which remained inaccessible under the regex-based pipeline. Overall, the LLM-assisted pipeline yielded legal boundaries with greater coherence and retrieved contexts with higher focus, while the regex-based design maintained a broader source diversity. These results suggest that LLM-assisted semantic chunking with multimodal fallback is a promising approach to enhance multilingual legal RAG over heterogeneous Moroccan legal corpora.

More from our Archive