CollectivIA: Two-Pipeline Multilingual Legal RAG for Moroccan Territorial Governance with LLM-Assisted and Regex-Based Chunking
Firiel Zouak, Omar El Beqqali, Jamal RiffiRetrieval grounding is crucial for high-stakes administrative applications, since large language models remain prone to hallucinations when addressing legal questions. This problem is particularly relevant in Moroccan territorial governance, where official legislative PDFs have highly heterogeneous digital quality, user interactions often occur in Moroccan Darija, and the legal corpus is bilingual Arabic–French. This paper presents CollectivIA, a multilingual Retrieval-Augmented Generation system implemented for Moroccan territorial governance law. The system supports queries in French, Arabic, and Moroccan Darija and indexes 2272 article-level segments from sixteen official legislative documents. We compare two end-to-end retrieval pipelines: an LLM-assisted semantic chunking architecture using Gemini and ChromaDB and a regex-based chunking architecture using FAISS. Based on an expanded multilingual benchmark of 150 legal queries, with 50 queries per language group, the LLM-assisted pipeline achieved higher RAGAS scores than the regex-based pipeline, particularly improving Context Precision from 0.315 to 0.818. The multimodal Vision fallback successfully recovered 456 articles, which remained inaccessible under the regex-based pipeline. Overall, the LLM-assisted pipeline yielded legal boundaries with greater coherence and retrieved contexts with higher focus, while the regex-based design maintained a broader source diversity. These results suggest that LLM-assisted semantic chunking with multimodal fallback is a promising approach to enhance multilingual legal RAG over heterogeneous Moroccan legal corpora.