Comparative evaluation of embedding and generative model combinations in retrieval-augmented generation pipelines
Mateo Hitl, Marina Bagić Babac, Vedran MornarPurpose
Retrieval-augmented generation (RAG) systems integrate information retrieval with generative language models to improve the relevance, accuracy and explainability of AI-driven responses. This study evaluates how different configurations of embedding and generative models nfluence the performance of RAGpipelines for knowledge management (KM) scenarios.
Design/methodology/approach
The study combines a broad benchmark of embedding and generation components with a contemporary open-weight comparison centered on Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3 and Gemma-2-9B-It. Retrieval configurations are evaluated through recall, latency and storage trade-offs, while generation quality is assessed using ROUGE-L, exact match (EM), token-level F1, BERTScore F1, semantic similarity, answer relevance and faithfulness. The benchmark also includes complementary evaluation on SQuAD and HotpotQA, grounded prompting, abstention prompting, error analysis and long-context stress testing.
Findings
Retrieval quality remained the main determinant of end-to-end RAG quality. The strongest shared retrieval setup combined all-mpnet-base-v2, 256-token chunking with 64-token overlap and top-1 retrieval, reaching Recall@1 = 0.938. Among the open-weight generators, Gemma-2-9B-It achieved the strongest lexical and semantic matching, with its best grounded-abstain configuration reaching ROUGE-L = 0.631, EM = 0.456, token-F1 = 0.631 and BERTScore F1 = 0.767. Llama-3-8B-Instruct produced the strongest faithfulness score in the best grounded setting (0.241), while Mistral-7B-Instruct-v0.3 occupied a more conservative operating point with lower answer matching but stronger abstention behavior. HNSW matched exact-search quality for equivalent retrieval configurations while reducing query latency.
Practical implications
The findings support retrieval chunking, top-1 retrieval and grounded prompting as robust design choices for question-answering-oriented RAG. They also suggest that safer abstention-oriented prompting should be treated as a different operating point rather than as a universal default.
Social implications
More reliable RAG systems can improve access to institutional knowledge, support organizational learning and reduce barriers to expertise discovery, especially when system designs balance quality, latency and computational cost.
Originality/value
The paper contributes a component-level benchmark for RAG in KM settings, richer evaluation dimensions and a more explicit treatment of retrieval/generation trade-offs across historical and contemporary open-weight baselines. The design narrows practical claims to what is supported by multi-dataset evidence, error analysis and long-context testing.