Comparative evaluation of embedding and generative model combinations in retrieval-augmented generation pipelines

doi:10.1108/dta-07-2025-0609

DOI: 10.1108/dta-07-2025-0609 ISSN: 2514-9288

Comparative evaluation of embedding and generative model combinations in retrieval-augmented generation pipelines

Mateo Hitl, Marina Bagić Babac, Vedran Mornar

Purpose

Retrieval-augmented generation (RAG) systems integrate information retrieval with generative language models to improve the relevance, accuracy and explainability of AI-driven responses. This study evaluates how different configurations of embedding and generative models nfluence the performance of RAGpipelines for knowledge management (KM) scenarios.

Design/methodology/approach

The study combines a broad benchmark of embedding and generation components with a contemporary open-weight comparison centered on Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3 and Gemma-2-9B-It. Retrieval configurations are evaluated through recall, latency and storage trade-offs, while generation quality is assessed using ROUGE-L, exact match (EM), token-level F1, BERTScore F1, semantic similarity, answer relevance and faithfulness. The benchmark also includes complementary evaluation on SQuAD and HotpotQA, grounded prompting, abstention prompting, error analysis and long-context stress testing.

Findings

Retrieval quality remained the main determinant of end-to-end RAG quality. The strongest shared retrieval setup combined all-mpnet-base-v2, 256-token chunking with 64-token overlap and top-1 retrieval, reaching Recall@1 = 0.938. Among the open-weight generators, Gemma-2-9B-It achieved the strongest lexical and semantic matching, with its best grounded-abstain configuration reaching ROUGE-L = 0.631, EM = 0.456, token-F1 = 0.631 and BERTScore F1 = 0.767. Llama-3-8B-Instruct produced the strongest faithfulness score in the best grounded setting (0.241), while Mistral-7B-Instruct-v0.3 occupied a more conservative operating point with lower answer matching but stronger abstention behavior. HNSW matched exact-search quality for equivalent retrieval configurations while reducing query latency.

Practical implications

The findings support retrieval chunking, top-1 retrieval and grounded prompting as robust design choices for question-answering-oriented RAG. They also suggest that safer abstention-oriented prompting should be treated as a different operating point rather than as a universal default.

More reliable RAG systems can improve access to institutional knowledge, support organizational learning and reduce barriers to expertise discovery, especially when system designs balance quality, latency and computational cost.

Originality/value

The paper contributes a component-level benchmark for RAG in KM settings, richer evaluation dimensions and a more explicit treatment of retrieval/generation trade-offs across historical and contemporary open-weight baselines. The design narrows practical claims to what is supported by multi-dataset evidence, error analysis and long-context testing.

Outline

Comparative evaluation of embedding and generative model combinations in retrieval-augmented generation pipelines

Purpose

Design/methodology/approach

Findings

Practical implications

Originality/value

More from our Archive

Comparative evaluation of embedding and generative model combinations in retrieval-augmented generation pipelines

Purpose

Design/methodology/approach

Findings

Practical implications

Social implications

Originality/value

More from our Archive