Semantic consistency in RAG: evaluating modern encoder-only models on active and passive voice in english and russian

Publication date

2025-07-30T15:21:19Z

2025-07-30T15:21:19Z

2025



Abstract

Treball de fi de màster en Lingüística Teòrica i Aplicada. Directora: Dra. Núria Bel


Retrieval-Augmented Generation (RAG) systems depend on dense embeddings to retrieve relevant context for open-domain question answering. A critical requirement for these embeddings is semantic consistency – the ability to remain stable across meaning-preserving variation. This study examines how modern encoder-only models handle active/passive voice alternations in English and Russian. Using a bilingual dataset of 500 factual question pairs, we evaluate semantic consistency (Overlap@K) and retrieval quality (MRR, Recall@K) in raw and fine-tuned versions of EuroBERT and RuModernBERT. Findings show that representations of raw encoders are only partially semantic: they are sensitive to word order, morphology, and query length. Consistency was significantly higher in English, indicating that morphologically rich languages like Russian are more challenging. EuroBERT performed poorly on Russian due to limited training exposure and subword fragmentation. RuModernBERT performed better on Russian passives, likely reflecting its training data. Contrastive fine-tuning substantially improved performance, though not all fine-tuned models benefited equally – EuroBERT_FT and LaBSE showed limitations tied to tokenization and training objectives.

Document Type

Master's final project

Language

English

Subjects and keywords

Rus --Veu passiva

Recommended citation

This citation was generated automatically.

Rights

Llicència CC Reconeixement 4.0 Internacional (CC BY 4.0)

https://creativecommons.org/licenses/by/4.0/

This item appears in the following Collection(s)