Semantic consistency in RAG: evaluating modern encoder-only models on active and passive voice in english and russian

Fomicheva, Marina; Fomicheva, Marina

Semantic consistency in RAG: evaluating modern encoder-only models on active and passive voice in english and russian

To access the full text documents, please follow this link: http://hdl.handle.net/10230/71040

Author

Fomicheva, Marina

Publication date

2025-07-30T15:21:19Z

2025

Abstract

Treball de fi de màster en Lingüística Teòrica i Aplicada. Directora: Dra. Núria Bel

Retrieval-Augmented Generation (RAG) systems depend on dense embeddings to retrieve relevant context for open-domain question answering. A critical requirement for these embeddings is semantic consistency – the ability to remain stable across meaning-preserving variation. This study examines how modern encoder-only models handle active/passive voice alternations in English and Russian. Using a bilingual dataset of 500 factual question pairs, we evaluate semantic consistency (Overlap@K) and retrieval quality (MRR, Recall@K) in raw and fine-tuned versions of EuroBERT and RuModernBERT. Findings show that representations of raw encoders are only partially semantic: they are sensitive to word order, morphology, and query length. Consistency was significantly higher in English, indicating that morphologically rich languages like Russian are more challenging. EuroBERT performed poorly on Russian due to limited training exposure and subword fragmentation. RuModernBERT performed better on Russian passives, likely reflecting its training data. Contrastive fine-tuning substantially improved performance, though not all fine-tuned models benefited equally – EuroBERT_FT and LaBSE showed limitations tied to tokenization and training objectives.

Document Type

Master's final project

Language

English

Subjects and keywords

Rus --Veu passiva

Recommended citation

This citation was generated automatically.

Export

DIDL MARC MARC_CCUC METS OAI_DC ORE QDC RDF

Rights

Llicència CC Reconeixement 4.0 Internacional (CC BY 4.0)

https://creativecommons.org/licenses/by/4.0/

This item appears in the following Collection(s)

Treballs d'estudiants [4945]

Semantic consistency in RAG: evaluating modern encoder-only models on active and passive voice in english and russian

Author

Publication date

Share

Abstract

Document Type

Language

Subjects and keywords

Recommended citation

Export

Rights

This item appears in the following Collection(s)