Semantic consistency in RAG: evaluating modern encoder-only models on active and passive voice in english and russian

dc.contributor.author
Fomicheva, Marina
dc.date.accessioned
2025-08-01T06:11:34Z
dc.date.available
2025-08-01T06:11:34Z
dc.date.issued
2025-07-30T15:21:19Z
dc.date.issued
2025-07-30T15:21:19Z
dc.date.issued
2025
dc.identifier
http://hdl.handle.net/10230/71040
dc.identifier.uri
http://hdl.handle.net/10230/71040
dc.description.abstract
Treball de fi de màster en Lingüística Teòrica i Aplicada. Directora: Dra. Núria Bel
dc.description.abstract
Retrieval-Augmented Generation (RAG) systems depend on dense embeddings to retrieve relevant context for open-domain question answering. A critical requirement for these embeddings is semantic consistency – the ability to remain stable across meaning-preserving variation. This study examines how modern encoder-only models handle active/passive voice alternations in English and Russian. Using a bilingual dataset of 500 factual question pairs, we evaluate semantic consistency (Overlap@K) and retrieval quality (MRR, Recall@K) in raw and fine-tuned versions of EuroBERT and RuModernBERT. Findings show that representations of raw encoders are only partially semantic: they are sensitive to word order, morphology, and query length. Consistency was significantly higher in English, indicating that morphologically rich languages like Russian are more challenging. EuroBERT performed poorly on Russian due to limited training exposure and subword fragmentation. RuModernBERT performed better on Russian passives, likely reflecting its training data. Contrastive fine-tuning substantially improved performance, though not all fine-tuned models benefited equally – EuroBERT_FT and LaBSE showed limitations tied to tokenization and training objectives.
dc.format
application/pdf
dc.language
eng
dc.rights
Llicència CC Reconeixement 4.0 Internacional (CC BY 4.0)
dc.rights
https://creativecommons.org/licenses/by/4.0/
dc.rights
info:eu-repo/semantics/openAccess
dc.subject
Rus --Veu passiva
dc.title
Semantic consistency in RAG: evaluating modern encoder-only models on active and passive voice in english and russian
dc.type
info:eu-repo/semantics/masterThesis


Ficheros en el ítem

FicherosTamañoFormatoVer

No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)