Quality-driven synthetic text generation for multilingual speech translation with audio large language models

;
Quality-Driven Synthetic Text Generation for Multilingual Speech Translation with Audio Large Language Models;

dc.contributor
Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions
dc.contributor
Hernando Pericás, Francisco Javier
dc.contributor.author
Val Vila, Xavier
dc.date.accessioned
2026-03-07T05:39:02Z
dc.date.available
2026-03-07T05:39:02Z
dc.date.issued
2025-10-28
dc.identifier
https://hdl.handle.net/2117/456950
dc.identifier
ETSETB-230.197783
dc.identifier.uri
https://hdl.handle.net/2117/456950
dc.description.abstract
.
dc.description.abstract
This thesis explores quality-driven synthetic data generation as a scalable solution for mul- tilingual speech-to-text translation (S2TT), focusing on Iberian languages with limited natural resources. Leveraging large language models (LLMs) and rigorous reference-free quality filtering via BLASER 2.0, an end-to-end pipeline was implemented to generate millions of high-quality synthetic translations. The approach demonstrates substantial improvements in translation quality and semantic similarity for low-resource languages such as Asturian and Occitan, while enabling efficient scaling to diverse linguistic do- mains. Experimental results reveal that models trained on filtered synthetic data achieve competitive and often state-of-the-art performance in S2TT tasks, and narrow the gap between direct and Chain-of-Thought cascade architectures. This work lays foundational evidence that scalable, quality-centric synthetic data pipelines are powerful enablers for inclusive, robust multilingual speech technologies, especially where manual annotation remains costly or infeasible.
dc.format
application/pdf
dc.language
eng
dc.publisher
Universitat Politècnica de Catalunya
dc.rights
S'autoritza la difusió de l'obra mitjançant la llicència Creative Commons o similar 'Reconeixement-NoComercial- SenseObraDerivada'
dc.rights
Open Access
dc.subject
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic
dc.subject
Text-to-speech software
dc.subject
Machine learning
dc.subject
Speech
dc.subject
LLM
dc.subject
Text
dc.subject
Generation
dc.subject
Synthetic
dc.subject
Quality
dc.subject
Síntesi de la parla (Programari)
dc.subject
Aprenentatge automàtic
dc.title
Quality-driven synthetic text generation for multilingual speech translation with audio large language models
dc.title
dc.title
Quality-Driven Synthetic Text Generation for Multilingual Speech Translation with Audio Large Language Models
dc.title
dc.type
Master thesis


Fitxers en aquest element

FitxersGrandàriaFormatVisualització

No hi ha fitxers associats a aquest element.

Aquest element apareix en la col·lecció o col·leccions següent(s)