Quality-driven synthetic text generation for multilingual speech translation with audio large language models

;
Quality-Driven Synthetic Text Generation for Multilingual Speech Translation with Audio Large Language Models;

dc.contributor
Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions
dc.contributor
Hernando Pericás, Francisco Javier
dc.contributor.author
Val Vila, Xavier
dc.date.accessioned
2026-03-07T05:39:02Z
dc.date.available
2026-03-07T05:39:02Z
dc.date.issued
2025-10-28
dc.identifier
https://hdl.handle.net/2117/456950
dc.identifier
ETSETB-230.197783
dc.identifier.uri
https://hdl.handle.net/2117/456950
dc.description.abstract
.
dc.description.abstract
This thesis explores quality-driven synthetic data generation as a scalable solution for mul- tilingual speech-to-text translation (S2TT), focusing on Iberian languages with limited natural resources. Leveraging large language models (LLMs) and rigorous reference-free quality filtering via BLASER 2.0, an end-to-end pipeline was implemented to generate millions of high-quality synthetic translations. The approach demonstrates substantial improvements in translation quality and semantic similarity for low-resource languages such as Asturian and Occitan, while enabling efficient scaling to diverse linguistic do- mains. Experimental results reveal that models trained on filtered synthetic data achieve competitive and often state-of-the-art performance in S2TT tasks, and narrow the gap between direct and Chain-of-Thought cascade architectures. This work lays foundational evidence that scalable, quality-centric synthetic data pipelines are powerful enablers for inclusive, robust multilingual speech technologies, especially where manual annotation remains costly or infeasible.
dc.format
application/pdf
dc.language
eng
dc.publisher
Universitat Politècnica de Catalunya
dc.rights
S'autoritza la difusió de l'obra mitjançant la llicència Creative Commons o similar 'Reconeixement-NoComercial- SenseObraDerivada'
dc.rights
Open Access
dc.subject
Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic
dc.subject
Text-to-speech software
dc.subject
Machine learning
dc.subject
Speech
dc.subject
LLM
dc.subject
Text
dc.subject
Generation
dc.subject
Synthetic
dc.subject
Quality
dc.subject
Síntesi de la parla (Programari)
dc.subject
Aprenentatge automàtic
dc.title
Quality-driven synthetic text generation for multilingual speech translation with audio large language models
dc.title
dc.title
Quality-Driven Synthetic Text Generation for Multilingual Speech Translation with Audio Large Language Models
dc.title
dc.type
Master thesis


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)