Quality-driven synthetic text generation for multilingual speech translation with audio large language models

;
Quality-Driven Synthetic Text Generation for Multilingual Speech Translation with Audio Large Language Models;

Altres autors/es

Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions

Hernando Pericás, Francisco Javier

Data de publicació

2025-10-28



Resum

.


This thesis explores quality-driven synthetic data generation as a scalable solution for mul- tilingual speech-to-text translation (S2TT), focusing on Iberian languages with limited natural resources. Leveraging large language models (LLMs) and rigorous reference-free quality filtering via BLASER 2.0, an end-to-end pipeline was implemented to generate millions of high-quality synthetic translations. The approach demonstrates substantial improvements in translation quality and semantic similarity for low-resource languages such as Asturian and Occitan, while enabling efficient scaling to diverse linguistic do- mains. Experimental results reveal that models trained on filtered synthetic data achieve competitive and often state-of-the-art performance in S2TT tasks, and narrow the gap between direct and Chain-of-Thought cascade architectures. This work lays foundational evidence that scalable, quality-centric synthetic data pipelines are powerful enablers for inclusive, robust multilingual speech technologies, especially where manual annotation remains costly or infeasible.

Tipus de document

Master thesis

Llengua

Anglès

Publicat per

Universitat Politècnica de Catalunya

Citació recomanada

Aquesta citació s'ha generat automàticament.

Drets

S'autoritza la difusió de l'obra mitjançant la llicència Creative Commons o similar 'Reconeixement-NoComercial- SenseObraDerivada'

Open Access

Aquest element apareix en la col·lecció o col·leccions següent(s)