Exploring the integration of large language models for automatic emotion labeling in speech

dc.contributor.author
Yun Chien, Yi
dc.date.accessioned
2025-10-22T19:46:56Z
dc.date.available
2025-10-22T19:46:56Z
dc.date.issued
2025-10-20T16:01:56Z
dc.date.issued
2025-10-20T16:01:56Z
dc.date.issued
2025
dc.identifier
http://hdl.handle.net/10230/71584
dc.identifier.uri
http://hdl.handle.net/10230/71584
dc.description.abstract
Treball fi de màster de: Master in Intelligent Interactive Systems
dc.description.abstract
Supervisor: Prof. María Inés Torres Barañano
dc.description.abstract
In this work, we present a comprehensive comparison of methodologies for speech emotion recognition (SER), with a focus on evaluating the effectiveness of large language models (LLMs) in this domain. Our study is structured into three parts. First, we extract audio embeddings using models such as WavLM, HuBERT, and Dasheng, and use classical machine learning classifier-Support Vector Machine (SVM) and Multilayer Perceptron (MLP) for emotion prediction. These approach serves as a baseline for comparison. Second, we investigate the capacity of LLMs like GPT-4o, Qwen2-Audio, and Amazon Nova Sonic to analyze audio features, including speaker attributes such as gender, thereby extending their application beyond traditional natural language processing. Third, we explore a more integrated approach that directly inputs raw audio into LLM for audio processing, such as Qwen2-Audio7B-Instruct, for end-to-end emotion classification, without the need for traditional signal-processing-based feature extraction. We evaluate and compare the performance of these methodologies based on various metrics, such as accuracy, precision, recall, and F1-score. A key aspect of this study is the primary focus on the results obtained from LLM-based models. Our results reveal several key insights: (1) data distribution significantly affects classifier performance; (2) different audio embeddings shows different results even with the same classifier and dataset; and (3) despite their capability, current LLMs still underperform compared to classical classifiers such as SVM and MLP in emotion prediction tasks.
dc.format
application/pdf
dc.language
eng
dc.rights
Llicència CC Reconeixement-NoComercial-CompartirIgual 4.0 Internacional (CC BY-NC-SA 4.0)
dc.rights
https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.rights
info:eu-repo/semantics/openAccess
dc.subject
Emocions
dc.title
Exploring the integration of large language models for automatic emotion labeling in speech
dc.type
info:eu-repo/semantics/masterThesis


Ficheros en el ítem

FicherosTamañoFormatoVer

No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)