Comparison of Audio Encoders for Audio-Text Contrastive Learning Representations

dc.contributor.author
Cárdenas Gracia, Sergio
dc.date.accessioned
2026-02-07T20:30:31Z
dc.date.available
2026-02-07T20:30:31Z
dc.date.issued
2026-02-06T13:11:55Z
dc.date.issued
2026-02-06T13:11:55Z
dc.date.issued
2025
dc.identifier
https://hdl.handle.net/10230/72481
dc.identifier.uri
https://hdl.handle.net/10230/72481
dc.description.abstract
Treball fi de màster de: Master in Sound and Music Computing
dc.description.abstract
Supervisor: Pablo Alonso Jiménez
dc.description.abstract
Co-Supervisor: Dmitry Bogdanov
dc.description.abstract
This project investigates contrastive learning techniques for aligning audio and text representations in the music domain, focusing on scenarios with limited data and computational resources. We provide a comprehensive review of existing methods relevant to music-text contrastive learning. Two audio encoders, HTSAT and MAEST, initialized with pretrained weights, are integrated with a frozen RoBERTa text encoder within the LAION-AI CLAP framework and fine-tuned on the MTGJamendo dataset. Model performance is evaluated on three tasks: zero-shot genre classification on the GTZAN dataset, multi-label tag classification on the MagnaTagATune dataset, and text-to-music retrieval on the Song Describer dataset. Results show that HTSAT generalizes better in low-data settings, while MAEST tends to overfit, highlighting the impact of encoder complexity in resource-constrained environments. Attempts to mitigate MAEST’s overfitting with weight decay and learning rate decay were unsuccessful. Additionally, the study highlights the critical role of data volume and batch size in contrastive learning effectiveness. The source code for this work is publicly available at https://github.com/SerX610/smc-master-thesis
dc.format
application/pdf
dc.language
eng
dc.rights
Creative Commons license AttributionNonCommercial- NoDerivs 4.0 International
dc.rights
Attribution-NonCommercial-NoDerivatives 4.0 International
dc.rights
https://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights
info:eu-repo/semantics/openAccess
dc.subject
Música per ordinador
dc.title
Comparison of Audio Encoders for Audio-Text Contrastive Learning Representations
dc.type
info:eu-repo/semantics/masterThesis


Ficheros en el ítem

FicherosTamañoFormatoVer

No hay ficheros asociados a este ítem.

Este ítem aparece en la(s) siguiente(s) colección(ones)