A study on universal language-agnostic sentence embeddings and applications

Ribalta Albado, Maria; Ribalta Albado, Maria

A study on universal language-agnostic sentence embeddings and applications

Author

Ribalta Albado, Maria

Other authors

Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions

Rodríguez Fonollosa, José Adrián

Publication date

2021-06-28

Abstract

In this project we make a study on universal language agnostic sentence embeddings: internal neural networks sentence representations that are independent with relation to the task and the language. To be more precise, we focus on how combining sentence embeddings of different models can improve benchmarks of well-known tasks. We also confirm our results by applying the methods in two self-created tasks involving a minority language, Occitan. We have used a total of four different architectures that produce four different encodings - each one with its characteristics and dimensions - and explored the behaviour when they are ensembled via concatenation or addition. This methodology is an easy and very simple approach that shows remarkable improvements without further training or fine-tuning at any moment in the experiments, which can represent an inflection point to consider if we need to create new neural networks each time or take advantage of the already existing state of the art. This way, we would save a lot of time, resources and means, since instead of training from scratch complex models, simple linear-cost operations lead to surprising results.

En aquest projecte s'ha fet un estudi d'embeddings basats en frases, agnòstics i universals de l'idioma: representacions internes de xarxes neurals de frases que són independents respecte la tasca i la llengua. Més precisament, es centra en com combinar els embeddings de frases de diferents models per tal de millorar puntuacions de referència de tasques ja conegudes. També, es confirmen els resultats aplicant els mètodes en dues tasques originals involucrant un idioma minoritari, l'Occità. S'ha utilitzat un total de quatre arquitectures que generen quatre codificacions diferents, cadascuna amb les seves característiques i dimensions, i s'ha investigat el comportament en combinar-se per mitjà de la concatenació o l'addició. Aquesta metodologia és una aproximació molt fàcil i simple que mostra millores destacables sense més entrenament ni afinat en cap moment dels experiments, el que pot representar un punt d'inflexió a l'hora de considerar si es necessiten crear noves xarxes neuronals cada vegada o aprofitar els millors models ja existents. D'aquesta manera, s'estalviaria molt temps, recursos i mitjans, ja que, en comptes d'entrenar models complexes des de zero, simples operacions de cost lineal comporten resultats sorprenents.

Document Type

Bachelor thesis

Language

English

Subjects and keywords

Àrees temàtiques de la UPC::Enginyeria de la telecomunicació::Processament del senyal::Processament de la parla i del senyal acústic; Natural language processing (Computer science); Embedded computer systems; NLP; Sentence Embeddings; Ensemble; Multilingüe; Semantic Similarity; Similarity Search; BERT; LASER; LaBSE; Tractament del llenguatge natural (Informàtica); Sistemes incrustats (Informàtica)

Publisher

Universitat Politècnica de Catalunya

Recommended citation

This citation was generated automatically.

Export

DIDL MARC MARC_CCUC METS OAI_DC ORE QDC RDF

Rights

Open Access

This item appears in the following Collection(s)

Treballs acadèmics [82541]

A study on universal language-agnostic sentence embeddings and applications

Author

Other authors

Publication date

Share

Abstract

Document Type

Language

Subjects and keywords

Publisher

Recommended citation

Export

Rights

This item appears in the following Collection(s)