2026-04-13T13:29:35Zhttps://recercat.cat/oai/request

oai:recercat.cat:2117/3769542025-07-23T01:09:56Zcom_2072_1033col_2072_452951

urn:hdl:2117/376954 Understanding and improving self-attention mechanisms Pujol Perich, David Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Llenguatge natural Deep learning (Machine learning) Natural language processing (Computer science) Computer vision aprenentatge profund self-attention processament de llenguatge natural visió per computació models d'ordre superior deep learning efficient Transformers natural language processing computer vision high-order models aprenentatge automàtic machine learning Transformers Aprenentatge profund Tractament del llenguatge natural (Informàtica) Visió per ordinador Recent years have seen the vast potential of the Transformer model, as it is arguably the first general-purpose architecture in the sense that achieves state-of-the-art performance in numerous fields –e.g., Computer Vision, Natural Language Processing, Autonomous driving– using minimal architecture modifications. The success of Transformers greatly relies on the use of the self-attention mechanism, the understanding of which still remains somewhat obscure. In this thesis, we first focus on bridging this gap from an empirical perspective, studying its main inductive biases and limitations. We also propose the ReLA Nyströmformer, a novel architecture that attains a linear complexity –considerably improving the original quadratic complexity of the self-attention– meanwhile proving to be empirically superior to a vast set of state-of-the-art benchmarks. We also notice that Transformer architectures are ultimately third-order-interaction-based models –i.e., can be formalized as third-order polynomials– which tractability strongly depends on a number of inductive biases. This motivates the last part of this thesis where we discuss the suitability of devising higher-order models –i.e., based on higher-order polynomials– both from a predictive and interpretability point of view. Finally, we propose two novel architectures, the Low-rank Deep Polynomial Network and the Adaptive attention, based on low-rank projections and automatic attention pattern learning. The state-of- the-art performance of both of these models underpins the need of researching further in this direction to fully elucidate their potential. 2022-07-10 Master thesis Restricted access - author's decision Universitat Politècnica de Catalunya