2026-04-05T11:12:25Zhttps://recercat.cat/oai/request

oai:recercat.cat:2117/4489782026-02-07T07:30:10Zcom_2072_1033col_2072_452950

Empowering vector architectures for ML: The CAMP architecture for matrix multiplication Esmali Nojehdeh, Mohammadreza Mokhtarnia, Hossein Pavón Rivera, Julián Rodas Quiroga, Narcís Figueras Bagué, Roger Reggiani, Enrico Moretó Planas, Miquel Unsal, Osman Sabri Cristal Kestelman, Adrián Ayguadé Parra, Eduard Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Aprenentatge automàtic Vector architecture SIMD-support unit Vector processing unit Matrix multiplication Hardware-software co-design Quantization LLM CNN This study presents the Cartesian Accumulative Matrix Pipeline (CAMP) architecture, a novel approach designed to enhance matrix multiplication in Vector Architectures (VAs) and Single Instruction, Multiple Data (SIMD) units. CAMP improves processing efficiency of Quantized Neural Networks (QNNs). Matrix multiplication is a cornerstone of machine learning applications, and its quantized versions are increasingly popular for more efficient operations. Unfortunately, existing VAs and SIMD-support units struggle to efficiently handle these quantized formats. In this work, we propose CAMP, a simple yet effective architecture that leverages a hybrid multiplier. The CAMP architecture significantly advances the performance of vector architectures in handling quantized data, enabling more efficient execution of matrix multiplication across various platforms, specifically targeting the ARMv8 Scalable Vector Extension (SVE) and edge RISC-V with SIMD-support unit architecture. Thanks to its hierarchical design, CAMP natively supports multiple integer precisions; here we focus on 4-bit and 8-bit, while the unit remains applicable as a general-purpose integer multiplier across other bit widths. In addition to increasing throughput, CAMP’s architectural design also contributes to energy efficiency, making it an effective solution for low-power applications. Evaluations on a range of Large Language Models (LLMs) and Convolutional Neural Networks (CNNs) demonstrate that matrix multiplication operations using the proposed micro-architecture achieve up to 17 × and 23 × performance improvements compared to their respective baselines, the ARM A64FX core and a RISC-V-based edge System-on-Chip (SoC). Furthermore, synthesis and place-and-route (PnR) of the CAMP micro-architecture — using Synopsys ICC2 for the ARM TSMC 7nm (A64FX) and Cadence Innovus for the GlobalFoundries 22nm flow (RISC-V SoC)—add only 1% and 4% area overhead, respectively, compared to the baseline designs. This work has received funding from the DARE SGA1 Project, from the European High-Performance Computing Joint Undertaking (JU) under Grant Agreement No. 101202459 and from PCI2024-161687-3 Project funded by MICIU/AEI/10.13039/501100011033 and European’s Union “NextGenerationEU”/PRTR. The JU receives support from the European Union’s Horizon Europe research and innovation programme and Spain, Germany, Czechia, Italy, Netherlands, Belgium, Finland, Greece, Croatia, Portugal, Poland, Sweden, France and Austria. This research was supported by the Spanish Ministry of Science and Innovation through contract PID2023-146511NB-I00, PID2023-147979NB-C21,and by the Ministry for Digital Transformation and Public Service, within the framework of the Recovery, Transformation and Resilience Plan – NextGenerationEU REGAGE22e00058408992; the Generalitat of Catalunya through contract 2021-SGR-00763; and the Lenovo-BSC Framework Contract (2020). Peer Reviewed Postprint (published version) 2025 Conference report https://dl.acm.org/doi/10.1145/3725843.3760547 info:eu-repo/grantAgreement/AEI//PCI2024-161687-3 http://creativecommons.org/licenses/by/4.0/ Open Access Attribution 4.0 International Association for Computing Machinery (ACM)