Cross-architecture benchmarking and performance evaluation of HPC systems

Other authors

Universitat Politècnica de Catalunya. Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial

Vicente Dorca, David

Banchelli Gracia, Fabio

Publication date

2025-07-02



Abstract

This thesis presents a comprehensive evaluation of the New Generation General Pur- pose Partition of the MareNostrum 5 supercomputer, deployed at the Barcelona Su- percomputing Center, and built on the NVIDIA Grace CPU architecture. The anal- ysis is conducted at three levels: micro-benchmarks targeting specific system compo- nents, High Performance Computing benchmarks for system-to-system comparisons, and a performance study using a scientific application that simulates a real-world workload. Additionally, this work compares the new cluster to the two existing oper- ational partitions of MareNostrum 5, the General Purpose and Accelerated Partition, both of which are based on Intel's Sapphire Rapids micro-architecture. This com- parison aims to assess the maturity and competitiveness of the NVIDIA Grace-based system relative to established technologies. The study also explores the performance variability introduced by using different compilers and runtimes on the new cluster. The findings indicate that the NVIDIA Grace-based system is largely mature, de- livering strong out-of-the-box performance with minimal tuning, notably for memory- bound workloads. This was particularly evident, as its memory bandwidth reached the advertised 1 TB/s and showed an improvement of more than twice that of the Intel- based clusters. However, some hardware characteristics remain opaque due to limited documentation, and the choice of compiler and runtime has a measurable impact on performance. In terms of scalability, the system demonstrates efficient node utiliza- tion and improved energy efficiency in standard HPC benchmarks. Although the CPU is not floating-point centric, as evidenced by achieving only 5.50 TFlop/s per node in HPL compared to 6.61 TFlop/s on the General Purpose Partition, it still demon- strates notable energy efficiency. Specifically, it reaches around 9.10 GFlop/(s × W) for a single node, outperforming the 7.59 GFlop/(s × W) observed on the General Purpose Partition. For real-life workloads, the new architecture outperforms x86 sys- tems in smaller-scale runs, but performance diminishes at larger scales, likely due to load-balancing issues.

Document Type

Master thesis

Language

English

Publisher

Universitat Politècnica de Catalunya

Recommended citation

This citation was generated automatically.

Rights

Open Access

This item appears in the following Collection(s)