SORS: Improving HPC performance and throughput through monitoring, analysis, and feedback

Autor/a

Brandt, Jim

Data de publicació

2025-06-16



Resum

The Lightweight Distributed Metric Service (LDMS) is a scalable lowoverhead High Performance Computer (HPC) monitoring framework for transport of system resource utilization data as well as application/workflow progress and performance information. LDMS also includes plugins for a variety of storage methods, including publication to a Kafka distributed event bus, as well as pre-storage analysis. Additionally, since it supports bi-directional data flow, LDMS can be utilized as a low-latency substrate for communicating conditions of interest from an analysis system back to system and/or application software to enable run time modification of behavior. This seminar will present the salient features of the LDMS ecosystem, how it is currently being deployed at other supercomputing sites, and current production and research activities in analysis, visualization, and active feedback. Furthermore, this seminar will introduce the WorkVisualizer framework, an open-source profiling tool developed by NexGen Analytics (NGA) that offers high-level, interactive, and visual HPC performance analysis. LDMS seeks to integrate the WorkVisualizer into its ecosystem in order to assist with converting vast amounts of monitoring data into actionable intelligence.

Tipus de document

Conference report

Llengua

Anglès

Publicat per

Barcelona Supercomputing Center

Citació recomanada

Aquesta citació s'ha generat automàticament.

Drets

http://creativecommons.org/licenses/by-nc-nd/4.0/

Open Access

Aquest element apareix en la col·lecció o col·leccions següent(s)

Congressos [11156]