SORS: Improving HPC performance and throughput through monitoring, analysis, and feedback

Author

Brandt, Jim

Publication date

2025-06-16



Abstract

The Lightweight Distributed Metric Service (LDMS) is a scalable lowoverhead High Performance Computer (HPC) monitoring framework for transport of system resource utilization data as well as application/workflow progress and performance information. LDMS also includes plugins for a variety of storage methods, including publication to a Kafka distributed event bus, as well as pre-storage analysis. Additionally, since it supports bi-directional data flow, LDMS can be utilized as a low-latency substrate for communicating conditions of interest from an analysis system back to system and/or application software to enable run time modification of behavior. This seminar will present the salient features of the LDMS ecosystem, how it is currently being deployed at other supercomputing sites, and current production and research activities in analysis, visualization, and active feedback. Furthermore, this seminar will introduce the WorkVisualizer framework, an open-source profiling tool developed by NexGen Analytics (NGA) that offers high-level, interactive, and visual HPC performance analysis. LDMS seeks to integrate the WorkVisualizer into its ecosystem in order to assist with converting vast amounts of monitoring data into actionable intelligence.

Document Type

Conference report

Language

English

Publisher

Barcelona Supercomputing Center

Recommended citation

This citation was generated automatically.

Rights

http://creativecommons.org/licenses/by-nc-nd/4.0/

Open Access

This item appears in the following Collection(s)

Congressos [11156]