dc.contributor.author
Brandt, Jim
dc.date.accessioned
2026-02-11T01:37:20Z
dc.date.available
2026-02-11T01:37:20Z
dc.date.issued
2025-06-16
dc.identifier
Brandt, J. SORS: Improving HPC performance and throughput through monitoring, analysis, and feedback. A: Severo Ochoa Research Seminars at BSC. «10th Severo Ochoa Research Seminar Lectures at BSC, Barcelona, 2024-25». Barcelona: Barcelona Supercomputing Center, 2025, p. 148-149.
dc.identifier
https://hdl.handle.net/2117/454301
dc.identifier.uri
http://hdl.handle.net/2117/454301
dc.description.abstract
The Lightweight Distributed Metric Service (LDMS) is a scalable lowoverhead
High Performance Computer (HPC) monitoring framework
for transport of system resource utilization data as well as
application/workflow progress and performance information. LDMS
also includes plugins for a variety of storage methods, including
publication to a Kafka distributed event bus, as well as pre-storage
analysis. Additionally, since it supports bi-directional data flow, LDMS
can be utilized as a low-latency substrate for communicating conditions
of interest from an analysis system back to system and/or application
software to enable run time modification of behavior. This seminar will
present the salient features of the LDMS ecosystem, how it is currently
being deployed at other supercomputing sites, and current production
and research activities in analysis, visualization, and active feedback.
Furthermore, this seminar will introduce the WorkVisualizer
framework, an open-source profiling tool developed by NexGen
Analytics (NGA) that offers high-level, interactive, and visual HPC
performance analysis. LDMS seeks to integrate the WorkVisualizer
into its ecosystem in order to assist with converting vast amounts of
monitoring data into actionable intelligence.
dc.format
application/pdf
dc.publisher
Barcelona Supercomputing Center
dc.rights
http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
dc.subject
High performance computing
dc.subject
Càlcul intensiu (Informàtica)
dc.title
SORS: Improving HPC performance and throughput through monitoring, analysis, and feedback
dc.type
Conference report