dc.contributor.author
Cooperman, Gene
dc.date.accessioned
2026-02-20T01:58:27Z
dc.date.available
2026-02-20T01:58:27Z
dc.date.issued
2022-07-01
dc.identifier
Cooperman, G. A future-proof design for long-running MPI-based computations in HPC. A: Severo Ochoa Research Seminars at BSC. «Research Seminar Lectures at BSC, Barcelona, 2021-22». Barcelona: Barcelona Supercomputing Center, 2022, p. 75-76.
dc.identifier
https://hdl.handle.net/2117/455523
dc.identifier.uri
https://hdl.handle.net/2117/455523
dc.description.abstract
Support for long-running computations on supercomputers has long been a pain point. To maintain scheduling flexibility, sysadmins set a maximum resource allocation (e.g., 48 hours) for HPC jobs. Sysadmins also often offer short-duration queues at a discount (e.g., 2 hours at 75% discount) in order to make use of idle cycles. Transparent checkpointing offers the dream of robust, fault-tolerant long-running jobs at scale, that can be employed in either of the two types of queues. MANA-2.0 (MPI-Agnostic, Network-Agnostic checkpointing) is an effort to achieve this dream. Like the original MANA academic prototype, MANA-2.0 operates over any MPI implementation and network interconnect that supports the MPI API standard. MANA-2.0 is also future-proof, in the sense that it runs independently of the underlying MPI and network libraries. Details of new algorithms required for its robustness will be presented. MANA-2.0 is being tested on: (i) NERSC's Cori supercomputer (proprietary Cray MPI and Cray GNI network); (ii) NERSC's Perlmutter (#5 supercomputer; proprietary Cray MPI and HPE Cray Slingshot network); and (iii) CentOS Linux for other HPC sites. Like all large projects, this has been a years-long collaboration that is only now coming to fruition. The many participants will be credited in the talk.
dc.format
application/pdf
dc.publisher
Barcelona Supercomputing Center
dc.rights
http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights
Attribution-NonCommercial-NoDerivatives 4.0 International
dc.subject
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
dc.subject
High performance computing
dc.subject
Càlcul intensiu (Informàtica)
dc.title
A future-proof design for long-running MPI-based computations in HPC
dc.type
Conference report