A future-proof design for long-running MPI-based computations in HPC

Publication date

2022-07-01



Abstract

Support for long-running computations on supercomputers has long been a pain point. To maintain scheduling flexibility, sysadmins set a maximum resource allocation (e.g., 48 hours) for HPC jobs. Sysadmins also often offer short-duration queues at a discount (e.g., 2 hours at 75% discount) in order to make use of idle cycles. Transparent checkpointing offers the dream of robust, fault-tolerant long-running jobs at scale, that can be employed in either of the two types of queues. MANA-2.0 (MPI-Agnostic, Network-Agnostic checkpointing) is an effort to achieve this dream. Like the original MANA academic prototype, MANA-2.0 operates over any MPI implementation and network interconnect that supports the MPI API standard. MANA-2.0 is also future-proof, in the sense that it runs independently of the underlying MPI and network libraries. Details of new algorithms required for its robustness will be presented. MANA-2.0 is being tested on: (i) NERSC's Cori supercomputer (proprietary Cray MPI and Cray GNI network); (ii) NERSC's Perlmutter (#5 supercomputer; proprietary Cray MPI and HPE Cray Slingshot network); and (iii) CentOS Linux for other HPC sites. Like all large projects, this has been a years-long collaboration that is only now coming to fruition. The many participants will be credited in the talk.

Document Type

Conference report

Language

English

Publisher

Barcelona Supercomputing Center

Recommended citation

This citation was generated automatically.

Rights

http://creativecommons.org/licenses/by-nc-nd/4.0/

Open Access

Attribution-NonCommercial-NoDerivatives 4.0 International

This item appears in the following Collection(s)

Congressos [11156]