Autor/a

Cooperman, Gene

Fecha de publicación

2022-07-01



Resumen

Support for long-running computations on supercomputers has long been a pain point. To maintain scheduling flexibility, sysadmins set a maximum resource allocation (e.g., 48 hours) for HPC jobs. Sysadmins also often offer short-duration queues at a discount (e.g., 2 hours at 75% discount) in order to make use of idle cycles. Transparent checkpointing offers the dream of robust, fault-tolerant long-running jobs at scale, that can be employed in either of the two types of queues. MANA-2.0 (MPI-Agnostic, Network-Agnostic checkpointing) is an effort to achieve this dream. Like the original MANA academic prototype, MANA-2.0 operates over any MPI implementation and network interconnect that supports the MPI API standard. MANA-2.0 is also future-proof, in the sense that it runs independently of the underlying MPI and network libraries. Details of new algorithms required for its robustness will be presented. MANA-2.0 is being tested on: (i) NERSC's Cori supercomputer (proprietary Cray MPI and Cray GNI network); (ii) NERSC's Perlmutter (#5 supercomputer; proprietary Cray MPI and HPE Cray Slingshot network); and (iii) CentOS Linux for other HPC sites. Like all large projects, this has been a years-long collaboration that is only now coming to fruition. The many participants will be credited in the talk.

Tipo de documento

Conference report

Lengua

Inglés

Publicado por

Barcelona Supercomputing Center

Citación recomendada

Esta citación se ha generado automáticamente.

Derechos

http://creativecommons.org/licenses/by-nc-nd/4.0/

Open Access

Attribution-NonCommercial-NoDerivatives 4.0 International

Este ítem aparece en la(s) siguiente(s) colección(ones)

Congressos [11156]