Título:
|
Designing and modelling selective replication for fault-tolerant HPC applications
|
Autor/a:
|
Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad; Unsal, Osman Sabri; Labarta Mancho, Jesús José
|
Otros autores:
|
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors; Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions |
Abstract:
|
Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user. |
Abstract:
|
This work is supported in part by the European Union Mont-blanc 2 Project (www.montblanc-project.eu), grant
agreement no. 610402 and the FEDER funds under contract TIN2015-65316-P. |
Abstract:
|
Peer Reviewed |
Materia(s):
|
-Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors -Fault-tolerant computing -Parallel processing (Electronic computers) -Computational modeling -Computer crashes -Mathematical model -Markov processes -Hardware -Reliability theory -Tolerància als errors (Informàtica) -Processament en paral·lel (Ordinadors) |
Derechos:
|
|
Tipo de documento:
|
Artículo - Versión presentada Objeto de conferencia |
Editor:
|
Institute of Electrical and Electronics Engineers (IEEE)
|
Compartir:
|
|