Title:
|
A runtime heuristic to selectively replicate tasks for application-specific reliability targets
|
Author:
|
Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad; Unsal, Osman Sabri; Labarta Mancho, Jesús José
|
Other authors:
|
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors; Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions |
Abstract:
|
In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated. |
Abstract:
|
This work was supported by FI-DGR 2013 scholarship and the European Community’s
Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2
Project (www.montblanc-project.eu), grant agreement no. 610402 and in part by the
European Union (FEDER funds) under contract TIN2015-65316-P. |
Abstract:
|
Peer Reviewed |
Subject(s):
|
-Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors -Parallel processing (Electronic computers) -Dataflow programming -Selective replication -HPC and exascale computing -Task parallelism -Processament en paral·lel (Ordinadors) |
Rights:
|
|
Document type:
|
Article - Submitted version Conference Object |
Published by:
|
Institute of Electrical and Electronics Engineers (IEEE)
|
Share:
|
|