Title:
|
CRC-based memory reliability for task-parallel HPC applications
|
Author:
|
Subasi, Omer; Unsal, Osman Sabri; Labarta Mancho, Jesús José; Yalcin, Gulay; Cristal Kestelman, Adrián
|
Other authors:
|
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors; Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions |
Abstract:
|
Memory reliability will be one of the major concerns for future HPC and Exascale systems. This concern is mostly attributed to the expected massive increase in memory capacity and the number of memory devices in Exascale systems. For memory systems Error Correcting Codes (ECC) are the mostcommonly used mechanism.
However state-of-the art hardware ECCs will not be sufficient in terms of error coverage for future computing systems and stronger hardware ECCs providing more coverage have prohibitive costs in terms of area, power and latency. Software-based solutions are needed to cooperate with hardware. In this work, we propose a Cyclic Redundancy
Checks (CRCs) based software mechanism for task-parallel HPC applications. Our mechanism incurs only 1.7% performance overheadwith hardware acceleration while being highly scalable at large scale. Our mathematical analysis demonstrates the effectiveness of our scheme and its error coverage. Results show that our CRC-based mechanism reduces the memory vulnerability by 87% on average with up to 32-bit burst (consecutive) and 5-bit arbitrary error
correction capability. |
Abstract:
|
This work was supported by FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and
TIN2015-65316-P. |
Abstract:
|
Peer Reviewed |
Subject(s):
|
-Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures paral·leles -Parallel processing (Electronic computers) -Application programs -Data flow analysis -Error correction -Errors -Hardware -Reconfigurable hardware -Reliability -Cyclic redundancy check -Dataflow model -Error correction capability -Hardware acceleration -Mathematical analysis -Memory reliability -Software-based solutions -Task parallelism -Processament en paral·lel (Ordinadors) |
Rights:
|
|
Document type:
|
Article - Published version Conference Object |
Published by:
|
Institute of Electrical and Electronics Engineers (IEEE)
|
Share:
|
|