Silent data corruptions in computing systems – modelling, measuring, mitigation across the layers

Gizopoulos, Dimitris; Gizopoulos, Dimitris

Silent data corruptions in computing systems – modelling, measuring, mitigation across the layers

Para acceder a los documentos con el texto completo, por favor, siga el siguiente enlace: http://hdl.handle.net/2117/450276

Autor/a

Gizopoulos, Dimitris

Fecha de publicación

2023-02-01

Resumen

Hyperscale’s of computing systems (Meta and Google) recently revealed a major issue in the operation of their server fleets: CPU hardware faults, marginalities, and bugs (not random transients) generate wrong program outputs (silent data corruptions – SDCs) more frequently than ever imagined and propagate at scale without any alert from the hardware or software. The research community was invited to join this challenging endeavour (cf. Meta RFP). In this talk, we discuss the severity of the problem and its (likely still unknown) implications in large scale computing. We focus on the problem’s cross-layer (circuit, microarchitecture, ISA, software) and end-to-end nature and how modelling efforts at different layers of abstraction can shed light to accurate measurement of SDCs rates. Fast and effective quantification of the rates along with identification of “troublemaking” hardware structures and software pieces, can assist mitigation actions by silicon manufacturers and system and software integrators.

Tipo de documento

Conference report

Lengua

Inglés

Materias y palabras clave

Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors; High performance computing; Càlcul intensiu (Informàtica)

Publicado por

Barcelona Supercomputing Center

Citación recomendada

Esta citación se ha generado automáticamente.

Exportar

DIDL MARC MARC_CCUC METS OAI_DC ORE QDC RDF

Derechos

http://creativecommons.org/licenses/by-nc-nd/4.0/

Open Access

Attribution-NonCommercial-NoDerivatives 4.0 International

Este ítem aparece en la(s) siguiente(s) colección(ones)

Congressos [11159]

Silent data corruptions in computing systems – modelling, measuring, mitigation across the layers

Autor/a

Fecha de publicación

Compartir

Resumen

Tipo de documento

Lengua

Materias y palabras clave

Publicado por

Citación recomendada

Exportar

Derechos

Este ítem aparece en la(s) siguiente(s) colección(ones)