Silent data corruptions in computing systems – modelling, measuring, mitigation across the layers

Fecha de publicación

2023-02-01



Resumen

Hyperscale’s of computing systems (Meta and Google) recently revealed a major issue in the operation of their server fleets: CPU hardware faults, marginalities, and bugs (not random transients) generate wrong program outputs (silent data corruptions – SDCs) more frequently than ever imagined and propagate at scale without any alert from the hardware or software. The research community was invited to join this challenging endeavour (cf. Meta RFP). In this talk, we discuss the severity of the problem and its (likely still unknown) implications in large scale computing. We focus on the problem’s cross-layer (circuit, microarchitecture, ISA, software) and end-to-end nature and how modelling efforts at different layers of abstraction can shed light to accurate measurement of SDCs rates. Fast and effective quantification of the rates along with identification of “troublemaking” hardware structures and software pieces, can assist mitigation actions by silicon manufacturers and system and software integrators.

Tipo de documento

Conference report

Lengua

Inglés

Publicado por

Barcelona Supercomputing Center

Citación recomendada

Esta citación se ha generado automáticamente.

Derechos

http://creativecommons.org/licenses/by-nc-nd/4.0/

Open Access

Attribution-NonCommercial-NoDerivatives 4.0 International

Este ítem aparece en la(s) siguiente(s) colección(ones)

Congressos [11159]