Silent data corruptions in computing systems – modelling, measuring, mitigation across the layers

Data de publicació

2023-02-01



Resum

Hyperscale’s of computing systems (Meta and Google) recently revealed a major issue in the operation of their server fleets: CPU hardware faults, marginalities, and bugs (not random transients) generate wrong program outputs (silent data corruptions – SDCs) more frequently than ever imagined and propagate at scale without any alert from the hardware or software. The research community was invited to join this challenging endeavour (cf. Meta RFP). In this talk, we discuss the severity of the problem and its (likely still unknown) implications in large scale computing. We focus on the problem’s cross-layer (circuit, microarchitecture, ISA, software) and end-to-end nature and how modelling efforts at different layers of abstraction can shed light to accurate measurement of SDCs rates. Fast and effective quantification of the rates along with identification of “troublemaking” hardware structures and software pieces, can assist mitigation actions by silicon manufacturers and system and software integrators.

Tipus de document

Conference report

Llengua

Anglès

Publicat per

Barcelona Supercomputing Center

Citació recomanada

Aquesta citació s'ha generat automàticament.

Drets

http://creativecommons.org/licenses/by-nc-nd/4.0/

Open Access

Attribution-NonCommercial-NoDerivatives 4.0 International

Aquest element apareix en la col·lecció o col·leccions següent(s)

Congressos [11156]