Hyperscale’s of computing systems (Meta and Google) recently revealed a major issue in the operation of their server fleets: CPU hardware faults, marginalities, and bugs (not random transients) generate wrong program outputs (silent data corruptions – SDCs) more frequently than ever imagined and propagate at scale without any alert from the hardware or software. The research community was invited to join this challenging endeavour (cf. Meta RFP). In this talk, we discuss the severity of the problem and its (likely still unknown) implications in large scale computing. We focus on the problem’s cross-layer (circuit, microarchitecture, ISA, software) and end-to-end nature and how modelling efforts at different layers of abstraction can shed light to accurate measurement of SDCs rates. Fast and effective quantification of the rates along with identification of “troublemaking” hardware structures and software pieces, can assist mitigation actions by silicon manufacturers and system and software integrators.
Conference report
Inglés
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors; High performance computing; Càlcul intensiu (Informàtica)
Barcelona Supercomputing Center
http://creativecommons.org/licenses/by-nc-nd/4.0/
Open Access
Attribution-NonCommercial-NoDerivatives 4.0 International
Congressos [11159]