Silent data corruptions in computing systems – modelling, measuring, mitigation across the layers

dc.contributor.author
Gizopoulos, Dimitris
dc.date.accessioned
2026-01-14T02:01:01Z
dc.date.available
2026-01-14T02:01:01Z
dc.date.issued
2023-02-01
dc.identifier
Gizopoulos, D. Silent data corruptions in computing systems – modelling, measuring, mitigation across the layers. A: Severo Ochoa Research Seminars at BSC. «8th Severo Ochoa Research Seminar Lectures at BSC, Barcelona, 2022-23». Barcelona: Barcelona Supercomputing Center, 2023, p. 49-50.
dc.identifier
https://hdl.handle.net/2117/450276
dc.identifier.uri
http://hdl.handle.net/2117/450276
dc.description.abstract
Hyperscale’s of computing systems (Meta and Google) recently revealed a major issue in the operation of their server fleets: CPU hardware faults, marginalities, and bugs (not random transients) generate wrong program outputs (silent data corruptions – SDCs) more frequently than ever imagined and propagate at scale without any alert from the hardware or software. The research community was invited to join this challenging endeavour (cf. Meta RFP). In this talk, we discuss the severity of the problem and its (likely still unknown) implications in large scale computing. We focus on the problem’s cross-layer (circuit, microarchitecture, ISA, software) and end-to-end nature and how modelling efforts at different layers of abstraction can shed light to accurate measurement of SDCs rates. Fast and effective quantification of the rates along with identification of “troublemaking” hardware structures and software pieces, can assist mitigation actions by silicon manufacturers and system and software integrators.
dc.format
2 p.
dc.format
application/pdf
dc.language
eng
dc.publisher
Barcelona Supercomputing Center
dc.rights
http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights
Open Access
dc.rights
Attribution-NonCommercial-NoDerivatives 4.0 International
dc.subject
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
dc.subject
High performance computing
dc.subject
Càlcul intensiu (Informàtica)
dc.title
Silent data corruptions in computing systems – modelling, measuring, mitigation across the layers
dc.type
Conference report


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

Congressos [11156]