Abstract:
|
Safety-relevant systems in the automotive domain often implement features such as lockstep execution for error detection, and reset and re-execution for error correction. Light-lockstep has already been adopted in some such systems due to its relatively low-implementation cost given that it does not require deep changes into nonlockstep hardware. Instead, as only off-core activities (i.e., data/addresses sent) need to be compared across different cores, light-lockstep designs are lowly intrusive. This approach has been proven sufficient to guarantee functional correctness of the system in the presence of errors in the cores, in particular in relation with certification against safety standards such as ISO26262 in the automotive domain. However, error detection in light-lockstep systems may occur long after the error actually occurs, thus jeopardizing timing guarantees, which are as critical as functional ones in hard real-time systems. In this paper, we analyze the timing behavior of errors due to transient and permanent faults in light-lockstep systems. Our results show that the time elapsed until an error is detected can be inordinately large, especially for permanent faults. Based on this observation and building upon the specific characteristics of light-lockstep systems, we propose lightly verbose (LiVe), a new mechanism to enforce the early detection of errors, due to both transient and permanent faults, thus enabling the computation of tight error detection timing bounds. We also analyze how existing mechanisms for error recovery in multicore systems increase their effectiveness when light-lockstep operates in LiVe mode in the context of mixed-criticality workloads. |