Abstract:
|
Compute and memory access units are two of the
most important resources to appropriately manage in current
and future multi–/many–core architectures. Memory bandwidth
and computational capacity need to be exploited in a combined
way to achieve the best system performance. Coarse–grain multi–
threading, also known as temporal multi–threading (TMT), is a
well known technique that improves overall resource utilization
by time–multiplexing the execution of a reduced number of
hardware threads that are switched in case of a high–latency
event, such as a memory miss. Hence, the processor does not stall
on memory misses and the number of in–fly memory operations
is increased, improving the overall processor resource utilization.
In this paper, we propose a software–based implementation
of TMT that supports and unbounded number of threads
and enables a flexible combination of multiple computational
kernels. Our TMT implementation is based on micro–threads
that combine fast cooperative and preemptive context switches
to overcome some intrinsic limitations of current TMT hardware
implementations, such as the reduced and fixed number of
hardware threads available. Our proposal is demonstrated with
an implementation on the Cell/B.E. which is evaluated using heterogeneous
mixes of memory–/CPU–bound kernels. Experimental
results show how the proposed technique reduce the execution
time of several benchmarks by up to 78%. |