Emerging Architectures
Until recently, microprocessor manufacturers have been seeking ways to continuously increase the clock frequency of their processors by using superscalar techniques, very deep pipelines and advanced microelectronic technologies. However, deep pipelines can pose quite large performance bottlenecks, and additionally, power consumption issues did not allow for further increase in the clock speed of the processors. Thus, a shift to multicore architectures is encountered the last years, where each core has more "modest" specifications, but a cleaner and more efficient microarchitecture. Nevertheless, even multicore
architectures, in the form that we know them now, are becoming questionable, especially in terms of scalability, and a transition to the
manycore era is not so obvious.
Quite recently, there have emerged architectures that can deliver performance incomparable to a classic superscalar multicore architecture, and at the same time they can be positioned at the low-end of the high-performance computing market. Among the most promising such architectures, there are the
STI Cell Broadband Engine and the Graphical Processing Units (GPUs), which are now capable of performing general purpose computations. Although these architectures have significant differences, they both share a basic concept: Instead of consuming the majority of the die area to build massive data caches, which is the case of conventional multicore architectures, they dedicate it to specialized SIMD (Single Instruction Multiple Data) processing elements, which act as
coprocessors to a main or host processor.
Our group is now performing research on the field of both these emerging architectures. Having evaluated the performance and the architecture of the STI Cell through a set of microbenchmarks (
diploma thesis), we are now moving to port
SpMV on the Cell. Since Cell's architecture is quite different than a commodity multicore architecture, it poses a set of challenges concerning the data partitioning of the input matrix (static, dynamic, load balancing issues), as well as the algorithm itself.
Similarly, we are experimenting with and evaluating the performance of a modern GPU architecture, namely NVIDIA's G80 architecture. We have developed some preliminary microbenchmarks to evaluate the instruction throughput of the architecture, the memory b/w, and the host to device transfer rates. In addition, we have ported to the G80 architecture some elementary algebraic operations, such as reduction and dense matrix-vector (DMV) multiplication, and investigated several optimization issues using the CUDA programming model (
diploma thesis). We currently studying the porting of
SpMV to the G80 architecture and whether successful alternative storage formats, such as BCSR, which are reportedly performing well to commodity architectures, could offer similar performance benefits when applied to the massively threaded GPUs. As a future research, we will investigate efficient ways of overlapping computations on the GPU and host/device PCI-E transfers. Preliminary results have shown that these transfers can greatly limit the overall perceived performance of a computational kernel.