|
|
META TOPICPARENT |
name="WebHome" |
Computer Architecture | |
> > | One of the major concerns regarding our group's research activity in the field
of computer architecture is the exploration and evaluation of modern and
emerging architecture designs. Recent developments in microprocessor technology
indicate a paradigm shift that is likely to alter the present programming
methodologies (see DDJ article ).
We aim at the exploration of these new architectures, while focusing especially
on multithreaded designs. Some examples of our involvement include:
- Typical (Intel Core Duo / Opteron) and Aggressive multicore designs (Niagara)
- Simultaneous multithreading (SMT)
- Cell Broadband Engine (Cell)
- General-purpose computing on graphics processing units (GPGPU)
Relevant Project Activites
<--
Links
--> | |
META FILEATTACHMENT |
attr="h" autoattached="1" comment="" date="1204649531" name="04227947.pdf" path="04227947.pdf" size="792734" user="Main.ArisSotiropoulos" version="1" |
META FILEATTACHMENT |
attr="h" autoattached="1" comment="" date="1204649356" name="01386058.pdf" path="01386058.pdf" size="312577" user="Main.ArisSotiropoulos" version="1" |
|
|
META TOPICPARENT |
name="WebHome" |
Computer Architecture | |
< < | Software Optimization
Previous research work has identified memory bandwidth as the main bottleneck of the ubiquitous Sparse Matrix-Vector Multiplication kernel. To attack this problem, we aim at reducing the overall data volume of the algorithm. Typical sparse matrix representation schemes store only the non-zero elements of the matrix and employ additional indexing information to properly iterate over these elements. In this paper we propose two distinct compression methods targeting index and numerical values respectively. We perform a set of experiments on a large real-world matrix set and demonstrate that the index compression method can be applied successfully to a wide range of matrices. Moreover, the value compression method is able to achieve impressive speedups in a more limited, yet important, class of sparse matrices that contain a small number of distinct values.
Operating Systems
Efficient sharing of block devices over an interconnection network is an important step in deploying a shared-disk parallel file system on a cluster of SMPs. We present gmbock, a client/server system for network sharing of storage devices over Myrinet, which uses an optimized data path in order to transfer data directly from the storage medium to the NIC, bypassing the host CPU and main memory bus. Its design enhances existing programming abstractions, combining the user level networking characteristics of Myrinet with Linux's virtual memory infrastructure, in order to construct the datapath in a way that is independent of the type of block device used. Experimental evaluation of a prototype system shows that remote I/O bandwidth can improve up to 36%, compared to an RDMA-based implementation. Moreover, interference on the main memory bus of the host is minimized, leading to an up to 41% improvement in the execution time of memory-intensive applications.
Providing scalable clustered storage in a cost-effective way depends on the availability of an efficient network block device (nbd) layer. To overcome the architectural limitation of a low number of outstanding requests in gmblock, we focus on overlapping read and network I/O for a single request, in order to improve throughput. To this end, we introduce the concept of synchronized send operations and present an implementation on Myrinet/GM, based on custom modifications to the NIC firmware and associated userspace library. Compared to a network block sharing system over standard GM and the base version of gmblock, our enhanced implementation supporting synchronized sends delivers 81% and 44% higher throughput for streaming block I/O, respectively. | |
META FILEATTACHMENT |
attr="h" autoattached="1" comment="" date="1204649531" name="04227947.pdf" path="04227947.pdf" size="792734" user="Main.ArisSotiropoulos" version="1" |
META FILEATTACHMENT |
attr="h" autoattached="1" comment="" date="1204649356" name="01386058.pdf" path="01386058.pdf" size="312577" user="Main.ArisSotiropoulos" version="1" |
|
|
META TOPICPARENT |
name="WebHome" |
Computer Architecture |
|
META TOPICPARENT |
name="WebHome" |
Computer Architecture | |
Operating Systems | |
< < | Providing scalable clustered storage in a cost-effective way depends on the availability of an efficient network block device (nbd) layer. We study the performance of gmblock, an nbd server over Myrinet utilizing a direct disk-to-NIC data path which bypasses the CPU and main memory bus. To overcome the architectural limitation of a low number of outstanding requests, we focus on overlapping read and network I/O for a single request, in order to improve throughput. To this end, we introduce the concept of synchronized send operations and present an implementation on Myrinet/GM, based on custom modifications to the NIC firmware and associated userspace library. Compared to a network block sharing system over standard GM and the base version of gmblock, our enhanced implementation supporting synchronized sends delivers 81% and 44% higher throughput for streaming block I/O, respectively. | > > | Efficient sharing of block devices over an interconnection network is an important step in deploying a shared-disk parallel file system on a cluster of SMPs. We present gmbock, a client/server system for network sharing of storage devices over Myrinet, which uses an optimized data path in order to transfer data directly from the storage medium to the NIC, bypassing the host CPU and main memory bus. Its design enhances existing programming abstractions, combining the user level networking characteristics of Myrinet with Linux's virtual memory infrastructure, in order to construct the datapath in a way that is independent of the type of block device used. Experimental evaluation of a prototype system shows that remote I/O bandwidth can improve up to 36%, compared to an RDMA-based implementation. Moreover, interference on the main memory bus of the host is minimized, leading to an up to 41% improvement in the execution time of memory-intensive applications. | | | |
> > | Providing scalable clustered storage in a cost-effective way depends on the availability of an efficient network block device (nbd) layer. To overcome the architectural limitation of a low number of outstanding requests in gmblock, we focus on overlapping read and network I/O for a single request, in order to improve throughput. To this end, we introduce the concept of synchronized send operations and present an implementation on Myrinet/GM, based on custom modifications to the NIC firmware and associated userspace library. Compared to a network block sharing system over standard GM and the base version of gmblock, our enhanced implementation supporting synchronized sends delivers 81% and 44% higher throughput for streaming block I/O, respectively. | |
META FILEATTACHMENT |
attr="h" autoattached="1" comment="" date="1204649531" name="04227947.pdf" path="04227947.pdf" size="792734" user="Main.ArisSotiropoulos" version="1" |
META FILEATTACHMENT |
attr="h" autoattached="1" comment="" date="1204649356" name="01386058.pdf" path="01386058.pdf" size="312577" user="Main.ArisSotiropoulos" version="1" |
|
|
META TOPICPARENT |
name="WebHome" |
Computer Architecture
Software Optimization | |
< < | Previous research work has identified memory bandwidth as the main bottleneck of the ubiquitous Sparse Matrix-Vector Multiplication kernel. To attack this problem, we aim at reducing the overall data volume of the algorithm. Typical sparse matrix representation schemes store only the non-zero elements of the matrix and employ additional indexing information to properly iterate over these elements. In this paper we propose two distinct compression methods targeting index and numerical values respectively. We perform a set of experiments on a large real-world matrix set and demonstrate that the index compression method can be applied successfully to a wide range of matrices. Moreover, the value compression method is able to achieve impressive speedups in a more limited yet important class of sparse matrix that contain a small number of distinct values. | > > | Previous research work has identified memory bandwidth as the main bottleneck of the ubiquitous Sparse Matrix-Vector Multiplication kernel. To attack this problem, we aim at reducing the overall data volume of the algorithm. Typical sparse matrix representation schemes store only the non-zero elements of the matrix and employ additional indexing information to properly iterate over these elements. In this paper we propose two distinct compression methods targeting index and numerical values respectively. We perform a set of experiments on a large real-world matrix set and demonstrate that the index compression method can be applied successfully to a wide range of matrices. Moreover, the value compression method is able to achieve impressive speedups in a more limited, yet important, class of sparse matrices that contain a small number of distinct values. | |
Operating Systems |
|
META TOPICPARENT |
name="WebHome" |
Computer Architecture | | Providing scalable clustered storage in a cost-effective way depends on the availability of an efficient network block device (nbd) layer. We study the performance of gmblock, an nbd server over Myrinet utilizing a direct disk-to-NIC data path which bypasses the CPU and main memory bus. To overcome the architectural limitation of a low number of outstanding requests, we focus on overlapping read and network I/O for a single request, in order to improve throughput. To this end, we introduce the concept of synchronized send operations and present an implementation on Myrinet/GM, based on custom modifications to the NIC firmware and associated userspace library. Compared to a network block sharing system over standard GM and the base version of gmblock, our enhanced implementation supporting synchronized sends delivers 81% and 44% higher throughput for streaming block I/O, respectively. | |
< < | Interconnects
Publications
- M. Athanasaki, E. Koukis, N. Koziris, "Efficient Scheduling of Tiled Iteration Spaces onto a Fixed Size Parallel Architecture," 9th Panhellenic Conference in Informatics, pp.178-192, Thessaloniki, Greece, November 21 23, 2003
- M. Athanasaki, E. Koukis, N. Koziris, "Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes," 12th Euromicro Conference on Parallel, Distributed and Network based Processing (PDP '04), pp.424-433, A Coruna, Spain, February 11-13, 2004
- Ε. Koukis and Ν. Koziris, "Memory Bandwidth Aware Scheduling for SMP Cluster Nodes," Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP '05), pp. 187-196, Lugano, Switzerland, 6-11 Feb. 2005
- Ε. Koukis and Ν. Koziris, "Memory and Network Bandwidth Aware Scheduling of Multiprogrammed Workloads on Clusters of SMPs," Proceedings of the 12th International Conference on Parallel and Distributed Systems (ICPADS 2006), pp. 345-354, Minneapolis, MN, USA, 12-15 July, 2006
- Ε. Koukis and Ν. Koziris, "Efficient Block Device Sharing over Myrinet with Memory Bypass," Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), p. 29, Long Beach, CA, USA, 26-30 March, 2007
- Ε. Koukis, A. Nanos and Ν. Koziris, “Synchronized Send Operations for Efficient Streaming Block I/O over Myrinet,” Proceedings of the Workshop on Communication Architecture for Clusters (CAC 2008), held in conjunction with the 22nd International Parallel and Distributed Processing Symposium (IPDPS 2008), Miami, FL, USA, 14-18 April, 2008, to appear
| |
META FILEATTACHMENT |
attr="h" autoattached="1" comment="" date="1204649531" name="04227947.pdf" path="04227947.pdf" size="792734" user="Main.ArisSotiropoulos" version="1" |
META FILEATTACHMENT |
attr="h" autoattached="1" comment="" date="1204649356" name="01386058.pdf" path="01386058.pdf" size="312577" user="Main.ArisSotiropoulos" version="1" |
|
|
META TOPICPARENT |
name="WebHome" |
Computer Architecture | |
Providing scalable clustered storage in a cost-effective way depends on the availability of an efficient network block device (nbd) layer. We study the performance of gmblock, an nbd server over Myrinet utilizing a direct disk-to-NIC data path which bypasses the CPU and main memory bus. To overcome the architectural limitation of a low number of outstanding requests, we focus on overlapping read and network I/O for a single request, in order to improve throughput. To this end, we introduce the concept of synchronized send operations and present an implementation on Myrinet/GM, based on custom modifications to the NIC firmware and associated userspace library. Compared to a network block sharing system over standard GM and the base version of gmblock, our enhanced implementation supporting synchronized sends delivers 81% and 44% higher throughput for streaming block I/O, respectively. | |
< < | Publications | | | |
< < | | | \ No newline at end of file | |
> > | Interconnects
Publications
- M. Athanasaki, E. Koukis, N. Koziris, "Efficient Scheduling of Tiled Iteration Spaces onto a Fixed Size Parallel Architecture," 9th Panhellenic Conference in Informatics, pp.178-192, Thessaloniki, Greece, November 21 23, 2003
- M. Athanasaki, E. Koukis, N. Koziris, "Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes," 12th Euromicro Conference on Parallel, Distributed and Network based Processing (PDP '04), pp.424-433, A Coruna, Spain, February 11-13, 2004
- Ε. Koukis and Ν. Koziris, "Memory Bandwidth Aware Scheduling for SMP Cluster Nodes," Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP '05), pp. 187-196, Lugano, Switzerland, 6-11 Feb. 2005
- Ε. Koukis and Ν. Koziris, "Memory and Network Bandwidth Aware Scheduling of Multiprogrammed Workloads on Clusters of SMPs," Proceedings of the 12th International Conference on Parallel and Distributed Systems (ICPADS 2006), pp. 345-354, Minneapolis, MN, USA, 12-15 July, 2006
- Ε. Koukis and Ν. Koziris, "Efficient Block Device Sharing over Myrinet with Memory Bypass," Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), p. 29, Long Beach, CA, USA, 26-30 March, 2007
- Ε. Koukis, A. Nanos and Ν. Koziris, “Synchronized Send Operations for Efficient Streaming Block I/O over Myrinet,” Proceedings of the Workshop on Communication Architecture for Clusters (CAC 2008), held in conjunction with the 22nd International Parallel and Distributed Processing Symposium (IPDPS 2008), Miami, FL, USA, 14-18 April, 2008, to appear
META FILEATTACHMENT |
attr="h" autoattached="1" comment="" date="1204649531" name="04227947.pdf" path="04227947.pdf" size="792734" user="Main.ArisSotiropoulos" version="1" |
META FILEATTACHMENT |
attr="h" autoattached="1" comment="" date="1204649356" name="01386058.pdf" path="01386058.pdf" size="312577" user="Main.ArisSotiropoulos" version="1" |
META FILEATTACHMENT |
attr="h" autoattached="1" comment="" date="1204649449" name="01655680.pdf" path="01655680.pdf" size="237494" user="Main.ArisSotiropoulos" version="1" |
META FILEATTACHMENT |
attr="h" autoattached="1" comment="" date="1204649651" name="epy2003.pdf" path="epy2003.pdf" size="459614" user="Main.ArisSotiropoulos" version="1" |
META FILEATTACHMENT |
attr="h" autoattached="1" comment="Scheduling" date="1204648880" name="01271475.pdf" path="01271475.pdf" size="510858" user="Main.ArisSotiropoulos" version="1" |
|
|
> > |
META TOPICPARENT |
name="WebHome" |
Computer Architecture
Software Optimization
Previous research work has identified memory bandwidth as the main bottleneck of the ubiquitous Sparse Matrix-Vector Multiplication kernel. To attack this problem, we aim at reducing the overall data volume of the algorithm. Typical sparse matrix representation schemes store only the non-zero elements of the matrix and employ additional indexing information to properly iterate over these elements. In this paper we propose two distinct compression methods targeting index and numerical values respectively. We perform a set of experiments on a large real-world matrix set and demonstrate that the index compression method can be applied successfully to a wide range of matrices. Moreover, the value compression method is able to achieve impressive speedups in a more limited yet important class of sparse matrix that contain a small number of distinct values.
Operating Systems
Providing scalable clustered storage in a cost-effective way depends on the availability of an efficient network block device (nbd) layer. We study the performance of gmblock, an nbd server over Myrinet utilizing a direct disk-to-NIC data path which bypasses the CPU and main memory bus. To overcome the architectural limitation of a low number of outstanding requests, we focus on overlapping read and network I/O for a single request, in order to improve throughput. To this end, we introduce the concept of synchronized send operations and present an implementation on Myrinet/GM, based on custom modifications to the NIC firmware and associated userspace library. Compared to a network block sharing system over standard GM and the base version of gmblock, our enhanced implementation supporting synchronized sends delivers 81% and 44% higher throughput for streaming block I/O, respectively.
Publications
|
|