The Memory Bandwidth aware Userspace Scheduler (MemBUS)
Symmetric Multiprocessor (SMPs) are commonly used as the building blocks for scalable clustered systems. Often they are combined with modern, low-latency, high-bandwidth interconnects such as
Myrinet. However, their design leads to contention among processors for access to shared resources, which can limit their efficiency significantly.
Such resources include the Front Side Bus (FSB) used to interconnect processors with one another and with the main memory controller (Northbridge), mainly used in designs based on Intel chipsets, the peripheral bus (commonly PCI/PCI-X) and the node Network Interface Card (NIC).
In the context of this activity, we first explored [
PDP'05] the impact of memory contention on a single cluster node, when running compute-intensive applications. The experiments showed contention on the memory bus can limit the degree of parallelism achieved, leading to severe degradation in attainable performance. Moreover, we highlighted that the DMA engines on the Myrinet NIC can induce significant load on the memory subsystem. This means that communication can interfere with computation even when employing a zero-copy, OS-bypass protocol such as
Myrinet/GM, which removes the CPU from the critical path of communication completely.
To attack the problem, we try to enhance local scheduling by taking into account run-time information on the memory bandwidth demands of each individual process. Memory behavior is monitored dynamically by using the
performance monitoring counters provided by modern microprocesors. The proposed scheduling policy tries to increase throughput for multiprogrammed workloads, by adjusting the selection of processes to be run simultaneously on an SMP node so as to avoid memory bus saturation.
The policy has been implemented in userspace, as the Memory Bandwidth aware Userpace Scheduler (
MemBUS). Scheduling decisions are enforced with a combination of the
perfctr performance-monitoring framework, the
ptrace()
Linux system call and standard UNIX
SIGSTOP / SIGCONT
signaling.
Later on,
MemBUS was expanded [
ICPADS 2006] to support cluster-wide gang scheduling; context switches are coordinated so that all peer processes belonging to the same job are scheduled simultaneously across the cluster, while trying to minimize interference due to contention for access to main memory and to the NIC on each node. Experimental evaluation based on the NAS parallel benchmark suite showed singificant increase in throughput compared to uncoordinated local scheduling.
Publications
- E. Koukis and N. Koziris, “Memory and Network Bandwidth Aware Scheduling of Multiprogrammed Workloads on Clusters of SMPs,” Proceedings of the 12th International Conference on Parallel and Distributed Systems (ICPADS 2006), pp. 345-354, Minneapolis, MN, USA, 12-15 July, 2006
- E. Koukis and N. Koziris, “Memory Bandwidth Aware Scheduling for SMP Cluster Nodes,” Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-based Processing (PDP '05), pp. 187-196, Lugano, Switzerland, 6-11 Feb. 2005