Abstract
Barrier synchronization is a common operation in parallel and distributed systems. A fast implementation is important because it allows fine grained parallel programs to be more efficient. It is therefore important to minimize the latency of barrier operations. Modern network interface cards (NICs) have programmable processors which can be used to support collective communications such as barrier. In [4] we have designed and implemented a NIC-based barrier feature over GM. This new NIC-based barrier operation raises many open questions which must be answered. Does the NIC-based barrier perform better than the host-based barrier? How does the performance of the NIC-based barrier change with better NICs? Is the NIC-based barrier scalable? How does the performance of the NIC-based barrier affect the granularity of computation? How does the NIC-based barrier affect the performance of applications? In this paper, we take on these challenges. We find that the NIC-based barrier performs better than the host-based barrier with up to a 2.22 factor of improvement on an eight node system at the MPI-level. We also find that the factor of improvement values increase with the number of nodes indicating that the NIC-based barrier is more scalable. We find that the NIC-based barrier also allows for finer grained computation without affecting the efficiency of the program. Finally, by using synthetic applications on an eight node system, we find up to a 1.93 factor of improvement in the applications using a NIC-based barrier versus using a host-based barrier. These results indicate that NIC-based barrier in current and future clusters can deliver significant performance benefits to the applications.