Abstract
By its nature, MPI leads to coarse grained communications. This is because all current MPI implementations deliver two orders of magnitude more bandwidth for large message sizes (kilobytes) than small message sizes (bytes). This translates into applications that bundle their small communications into larger communications whenever possible. In modern implementations, this sacrifice in the granularity of communication translates directly into a sacrifice in the granularity of synchronization. MPI requires that the entire message arrive before any of the data can be delivered to the application, because message completion is the only synchronization semantic the network can expose to the processor. This paper explores the implications of providing synchronization between the network and the processor at the memory word level using a mechanism such as Full/Empty Bits. This enables the application to begin computing as soon as the data for the first memory referenced has arrived without having to wait for all of the data in the message