Abstract
Data-parallel applications, especially those associated with user-facing web services, have struggled to enhance their worst case performance. It is therefore important to improve the minimum amount of resources guaranteed for applications in a cluster. Existing cluster management frameworks, however, provide isolation for computation resources (such as CPU) only, and are oblivious to network isolation guarantees. In this paper, we design, implement and evaluate Libra, a new cluster management framework that helps to maximize the isolation guarantee for the bandwidth requirements from applications. We start with a theoretical analysis of the network sharing problem, which contains two key steps: container placement and bandwidth allocation. By collecting the status of access links and the bandwidth demand of applications, we coordinate the placement of containers to minimize the system bottleneck such that the bandwidth guarantee for applications can be optimized. We further embrace host-based rate limiting to ensure such maximized bandwidth guarantee can be reached without hurting network utilization. Both our testbed-based experiments and large-scale simulations demonstrate that Libra significantly improves the network isolation guarantee: in comparison with existing cluster managers and network schedulers, the performance gain is more than 105.59%. Meanwhile, it improves application performance by 57.71% and maintains high network utilization.