Design of PC/104 Processor Module Based on ARM

Xiao Ke; Bai Wenle; An Yanwei

doi:10.1109/iCECE.2010.198

Abstract

Distributed machine learning (DML) has recently experienced widespread application. A major performance bottleneck is the costly communication for gradients synchronization. Recently, researchers have explored the use of programmable switches for in-network synchronous aggregation of gradients to mitigate the communication overhead. Nevertheless, the performance of in-network synchronous aggregation is significantly impacted by the stragglers. Unfortunately, the schedulers in existing DML systems are no longer effective in dealing with stragglers because of the ignorance of the aggregation progress that is offloaded from the parameter servers to the programmable switches. To address this gap, this paper presents VAKY, an adaptive scheduler specifically designed for in-network aggregation. At the heart of VAKY is the variable K-block sync method, where the aggregators stop waiting for updates from more workers once having received updates from the fastest K workers for each block of gradients. We propose an efficient solution that can dynamically choose the optimal values of K during the training process, in order to minimize the expected training completion time. We have integrated VAKY into PyTorch, and our experiments show that compared to the state-of-the-art in-network aggregation systems, VAKY improves the aggregation throughput by up to

40 %

$40 \%$ and reduces the training time by

25 %

$25 \%$ .

Design of PC/104 Processor Module Based on ARM

Authors

Abstract

Similar Articles