Abstract
The amount of available resources of a cloud is constantly changing. However, the current distributed DNN framework does not allow dynamic scaling of a training cluster. Therefore, a cloud-based training cluster cannot flexibly scale in response to the dynamically changing resource availability. To resolve this issue, we propose a dynamic scaling scheme for cloud- based DNN training clusters. In the proposed approach, a cluster manages a separate communication pool for orchestrating scaling operations, and a new node synchronizes its weight tensors through eavesdropping gradient exchanges before it actually participates the training operation. Our evaluation showed that the proposed approach reduced the scaling overhead by 13% in comparison to the conventional checkpoint-restore approach, and revealed the possibilities of further improvement.