A Cost-Efficient Failure-Tolerant Scheme for Distributed DNN Training

Menglei Chen; Yu Hua; Rong Bai; Jianming Huang

doi:10.1109/ICCD58817.2023.00031

Abstract

Distributed deep neural network (DNN) training is important to support artificial intelligence (AI) applications, such as image classification, natural language processing, and autonomous driving. Unfortunately, the distributed property makes the DNN training vulnerable to system failures. Check-pointing is generally used to support failure tolerance, which however suffers from high runtime overheads. In order to enable high-performance and low-latency checkpointing, we propose a lightweight checkpointing system for distributed DNN training, called LightCheck. To reduce the checkpointing overheads, we leverage fine-grained asynchronous checkpointing by pipelining checkpointing in a layer-wise way. To further decrease the checkpointing latency, we leverage the software-hardware co-design methodology by coalescing new hardware devices into our checkpointing system via a persistent memory (PM) manager. Experimental results on six representative real-world DNN models demonstrate that LightCheck offers more than 10× higher check-pointing frequency with lower runtime overheads than state-of-the-art checkpointing schemes. We have released the open-source codes for public use in https://github.com/LighT-chenml/LightCheck.git.

A Cost-Efficient Failure-Tolerant Scheme for Distributed DNN Training

Authors

Abstract

Related Articles