2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA)
Download PDF

Abstract

Convolutional Neural Networks (CNNs) have demonstrated remarkable performance across various computer vision tasks. Due to the computational and data-intensive nature of CNNs, Field-Programmable Gate Arrays (FPGAs) are exceptionally well-suited for accelerating the CNN computation process. However, large-scale CNNs such as ResNet-84 contain an enormous number of parameters that exceed the capacity of a single FPGA, rendering the deployment on a single device impractical. In this paper, we develop an automated end-to-end design flow for mapping large-scale CNNs across multiple FPGAs, on the basis of the HLS4ML dataflow architecture. We propose a graph optimization method to streamline the CNN structure and reduce resource consumption. We also summarize a resource allocation algorithm that automatically determines the specific hardware resources necessary for each CNN layer. Furthermore, we introduce a partitioning methodology capable of effectively segmenting CNNs into multiple FPGAs, and providing each subgraph with specific interfaces to support communication between different FPGAs. To validate the methodology, we construct a multi-FPGA platform interconnected via LVDS. We select two typical networks, ResNet-8 and ResNet-84, as the benchmarks for evaluation. The experimental results demonstrate that our approach significantly outperforms existing solutions. It attains an 18.6-fold increase in speed over Vitis AI, a 2.2-fold improvement over FINN, and a 3.4-fold enhancement over the original single-FPGA HLS4ML implementation for ResNet-8. For ResNet-84, our method achieves a remarkable 33.6-fold speedup over Vitis AI. Additionally, when compared to other non-automated multi-FPGA solutions, our methodology still exhibits significant performance improvements.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles