Abstract
Convolutional Neural Networks (CNNs) have demonstrated remarkable performance across various computer vision tasks. Due to the computational and data-intensive nature of CNNs, Field-Programmable Gate Arrays (FPGAs) are exceptionally well-suited for accelerating the CNN computation process. However, large-scale CNNs such as ResNet-84 contain an enormous number of parameters that exceed the capacity of a single FPGA, rendering the deployment on a single device impractical. In this paper, we develop an automated end-to-end design flow for mapping large-scale CNNs across multiple FPGAs, on the basis of the HLS4ML dataflow architecture. We propose a graph optimization method to streamline the CNN structure and reduce resource consumption. We also summarize a resource allocation algorithm that automatically determines the specific hardware resources necessary for each CNN layer. Furthermore, we introduce a partitioning methodology capable of effectively segmenting CNNs into multiple FPGAs, and providing each subgraph with specific interfaces to support communication between different FPGAs. To validate the methodology, we construct a multi-FPGA platform interconnected via LVDS. We select two typical networks, ResNet-8 and ResNet-84, as the benchmarks for evaluation. The experimental results demonstrate that our approach significantly outperforms existing solutions. It attains an 18.6-fold increase in speed over Vitis AI, a 2.2-fold improvement over FINN, and a 3.4-fold enhancement over the original single-FPGA HLS4ML implementation for ResNet-8. For ResNet-84, our method achieves a remarkable 33.6-fold speedup over Vitis AI. Additionally, when compared to other non-automated multi-FPGA solutions, our methodology still exhibits significant performance improvements.