Abstract
To implement target detection algorithms such as YOLO on FPGA and meet the strict requirement of real-time target detection with low latency, a variety of optimization methods from model quantization to hardware optimization are needed. Firstly, the layer fusion and the bit width quantization strategy are used to reduce computational complexity. Then, the column-based fine-grained pipeline architecture with padding skip technology is used to reduce start time. Next, the double symbol multiplication correction circuit is introduced to shorten the calculation time of CNN. Finally, the design space exploration algorithm is used to solve the problem of resource allocation in the FPGA-based convolutional neural network hardware accelerator and improve the efficiency of DSP. To verify the neural network accelerator architecture, we implement YOLOv2-tiny network on Xilinx ZCU104. Compared with the previous accelerators, the latency is reduced by 1.88 to 2.07 times, and the DSP efficiency reaches 90.9%.