Abstract
FPGAs are increasingly being used in the high performance and scientific computing community to implement floating-point based hardware accelerators. In this paper we analyze the floating-point multiplier and adder/subtractor units by considering the number of pipeline stages of the units as a parameter and use throughput/area as the metric. We achieve throughput rates of more than 240Mhz (200Mhz) for single (double) precision operations by deeply pipelining the units. To illustrate the impact of the floating-point units on a kernel, we implement a matrix multiplication kernel based on our floating-point units and show that a state-of-the-art FPGA device is capable of achieving about 15GFLOPS (8GFLOPS) for the single (double) precision .oating-point based matrix multiplication. We also show that FPGAs are capable of achieving upto 6x improvement (for single precision) in terms of the GFLOPS/W (performance per unit power) metric over that of general purpose processors. We then discuss the impact of floating-point units on the design of an energy efficient architecture for the matrix multiply kernel.