Abstract
Using dedicated hardware to do machine learning typically ends up in disaster because of cost, obsolescence, and poor software. The popularization of Graphic Processing Units (GPUs), which are now available on every PC, provides an attractive alternative. We propose a generic 2-layer fully connected neural network GPU implementation which yields over 3X speedup for both training and testing with respect to a 3GHz P4 CPU.