2017 IEEE International Conference on Big Data (Big Data)
Download PDF

Abstract

In order to find the best linear regression model or polynomial regression model that fits the data, traditional methods have to read the whole datasets repetitively and incur many unnecessary slow I/O operations. Apache Spark can train regression models significantly more efficiently with distributed clusters due to its well-crafted in-memory computing architecture. However, if the dataset itself or the temporary data during computation is even bigger for the total physical memory space of a spark system, in-memory data has to be spilled to the secondary storage (such as hard drives or solid state disks) and read it back later if it is needed. These frequent I/O operations will negatively affect the efficiency of Spark computation. Built on top of the per-row update-able data modeling concept we proposed before, this work investigated the cases of finding the best Box-Cox transformation model on a Spark system. The major contribution of this work is that the information needed to compute a linear regression model, or a polynomial regression model can be summarized in an Information Array. The size of this information array does not grow with the datasets. Rather, it is only related to the number of features and the number of models need to be considered. Because the information array is usually very small, it can be stored in memory all the time. With the propose information array approach, the best linear or polynomial regression model could be obtained after one scan of the raw data. The experiment results proved that this approach is fast and efficient on Spark. When training 41 models, the proposed Box-Cox Information Array method is about 8 times faster than the existing Spark APIs and it has better performance of prediction than using linear regression models.
Like what you’re reading?
Already a member?
Get this article FREE with a new membership!

Related Articles