Abstract
High performance and parallel computing are traditionally implemented on very large dedicated compute clusters. However, as many organizations begin to adopt service-oriented cloud-based infrastructures, we can expect to see the development of parallel computing in the cloud. The goal of a parallel compute cluster is to divide a large job into several small jobs, execute the small jobs in parallel on many compute nodes, and then combine the results in some coherent manner. The biggest hurdle in moving this type of service to a cloud-based infrastructure is that performance will undoubtedly be affected by many factors, particularly those related to virtualization in clouds, such as memory and CPU overhead, limited resources, and others relating to hardware virtualization. In order to fully understand how virtualization can affect parallel computing in a tiny private cloud, we have devised four case studies that examine the performance of Apache Hadoop in varying environments on our private cloud. Our case studies are comprised of a baseline or bare metal (non-virtualized) cluster deployment consisting of seven nodes, a seven-node virtual machine cluster, a twenty-node virtual machine cluster, and an optimized seven-node virtual machine cluster. Results show that, although small data sets result in comparable job completion times, as the data size increases the performance of Apache Hadoop is affected greatly by virtualization even when we attempt to optimize the configuration of our cloud.