Abstract
Cross-matching astronomical catalogs is a central operation in astronomical data integration and analysis. As current commodity clusters typically consist of heterogeneous processors including both multi-core CPUs and GPUs, we study how to efficiently cross-match large astronomical catalogs on such clusters. Specifically, we develop a three-phase common algorithm for parallel cross-match, and optimize it for a single GPU, multiple GPUs on a node, and a heterogeneous cluster of multiple nodes, respectively. Furthermore, we study the performance impact of data chunk size and that of inter-node communication mechanisms in the cluster. Our results show that, with suitable design choices and optimizations, cross-matching billion-record catalogs was completed under 10 minutes on a seven-node CPU-GPU cluster.