Lattice H-Matrices on Distributed-Memory Systems

Akihiro Ida

doi:10.1109/IPDPS.2018.00049

Abstract

Low-rank approximation methods, such as hierarchical (H) matrices and block low-rank (BLR) matrices, can approximate dense matrices that come from scientific integral equations to reduce computational costs and memory usage. When considering the efficient use of a massive number of MPI processes on distributed-memory systems, we must balance the computational load and construct an efficient communication pattern among MPI processors. Unfortunately, the complicated structure of H-matrices prevents us from meeting these requirements. Simplifying the matrix structure is one possible approach to solve this problem, and the lattice structures found in BLR-matrices are one of the most convenient structures for this approach. However, as a trade-off, the memory usage increases from H-matrices, which use O(N log N), to BLR-matrices, which use O(N^1.5). In this study, we propose a new method called "lattice H-matrices."?In short, the lattice H-matrices are constructed by utilising H-matrices as submatrices in blocks of lattice structures observed in BLR-matrices. By assigning the lattice blocks to MPI processes, we can utilise sophisticated existing parallel algorithms for dense matrices. We demonstrate how the lattice block size should be defined and confirm that the memory complexity of the lattice H-matrices remains O(N log N) when using appropriate block sizes depending on the number of MPI processors. Accordingly, the lattice H-matrices maintain the advantages of both H-matrices and BLR-matrices. We examine the efficiency of lattice H-matrices in arithmetic functions, such as H-matrix generation and H-matrix-vector multiplication, in large-scale problems on distributed-memory systems. In numerical experiments of electric field analyses, we confirmed that a relatively good load balance is maintained in the case of lattice H-matrices even if we use a large number of processes. The implementation of the lattice H-matrices exhibits a parallel speed-up that reaches as high as about 4,000 MPI processes. It is confirmed that the implementation of lattice H-matrix version is significantly faster than of normal H-matrix version for a large number of processes.

Lattice H-Matrices on Distributed-Memory Systems

Authors

Abstract

Related Articles