DOI: 10.1177/10943420261462361 ISSN: 1094-3420

A randomized point-block Schwarz preconditioner for multiphysics problems on fully unstructured meshes on GPUs

Wenpeng Ma, Xiaofan Le, Xiao-Chuan Cai

In this paper, we propose a randomized point-block Schwarz preconditioner for a heterogeneous computing system consisting of both CPUs and GPUs. The preconditioner can be used to accelerate a Newton-Krylov method for solving nonlinear algebraic system of equations arising from the discretization of multi-component partial differential equations. Roughly speaking, GPUs consist of a large number of light-weighted processing cores for whom the synchronization cost is high. On such a processor, the traditional algorithms for the subdomain LU factorization and the triangular solves in the Schwarz method are no longer efficient since they require a large number of synchronizations for some or all of the processing cores. The key steps of the proposed method involve (1) a randomized iterative method for the LU factorization of a matrix; (2) a randomized iterative triangular solver for the upper and lower triangular systems of equations; (3) a symbolic inverse of point-block diagonal matrices whose size is determined by the Warp-size of the GPU. In the realization of these steps, we employ a warp-based thread partitioning strategy with an adjustable factor that controls the degree of parallelism for the randomized computation between threads. The target problems are matrices with block structures and the block size is small enough so that the inverses of diagonal blocks are obtained symbolically. Two representative applications covering the discretized incompressible Navier-Stokes and hyperelasticity equations verify the preconditioner’s performance in a Newton-Krylov solver, with a 12x + speedup on 8 GPUs over 8 CPU cores. The randomized method in the first case achieves 2x–5x gains over the deterministic cuSPARSE baseline and performs nearly comparable to the baseline in the second case. Cerebral artery blood flow simulation proves its superior scalability to the deterministic GPU implementation, attaining up to 45% parallel efficiency on 64 GPUs.

More from our Archive